Turn a podcast into a 60-second clip

Tutorial: go from a 45–60 minute podcast recording to a captioned, vertical 60-second clip with one prompt and a couple of refinements.

You recorded a 50-minute interview. Somewhere in it is one story that will stop the scroll — you just don't want to spend an evening scrubbing for it. This tutorial takes you from the raw recording to a posted-ready, captioned, vertical 60-second clip in about 15 minutes, and the agent does almost all of it.

What you'll make

A 60-second vertical (9:16) clip with:

The strongest self-contained story or insight from the episode, opening mid-thought on its best line
Filler words and rambly repeats removed
Face-focused vertical framing
High-contrast boxed captions (the boxed-contrast skin — the default for podcast clips)
A quiet music bed mixed under the voice

Before you start

Import your recording into the project (drag it into the Media panel, then onto the timeline — or just tell the agent to place it).
Know roughly what the episode is about, so you can give the agent a goal. "Find the strongest moment" works; "find the strongest story about hiring" works better.
Long footage burns AI video-analysis time. On the free plan a 60-minute episode may exceed your monthly analysis budget — see Plans and limits and the troubleshooting section below.

Step 1: One prompt

Open the agent panel and delegate the whole job:

Turn this podcast into a 60-second vertical clip. Find the strongest story about hiring your first engineer — one self-contained moment, not a montage. Captions on, clean up the filler words, keep my guest's face centered.

That's it. You don't pick tools; the agent runs the podcast-clip flow end to end.

What the agent actually does

Watch the tool rows appear in the panel — this is the sequence behind them:

Storyboard first. The agent states its thesis (the one idea the clip argues) and posts a storyboard card above the chat: target length, format, and 3–6 beats with the source timestamps it plans to pull from. That card is the shared plan — it updates as beats complete. See Storyboard.
Captions before cutting. It runs auto_caption_styled early, because transcribing the audio is what gives it a transcript to search. The captions land as a reviewable text track.
Finds the moment. extract_transcript_highlights with your goal ("hiring your first engineer") and a ~60s target selects the high-signal transcript segments and ripple-deletes everything else — the other 49 minutes disappear, in one undoable step.
Tightens delivery. remove_transcript_fillers cuts filler-only segments (um, uh, you know) and near-duplicate restarts, keeping every track in sync.
Reframes for vertical. smart_reframe_subject with focus: face sets a 9:16 canvas and frames each clip with face-aware heuristics (faces bias upward). This is heuristic framing from the visual analysis, not pixel-level face tracking.
Styles the captions. The boxed-contrast skin — white text on a dark box — is the podcast-clip default because it stays readable over any footage.
Sound design. auto_sound_design balances the mix: voice at 0 dB, any music bed at -18 dB, 0.25s fades.
Checks its own work. Before finishing, the agent runs critique_edit, verify, and review_edit — the last one renders the cut and has an AI actually watch it, returning a ship/no-ship verdict. If a check fails, the agent is bounced back to fix it before it's allowed to conclude. See How the agent checks its work.

What the timeline looks like after

A handful of clips (a tight 60s clip is usually 3–6 segments, not 20 fragments) on the main track, a caption text track above it, a portrait canvas, and every change sitting in undo history. The storyboard card shows each beat marked done.

Step 2: Refine conversationally

The first cut is the agent's best editorial judgment. Push back like you would with a human editor:

Different moment. The story it picked isn't the one you remembered:

Not that story — there's a better moment where she talks about the candidate who turned them down twice. Rebuild the clip around that.

The agent updates the storyboard first, then re-selects from the transcript. Nothing is lost — the full source is still in your media bin.

Tighter. Sixty seconds is the ceiling, not the goal:

Tighten this to 45 seconds. Keep the punchline intact.

Different caption look. Restyling never re-transcribes:

Try the tiktok-bold caption style instead.

That's a single apply_caption_skin call — instant, and instantly undoable if you liked boxed-contrast better.

Make it yours

Batch it. Ask for three clips on three different topics from the same episode, one at a time, and export each.
Add a hook overlay. "Add a short hook using her actual words from the clip." The agent quotes the transcript — it will refuse to slap on a stock phrase like "wait for this."
Introduce the speaker. "Add a lower-third with her name and company in the first few seconds" uses an animated overlay card. See the B-roll and overlay cards tutorial.
Two-camera podcasts. If you have separate guest/host angles, put the angle you want on the main track and tell the agent which speaker the clip is about.

Troubleshooting

The clip cuts mid-crosstalk or keeps the host's interjections. Transcription is segment-level, not word-level, so when two people talk over each other a segment can contain both voices and the cut can't split between them. Fix it conversationally — "trim the clip at 0:22 so the host's 'right, right' is gone" — or nudge the playhead and split manually on the timeline.

Analysis fails or the agent says it can't watch the footage. Full-video AI analysis is metered, and an hour-long episode is a big bite. If you're near your monthly limit, the request may be blocked — check the AI usage bar in the agent panel and see Plans and limits. Workaround: rough-trim the episode to the 5–10 minute region you care about first ("keep 12:00 to 21:00, drop the rest"), then run the clip prompt on that.

The chosen moment starts confusingly mid-context. Ask: "Add 3 seconds of setup before the story starts so it makes sense cold." The agent re-pulls from the source with padding.

Captions miss or mangle a technical term. Segment-level transcription occasionally fumbles jargon and names. Captions are ordinary text elements — click the cue on the timeline and fix the word, or tell the agent: "In the captions, change 'cubeernetes' to 'Kubernetes' everywhere."