What Went Well
- Frame extraction for visual analysis worked extremely well. Using ffmpeg to pull frames at 2-second intervals from Veo-generated clips, then visually inspecting them for lip movement, gave concrete evidence to back up what were initially theoretical risk ratings. This turned “Shot 16 might have a conflict” into “Shot 16 definitively shows 8+ characters with open mouths while VO plays.”
- Audio RMS analysis added a quantitative layer. Measuring Veo’s native audio levels at 1-second intervals revealed clear speech bursts (e.g., -16.7 dB spike in Shot 6 at 1s) that confirmed the model generated dialogue audio to match the lip movement in the video.
- Cross-referencing three artifacts (short story → scene list → gen.log/timeline) was the right analytical approach. The short story revealed what dialogue SHOULD have been (rich, flowing, comedically overwrought), the scene list showed what it became (truncated one-liners), and the timeline/gen.log showed how it was assembled. Each artifact alone would have given an incomplete picture.
- Downgrading Shot 23 after visual inspection was a good call. The initial text-based risk rating (HIGH — close-up on face with VO) was wrong because the beard obscured the mouth. Frame analysis corrected this before it became a wasted reshoot.
What Didn’t Go Well
- Initial risk ratings were partially wrong. I rated Shot 23 as HIGH and Shot 6 as MEDIUM based on text analysis of the scene list. After viewing frames, Shot 23 was actually LOW and Shot 6 was HIGH. The lesson: text-based analysis of motion prompts is unreliable for predicting Veo’s actual output — Veo frequently deviates from the prompt (Shot 6’s “stands rigidly at attention” became animated gesturing). Always verify with frame extraction.
- No ability to hear the audio. I could measure audio levels and detect speech bursts quantitatively, but couldn’t actually listen to confirm whether the bursts were speech, crowd noise, or ambient sound. I had to infer from the visual context + RMS patterns.
Failure Modes & Bottlenecks
- No significant stalls or loops. The task was well-scoped and the artifacts were clearly organized in the team folder.
- Minor: the gen.log was truncated — it only covered shots 3 through 9a (two generation runs), not all 25 shots. This meant I couldn’t verify the exact prompts used for every shot. I relied on the scene_list.md motion prompts instead, which should match but may have been modified by the tech lead during generation.
Key Decisions Made
- Chose to do frame-by-frame visual analysis rather than relying purely on text analysis of the scene list and motion prompts. This was more work but produced much more reliable findings. The alternative (text-only analysis) would have produced incorrect risk ratings for at least 2 shots.
- Recommended compound VO+dialogue conversion for Shot 6 rather than a pure VO-safe reshoot. The rationale: Veo naturally generated the shot as a conversation scene — fighting the model’s tendency is harder than embracing it. Converting to compound audio uses the model’s strength instead of fighting it.
- Proposed adding a new compound dialogue shot (Gary+Reginald exchange) rather than just fixing existing shots. This goes slightly beyond “fix what’s broken” into “improve what’s missing,” but the absence of any two-character dialogue exchange was a significant gap that warranted addressing during a reshoot pass anyway.
- Wrote the guidelines document at a playbook-ready level of detail rather than a brief summary. Preston indicated this would be incorporated into the playbook, so investing in thorough, structured guidance (with examples, checklists, and Veo-specific strategies) made more sense than bullet points that would need to be expanded later.
Suggestions for Improvement
- Add a
[DIALOGUE]/[VO]/[COMPOUND]/[SILENT]tag requirement to the scene_list.md template in the playbook. This is the single highest-leverage change. If teams tag every shot’s audio mode at Step 2, most of the conflicts found in this analysis would be caught before any generation happens. - Add a “VO-safety check” gate at Step 4 (Storyboard). Before generating storyboard frames, have the Editor verify that every
[VO]-tagged shot’s motion prompt does not include speaking/shouting/cheering verbs. This is a 5-minute review that prevents expensive reshoots. - Include temporal cueing in the scene_list.md format. Even simple hints like “Dialogue (Character, first half)” would dramatically improve editor alignment and reduce guesswork during timeline construction.
- Establish a minimum dialogue word count per character type in the Team Event Guide. The “short declarative statement” pattern seems to be a default behavior of AI scene-list generation — having an explicit minimum (e.g., 15+ words for protagonist lines) would counteract this tendency.
- Consider having the analyst role available earlier in the pipeline — ideally at Step 2 (Beat Sheet review) — rather than only post-production. The dialogue/narration issues identified here were all baked in at the scripting stage. Catching them before generation would save reshoot costs.