What Went Well
- Reference-to-Reference Chaining worked extremely well. Using composite character_sheet.png as the single character reference, combined with setting references, produced consistent characters across 62 storyboard frames and 31 video clips. The lion-instead-of-badger crest was the only drift, and it was consistent throughout — a net positive.
- from-frames video generation was the right call. Using storyboard start+end frames as first/last keyframes for Veo 3.1 gave us direct control over the shot composition. 28/31 shots succeeded on the first batch run.
- Parallel generation saved significant time. Running start frames for different shots in parallel (where reference chains allowed it) and delegating the batch video run to a qa-helper agent kept overall wall-clock time manageable.
- verify-dailies caught nothing because the pipeline was clean. 32/32 checks passed with zero failures. The manifest-driven approach made verification trivial.
- Tone anchors prevented genre drift. Including “NOT noir, NOT moody, NOT dark” and “comedic, slapstick, Pixar-like warmth” in every single prompt kept the model from defaulting to drama. Zero frames needed re-generation for tone issues.
- Team communication was efficient. Shared files as source of truth, batch status updates, and clear handoffs between steps kept coordination overhead low.
What Didn’t Go Well
- Safety filter on Shot 9 (finger guns). The storyboard image of Craig pointing finger guns at an armored knight triggered Veo’s safety classifier. Both from-frames and from-image with that image failed. Required falling back to pure text-to-video, which meant losing the visual continuity from the storyboard keyframes for that one shot.
- API transient failures on Shots 9b and 22. Two shots failed with generic “no videos were generated” errors that were resolved by simple retries. Shot 22 needed a second retry via from-image. No clear root cause — likely API-side load issues.
- mograph agent template didn’t exist. The playbook references a “Motion Graphics Agent” but no
mographtemplate was available. The qa-helper fallback produced static PNGs but couldn’t convert them to video. Had to take over and do the ffmpeg conversion manually. - Context window compaction hit mid-storyboard. Lost context during Scene 4 storyboard generation and had to reconstruct state from the compaction summary. The generation pattern was mechanical enough that recovery was smooth, but it’s a risk for more complex tasks.
Failure Modes & Bottlenecks
- Batch video generation was the biggest wall-clock bottleneck. 31 shots × ~75s each = ~40 minutes of pure generation time. Delegating to the qa-helper was the right move to protect context, but the overall pipeline was serialized (each shot waited for the previous one). Parallel video generation would cut this dramatically if the API supports it.
- GCS URI requirement for from-frames was unexpected. The from-frames command requires
gs://URIs, not local paths (unlike from-image which also needs GCS). Had to upload all 62 storyboard frames to GCS before video generation could begin. This should be documented more prominently. - Bash arithmetic syntax incompatibility. The generation script used
((FAILED++))which failed in the qa-helper’s shell. The helper fixed it toFAILED=$((FAILED+1))— a minor issue but a reminder to write POSIX-compatible scripts when delegating to other agents.
Key Decisions Made
- Accepted lion crest instead of badger. The model consistently generated a lion rampant instead of a rampant badger on Reginald’s surcoat. Re-generating would have broken the reference chain across all 5 Reginald images. Since it was consistent, we accepted it — the visual joke works regardless of which animal is on the crest.
- Used from-frames over from-image for video generation. Could have used from-image (start frame only) or pure text-to-video. from-frames gave us maximum control over both the start and end state of each shot, directly leveraging the storyboard work.
- Generated all clips at 8s with 0 extends. Since all planned shots were 4-8s, an 8s base clip with the Overhang Principle provided sufficient material for every shot without the complexity and consistency risk of extends.
- Sanitized Shot 9 to text-to-video rather than re-generating the storyboard frame. Re-generating a “safe” storyboard frame and then running from-frames would have been more visually consistent, but the time cost wasn’t justified for a 5s shot.
- Two-part music score. Lyria Pro generates ~2:30 max, but we needed ~3:50 of score. Generated two complementary halves with the editor choosing the crossfade point, rather than attempting a single track with an extend.
Suggestions for Improvement
- Add a
mographagent template. The playbook mandates starting a Motion Graphics Agent, but the template doesn’t exist. Either create the template or update the playbook to describe an alternative approach (e.g., using genmedia-image to generate title frames + ffmpeg conversion). - Support local paths in genmedia-video from-frames. Having to upload storyboard frames to GCS before video generation adds an unnecessary step. The from-image command apparently also requires GCS, but the image generation tool happily takes local paths. Consistency would help.
- Parallel video generation in the batch script. The generation script runs shots sequentially. A future version could run 3-4 shots in parallel (respecting API rate limits) to cut wall-clock time by 3-4×.
- Pre-screen storyboard frames through safety filter before video generation. A quick safety check on each storyboard frame before the batch run would identify problematic shots upfront, allowing prompt fixes before the long generation pass rather than discovering failures 15 minutes in.
- Document the VO content policy edge cases. “Torment” was flagged by the TTS safety filter. A list of known trigger words for the voice API would save trial-and-error time.
Addendum: Reshoot Phase
A post-production review by an external analyst identified visual-audio conflicts (characters with open mouths during VO narration) and the lack of back-and-forth dialogue exchanges. Three targeted reshoots were executed:
What went well:
- The reshoot scope was surgically precise — 3 shots out of 31, with clear instructions and approved dialogue. Total reshoot turnaround was ~20 minutes (storyboard frames + GCS upload + video generation + TTS stems).
- The “frozen tableau” approach for Shot 16 elegantly solved the VO-over-crowd-shouting conflict. Prompting “frozen mid-action, silent awe, no open mouths” gave Veo clear direction.
- The compound dialogue shot (14b) added genuine back-and-forth comedy that was missing from the film. The analyst was right — isolated one-liners don’t carry the humor the way exchanges do.
What could improve:
- The visual-audio conflict should have been caught earlier. In retrospect, including “no one is speaking, mouths closed” in VO-tagged shot prompts during initial generation would have prevented the Shot 16 issue entirely. This should be a standard prompt directive for any shot tagged for narrator voiceover.
- The shortened VO stem for Shot 6 came out at 4.3s instead of the target 3s. TTS pace control is imprecise — the “brisk pace” prompt instruction only partially worked. The editor had to adjust compound timing to accommodate.