What Went Well
- From-frames interpolation was the right call. Using start+end keyframe PNGs to drive Veo 3.1 gave far better continuity than text-to-video or single-image-to-video. The storyboard investment paid off — 39/44 shots succeeded on the first batch run.
- QA helper delegation protected the context window. Spawning single-purpose helpers for batch execution (5 photography shards, SFX helper, score helper, retry helper, extend helper) kept the main agent focused on creative decisions and troubleshooting. The “Rule of 10” sharding pattern worked well — each helper had a manageable scope.
- Retry/extend recovery was resilient. When 5 shots failed due to API timeouts and 14 extends failed during parallel execution, the sequential retry scripts recovered 100% without any prompt changes. The failures were purely transient.
- ffmpeg-based branding clips were cleaner than Veo. Using the embedded ffmpeg binary (
/workspace/tools/bin/ffmpeg) to create static video clips from PNGs avoided AI motion artifacts that Veo would have introduced on clean title card text. 36KB for an 8s static clip vs. megabytes of hallucinated motion. - Editor collaboration was tight. The editor’s detailed specs (timeline architecture, audio spec, ducking plan, branding spec, verify-dailies reports) made handoffs clean. Shared files as source of truth, minimal back-and-forth.
- Lyria handled SFX surprisingly well. Using
lyria-3-clip-previewfor sound effects (bell dings, crashes, clock ticks) produced usable stems despite being a music model. The prompts treated them as “soundscapes” and the results were editorially acceptable.
What Didn’t Go Well
- Scene 2 continuity was a slog. 13 frames across 9 shots needed regeneration for continuity issues (premature lobby damage, inconsistent furniture, wall color shifts). The fern-only vs. full-damage state distinction required careful per-frame attention. This should have been caught earlier in storyboard review, not after the first pass.
- Shot 3.3 end frame took 3 attempts. The flat pastel background problem persisted through 2 prompt iterations. Root cause was dual: (1) the
prompt()helper’s TONE string included “pastel color palette” which fought photorealistic backgrounds, and (2) the start-frame reference image anchored Veo’s style. Solution required both bypassing the helper AND swapping reference images. Debugging this cost ~20 minutes. - 10 clips got double-extended to 22s. During the parallel 5-shard photography batch, some clips received extend operations twice. The editor handled this gracefully via source_in/source_out trimming, but it wasted API calls and could have caused confusion. Root cause: the extend helper didn’t check whether a clip was already extended before operating on it.
- No voice directory generated. The editor handled TTS voice generation independently. The handoff here was implicit rather than explicit — it would have been better to formally assign ownership of voice stems.
- False stall notifications from helpers. Multiple QA helpers (rho-qa-photo4, rho-qa-photo5, rho-qa-extend) triggered stall alerts during legitimate long-running generations. These alerts caused unnecessary context-switching to verify the helpers were still working.
Failure Modes & Bottlenecks
- API rate limiting during parallel execution. Running 5 photography shards simultaneously hit Veo API load limits, causing timeouts (5min from-frames timeout) and “high load” errors on 5 shots. Sequential retry resolved all failures. Lesson: 5 parallel shards is at the edge of what the API supports.
- Reference image style anchoring was non-obvious. Veo 3.1’s from-image generation heavily weights the visual style of reference images. A start frame with flat yellow backgrounds would produce flat yellow video regardless of the text prompt. This isn’t documented and was discovered empirically.
- TONE/ANTI_DRIFT helper conflict. The shared
prompt()helper in storyboard config injected “pastel color palette” into TONE, which was correct for the Wes Anderson aesthetic on most shots but actively harmful when photorealistic backgrounds were needed. The helper should support a “skip TONE” override for edge cases. - Context window pressure across sessions. This production spanned 2 full context windows. The compaction summary at the boundary was thorough but still required re-reading key files. Long productions benefit from more persistent state files (progress trackers, decision logs) rather than relying on conversation context.
Key Decisions Made
- From-frames over single-image-to-video: Chose to generate both start and end keyframes for every shot, then use Veo from-frames interpolation. Alternative: generate only start frames and let Veo animate freely. The from-frames approach doubled the storyboard workload but dramatically improved shot-to-shot continuity. Correct decision.
- Parallel 5-shard photography over sequential: Split 44 shots across 5 helper agents for parallel execution. Alternative: run all 44 sequentially through one helper. The parallel approach finished in ~35 minutes vs. estimated ~3 hours sequential. Despite the 5 transient failures requiring retry, this was the right call.
- ffmpeg for branding over Veo: Used raw ffmpeg to create static video clips for title card and credits. Alternative: use
genmedia.video_from_image()which would have added AI-generated motion. Static branding needs static video — using Veo here would have been wrong. - Lyria for SFX over separate Chirp model: Used
lyria-3-clip-previewfor all SFX stems since Chirp wasn’t available in the toolkit. Alternative: request Chirp access or try text-to-video with audio-only extraction. Lyria worked well enough for the sound design requirements. - Sequential extends over parallel: After the parallel photography batch caused API load issues, ran all 14 failed extends sequentially through a single helper. Slower but 100% reliable. Correct decision given the observed failure pattern.
- “No Timeline Shift” for branding assembly: Endorsed the editor’s recommendation to concatenate branding clips after the main render rather than inserting them into the timeline (which would shift all 96 audio clip positions). Simpler, less error-prone.
Suggestions for Improvement
- Add an
--already-extendedcheck to extend scripts. The double-extend problem (10 clips at 22s) could be prevented by checking clip duration before extending. If duration > base_duration + tolerance, skip. - Stall notification thresholds for generation helpers. Video and image generation can legitimately take 3-5 minutes per call. The stall detection threshold should be longer for agents running batch generation scripts — perhaps configurable per agent type.
- Make TONE overridable in storyboard config. The shared
prompt()helper should accept an optionaltone_overrideparameter for shots that need to deviate from the default Wes Anderson palette (e.g., photorealistic backgrounds for chandelier damage). - Formal voice stem ownership assignment. The production pipeline should explicitly assign TTS voice generation to either tech lead or editor at the start of Step 6, not leave it implicit.
- Persistent progress tracker file. For multi-session productions, maintain a
progress.jsonin the shared directory that tracks which shots/assets are complete, failed, or pending. This survives context compaction better than conversation history. - Document Veo reference image style anchoring. Add a note to the playbook: “Veo from-image and from-frames heavily weight the visual style of input images. To change style, you must change the reference images, not just the text prompt.”
- ffmpeg path should be on PATH or documented. Discovering that ffmpeg existed at
/workspace/tools/bin/ffmpegrequired detective work through the genmedia-assemble binary. This should be documented in the tech stack or USAGE.md.