The AI Film Rig
Seven CLI tools, five generative media APIs, and a 7-step production playbook — the shared pipeline powering 17 AI agent film crews.
01
The genmedia Toolset
A dedicated tool-maker agent builds and maintains the genmedia CLI — six specialized commands that wrap Google’s generative media APIs into a production-ready pipeline. Teams focus on creative decisions while the toolset handles API orchestration, file naming, resolution enforcement, and audit logging.
Early pilots had teams building their own tools; the process matured into a shared, battle-tested suite that every crew relies on. A companion Python batch module (genmedia.py) enables programmatic orchestration for bulk operations — generating dozens of storyboard frames or video clips without shell-quoting headaches.
genmedia-image Storyboard frames, character sheets, setting references. Supports multiple model families with aspect ratio and resolution control.
genmedia-video Shot synthesis from prompts, single images, or start/end keyframe pairs. Scene extension for longer takes.
genmedia-music Original score generation — short clips for scene cues or full-length stems for multi-minute sequences.
genmedia-voice Character dialogue and narrator voiceover with 30 distinct voice profiles and style-prompted delivery.
genmedia-assemble Final film assembly from Timeline JSON — video sequencing, multi-track audio mixing with sidechain ducking, crossfades, and titles.
genmedia-verify QA suite: technical compliance checks, shot audits, dailies verification, vocal classification validation, and manifest linting.
genmedia-omni Multimodal video generation via Gemini Omni — video-to-video style transfer, reference-driven generation, and lip-synced dialogue from source footage.
02 Generative Media APIs
Image Synthesis
Nano Banana (Gemini) & Imagen 4
Two model families for different needs. Nano Banana (Gemini-based) handles character reference chains, composite look-books, and storyboard frames with strong prompt adherence. Imagen 4 and Imagen 4 Ultra deliver photorealistic hero frames and setting references.
Video Synthesis
Veo 3.1
Cinematic 720p/24fps video generation with multiple workflows: text-to-video, single-image animation (from-image), start/end keyframe interpolation (from-frames), and scene extension (extend). Up to 3 reference images per shot for character and object consistency.
Music & Scoring
Lyria 3
AI-composed original scores. Lyria 3 Clip generates ~30-second scene cues; Lyria 3 Pro produces full stems up to 2:30 for longer sequences. Style-prompted with genre, mood, instrumentation, and tempo control.
Voice & Dialogue
Gemini TTS
Character dialogue and narrator voiceover via gemini-3.1-flash-tts. 30 distinct voice profiles with style prompts for emotional delivery. Each voice becomes a consistent character — teams cast voices like actors.
Multimodal Video Generation
Gemini Omni
The gemini-omni-flash-preview model via the Interactions API. Supports text-to-video, image-to-video, reference-to-video, and video-to-video workflows. Used in the documentary for transforming real footage of the human director into the stylized “Luminous Digital Ghost” treatment — preserving the subject’s gestures and lip-sync while applying the pencil-on-photoreal visual style with digital transmission artifacts.
03 The 7-Step Production Pipeline
Every team follows the same structured playbook — from concept to final cut. Each step has defined roles, outputs, and quality gates that must pass before advancing.
04 Character Consistency
Maintaining recognizable characters across dozens of AI-generated shots is one of the hardest problems in generative filmmaking. The pipeline uses a reference chain methodology: each character gets 4+ reference images synthesized from a detailed text profile, then composed into a single Composite Character Sheet (multiple angles in one image). This sheet is fed as a reference image into every subsequent generation to anchor the character’s appearance.
The same approach extends to recurring objects and settings — a distinctive building, vehicle, or prop gets its own reference sheet so it looks consistent across every shot it appears in. Veo supports up to 3 reference images per generation, so teams budget their references carefully: typically one character sheet, one setting reference, and one object anchor.
05 Assembly & Post-Production
# Timeline-driven assembly with multi-track audio
$ genmedia-assemble timeline -timeline timeline.json
$ genmedia-assemble mix-audio -timeline timeline.json \
-ducking sidechaincompress
$ genmedia-verify audit -dir ./dailies/
Timeline JSON is the single source of truth — shot sequence, per-clip timing, audio track layering, crossfade durations, and sidechain ducking so dialogue always sits above the score.
Timeline JSON
The editor’s blueprint: every clip, every audio track, every transition defined in a single structured file that genmedia-assemble renders into the final master.
Voice-First Mixing
Audio mixing uses sidechain compression to duck the music score beneath dialogue and narration, ensuring vocal clarity without manual keyframing.
QA & Verification
Automated checks validate resolution, frame rate, duration, vocal classification compliance, and manifest consistency before any gate is cleared.
06 Visual-Audio Agreement
Every shot is tagged with a vocal classification that enforces alignment between what the viewer sees and hears. Characters appearing to speak must have a dialogue track; narrator voiceover must play over shots where no one appears to be talking.
Character speaks on screen. Motion prompt describes active speaking.
Narrator over visuals. Characters must not appear to be speaking.
Multiple non-overlapping vocal segments in one shot.
Visual-only storytelling. Music and ambient sound only.
07 The Final Deliverable
Every film ships as a single MP4 master with opening titles, closing credits, and a voice-first audio mix — generated entirely by AI agents using the shared genmedia pipeline.