The AI Film Rig

Seven CLI tools, five generative media APIs, and a 7-step production playbook — the shared pipeline powering 17 AI agent film crews.

01 The genmedia Toolset

A dedicated tool-maker agent builds and maintains the genmedia CLI — six specialized commands that wrap Google’s generative media APIs into a production-ready pipeline. Teams focus on creative decisions while the toolset handles API orchestration, file naming, resolution enforcement, and audit logging.

Early pilots had teams building their own tools; the process matured into a shared, battle-tested suite that every crew relies on. A companion Python batch module (genmedia.py) enables programmatic orchestration for bulk operations — generating dozens of storyboard frames or video clips without shell-quoting headaches.

genmedia-image

Storyboard frames, character sheets, setting references. Supports multiple model families with aspect ratio and resolution control.

genmedia-video

Shot synthesis from prompts, single images, or start/end keyframe pairs. Scene extension for longer takes.

genmedia-music

Original score generation — short clips for scene cues or full-length stems for multi-minute sequences.

genmedia-voice

Character dialogue and narrator voiceover with 30 distinct voice profiles and style-prompted delivery.

genmedia-assemble

Final film assembly from Timeline JSON — video sequencing, multi-track audio mixing with sidechain ducking, crossfades, and titles.

genmedia-verify

QA suite: technical compliance checks, shot audits, dailies verification, vocal classification validation, and manifest linting.

genmedia-omni

Multimodal video generation via Gemini Omni — video-to-video style transfer, reference-driven generation, and lip-synced dialogue from source footage.

02 Generative Media APIs

Image Synthesis

Nano Banana (Gemini) & Imagen 4

Two model families for different needs. Nano Banana (Gemini-based) handles character reference chains, composite look-books, and storyboard frames with strong prompt adherence. Imagen 4 and Imagen 4 Ultra deliver photorealistic hero frames and setting references.

Video Synthesis

Veo 3.1

Cinematic 720p/24fps video generation with multiple workflows: text-to-video, single-image animation (from-image), start/end keyframe interpolation (from-frames), and scene extension (extend). Up to 3 reference images per shot for character and object consistency.

Music & Scoring

Lyria 3

AI-composed original scores. Lyria 3 Clip generates ~30-second scene cues; Lyria 3 Pro produces full stems up to 2:30 for longer sequences. Style-prompted with genre, mood, instrumentation, and tempo control.

Voice & Dialogue

Gemini TTS

Character dialogue and narrator voiceover via gemini-3.1-flash-tts. 30 distinct voice profiles with style prompts for emotional delivery. Each voice becomes a consistent character — teams cast voices like actors.

Multimodal Video Generation

Gemini Omni

The gemini-omni-flash-preview model via the Interactions API. Supports text-to-video, image-to-video, reference-to-video, and video-to-video workflows. Used in the documentary for transforming real footage of the human director into the stylized “Luminous Digital Ghost” treatment — preserving the subject’s gestures and lip-sync while applying the pencil-on-photoreal visual style with digital transmission artifacts.

03 The 7-Step Production Pipeline

Every team follows the same structured playbook — from concept to final cut. Each step has defined roles, outputs, and quality gates that must pass before advancing.

Step 1
Concept
Story & design brief
Step 2
Beat Sheet
Scenes, shots & vocal tags
Step 3
Characters
Reference chains & look-books
Step 4
Storyboard
Start & end keyframes
Step 5
Photography
Video synthesis & dailies
Step 6
Soundstage
Score, dialogue & timeline
Step 7
Final Cut
Assembly, mix & master

04 Character Consistency

Maintaining recognizable characters across dozens of AI-generated shots is one of the hardest problems in generative filmmaking. The pipeline uses a reference chain methodology: each character gets 4+ reference images synthesized from a detailed text profile, then composed into a single Composite Character Sheet (multiple angles in one image). This sheet is fed as a reference image into every subsequent generation to anchor the character’s appearance.

The same approach extends to recurring objects and settings — a distinctive building, vehicle, or prop gets its own reference sheet so it looks consistent across every shot it appears in. Veo supports up to 3 reference images per generation, so teams budget their references carefully: typically one character sheet, one setting reference, and one object anchor.

05 Assembly & Post-Production

# Timeline-driven assembly with multi-track audio

$ genmedia-assemble timeline -timeline timeline.json

$ genmedia-assemble mix-audio -timeline timeline.json \

-ducking sidechaincompress

$ genmedia-verify audit -dir ./dailies/

Timeline JSON is the single source of truth — shot sequence, per-clip timing, audio track layering, crossfade durations, and sidechain ducking so dialogue always sits above the score.

Timeline JSON

The editor’s blueprint: every clip, every audio track, every transition defined in a single structured file that genmedia-assemble renders into the final master.

Voice-First Mixing

Audio mixing uses sidechain compression to duck the music score beneath dialogue and narration, ensuring vocal clarity without manual keyframing.

QA & Verification

Automated checks validate resolution, frame rate, duration, vocal classification compliance, and manifest consistency before any gate is cleared.

06 Visual-Audio Agreement

Every shot is tagged with a vocal classification that enforces alignment between what the viewer sees and hears. Characters appearing to speak must have a dialogue track; narrator voiceover must play over shots where no one appears to be talking.

[DIALOGUE]

Character speaks on screen. Motion prompt describes active speaking.

[VO]

Narrator over visuals. Characters must not appear to be speaking.

[SEQUENCED]

Multiple non-overlapping vocal segments in one shot.

[SILENT]

Visual-only storytelling. Music and ambient sound only.

07 The Final Deliverable

1280×720
Resolution (strict)
24 fps
Frame rate
3–5 min
Target runtime

Every film ships as a single MP4 master with opening titles, closing credits, and a voice-first audio mix — generated entirely by AI agents using the shared genmedia pipeline.