Xi Team — API & Model Recommendations for “Meltdown”

Author: xi-techlead (DP)
Date: 2026-05-19

Optimal model selections for the bright claymation aesthetic.

Image Generation (Reference Chains & Storyboard)

Use Case	Model	Rationale
Character references	`gemini-3-pro-image-preview` (Nano Banana Pro)	Supports `--reference-image` chaining — essential for building the 4-image reference chain per character. Best quality for anchoring the claymation look.
Setting references	`gemini-3-pro-image-preview`	Same reasoning — environment anchoring via reference images.
Storyboard frames	`gemini-3-pro-image-preview`	Start/End frames need reference injection for character + setting consistency.
Quick iterations	`gemini-3.1-flash-image-preview` (Nano Banana 2)	Faster, for exploratory prompts or retries.

Aspect ratio: 16:9 for all frames. Higher-than-720p intermediate resolution encouraged.

Video Generation (Principal Photography)

Use Case	Model	Rationale
Primary shooting	`veo-3.1-fast-generate-001` (Veo 3.1 Fast)	Default. 4/6/8s durations. Supports `from-frames` (start + end frame interpolation) — critical for storyboard-to-video pipeline. Generates audio alongside visuals.
Hero shots	`veo-3.1-generate-001` (Veo 3.1)	Higher quality for key moments. Same capabilities as Fast.
Extend operations	`veo-3.1-lite-generate-001` (Veo 3.1 Lite)	Only model that supports extend. Each extend adds exactly 7 seconds.

Audio prompting: Include ambient/SFX descriptions (squishing clay sounds, alarm ringing, refrigerator hum) directly in the Veo prompt. Generate with audio enabled for [VO] and [SILENT] shots; for [DIALOGUE] shots, consider --generate-audio=false if we’ll overlay TTS.

Music (Score)

Use Case	Model	Rationale
Score segments (~30s)	`lyria-3-clip-preview` (Lyria 3 Clip)	Quick ~30s clips for per-scene scoring.
Extended score (~2:30)	`lyria-3-pro-preview` (Lyria 3 Pro)	Longer arcs spanning multiple scenes. Preferred for the main score to avoid stitching artifacts.

Genre direction for prompts: Whimsical, bright, playful orchestral — xylophones, plucked strings, light percussion. Think Aardman/Wallace & Gromit score. Avoid anything dark, brooding, or electronic.

Voice/TTS (Narration)

Use Case	Model	Rationale
Narrator VO	`gemini-3.1-flash-tts-preview`	Only TTS option. 800-char limit per call — split longer narration into segments.

Voice selection: Recommend testing Fenrir (warm, measured) or Zephyr (lighter, more playful) for the narrator. The tone should be warm and amused — a storyteller enjoying a silly tale, not a serious documentarian.

TTS timing buffer: 20% extra on all video durations for dialogue/narration shots per playbook mandate.

Key Technical Notes for This Concept