Xi Team — API & Model Recommendations for “Meltdown”
Author: xi-techlead (DP)
Date: 2026-05-19
Optimal model selections for the bright claymation aesthetic.
Image Generation (Reference Chains & Storyboard)
| Use Case | Model | Rationale |
|---|---|---|
| Character references | gemini-3-pro-image-preview (Nano Banana Pro) | Supports --reference-image chaining — essential for building the 4-image reference chain per character. Best quality for anchoring the claymation look. |
| Setting references | gemini-3-pro-image-preview | Same reasoning — environment anchoring via reference images. |
| Storyboard frames | gemini-3-pro-image-preview | Start/End frames need reference injection for character + setting consistency. |
| Quick iterations | gemini-3.1-flash-image-preview (Nano Banana 2) | Faster, for exploratory prompts or retries. |
Aspect ratio: 16:9 for all frames. Higher-than-720p intermediate resolution encouraged.
Video Generation (Principal Photography)
| Use Case | Model | Rationale |
|---|---|---|
| Primary shooting | veo-3.1-fast-generate-001 (Veo 3.1 Fast) | Default. 4/6/8s durations. Supports from-frames (start + end frame interpolation) — critical for storyboard-to-video pipeline. Generates audio alongside visuals. |
| Hero shots | veo-3.1-generate-001 (Veo 3.1) | Higher quality for key moments. Same capabilities as Fast. |
| Extend operations | veo-3.1-lite-generate-001 (Veo 3.1 Lite) | Only model that supports extend. Each extend adds exactly 7 seconds. |
Audio prompting: Include ambient/SFX descriptions (squishing clay sounds, alarm ringing, refrigerator hum) directly in the Veo prompt. Generate with audio enabled for [VO] and [SILENT] shots; for [DIALOGUE] shots, consider --generate-audio=false if we’ll overlay TTS.
Music (Score)
| Use Case | Model | Rationale |
|---|---|---|
| Score segments (~30s) | lyria-3-clip-preview (Lyria 3 Clip) | Quick ~30s clips for per-scene scoring. |
| Extended score (~2:30) | lyria-3-pro-preview (Lyria 3 Pro) | Longer arcs spanning multiple scenes. Preferred for the main score to avoid stitching artifacts. |
Genre direction for prompts: Whimsical, bright, playful orchestral — xylophones, plucked strings, light percussion. Think Aardman/Wallace & Gromit score. Avoid anything dark, brooding, or electronic.
Voice/TTS (Narration)
| Use Case | Model | Rationale |
|---|---|---|
| Narrator VO | gemini-3.1-flash-tts-preview | Only TTS option. 800-char limit per call — split longer narration into segments. |
Voice selection: Recommend testing Fenrir (warm, measured) or Zephyr (lighter, more playful) for the narrator. The tone should be warm and amused — a storyteller enjoying a silly tale, not a serious documentarian.
TTS timing buffer: 20% extra on all video durations for dialogue/narration shots per playbook mandate.
Key Technical Notes for This Concept
- Claymation is our ally. Model artifacts (wobbly textures, slight inconsistencies) read as genre-authentic fingerprints, not errors. Lean into this.
- The melting effect is a progressive transformation. Best approach: generate Start Frame (intact marshmallow) and End Frame (partially melted) per shot, then use
from-framesto interpolate the melt. The video model should handle continuous deformation well. - Bright, warm, high-key lighting must be encoded in every prompt. Include tone anchors: “tactile, claymation, whimsical, hand-made, bright” per the Tone Contract.
- No lip-sync needed. The design brief explicitly bans spoken dialogue. All vocal track is narrator VO — characters are physically expressive but mouth-closed. This simplifies video generation significantly.
- Reference budget per shot: 1 character sheet (marshmallow man) + 1 setting reference (kitchen) = 2 refs, leaving 1 slot for a second character (alarm clock) or object reference (fridge).