Mu Team — Step 1 Technical Feasibility Assessment

Author: mu-techlead (Director of Photography) Date: 2026-05-18

Concept Summary

Two photo-realistic matchbox toy cars (Zip and Rusty) traverse a hand-drawn pencil-sketch world. Happy adventure genre. Cars “speak” by opening/closing their hoods.

Generatability Assessment

1. Mixed-Media Visual Style — FEASIBLE (Medium Risk)

The Ask: Photo-realistic die-cast toy cars composited onto hand-drawn pencil sketch backgrounds in a single frame.

Assessment: This mixed-media look is achievable with careful prompting. Generative image models (especially Gemini/Nano Banana Pro) handle style fusion well when given explicit instructions. The key is prompt specificity:

✅ “Miniature die-cast toy car sitting on top of hand-drawn pencil sketch paper, mixed media, macro photography” — this framing (real object on drawn surface) gives the model a physically plausible interpretation rather than asking it to arbitrarily mix two render styles.
✅ Reference chaining with --reference-image will anchor the car designs across all generations.
⚠️ Risk: The model may homogenize the two styles over time (making the pencil parts too realistic or the cars too illustrated). Tone anchors and aggressive re-prompting will be needed.

Optimal API: gemini-3-pro-image-preview (Nano Banana Pro) for all image generation. It supports --reference-image chaining (critical for character consistency) and handles style-mixing better than Imagen for this use case.

2. Hood-Open Lip-Sync — HIGH RISK (Needs Early Testing)

The Ask: Cars speak by mechanically opening/closing their hoods in sync with dialogue. Use Veo native audio to auto-sync this.

Assessment: This is the single riskiest element in the production.

⚠️ Veo’s native lip-sync is designed for human faces. It detects mouth regions and synchronizes jaw movement with speech audio. A car hood is geometrically different — there’s no guarantee Veo will map speech to hood motion.
⚠️ Even if prompted with “the car’s hood opens and closes like a mouth as it speaks,” the model may: (a) ignore the instruction entirely, (b) add a human face to the car, or (c) produce random hood flickering unrelated to speech cadence.
✅ Fallback strategy if native sync fails:
1. Generate car video with hood animation as a visual prompt (no audio dependency) — e.g., “the car’s hood pops open and snaps shut repeatedly in a rhythmic chattering motion”
2. Generate dialogue via TTS (genmedia-voice) separately
3. Overlay in post with genmedia-assemble combine
4. The timing won’t be phoneme-perfect, but for toy cars with mechanical hoods, approximate sync reads as charming rather than uncanny.

Recommendation: We MUST run a quick proof-of-concept test in Step 3 — generate a single 4s clip of a toy car with its hood opening/closing while “speaking” to validate the mechanism before committing to the full pipeline.

3. Character Consistency — FEASIBLE (Standard Pipeline)

The Ask: Two distinct toy cars (yellow vintage sports car, blue battered pickup truck) that look the same across all shots.

Assessment: This is well-suited to the reference-chain workflow:

Generate headshot → body sheet → scene tests → composite character sheet for each car
Use character_sheet.png as the primary reference in all subsequent generations
Toy cars are actually EASIER to keep consistent than human characters — less facial variation to drift

Optimal API: gemini-3-pro-image-preview with --reference-image chaining.

4. Anthropomorphism Drift — MEDIUM RISK

The Ask: Cars should look like real die-cast toys. No faces, no eyes, no limbs.

Assessment: Models have a strong bias toward anthropomorphizing objects, especially when they’re described as “speaking” or having personalities. We’ll need:

Aggressive negative prompting: “no eyes, no face, no limbs, no anthropomorphic features, inanimate die-cast metal toy”
The hood-open mechanic helps — it gives the model a specific physical mechanism rather than leaving it to invent one (which would default to googly eyes)
Monitor every generation for drift

5. Genre Drift Prevention — STANDARD

The Ask: Joyful, optimistic, warm, playful.

Assessment: The design brief tone anchors are solid. Every prompt must include: “Joyful, bright, warm sunlight, optimistic, childlike wonder, colorful pencil sketch background, tangible macro photography of toy cars, playful, bright.” Without these, the model will default to moody/dramatic/noir.

Optimal API Map

Asset Type	Tool	Model	Notes
Character refs & storyboard frames	`genmedia-image generate`	Nano Banana Pro (default)	Reference chaining for consistency
High-fidelity object refs	`genmedia-image imagen`	Imagen 4	Use for isolated car renders if Nano Banana struggles with photo-realism
Principal photography (video)	`genmedia-video from-image` / `generate`	Veo 3.1 (`veo-3.1-generate-001`)	Audio enabled for dialogue shots; `--generate-audio=false` for VO shots
Video extension	`genmedia-video extend`	Veo 3.1 Lite (`veo-3.1-lite-generate-001`)	Only model supporting extend
Frame interpolation	`genmedia-video from-frames`	Veo 3.1	Start/end frame storyboard animation
Narrator voiceover	`genmedia-voice generate`	Gemini 3.1 Flash TTS	Voice TBD — suggest Fenrir or Orus for warm narrator
Car dialogue (if TTS fallback)	`genmedia-voice generate`	Gemini 3.1 Flash TTS	Zip: a bright/quick voice; Rusty: a slower/gruffer voice
Score/music	`genmedia-music generate`	Lyria 3 Pro	2:30 tracks for score arcs; Lyria 3 Clip for 30s stingers
Assembly	`genmedia-assemble`	N/A (FFmpeg)	Timeline-based assembly for precise control

Key Technical Constraints to Feed Back to Idea Person

Max 2 characters per shot (playbook mandate) — works perfectly for Zip + Rusty buddy dynamic.
Max 3 reference images per Veo shot — budget: 1 car character sheet + 1 setting ref + 1 second car OR object ref.
Video durations: Base clips 4/6/8s. Extensions add exactly 7s each. Plan shot durations around these increments + 4s overhang.
TTS text limit: 800 characters per call. Long narration lines must be split.
Hood lip-sync is unproven — the story should not depend on perfect phoneme-level sync. Write dialogue that works even with approximate mechanical movement.
16:9 aspect ratio throughout. Final output 1280x720.

Verdict

The concept is generatable. The mixed-media style is well-suited to prompt engineering, and the two-character structure keeps complexity manageable. The main risk is the hood lip-sync mechanism — we’ll validate it early in Step 3 and have a clear fallback (TTS overlay). The happy adventure tone requires active prompt engineering to prevent genre drift, but the tone anchors in the design brief are strong.

Ready to proceed to Step 2 once the Idea Person delivers high_concept.md.

Tech Feasibility: Step 1