papers/zen-director.tex at main · zenlm/papers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage[dvipsnames]{xcolor}
\usepackage{booktabs}
\usepackage{float}
\usepackage{geometry}
\geometry{margin=1in}

\definecolor{zenblue}{RGB}{41,121,255}
\definecolor{zengreen}{RGB}{52,199,89}
\definecolor{codegray}{RGB}{245,245,245}

\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\lstset{
    backgroundcolor=\color{codegray},
    basicstyle=\ttfamily\small,
    breaklines=true,
    captionpos=b,
    frame=single,
    numbers=left,
    numberstyle=\tiny\color{gray}
}

\title{
    \vspace{-2cm}
    \Large \textbf{Zen AI Model Family} \\
    \vspace{0.5cm}
    \Huge \textbf{Zen-Director} \\
    \vspace{0.3cm}
    \large Video Scene Generation, Storyboarding, and Cinematic Direction \\
    \vspace{0.5cm}
    \normalsize Technical Report v2025.01
}

\author{
    Hanzo AI Research Team\thanks{research@hanzo.ai} \and
    Zoo Labs Foundation\thanks{foundation@zoo.ngo}
}

\date{January 2025}

\begin{document}

\maketitle

\begin{abstract}
We present \textbf{Zen-Director}, a 7-billion parameter vision-language model specialized for
video scene generation, cinematographic planning, and multi-shot storyboard synthesis.
Built on the Zen MoDE (Mixture of Distilled Experts) architecture with a temporal transformer
extension, Zen-Director understands and generates structured cinematic descriptions---shot lists,
scene compositions, camera movements, and narrative arcs---that downstream video generation
models can consume directly. The model achieves BLEU-4 of 0.423 on the VideoCaption benchmark,
81.2\% accuracy on CinematicQA (our novel evaluation of cinematographic knowledge), and
produces storyboards rated 4.1/5.0 by professional cinematographers in blind evaluation.
Zen-Director bridges the gap between natural language creative intent and the precise technical
specifications required to direct AI video generation systems at professional quality.
\end{abstract}

\tableofcontents
\newpage

\section{Introduction}

The rapid maturation of text-to-video generation systems has exposed a critical bottleneck:
the gap between a creator's high-level creative intent and the precise technical specifications
these systems require to produce coherent, professional-quality video. Current generation
workflows demand that users specify camera angles, focal lengths, lighting conditions, shot
durations, and scene transitions---knowledge traditionally held by professional cinematographers
and directors.

Zen-Director addresses this bottleneck as a \textbf{directorial intelligence}: a model that
understands narrative intent and translates it into actionable cinematic specifications. Rather
than generating video pixels directly, Zen-Director generates structured \textbf{shot plans}:
hierarchical scene descriptions that video generation systems consume as structured prompts.

\subsection{Model Overview}

\begin{table}[H]
\centering
\begin{tabular}{ll}
\toprule
\textbf{Property} & \textbf{Value} \\
\midrule
Parameters & 7B \\
Architecture & Zen MoDE 7B + Temporal Transformer \\
Context Length & 32K tokens + 512 video frames \\
Visual Encoder & ViT-L/14 (336px) \\
Temporal Depth & 12 temporal transformer layers \\
Shot Plan Format & Structured JSON + prose description \\
Training Data & 8.2M scene-annotation pairs, 1.4M film scripts \\
\bottomrule
\end{tabular}
\caption{Zen-Director Model Specifications}
\end{table}

\subsection{Key Capabilities}

\begin{itemize}
    \item \textbf{Scene storyboarding}: Convert natural language scene descriptions into
    complete shot-by-shot storyboards with camera specifications.
    \item \textbf{Shot composition}: Recommend and generate compositional guidelines (rule of
    thirds, leading lines, depth of field) for each shot.
    \item \textbf{Narrative arc planning}: Structure multi-scene video narratives with consistent
    pacing, tension arcs, and visual motifs.
    \item \textbf{Cinematic vocabulary}: Understand and generate industry-standard cinematographic
    terminology (establishing shot, dolly zoom, rack focus, etc.).
    \item \textbf{Video comprehension}: Analyze existing video clips and generate directorial
    notes describing their cinematographic techniques.
\end{itemize}

\section{Architecture}

\subsection{Zen MoDE 7B Language Backbone}

The language backbone is Zen MoDE at 7B scale: 28 transformer layers, 28 attention heads with
grouped-query attention (4 KV heads), and MoE feed-forward networks (4 experts, top-2 routing).
This provides strong natural language understanding for parsing creative briefs and generating
rich cinematic descriptions.

\subsection{Visual Encoder}

A ViT-L/14 vision encoder operating at 336px resolution encodes reference images and video
frames into 256 visual tokens per frame. A two-layer MLP projection maps visual tokens into
the language model's embedding space.

\subsection{Temporal Transformer Extension}

The key architectural innovation is a 12-layer temporal transformer that operates \textit{across}
frames rather than within them. Given $T$ frames, each encoded to 256 tokens, the temporal
transformer attends over the time dimension to build a coherent representation of motion,
continuity, and scene evolution:

\begin{equation}
    H_{\text{temporal}} = \text{TransformerEncoder}([h_1^{CLS}, h_2^{CLS}, \ldots, h_T^{CLS}])
\end{equation}

where $h_t^{CLS}$ is the CLS token representation of frame $t$ from the visual encoder.
The temporal representation is concatenated with text tokens before the language model's
cross-attention layers.

\subsection{Shot Plan Generation}

Zen-Director generates shot plans as structured JSON objects:

\begin{lstlisting}[language=Python, caption=Shot Plan Schema]
{
  "scene": {
    "id": "s01",
    "location": "rain-soaked rooftop, night",
    "mood": "tense, noir",
    "duration_sec": 45
  },
  "shots": [
    {
      "id": "s01_001",
      "type": "establishing",
      "camera": {"position": "wide", "angle": "high angle", "move": "slow push in"},
      "focal_length": "24mm",
      "subject": "city skyline with protagonist silhouette",
      "duration_sec": 8,
      "lighting": "practical neon, rain reflections",
      "notes": "City should feel overwhelming relative to protagonist"
    }
  ]
}
\end{lstlisting}

\section{Training}

\subsection{Dataset}

\begin{table}[H]
\centering
\begin{tabular}{lrrl}
\toprule
\textbf{Source} & \textbf{Samples} & \textbf{Proportion} & \textbf{Content} \\
\midrule
Film scripts + frames & 2,400,000 & 29.3\% & Script-to-scene alignment \\
Cinematography textbooks & 180,000 & 2.2\% & Technical knowledge \\
Film criticism corpus & 820,000 & 10.0\% & Aesthetic analysis \\
Video-caption pairs & 3,100,000 & 37.8\% & Visual understanding \\
Storyboard collections & 1,700,000 & 20.7\% & Shot plan examples \\
\midrule
\textbf{Total} & \textbf{8,200,000} & 100\% & \\
\bottomrule
\end{tabular}
\caption{Zen-Director Training Data}
\end{table}

\subsection{Training Protocol}

\textbf{Stage 1 -- Visual encoder alignment} (20K steps): The ViT encoder and MLP projection
are trained to align visual representations with cinematic language descriptions.

\textbf{Stage 2 -- Temporal pretraining} (40K steps): The temporal transformer is pretrained
on video sequences with a masked-frame prediction objective.

\textbf{Stage 3 -- Directorial SFT} (60K steps): The full model is fine-tuned on storyboard
generation, scene description, and cinematographic Q\&A tasks jointly.

\textbf{Stage 4 -- RLHF from cinematographers} (15K steps): A reward model trained on 50,000
pairwise comparisons from professional cinematographers (recruited via film school partnerships)
is used to further refine shot plan quality via PPO.

\section{Evaluation}

\subsection{VideoCaption Benchmark}

We evaluate video description quality on a held-out set of 5,000 film clips spanning 20 genres.

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{BLEU-4} & \textbf{METEOR} & \textbf{CIDEr} & \textbf{ROUGE-L} \\
\midrule
BLIP-2 & 0.281 & 0.314 & 0.872 & 0.512 \\
InstructBLIP & 0.308 & 0.341 & 0.941 & 0.534 \\
VideoChat2 & 0.352 & 0.374 & 1.021 & 0.561 \\
Video-LLaMA2 & 0.387 & 0.402 & 1.103 & 0.584 \\
\textbf{Zen-Director} & \textbf{0.423} & \textbf{0.441} & \textbf{1.187} & \textbf{0.612} \\
\bottomrule
\end{tabular}
\caption{VideoCaption Benchmark Results (higher is better)}
\end{table}

\subsection{CinematicQA}

CinematicQA is a novel benchmark we introduce comprising 2,000 multiple-choice questions
testing cinematographic knowledge: shot types, camera movements, lighting techniques, editing
principles, and genre conventions. Questions were authored by five professional cinematographers.

\begin{table}[H]
\centering
\begin{tabular}{lc}
\toprule
\textbf{Model} & \textbf{Accuracy (\%)} \\
\midrule
GPT-4o (zero-shot) & 68.3 \\
Claude 3.5 Sonnet & 71.4 \\
Gemini 1.5 Pro & 66.8 \\
Specialist fine-tuned 7B & 74.2 \\
\textbf{Zen-Director 7B} & \textbf{81.2} \\
\bottomrule
\end{tabular}
\caption{CinematicQA Accuracy (\%)}
\end{table}

\subsection{Professional Storyboard Evaluation}

Twenty professional cinematographers evaluated storyboards generated from 100 scene descriptions,
rating each on a 5-point scale across five dimensions.

\begin{table}[H]
\centering
\begin{tabular}{lc}
\toprule
\textbf{Dimension} & \textbf{Mean Score (1--5)} \\
\midrule
Technical accuracy of shot specifications & 4.3 \\
Narrative coherence across shots & 4.1 \\
Creative quality / originality & 3.9 \\
Pacing appropriateness & 4.0 \\
Overall directorial vision & 4.1 \\
\midrule
\textbf{Overall Mean} & \textbf{4.1} \\
\bottomrule
\end{tabular}
\caption{Professional Cinematographer Evaluation (N=20 evaluators, 100 storyboards)}
\end{table}

\subsection{Downstream Video Generation Quality}

We evaluate whether Zen-Director shot plans improve final video quality when used as structured
prompts for a text-to-video generation system. Using the same creative brief:

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Prompt Method} & \textbf{Human Preference (\%)} & \textbf{FVD} $\downarrow$ \\
\midrule
Raw creative brief (baseline) & 18.4\% & 412 \\
Manual cinematographer spec & 42.1\% & 287 \\
Zen-Director shot plan & 39.5\% & 298 \\
\bottomrule
\end{tabular}
\caption{Downstream Video Quality with Zen-Director Shot Plans}
\end{table}

Zen-Director reaches 94\% of manual cinematographer performance at a fraction of the cost and
time, validating its role as an effective creative intermediary.

\section{Applications}

\subsection{AI Film Production Pipeline}

Zen-Director is integrated into the Hanzo AI film production pipeline as the directorial layer
between a human creative brief and the Zen-Video generation system. A typical workflow:

\begin{enumerate}
    \item Human writes a scene description in natural language.
    \item Zen-Director generates a shot plan JSON with full cinematographic specs.
    \item Human reviews and optionally edits the shot plan (typically 3--5 minutes).
    \item Zen-Video consumes the shot plan and generates video clips per shot.
    \item Zen-Director evaluates temporal consistency across clips and suggests retakes.
\end{enumerate}

\subsection{Film Education}

Zen-Director serves as an interactive cinematography tutor: students can submit scene
descriptions and receive expert-level directorial notes explaining the reasoning behind
each shot choice.

\subsection{Video Game Cutscene Direction}

Game studios use Zen-Director to generate cinematic specifications for in-engine cutscene
directors, reducing the time from narrative script to playable sequence by an estimated 60\%.

\section{Related Work}

Video understanding models (VideoChat, Video-LLaMA) focus primarily on description generation.
Text-to-video generation systems (Sora, CogVideo, ModelScope) focus on pixel synthesis.
Zen-Director uniquely occupies the directorial planning layer between these two stages,
drawing on the rich tradition of computational narrative research and cinematography theory.

\section{Conclusion}

Zen-Director establishes a new model category: the AI cinematographer. By training on film
scripts, storyboards, and cinematographic knowledge at 7B scale, the model achieves 81.2\%
on CinematicQA and generates shot plans rated 4.1/5 by professional cinematographers.
Integration with the Zen-Video generation backbone creates a complete AI film production
pipeline from creative brief to rendered footage.

\begin{thebibliography}{10}
\bibitem{li2023blip} J. Li et al., ``BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,'' ICML, 2023.
\bibitem{videochat} K. Li et al., ``VideoChat: Chat-Centric Video Understanding,'' arXiv:2305.06355, 2023.
\bibitem{makarov2022} A. Makarov et al., ``Computational Cinematography: A Survey,'' IEEE TPAMI, 2022.
\end{thebibliography}

\end{document}