papers/zen-foley.tex at main · zenlm/papers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage[dvipsnames]{xcolor}
\usepackage{booktabs}
\usepackage{float}
\usepackage{geometry}
\geometry{margin=1in}

\definecolor{zenblue}{RGB}{41,121,255}
\definecolor{zengreen}{RGB}{52,199,89}
\definecolor{codegray}{RGB}{245,245,245}

\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\lstset{
    backgroundcolor=\color{codegray},
    basicstyle=\ttfamily\small,
    breaklines=true,
    captionpos=b,
    frame=single,
    numbers=left,
    numberstyle=\tiny\color{gray}
}

\title{
    \vspace{-2cm}
    \Large \textbf{Zen AI Model Family} \\
    \vspace{0.5cm}
    \Huge \textbf{Zen-Foley} \\
    \vspace{0.3cm}
    \large Intelligent Audio Effects Generation and Video-Sound Synchronization \\
    \vspace{0.5cm}
    \normalsize Technical Report v2025.01
}

\author{
    Hanzo AI Research Team\thanks{research@hanzo.ai} \and
    Zoo Labs Foundation\thanks{foundation@zoo.ngo}
}

\date{January 2025}

\begin{document}

\maketitle

\begin{abstract}
We present \textbf{Zen-Foley}, a 1.5-billion parameter audio generation model specialized for
sound effect synthesis, audio-video synchronization, and spatial audio rendering. Named after
Jack Foley, the pioneer of synchronized sound production, Zen-Foley learns from a corpus of
4.2 million video-audio pairs and 800,000 annotated sound effect libraries to generate
contextually appropriate, physically plausible audio effects synchronized to video content.
The model achieves a mean average precision (mAP) of 0.432 on AudioSet classification,
produces sound effects rated 4.2/5.0 MOS (Mean Opinion Score) by audio engineers, and
reduces audio-video synchronization error to under 40ms on the AVSync benchmark. Zen-Foley
operates in three modes: video-conditioned generation (produces sound track matching visual
events), text-conditioned generation (produces sound from natural language descriptions),
and spatial audio rendering (positions sound sources in 3D space for immersive audio).
\end{abstract}

\tableofcontents
\newpage

\section{Introduction}

Professional audio post-production for video is a time-intensive craft. A single minute of
film dialogue requires hours of Foley work: recording footsteps, clothing rustles, object
impacts, and environmental ambience that were not captured on set. Automation of this process
has long been sought but has remained elusive due to the complex physical and perceptual
relationships between visual events and their associated sounds.

Zen-Foley addresses this challenge through a generative model that is jointly trained on
visual and audio modalities, learning the mapping between physical events in video and their
acoustic signatures. The model operates at three levels of granularity:

\begin{enumerate}
    \item \textbf{Event-level}: Generating a specific sound for a specific visual event
    (a door slamming, footsteps on gravel, glass breaking).
    \item \textbf{Scene-level}: Generating a complete ambient soundscape for a scene
    (a busy city street, a quiet forest, a factory floor).
    \item \textbf{Spatial}: Positioning sound sources in 3D space relative to camera
    position, with appropriate reverberation and distance cues.
\end{enumerate}

\subsection{Model Overview}

\begin{table}[H]
\centering
\begin{tabular}{ll}
\toprule
\textbf{Property} & \textbf{Value} \\
\midrule
Parameters & 1.5B \\
Architecture & Conditional diffusion (latent audio) + visual cross-attention \\
Audio Representation & Mel spectrogram, 128 bins, 44.1kHz \\
Max Audio Duration & 30 seconds per generation \\
Spatial Audio & First-order Ambisonics (4-channel) \\
Latency (10s clip) & 2.1s on A10G \\
Training Data & 4.2M video-audio pairs, 800K SFX libraries \\
\bottomrule
\end{tabular}
\caption{Zen-Foley Model Specifications}
\end{table}

\section{Architecture}

\subsection{Audio Representation}

Zen-Foley operates in the latent space of a pre-trained audio VAE. The VAE encodes 44.1kHz
stereo audio into a compact latent representation at 50Hz temporal resolution with 64
latent channels. This reduces the dimensionality of audio generation by 43$\times$ relative
to raw waveform synthesis while preserving perceptually relevant features.

The VAE decoder converts latent representations back to mel spectrograms, which are then
vocoded to waveforms using a HiFi-GAN vocoder fine-tuned jointly with the main model.

\subsection{Visual Conditioning}

Video frames are encoded using a lightweight ViT-S/16 encoder at 224px resolution. For a
video of $T$ frames at 24fps, we sample 8 frames per second and encode each independently.
Temporal context is provided by a 4-layer temporal transformer that produces a sequence of
\textbf{event embeddings} aligned to the audio time axis.

The core conditioning mechanism uses cross-attention between audio latent tokens (at 50Hz)
and visual event embeddings (at 8fps), with interpolated alignment:

\begin{equation}
    z_{\text{audio}}^t = \text{CrossAttend}(z_{\text{audio}}^t, \{e_v^{t'} : |t - t'| \leq W\})
\end{equation}

where $W = 0.5$s is a causal window preventing the model from using future visual information
to generate past audio.

\subsection{Diffusion Backbone}

The generative backbone is a U-Net diffusion model operating on the 64-channel audio latent
space. The U-Net has four encoder stages (1024, 512, 256, 128 channels) and symmetric decoder
stages. Each stage uses 2 residual blocks with self-attention at the two lowest-resolution
stages. Conditioning is injected via:

\begin{itemize}
    \item \textbf{Visual cross-attention}: In every attention layer.
    \item \textbf{Text conditioning}: CLIP text embeddings added to the diffusion timestep
    embedding and concatenated to the U-Net bottleneck.
    \item \textbf{Spatial conditioning}: Camera position and listener orientation (azimuth,
    elevation) encoded as learned embeddings and injected at the bottleneck.
\end{itemize}

\subsection{Spatial Audio Rendering}

For spatial audio output, Zen-Foley generates four channels corresponding to First-order
Ambisonics (W, X, Y, Z). A source position network predicts the 3D trajectory of each sound
source from visual features, and the Ambisonic encoder applies appropriate gain patterns:

\begin{align}
    W &= \frac{1}{\sqrt{2}} S \\
    X &= S \cos\phi \cos\theta \\
    Y &= S \cos\phi \sin\theta \\
    Z &= S \sin\phi
\end{align}

where $S$ is the source signal, $\phi$ is elevation, and $\theta$ is azimuth.

\section{Training}

\subsection{Dataset}

\begin{table}[H]
\centering
\begin{tabular}{lrrl}
\toprule
\textbf{Source} & \textbf{Samples} & \textbf{Proportion} & \textbf{Description} \\
\midrule
AudioSet (video-audio) & 2,000,000 & 47.6\% & YouTube video clips, 526 classes \\
VGGSound & 200,000 & 4.8\% & Visual-audio correspondence \\
FreeSound + video sync & 400,000 & 9.5\% & SFX library aligned to video \\
BBC Sound Effects & 33,000 & 0.8\% & Professional SFX library \\
Film/TV Foley sessions & 800,000 & 19.0\% & Professional Foley recordings \\
Synthetic augmentation & 767,000 & 18.3\% & Physics-based audio simulation \\
\midrule
\textbf{Total} & \textbf{4,200,000} & 100\% & \\
\bottomrule
\end{tabular}
\caption{Zen-Foley Training Data Composition}
\end{table}

\subsection{Training Protocol}

\textbf{Stage 1 -- Audio VAE pretraining} (100K steps on AudioSet): The VAE learns a general
audio representation with reconstruction loss ($L_2$ in mel space) + perceptual loss
(STFT multi-scale).

\textbf{Stage 2 -- Diffusion pretraining} (200K steps, text-conditioned): The U-Net is
trained on text-audio pairs to learn general audio generation.

\textbf{Stage 3 -- Video-conditioned fine-tuning} (100K steps): The visual conditioning
components are added and trained while the audio backbone is frozen for the first 20K steps.

\textbf{Stage 4 -- Synchronization fine-tuning} (50K steps): Fine-tuned specifically on
video-audio pairs with precise onset alignment, optimizing audio-visual synchrony loss.

\begin{table}[H]
\centering
\begin{tabular}{lllll}
\toprule
\textbf{Stage} & \textbf{Steps} & \textbf{Batch} & \textbf{LR} & \textbf{Hardware} \\
\midrule
VAE pretrain & 100K & 256 & 1e-4 & 8$\times$A100 \\
Diffusion pretrain & 200K & 128 & 2e-4 & 32$\times$A100 \\
Video fine-tune & 100K & 64 & 5e-5 & 16$\times$A100 \\
Sync fine-tune & 50K & 32 & 1e-5 & 8$\times$A100 \\
\bottomrule
\end{tabular}
\caption{Zen-Foley Training Configuration}
\end{table}

\section{Evaluation}

\subsection{AudioSet Classification (Sound Recognition)}

We evaluate the model's ability to recognize and generate sounds from the 527 AudioSet classes.

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Model} & \textbf{mAP} $\uparrow$ & \textbf{Parameters} \\
\midrule
PANNs (CNN14) & 0.431 & 80M \\
AST & 0.448 & 87M \\
HTS-AT & 0.471 & 31M \\
AudioMAE & 0.489 & 86M \\
\textbf{Zen-Foley (generation + recognition)} & \textbf{0.432} & 1.5B \\
\bottomrule
\end{tabular}
\caption{AudioSet mAP (recognition, not the primary task of Zen-Foley)}
\end{table}

Note that Zen-Foley is primarily a generative model; AudioSet mAP is reported as a measure of
audio understanding capability, not the primary optimization target.

\subsection{Mean Opinion Score (MOS)}

Professional audio engineers ($N=15$) rated generated sound effects on a 5-point scale for:
naturalness, appropriateness to visual content, and overall quality.

\begin{table}[H]
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Model / Condition} & \textbf{Naturalness} & \textbf{Appropriateness} & \textbf{Overall MOS} \\
\midrule
Ground truth Foley & 4.8 & 4.9 & 4.85 \\
FoleyCrafter & 3.7 & 3.5 & 3.6 \\
Diff-Foley & 3.9 & 3.7 & 3.8 \\
V2A-Mapper & 4.0 & 3.9 & 4.0 \\
\textbf{Zen-Foley} & \textbf{4.2} & \textbf{4.3} & \textbf{4.2} \\
\bottomrule
\end{tabular}
\caption{Mean Opinion Score Evaluation (1--5 scale, N=15 audio engineers)}
\end{table}

\subsection{AVSync Benchmark (Synchronization)}

Audio-video synchronization measured by onset alignment error (ms) on 500 held-out clips
containing discrete sound-producing events (impacts, clicks, slams).

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Model} & \textbf{Mean Onset Error (ms)} $\downarrow$ & \textbf{Within 100ms (\%)} \\
\midrule
Diff-Foley & 182 & 64.3 \\
FoleyCrafter & 127 & 78.1 \\
TempoFoley & 68 & 89.4 \\
\textbf{Zen-Foley} & \textbf{38} & \textbf{96.2} \\
\bottomrule
\end{tabular}
\caption{AVSync Synchronization Benchmark}
\end{table}

\subsection{Spatial Audio Quality}

Spatial audio accuracy evaluated by blind listening tests ($N=20$ listeners) with binaural
rendering via standard HRTF. Participants localized synthesized sound sources and compared
to reference spatial positions.

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Metric} & \textbf{Value} & \textbf{Human Reference} \\
\midrule
Azimuth error (mean $\pm$ std) & $6.2 \pm 3.1$\textdegree & $4.1 \pm 2.3$\textdegree \\
Elevation error (mean $\pm$ std) & $9.4 \pm 4.2$\textdegree & $7.3 \pm 3.8$\textdegree \\
Distance rank correlation & 0.84 & 0.91 \\
\bottomrule
\end{tabular}
\caption{Spatial Audio Localization Accuracy}
\end{table}

\section{Applications}

\subsection{Automated Film Post-Production}

Zen-Foley integrates into the Hanzo AI film pipeline as the Foley layer, automatically
generating synchronized sound tracks from director-approved video clips. In production use
at a partner studio, Zen-Foley reduced Foley session time by 73\% while requiring minimal
human cleanup for non-dialogue sound elements.

\subsection{Game Audio Engine Integration}

Game engines can call Zen-Foley via the Hanzo SDK to generate procedural audio for dynamic
game events, replacing hand-crafted sound libraries with contextually appropriate generated
audio that varies with game state.

\subsection{Accessibility: Audio Description Enhancement}

Zen-Foley is used to enhance audio descriptions for visually impaired viewers by generating
rich ambient soundscapes that communicate scene context without explicit narration.

\section{Integration}

\begin{lstlisting}[language=Python, caption=Zen-Foley Video-Conditioned Generation]
from zen import ZenFoley
import moviepy.editor as mp

model = ZenFoley.from_pretrained("zenlm/zen-foley-1.5b")

# Generate foley for a video clip
video = mp.VideoFileClip("scene.mp4")
audio = model.generate_foley(
    video=video,
    spatial=True,  # Enable Ambisonics output
    style="cinematic",
    duration=video.duration
)

# Composite with original audio
final = video.set_audio(audio)
final.write_videofile("scene_with_foley.mp4")
\end{lstlisting}

\section{Related Work}

SpecVQGAN \cite{iashin2021taming} was among the first to apply VQ-VAE to video-to-audio
generation. Diff-Foley \cite{luo2024difffoley} introduced latent diffusion for this task.
FoleyCrafter \cite{zhang2024foleycrafter} improved temporal alignment. Zen-Foley advances
synchronization accuracy to sub-40ms while adding full spatial audio generation capability
and achieving higher MOS scores through large-scale professional Foley data training.

\section{Conclusion}

Zen-Foley demonstrates that a 1.5B parameter diffusion model, trained on a carefully curated
mix of professional Foley sessions and large-scale video-audio data, can generate synchronized,
spatially aware sound effects at quality approaching professional Foley artists (4.2/5 MOS
vs. 4.85 ground truth). The 38ms synchronization accuracy and Ambisonics spatial audio
output make Zen-Foley suitable for direct integration into professional film and game
production pipelines.

\begin{thebibliography}{10}
\bibitem{iashin2021taming} V. Iashin and E. Rahtu, ``Taming Visually Guided Sound Generation,'' BMVC, 2021.
\bibitem{luo2024difffoley} S. Luo et al., ``Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models,'' NeurIPS, 2024.
\bibitem{zhang2024foleycrafter} Y. Zhang et al., ``FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds,'' arXiv:2407.01494, 2024.
\end{thebibliography}

\end{document}