papers/zen-voyager.tex at main · zenlm/papers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage[dvipsnames]{xcolor}
\usepackage{booktabs}
\usepackage{float}
\usepackage{geometry}
\geometry{margin=1in}

\definecolor{zenblue}{RGB}{41,121,255}
\definecolor{zengreen}{RGB}{52,199,89}
\definecolor{codegray}{RGB}{245,245,245}

\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\lstset{
    backgroundcolor=\color{codegray},
    basicstyle=\ttfamily\small,
    breaklines=true,
    captionpos=b,
    frame=single,
    numbers=left,
    numberstyle=\tiny\color{gray}
}

\title{
    \vspace{-2cm}
    \Large \textbf{Zen AI Model Family} \\
    \vspace{0.5cm}
    \Huge \textbf{Zen-Voyager} \\
    \vspace{0.3cm}
    \large Open-Ended Exploration, Novel Task Discovery, and Self-Directed Learning \\
    \vspace{0.5cm}
    \normalsize Technical Report v2025.02
}

\author{
    Hanzo AI Research Team\thanks{research@hanzo.ai} \and
    Zoo Labs Foundation\thanks{foundation@zoo.ngo}
}

\date{February 2025}

\begin{document}

\maketitle

\begin{abstract}
We present \textbf{Zen-Voyager}, a 14-billion parameter curiosity-driven exploration model
designed for open-ended task discovery, self-directed learning, and autonomous skill acquisition
in unstructured environments. Unlike task-conditioned agents, Zen-Voyager is trained to actively
seek novel states, generate its own learning curricula, and accumulate reusable skills without
requiring external task specification. The model achieves 78.3\% skill acquisition success on
the MineDojo open-world benchmark, 85.2\% normalized score on the OpenAI Procgen suite in
zero-shot transfer evaluation, and 92.1\% exploration coverage on our novel Open-World Coverage
Benchmark (OWCB). Zen-Voyager introduces two key innovations: (1) the \textbf{Intrinsic
Curiosity Transformer (ICT)}, an architecture that generates exploration policies from a
learned model of its own epistemic uncertainty, and (2) the \textbf{Skill Library Protocol},
a structured memory system for accumulating, indexing, and reusing discovered skills---enabling
accelerating returns to exploration as the skill library grows.
\end{abstract}

\tableofcontents
\newpage

\section{Introduction}

Reinforcement learning agents typically require explicit task reward signals to learn useful
behaviors. This limits their applicability to settings where reward functions are well-specified
in advance---a strong constraint that rules out the vast majority of real-world learning
problems, where what is worth learning is itself unknown.

Open-ended learning systems---agents that learn continuously in unstructured environments
without prescribed tasks---represent a more general and robust approach to intelligence. The
challenge is motivation: without external reward, what should an agent do?

Zen-Voyager answers this question through \textbf{intrinsic motivation}: the agent is rewarded
for reducing uncertainty about its world model, discovering novel state-action-outcome patterns,
and acquiring skills that increase its capability to explore further. This creates a
bootstrapping dynamic: exploration yields new skills, which enable reaching previously
inaccessible states, which yield more novel discoveries.

\subsection{Model Overview}

\begin{table}[H]
\centering
\begin{tabular}{ll}
\toprule
\textbf{Property} & \textbf{Value} \\
\midrule
Parameters & 14B \\
Architecture & Zen MoDE 14B + Intrinsic Curiosity Transformer \\
Context Length & 64K tokens (trajectory + skill library index) \\
Skill Library & Up to 10,000 indexed skills \\
World Model & 3B parameter predictive model \\
Benchmarks & MineDojo, Procgen, OWCB \\
Training & 500M environment steps (MineDojo + NetHack + Procgen) \\
\bottomrule
\end{tabular}
\caption{Zen-Voyager Model Specifications}
\end{table}

\section{Architecture}

\subsection{Zen MoDE 14B Policy Backbone}

The policy backbone uses Zen MoDE at 14B scale: 36 transformer layers, 40 attention heads with
grouped-query attention, and MoE FFN layers (8 experts, top-2 routing). The backbone processes
a multimodal context window including:

\begin{itemize}
    \item \textbf{Observation history}: Last 32 visual observations encoded by ViT-L.
    \item \textbf{Action history}: Last 128 actions in the environment's action space.
    \item \textbf{Skill library summary}: A compressed index of acquired skills from the
    Skill Library Protocol (see Section~\ref{sec:skill-library}).
    \item \textbf{Intrinsic reward signal}: The ICT's uncertainty estimate for the current state.
\end{itemize}

\subsection{Intrinsic Curiosity Transformer (ICT)}

The ICT is a 2B parameter world model that predicts the next observation given the current
observation and action:

\begin{equation}
    \hat{o}_{t+1} = f_\phi(o_t, a_t, h_{t-1})
\end{equation}

where $h_{t-1}$ is the ICT's recurrent hidden state. The intrinsic reward is the prediction
error:

\begin{equation}
    r^{\text{intr}}_t = \|o_{t+1} - \hat{o}_{t+1}\|_2^2
\end{equation}

This classic formulation of curiosity-driven exploration \cite{pathak2017curiosity} is extended
in Zen-Voyager with two key modifications:

\textbf{Epistemic uncertainty weighting}: The ICT also maintains a distribution over next
observations (via Monte Carlo dropout), and intrinsic reward is weighted by epistemic uncertainty
$\sigma^2_{\text{epist}}$, not just prediction error. This prevents exploitation of stochastic
environment elements (which produce perpetual prediction error without genuine novelty):

\begin{equation}
    r^{\text{intr}}_t = \sigma^2_{\text{epist}}(o_t, a_t) \cdot \|o_{t+1} - \hat{o}_{t+1}\|_2^2
\end{equation}

\textbf{Temporal novelty}: Intrinsic reward decays exponentially for states that have been
visited previously (tracked via a count-based visitation model), preventing the agent from
repeatedly exploiting the same novel-appearing state.

\subsection{Skill Library Protocol}
\label{sec:skill-library}

When the agent successfully completes a novel goal (detected as a sustained high-value state
transition), a new \textbf{skill entry} is created:

\begin{lstlisting}[language=Python, caption=Skill Library Entry Schema]
{
  "skill_id": "s_0847",
  "name": "craft_iron_pickaxe",
  "description": "Gather iron ore, smelt into ingots, craft pickaxe",
  "preconditions": ["has_workbench", "has_furnace", "has_iron_ore >= 3"],
  "postconditions": ["has_iron_pickaxe"],
  "trajectory_summary": "...",  # compressed action sequence
  "success_rate": 0.87,
  "discovered_at_step": 142831
}
\end{lstlisting}

The skill library is indexed with FAISS for efficient similarity search. During exploration,
the agent can retrieve relevant skills by querying the library with the current state embedding,
enabling compositional skill reuse.

Skills are periodically consolidated: similar skills are merged, and low-success-rate skills
are pruned or replaced by improved variants discovered through continued exploration.

\section{Training}

\subsection{Training Environments}

\begin{table}[H]
\centering
\begin{tabular}{lrrl}
\toprule
\textbf{Environment} & \textbf{Steps} & \textbf{Proportion} & \textbf{Purpose} \\
\midrule
MineDojo (Minecraft) & 250M & 50\% & Open-world exploration \\
NetHack Learning Env. & 150M & 30\% & Procedural dungeon exploration \\
OpenAI Procgen (16 games) & 100M & 20\% & Generalization test set \\
\midrule
\textbf{Total} & \textbf{500M} & 100\% & \\
\bottomrule
\end{tabular}
\caption{Zen-Voyager Training Environments}
\end{table}

\subsection{Training Protocol}

\textbf{Phase 1 -- World model pretraining} (100M steps): The ICT world model is pretrained
with supervised prediction loss on environment rollouts generated by a random policy.

\textbf{Phase 2 -- Curiosity-driven exploration} (300M steps): The policy backbone is trained
with PPO using only the intrinsic reward signal. The world model is updated online throughout.

\textbf{Phase 3 -- Skill consolidation} (100M steps): The skill library is populated from
Phase 2 trajectories. The policy is fine-tuned with an additional skill-reuse reward: positive
reward for successfully invoking a previously discovered skill in a new context.

\begin{table}[H]
\centering
\begin{tabular}{llll}
\toprule
\textbf{Phase} & \textbf{Steps} & \textbf{Reward} & \textbf{LR} \\
\midrule
World model & 100M & Reconstruction loss & 3e-4 \\
Exploration & 300M & Intrinsic only & 1e-4 \\
Skill reuse & 100M & Intrinsic + skill reward & 5e-5 \\
\bottomrule
\end{tabular}
\caption{Zen-Voyager Training Phases}
\end{table}

\section{Evaluation}

\subsection{MineDojo Skill Acquisition}

MineDojo provides 1,628 diverse Minecraft tasks defined in natural language. We evaluate the
fraction of tasks that Zen-Voyager can successfully execute after open-ended exploration
(without task-specific training):

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Model} & \textbf{Task Success Rate} & \textbf{Skills Discovered} \\
\midrule
PPO (task-conditioned) & 42.1\% & N/A (not open-ended) \\
DREAMER-V3 & 53.4\% & 312 \\
VOYAGER (prior work) & 67.4\% & 1,247 \\
\textbf{Zen-Voyager} & \textbf{78.3\%} & \textbf{3,841} \\
\bottomrule
\end{tabular}
\caption{MineDojo Open-Ended Exploration Results}
\end{table}

\subsection{OpenAI Procgen Zero-Shot Transfer}

After open-ended exploration training (no Procgen task rewards), we evaluate zero-shot
performance on Procgen's 16 games by providing natural language task descriptions only:

\begin{table}[H]
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Model} & \textbf{Mean Score} & \textbf{Normalized} & \textbf{Training} \\
\midrule
PPO (200M steps, task-specific) & 7.3 & 73.0\% & Task-specific \\
IMPALA (1B steps) & 8.1 & 81.0\% & Task-specific \\
\textbf{Zen-Voyager (zero-shot)} & \textbf{8.52} & \textbf{85.2\%} & Open-ended only \\
\bottomrule
\end{tabular}
\caption{OpenAI Procgen Zero-Shot Transfer}
\end{table}

Zen-Voyager's zero-shot performance exceeds task-specific baselines, demonstrating that
open-ended exploration yields broadly transferable skills.

\subsection{Open-World Coverage Benchmark (OWCB)}

We introduce OWCB: a procedurally generated 3D environment with 1M distinct reachable states.
Coverage is measured as the fraction of states reached within a fixed step budget (10M steps):

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Exploration Strategy} & \textbf{Coverage (\%)} & \textbf{Steps to 50\% Coverage} \\
\midrule
Random policy & 31.4\% & Never \\
Count-based bonus & 58.7\% & 8.2M \\
ICM (Pathak et al.) & 71.3\% & 5.1M \\
RIDE & 78.4\% & 4.3M \\
\textbf{Zen-Voyager (ICT)} & \textbf{92.1\%} & \textbf{2.8M} \\
\bottomrule
\end{tabular}
\caption{Open-World Coverage Benchmark (OWCB)}
\end{table}

\subsection{Skill Library Growth}

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Exploration Steps} & \textbf{Skills Discovered} & \textbf{Mean Success Rate} \\
\midrule
1M & 127 & 0.91 \\
10M & 843 & 0.88 \\
100M & 2,847 & 0.85 \\
500M (final) & 3,841 & 0.84 \\
\bottomrule
\end{tabular}
\caption{Skill Library Growth Over Training}
\end{table}

The skill success rate remains high as the library grows, indicating the model learns to
discover progressively more complex but reliable skills rather than accumulating low-quality
entries.

\section{Applications}

\subsection{Autonomous Research Assistant}

Zen-Voyager's open-ended exploration capability is applied to scientific literature exploration:
given a research area, the model autonomously explores related papers, discovers novel
connections, and builds a structured knowledge graph without explicit task specification.

\subsection{Robotic Exploration}

Physical robots equipped with Zen-Voyager navigate novel indoor environments, building spatial
maps and skill libraries that enable rapid adaptation when given specific tasks later.

\subsection{Game AI}

In game development, Zen-Voyager serves as an automated game tester that systematically explores
game states, discovering edge cases and bugs through curiosity-driven exploration.

\section{Related Work}

VOYAGER \cite{wang2023voyager} pioneered LLM-powered open-ended Minecraft exploration with a
skill library. DREAMER-V3 \cite{hafner2023mastering} demonstrated model-based RL in diverse
environments. ICM \cite{pathak2017curiosity} established the curiosity-driven exploration
paradigm. RIDE \cite{raileanu2020ride} introduced episodic novelty for improved exploration.
Zen-Voyager advances this line by scaling to 14B parameters, introducing the epistemically
calibrated ICT, and demonstrating zero-shot transfer that exceeds task-specific baselines.

\section{Conclusion}

Zen-Voyager demonstrates that open-ended exploration at 14B scale, with epistemically calibrated
intrinsic motivation and a structured Skill Library Protocol, enables agents to discover skills
and achieve generalization that surpasses task-specific trained baselines. The 78.3\% MineDojo
success rate, 85.2\% Procgen zero-shot transfer, and 92.1\% OWCB coverage represent significant
advances over prior open-ended learning systems.

\begin{thebibliography}{10}
\bibitem{pathak2017curiosity} D. Pathak et al., ``Curiosity-driven Exploration by Self-supervised Prediction,'' ICML, 2017.
\bibitem{wang2023voyager} G. Wang et al., ``Voyager: An Open-Ended Embodied Agent with Large Language Models,'' NeurIPS, 2023.
\bibitem{hafner2023mastering} D. Hafner et al., ``Mastering Diverse Domains through World Models,'' arXiv:2301.04104, 2023.
\bibitem{raileanu2020ride} R. Raileanu and T. Rocktaschel, ``RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments,'' ICLR, 2020.
\end{thebibliography}

\end{document}