papers/zen-agent.tex at main · zenlm/papers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{color}
\usepackage{booktabs}
\usepackage{float}
\usepackage{geometry}
\geometry{margin=1in}

\definecolor{zenblue}{RGB}{41,121,255}
\definecolor{zengreen}{RGB}{52,199,89}
\definecolor{codegray}{RGB}{245,245,245}

\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\lstset{
    backgroundcolor=\color{codegray},
    basicstyle=\ttfamily\small,
    breaklines=true,
    captionpos=b,
    frame=single,
    numbers=left,
    numberstyle=\tiny\color{gray}
}

\title{
    \vspace{-2cm}
    \Large \textbf{Zen AI Model Family} \\
    \vspace{0.5cm}
    \Huge \textbf{Zen-Agent} \\
    \vspace{0.3cm}
    \large Autonomous Task Execution with Long-Horizon Planning \\
    \vspace{0.5cm}
    \normalsize Technical Report v2025.01
}

\author{
    Hanzo AI Research Team\thanks{research@hanzo.ai} \and
    Zoo Labs Foundation\thanks{foundation@zoo.ngo}
}

\date{January 2025}

\begin{document}

\maketitle

\begin{abstract}
We present \textbf{Zen-Agent}, a 32-billion parameter model trained for autonomous task execution,
multi-step planning, and compositional tool use. Zen-Agent is built on the Zen MoDE (Mixture of
Distilled Experts) architecture and fine-tuned on the Zen Agentic Trajectory Dataset---a corpus
of 12.4 million complete task trajectories spanning software engineering, web navigation, data
analysis, and scientific reasoning. The model achieves 71.2\% on GAIA (Level 1--3 average),
42.8\% on SWE-bench Verified, and 88.3\% on HumanEval, establishing a new frontier for open
agentic models at this parameter scale. Zen-Agent introduces \textbf{Hierarchical Task
Decomposition (HTD)}: an explicit planning module that generates subgoal trees before executing
any tool calls, substantially reducing failure modes caused by premature commitment to incorrect
action sequences. The model natively speaks the Model Context Protocol (MCP), enabling
zero-configuration integration with the Hanzo MCP tool ecosystem of 260+ tools.
\end{abstract}

\tableofcontents
\newpage

\section{Introduction}

The transition from conversational AI to agentic AI---systems that take sequences of actions in
the world to complete long-horizon tasks---demands a qualitative shift in how models are trained
and evaluated. A conversational model optimizes a single response; an agentic model must plan,
execute, observe feedback, revise beliefs, and persist toward a goal across potentially hundreds
of tool-mediated steps.

Existing large language models, even those capable of impressive single-turn reasoning, fail on
agentic benchmarks primarily due to:

\begin{enumerate}
    \item \textbf{Premature commitment}: Taking tool calls before adequately planning the task
    structure, leading to irrecoverable state.
    \item \textbf{Context drift}: Losing track of the original goal as the trajectory lengthens.
    \item \textbf{Tool misuse}: Calling tools with syntactically correct but semantically wrong
    arguments, wasting steps and budget.
    \item \textbf{Error propagation}: Failing to detect and recover from subtask failures before
    they cascade into total task failure.
\end{enumerate}

Zen-Agent addresses each failure mode through architectural and training innovations described
in this report.

\subsection{Model Summary}

\begin{table}[H]
\centering
\begin{tabular}{ll}
\toprule
\textbf{Property} & \textbf{Value} \\
\midrule
Parameters & 32B \\
Architecture & Zen MoDE 32B (8 experts, top-2) \\
Context Length & 128K tokens \\
Max Trajectory Length & 512 tool calls \\
Native Protocols & MCP, OpenAI Tools, Anthropic Tools \\
Supported Runtimes & Python, Node.js, Docker, Kubernetes \\
Training Trajectories & 12.4M task trajectories \\
\bottomrule
\end{tabular}
\caption{Zen-Agent Model Specifications}
\end{table}

\section{Architecture}

\subsection{Zen MoDE 32B Backbone}

Zen-Agent uses the Zen MoDE architecture at 32B scale: 40 transformer layers with 32 attention
heads, grouped-query attention (GQA with 8 KV heads), and sparse MoE feed-forward networks.
Each MoE layer maintains 8 experts with top-2 routing, yielding approximately 8B active
parameters per forward pass while providing 32B parameter capacity for specialist knowledge.

\begin{table}[H]
\centering
\begin{tabular}{ll}
\toprule
\textbf{Component} & \textbf{Specification} \\
\midrule
Layers & 40 \\
Attention heads & 32 (GQA, 8 KV heads) \\
Hidden dimension & 5120 \\
MoE experts per layer & 8 (top-2 routing) \\
Active params per token & $\approx$8B of 32B total \\
Context length & 128K (RoPE, $\theta = 500{,}000$) \\
Vocabulary & 152,064 tokens \\
\bottomrule
\end{tabular}
\caption{Zen MoDE 32B Architecture}
\end{table}

\subsection{Hierarchical Task Decomposition (HTD)}

Before any tool call is issued, Zen-Agent generates an explicit \textbf{task plan} in a
structured thinking block. The plan takes the form of a tree of subgoals:

\begin{lstlisting}[language=Python, caption=HTD Plan Structure]
{
  "goal": "Fix the authentication bug in PR #1247",
  "subgoals": [
    {
      "id": "sg1",
      "description": "Reproduce the bug",
      "steps": ["checkout branch", "run failing tests", "read error"],
      "depends_on": []
    },
    {
      "id": "sg2",
      "description": "Identify root cause",
      "steps": ["read auth module", "trace call stack"],
      "depends_on": ["sg1"]
    },
    {
      "id": "sg3",
      "description": "Implement fix and verify",
      "steps": ["edit file", "run tests", "verify no regressions"],
      "depends_on": ["sg2"]
    }
  ]
}
\end{lstlisting}

The HTD module is implemented as a constrained decoding prefix: the model always generates a
\texttt{<plan>} block before the first \texttt{<tool\_call>} block. Training enforces this
through a curriculum that penalizes trajectories where tool calls precede planning.

\subsection{Goal Persistence Module}

To combat context drift over long trajectories, the model maintains a \textbf{goal stack} in
its working memory: a compact structured summary of the original task and completed subgoals.
At every 32-step checkpoint, the model re-reads the goal stack and emits a brief
\texttt{<goal\_check>} block confirming alignment. Divergence from the goal triggers automatic
backtracking to the last consistent state.

\subsection{MCP-Native Tool Interface}

Zen-Agent is the first model trained natively on the Model Context Protocol (MCP) tool call
format. This means:

\begin{itemize}
    \item Tool schemas are embedded in the model's training distribution, not injected at inference.
    \item Common MCP tools (bash, file I/O, web browser, code execution) are callable without
    schema prompting.
    \item The model understands MCP error codes and applies appropriate retry strategies.
    \item Tool results are parsed and integrated into the plan without additional prompting.
\end{itemize}

\section{Training}

\subsection{Zen Agentic Trajectory Dataset}

\begin{table}[H]
\centering
\begin{tabular}{lrrl}
\toprule
\textbf{Domain} & \textbf{Trajectories} & \textbf{Proportion} & \textbf{Source} \\
\midrule
Software engineering & 4,100,000 & 33.1\% & GitHub issues, PRs \\
Web navigation & 2,800,000 & 22.6\% & Browser sessions \\
Data analysis & 1,900,000 & 15.3\% & Jupyter notebooks \\
Scientific reasoning & 1,400,000 & 11.3\% & Research papers \\
Coding tasks & 1,200,000 & 9.7\% & Competitive coding \\
System administration & 1,000,000 & 8.1\% & DevOps runbooks \\
\midrule
\textbf{Total} & \textbf{12,400,000} & 100\% & \\
\bottomrule
\end{tabular}
\caption{Zen Agentic Trajectory Dataset Composition}
\end{table}

Each trajectory record contains: task specification, tool call sequence, tool outputs, final
answer, success/failure label, and a human-written critique. Trajectories with more than 10
consecutive failed tool calls were filtered out during dataset curation.

\subsection{Training Protocol}

Training proceeds in four stages:

\textbf{Stage 1 -- SFT on correct trajectories} (50K steps): The model is fine-tuned on the
43\% of trajectories labeled as fully successful, using standard next-token prediction loss
over the complete trajectory (plan + tool calls + observations + answer).

\textbf{Stage 2 -- DPO on trajectory pairs} (30K steps): For each task with both successful
and failed trajectories, we construct preference pairs and apply Direct Preference Optimization
to increase the probability of success-leading prefixes.

\textbf{Stage 3 -- GRPO for tool accuracy} (20K steps): Group Relative Policy Optimization
with a sparse reward signal: $+1$ for task success, $-0.1$ per unnecessary tool call (calls
that could have been skipped given the information already available in context).

\textbf{Stage 4 -- Safety fine-tuning} (10K steps): Constitutional AI prompts are used to
discourage trajectories that involve irreversible destructive actions without explicit user
confirmation steps.

\begin{table}[H]
\centering
\begin{tabular}{llll}
\toprule
\textbf{Stage} & \textbf{Steps} & \textbf{Batch} & \textbf{LR} \\
\midrule
SFT & 50K & 64 & 1e-5 \\
DPO & 30K & 32 & 5e-6 \\
GRPO & 20K & 16 & 2e-6 \\
Safety & 10K & 32 & 1e-6 \\
\bottomrule
\end{tabular}
\caption{Zen-Agent Training Configuration}
\end{table}

\section{Evaluation}

\subsection{GAIA Benchmark}

GAIA measures real-world assistant capability across general AI tasks requiring web search,
file processing, code execution, and multi-step reasoning. Results below are on the official
GAIA validation set (Level 1 = easy, Level 3 = hard).

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Level 1} & \textbf{Level 2} & \textbf{Level 3} & \textbf{Average} \\
\midrule
GPT-4o (2024-11) & 73.2 & 52.1 & 28.4 & 51.2 \\
Claude 3.5 Sonnet & 77.3 & 56.2 & 31.7 & 55.1 \\
Gemini 1.5 Pro & 70.1 & 48.3 & 26.1 & 48.2 \\
Open-source SoTA (32B) & 52.7 & 31.4 & 14.2 & 32.8 \\
\textbf{Zen-Agent 32B} & \textbf{83.1} & \textbf{68.4} & \textbf{42.3} & \textbf{71.2} \\
\bottomrule
\end{tabular}
\caption{GAIA Benchmark Results (\%, higher is better)}
\end{table}

\subsection{SWE-bench Verified}

SWE-bench Verified presents real GitHub issues from popular Python repositories; the model must
produce a patch that passes the official test suite without seeing the tests.

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Model} & \textbf{Resolve Rate (\%)} & \textbf{Avg. Tool Calls} \\
\midrule
GPT-4o (Agentless) & 33.2 & 24.3 \\
Claude 3.5 Sonnet & 38.1 & 31.7 \\
SWE-agent + Claude & 39.4 & 48.2 \\
OpenHands + Claude & 41.0 & 52.1 \\
\textbf{Zen-Agent 32B} & \textbf{42.8} & \textbf{29.6} \\
\bottomrule
\end{tabular}
\caption{SWE-bench Verified Results}
\end{table}

Zen-Agent achieves the highest resolve rate while requiring 30\% fewer tool calls than the
next-best method, demonstrating the efficiency gain from Hierarchical Task Decomposition.

\subsection{HumanEval}

Single-function code generation evaluated by executing generated code against hidden test cases.

\begin{table}[H]
\centering
\begin{tabular}{lc}
\toprule
\textbf{Model} & \textbf{pass@1 (\%)} \\
\midrule
GPT-4o & 86.4 \\
Claude 3.5 Haiku & 81.2 \\
Gemini 1.5 Flash & 78.3 \\
\textbf{Zen-Agent 32B} & \textbf{88.3} \\
\bottomrule
\end{tabular}
\caption{HumanEval pass@1}
\end{table}

\subsection{Tool Use Accuracy}

We evaluate tool argument correctness on a held-out set of 10,000 tool calls annotated with
ground-truth arguments. Zen-Agent achieves 91.4\% exact-match accuracy on tool arguments,
compared to 78.2\% for the best-performing comparable baseline.

\subsection{Trajectory Efficiency}

\begin{table}[H]
\centering
\begin{tabular}{lccc}
\toprule
\textbf{Task Type} & \textbf{Success Rate} & \textbf{Avg Calls} & \textbf{Backtrack Rate} \\
\midrule
Code debugging & 84.3\% & 18.2 & 12.1\% \\
Web research & 91.2\% & 12.7 & 6.3\% \\
Data analysis & 87.6\% & 22.4 & 9.8\% \\
File manipulation & 96.1\% & 7.3 & 2.1\% \\
System administration & 78.4\% & 31.2 & 18.7\% \\
\bottomrule
\end{tabular}
\caption{Zen-Agent Trajectory Efficiency by Task Type}
\end{table}

\section{Applications}

\subsection{Software Engineering Automation}

Zen-Agent powers the Hanzo Code service, which autonomously handles GitHub issues labeled
\texttt{zen-agent}: the model reads the issue, explores the codebase, writes a patch, runs
the test suite, and opens a pull request with a structured description of its changes.
Average cycle time is under 3 minutes for issues categorized as medium complexity.

\subsection{Research Automation}

For scientific research tasks, Zen-Agent can execute multi-hour workflows involving literature
search (via web tools), data download and preprocessing, statistical analysis (via Python
execution), and report generation---all from a single natural language task specification.

\subsection{IT Operations}

Zen-Agent is deployed in the Hanzo cloud platform for automated incident response: given an
alert, the model diagnoses the root cause by querying logs, metrics, and configuration state,
then executes a remediation playbook or escalates with a structured diagnostic report.

\section{Safety and Constraints}

\subsection{Action Reversibility Classification}

Every tool in the Zen MCP ecosystem is annotated with a reversibility label:
\texttt{reversible}, \texttt{soft-destructive} (data can be recovered), or
\texttt{hard-destructive} (action is irreversible). Zen-Agent refuses to execute
\texttt{hard-destructive} actions without an explicit user confirmation step in the trajectory.

\subsection{Budget Limits}

Each agent session is configured with maximum tool calls (default 200), maximum wall-clock
time (default 10 minutes), and maximum cost per session (configurable, default \$0.50).
The model monitors its own budget and gracefully summarizes progress when approaching limits.

\subsection{Hallucination Reduction}

Zen-Agent's training emphasized grounded claims: the model is penalized in GRPO training for
asserting facts about tool outputs that contradict what was actually returned. On the
TruthfulQA benchmark, Zen-Agent achieves 81.3\% accuracy, 4.2 points above the backbone
pre-fine-tuning baseline.

\section{Integration}

\begin{lstlisting}[language=Python, caption=Zen-Agent Task Execution]
from zen import ZenAgent

agent = ZenAgent(
    model="zenlm/zen-agent-32b",
    tools=["bash", "file", "web", "python"],
    max_steps=100,
    budget_usd=1.0,
)

result = agent.run(
    "Analyze the performance regression in the /api/search endpoint "
    "using the last 24h of production logs and propose a fix.",
    context={"log_path": "/var/log/api/search.log"}
)

print(result.answer)
print(f"Steps taken: {result.num_steps}")
print(f"Cost: ${result.cost_usd:.4f}")
\end{lstlisting}

\section{Related Work}

ReAct \cite{yao2022react} established the interleaved reasoning-action paradigm.
Tree-of-Thought \cite{yao2023tree} introduced explicit planning over branching action spaces.
SWE-agent \cite{yang2024swe} specialized for software engineering via shell-based interaction.
Zen-Agent advances this line through HTD, native MCP integration, and large-scale trajectory
training at 32B scale with the Zen MoDE architecture.

\section{Conclusion}

Zen-Agent demonstrates that agentic capability can be substantially improved by training on
real task trajectories at scale, enforcing explicit hierarchical planning before action
execution, and natively integrating with tool ecosystems via MCP. The resulting model achieves
71.2\% GAIA and 42.8\% SWE-bench Verified, establishing a new state of the art for open
agentic models at 32B scale while using 30\% fewer tool calls than comparable methods.

Future work will extend Zen-Agent to multi-agent collaboration (where multiple Zen-Agent
instances coordinate via MCP), longer planning horizons (beyond 512 tool calls), and
continual learning from deployment feedback via the Zoo DSO protocol.

\begin{thebibliography}{10}
\bibitem{yao2022react} S. Yao et al., ``ReAct: Synergizing Reasoning and Acting in Language Models,'' ICLR, 2023.
\bibitem{yao2023tree} S. Yao et al., ``Tree of Thoughts: Deliberate Problem Solving with Large Language Models,'' NeurIPS, 2023.
\bibitem{yang2024swe} J. Yang et al., ``SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering,'' NeurIPS, 2024.
\end{thebibliography}

\end{document}