papers/zen-alignment.tex at main · zenlm/papers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{color}
\usepackage{booktabs}
\usepackage{float}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{geometry}
\geometry{margin=1in}
\definecolor{zenblue}{RGB}{41,121,255}
\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\title{\textbf{Zen Alignment: RLHF and Constitutional AI at Scale}\\
\large Technical Report v2025.04}
\author{Zen LM Research Team\\
\texttt{research@zenlm.org}}
\date{April 2025}

\begin{document}
\maketitle

\begin{abstract}
We describe the alignment infrastructure for the Zen model family, covering reward
modeling, proximal policy optimization, and a constitutional AI framework that reduces
dependence on human preference annotation. Key contributions include multi-reward
fusion that jointly optimizes helpfulness, harmlessness, and honesty; an adaptive
KL penalty that prevents mode collapse while maintaining policy expressiveness; and
a synthetic preference generation pipeline that scales annotation by $12\times$ at
$\sim$90\% of human annotator agreement. On head-to-head human evaluation, aligned
Zen models achieve 68.4\% win rate against SFT-only baselines, with harmlessness
improving from 71.2\% to 94.7\% and helpfulness from 76.8\% to 89.3\%.
\end{abstract}

\section{Introduction}

Aligning large language models to human values is a multi-objective optimization problem:
models must be simultaneously \textit{helpful} (completing user tasks accurately),
\textit{harmless} (avoiding dangerous or offensive outputs), and \textit{honest}
(refusing to assert false claims). These objectives conflict in non-trivial ways.
A model optimized purely for helpfulness tends to hallucinate; one optimized purely
for harmlessness becomes overly cautious and refuses legitimate requests.

The Zen alignment framework addresses this tension through three innovations:

\begin{enumerate}
  \item \textbf{Multi-reward fusion} with learned weighting across reward dimensions
  \item \textbf{Adaptive KL penalty} that adjusts dynamically during training
  \item \textbf{Constitutional synthetic preference generation} at scale
\end{enumerate}

We detail each component, report experimental results, and analyze the
helpfulness-harmlessness tradeoff surface achieved by our approach.

\section{Background}

\subsection{Reinforcement Learning from Human Feedback}

RLHF~\cite{christiano2017deep,ouyang2022rlhf} trains a reward model $r_\phi$ on
human preference data, then uses reinforcement learning (typically PPO~\cite{schulman2017ppo})
to maximize expected reward subject to a KL constraint against the reference policy.

The standard objective is:
\begin{equation}
J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x,y) - \beta \cdot \text{KL}[\pi_\theta(\cdot|x) \| \pi_\text{ref}(\cdot|x)]\right]
\end{equation}

A fixed $\beta$ creates instability: too small permits reward hacking; too large
prevents meaningful policy improvement. We address this with adaptive scheduling.

\subsection{Constitutional AI}

Constitutional AI~\cite{bai2022constitutional} replaces human preference labels
with model-generated critiques based on a set of principles (the ``constitution'').
This scales annotation but risks distribution shift when the critiquing model shares
failure modes with the model being trained. Zen addresses this via an independent
critique model trained on a held-out safety dataset.

\section{Reward Modeling}

\subsection{Multi-Reward Architecture}

Rather than a single scalar reward, we train three specialized reward heads:
\begin{align}
r_\text{help}(x,y) &: \text{Task completion quality and accuracy} \\
r_\text{harm}(x,y) &: \text{Safety and policy compliance} \\
r_\text{honest}(x,y) &: \text{Factual accuracy and calibration}
\end{align}

All three heads share a Zen-7B backbone with task-specific linear projection heads
of dimension 1 (scalar reward). The backbone is initialized from the SFT checkpoint
to preserve instruction-following representations.

\subsubsection{Training Data}

\begin{table}[H]
\centering
\begin{tabular}{lrrr}
\toprule
\textbf{Reward Dimension} & \textbf{Human Pairs} & \textbf{Synthetic Pairs} & \textbf{Total} \\
\midrule
Helpfulness & 120K & 980K & 1.1M \\
Harmlessness & 80K & 420K & 500K \\
Honesty & 60K & 340K & 400K \\
\midrule
Total & 260K & 1.74M & 2.0M \\
\bottomrule
\end{tabular}
\caption{Reward model training data composition.}
\end{table}

Human annotators rated response pairs on a 4-point scale per dimension.
Synthetic pairs were generated via constitutional critiques (Section 4) and filtered
to pairs where the constitutional model assigned high confidence ($>0.85$).

\subsubsection{Loss Function}

Multi-reward training minimizes:
\begin{equation}
\mathcal{L}_\text{RM} = \sum_{d \in \{\text{help,harm,honest}\}} \lambda_d \cdot \mathbb{E}\left[-\log \sigma\left(r_\phi^d(x, y_w) - r_\phi^d(x, y_l)\right)\right]
\end{equation}

with $\lambda_\text{help} = 0.45$, $\lambda_\text{harm} = 0.35$, $\lambda_\text{honest} = 0.20$.
These weights are set via a small held-out human evaluation of $5000$ pairs.

\subsection{Reward Model Accuracy}

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{RM Variant} & \textbf{Help Acc.} & \textbf{Harm Acc.} & \textbf{Honest Acc.} & \textbf{Avg.} \\
\midrule
Single scalar & 72.1\% & 78.4\% & 68.3\% & 72.9\% \\
Multi-head (ours) & \textbf{79.8\%} & \textbf{85.2\%} & \textbf{76.4\%} & \textbf{80.5\%} \\
Multi-head + synth & 78.9\% & 84.7\% & 76.1\% & 79.9\% \\
\bottomrule
\end{tabular}
\caption{Reward model accuracy on held-out human preference pairs (6K pairs per dimension).}
\end{table}

The multi-head architecture improves over a single scalar reward model by 7.6 points
average accuracy. Adding synthetic data slightly reduces accuracy on human pairs
(distribution shift) but enables higher-quality policy training.

\section{Constitutional Synthetic Preference Generation}

\subsection{Constitution Definition}

The Zen constitution contains 42 principles organized into five categories:

\begin{enumerate}
  \item \textbf{Factual accuracy} (10 principles): truthfulness, uncertainty expression,
    source attribution, numerical accuracy
  \item \textbf{Safety} (12 principles): refusal of harmful requests, child protection,
    no dangerous instructions, bias avoidance
  \item \textbf{Privacy} (6 principles): no PII disclosure, data minimization
  \item \textbf{Helpfulness} (8 principles): task relevance, completeness, clarity
  \item \textbf{Fairness} (6 principles): demographic balance, representation
\end{enumerate}

\subsection{Synthetic Preference Generation Algorithm}

\begin{algorithm}[H]
\caption{Constitutional Synthetic Preference Generation}
\begin{algorithmic}[1]
\REQUIRE Prompt set $\mathcal{P}$, constitution $\mathcal{C}$, model $\pi$, critique model $\pi_c$
\ENSURE Synthetic preference pairs $\mathcal{S}$
\STATE $\mathcal{S} \leftarrow \emptyset$
\FOR{each prompt $p \in \mathcal{P}$}
  \STATE $y_0 \leftarrow \pi(p)$ \COMMENT{Initial response}
  \STATE Sample $c_i \sim \mathcal{C}$ uniformly
  \STATE $\text{critique} \leftarrow \pi_c(p, y_0, c_i)$ \COMMENT{Generate critique}
  \STATE $y_1 \leftarrow \pi(p, \text{critique})$ \COMMENT{Revised response}
  \STATE $\text{conf} \leftarrow \pi_c.\text{score}(p, y_0, y_1, c_i)$ \COMMENT{Confidence}
  \IF{$\text{conf} > 0.85$}
    \STATE $\mathcal{S} \leftarrow \mathcal{S} \cup \{(p, y_1, y_0)\}$ \COMMENT{Revised preferred}
  \ENDIF
\ENDFOR
\RETURN $\mathcal{S}$
\end{algorithmic}
\end{algorithm}

The critique model $\pi_c$ is a Zen-32B model fine-tuned on 80K human-written
critique-revision pairs, held separate from the policy being aligned to prevent
feedback loops.

\subsection{Synthetic vs. Human Preference Agreement}

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Principle Category} & \textbf{Synth-Human Agreement} & \textbf{Human-Human Agreement} \\
\midrule
Factual accuracy & 88.2\% & 91.4\% \\
Safety & 91.3\% & 93.7\% \\
Privacy & 87.6\% & 89.1\% \\
Helpfulness & 83.4\% & 87.2\% \\
Fairness & 79.8\% & 84.3\% \\
\midrule
Average & \textbf{86.1\%} & \textbf{89.1\%} \\
\bottomrule
\end{tabular}
\caption{Agreement between synthetic preferences and human annotators vs. inter-annotator agreement.}
\end{table}

\section{PPO Training with Adaptive KL}

\subsection{Fused Reward Signal}

The fused reward used during PPO combines the three reward dimensions:
\begin{equation}
r_\text{fused}(x,y) = \mathbf{w}^\top \mathbf{r}(x,y) - \beta_t \cdot \text{KL}[\pi_\theta(\cdot|x)\|\pi_\text{ref}(\cdot|x)]
\end{equation}

where $\mathbf{w} = (w_\text{help}, w_\text{harm}, w_\text{honest})$ is a learned
weight vector and $\beta_t$ is the adaptive KL penalty.

\subsection{Adaptive KL Penalty}

We introduce a PID-style adaptive KL controller:

\begin{equation}
\beta_{t+1} = \beta_t \cdot \exp\left(\kappa \cdot \left(\overline{\text{KL}}_t - \text{KL}_\text{target}\right)\right)
\end{equation}

where $\overline{\text{KL}}_t$ is the exponential moving average of KL divergence,
$\text{KL}_\text{target} = 0.1$ nats is the target, and $\kappa = 0.1$ is the
controller gain. $\beta$ is clipped to $[0.01, 1.0]$.

\begin{algorithm}[H]
\caption{PPO with Adaptive KL and Multi-Reward Fusion}
\begin{algorithmic}[1]
\REQUIRE Policy $\pi_\theta$, reference $\pi_\text{ref}$, reward heads $r^d_\phi$, weights $\mathbf{w}$
\STATE Initialize $\beta_0 = 0.04$, $\overline{\text{KL}}_0 = 0$
\FOR{step $t = 1 \ldots T$}
  \STATE Sample prompts $\{x_i\}$ from dataset
  \STATE Generate responses $\{y_i\} \sim \pi_\theta(\cdot|x_i)$
  \STATE Compute $r^d_i = r^d_\phi(x_i, y_i)$ for each dimension $d$
  \STATE $r_i = \mathbf{w}^\top \mathbf{r}_i - \beta_t \cdot \text{KL}_i$
  \STATE Compute advantages $\hat{A}_i$ via GAE with value baseline
  \FOR{$k = 1 \ldots K_\text{epochs}$}
    \STATE Update $\theta$ via clipped PPO objective: $\min(r_t \hat{A}, \text{clip}(r_t, 1\pm\epsilon)\hat{A})$
  \ENDFOR
  \STATE $\overline{\text{KL}}_{t+1} \leftarrow 0.9 \cdot \overline{\text{KL}}_t + 0.1 \cdot \overline{\text{KL}}_{\text{batch}}$
  \STATE $\beta_{t+1} \leftarrow \text{clip}\left(\beta_t \exp(\kappa(\overline{\text{KL}}_t - 0.1)), 0.01, 1.0\right)$
\ENDFOR
\end{algorithmic}
\end{algorithm}

\subsection{Policy Optimization Metrics}

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Method} & \textbf{Reward} & \textbf{KL (nats)} & \textbf{Win Rate vs SFT} & \textbf{Refusal Rate} \\
\midrule
SFT baseline & -- & -- & 50.0\% & 8.2\% \\
PPO fixed $\beta=0.01$ & 0.83 & 0.41 & 61.2\% & 18.3\% \\
PPO fixed $\beta=0.10$ & 0.71 & 0.08 & 57.4\% & 11.4\% \\
PPO adaptive (ours) & \textbf{0.89} & \textbf{0.10} & \textbf{68.4\%} & \textbf{13.7\%} \\
\bottomrule
\end{tabular}
\caption{PPO training results on Zen-7B. Win rate from 2K human-judged comparisons.}
\end{table}

\section{Helpfulness-Harmlessness Tradeoff}

A key concern in alignment is the tradeoff between helpfulness and harmlessness:
models that refuse more often are safer but less useful. We characterize this frontier
by sweeping $w_\text{harm}$ from 0.1 to 0.9 while normalizing other weights:

\begin{table}[H]
\centering
\begin{tabular}{cccc}
\toprule
\textbf{$w_\text{harm}$} & \textbf{Helpfulness (\%)} & \textbf{Harmlessness (\%)} & \textbf{Over-refusal (\%)} \\
\midrule
0.1 & 91.2 & 78.3 & 4.1 \\
0.2 & 90.4 & 84.7 & 6.3 \\
0.35 (default) & 89.3 & 94.7 & 13.7 \\
0.5 & 85.1 & 97.1 & 22.4 \\
0.7 & 76.8 & 99.1 & 38.2 \\
\bottomrule
\end{tabular}
\caption{Pareto frontier of helpfulness vs. harmlessness at varying harm reward weight.}
\end{table}

Our default setting ($w_\text{harm}=0.35$) achieves an over-refusal rate of 13.7\%
(vs. 8.2\% for unaligned SFT) while improving harmlessness from 71.2\% to 94.7\%.

\section{Human Preference Evaluation}

\subsection{Evaluation Protocol}

We conducted a human preference study with 48 annotators evaluating 3000 prompt-response
pairs per model comparison. Annotators rated each response on helpfulness,
harmlessness, and honesty on a 5-point scale, and made an overall preference judgment.

\subsection{Results}

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Comparison} & \textbf{Win} & \textbf{Tie} & \textbf{Loss} & \textbf{Net Preference} \\
\midrule
Zen-7B-Aligned vs. SFT & 68.4\% & 12.3\% & 19.3\% & +49.1\% \\
Zen-32B-Aligned vs. SFT & 71.2\% & 11.8\% & 17.0\% & +54.2\% \\
Zen-7B single-RM vs. SFT & 59.8\% & 14.1\% & 26.1\% & +33.7\% \\
\bottomrule
\end{tabular}
\caption{Human preference win rates. Evaluations conducted by independent annotators.}
\end{table}

\section{Analysis}

\subsection{Reward Hacking Behavior}

We observe reward hacking manifests differently per reward dimension:
\begin{itemize}
  \item \textbf{Helpfulness}: Length exploitation (longer responses score higher, model
    learns to pad); mitigated by length-normalized reward.
  \item \textbf{Harmlessness}: Excessive hedging (adding disclaimers to benign responses);
    mitigated by over-refusal penalty term.
  \item \textbf{Honesty}: Uncertainty inflation (saying ``I'm not sure'' to avoid
    factual claims); mitigated by calibration reward signal.
\end{itemize}

\subsection{Constitutional Training Contribution}

Ablating constitutional training (Section 3 of zen-training-methodology) shows
12-point harmlessness degradation at the same refusal rate, confirming that
constitutional training instills genuine safety understanding rather than
pattern-matching refusal.

\subsection{Scaling Reward Model Size}

Larger reward models yield monotonically better policy quality:

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{RM Size} & \textbf{Win Rate vs. SFT} & \textbf{Human-RM Correlation} \\
\midrule
1.3B & 61.2\% & 0.74 \\
7B (default) & 68.4\% & 0.81 \\
32B & 71.8\% & 0.85 \\
\bottomrule
\end{tabular}
\caption{Effect of reward model size on policy quality.}
\end{table}

Diminishing returns above 7B RM size suggest a ceiling in preference prediction
accuracy given our annotation methodology.

\section{Conclusion}

The Zen alignment framework demonstrates that multi-objective reward modeling with
adaptive KL control and constitutional synthetic preference generation produces
models that substantially improve over SFT baselines on all alignment dimensions.
The 68.4\% human preference win rate and 23.5-point harmlessness improvement
(71.2\% $\to$ 94.7\%) validate the framework at the 7B scale, with consistent
gains at 32B.

Future directions include online RLHF where the reward model is updated in parallel
with the policy, debate-based preference elicitation for complex reasoning tasks,
and multi-lingual alignment extending the constitutional framework to non-English
principles.

\bibliographystyle{plain}
\begin{thebibliography}{99}
\bibitem{christiano2017deep} Christiano et al. (2017). Deep Reinforcement Learning from Human Preferences. \textit{NeurIPS}.
\bibitem{ouyang2022rlhf} Ouyang et al. (2022). Training language models to follow instructions with human feedback. \textit{NeurIPS}.
\bibitem{schulman2017ppo} Schulman et al. (2017). Proximal Policy Optimization Algorithms. \textit{arXiv:1707.06347}.
\bibitem{bai2022constitutional} Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. \textit{arXiv:2212.08073}.
\bibitem{stiennon2020rlhf} Stiennon et al. (2020). Learning to summarize from human feedback. \textit{NeurIPS}.
\end{thebibliography}

\end{document}