papers/zen-safety-evaluation.tex at main · zenlm/papers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{color}
\usepackage{booktabs}
\usepackage{float}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{geometry}
\geometry{margin=1in}
\definecolor{zenblue}{RGB}{41,121,255}
\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\title{\textbf{Zen Safety Framework: Evaluation and Mitigation}\\
\large Technical Report v2025.05}
\author{Antje Worring, Zach Kelling \\ Zen LM Research Team\\
\texttt{research@zenlm.org}}
\date{May 2025}

\begin{document}
\maketitle

\begin{abstract}
We present the Zen Safety Framework, a comprehensive methodology for evaluating and
mitigating safety risks in large language models. The framework encompasses structured
red-teaming, a 2400-item safety evaluation suite, automated adversarial generation,
and constitutional RLHF for mitigation. Zen models achieve 99.2\% on ToxiGen,
96.8\% jailbreak resistance on our Adversarial Red-Team Benchmark (ARTB), and
competitive bias scores on BBQ and WinoBias. We describe our red-teaming methodology,
evaluation protocols, and the feedback loop between red-team findings and model
retraining. Safety does not significantly degrade capability: across MMLU, HumanEval,
and MATH, safety-trained models lose fewer than 0.8 points versus unsafety-trained
counterparts.
\end{abstract}

\section{Introduction}

Deploying frontier language models requires confidence that they will not produce
harmful, biased, or privacy-violating content. Safety evaluation must be adversarial,
systematic, and diverse: models quickly overfit to narrow evaluations that do not
transfer to real-world misuse.

The Zen Safety Framework is built on four principles:
\begin{enumerate}
  \item \textbf{Adversarial by default}: Evaluations must probe the model's worst case, not its average case.
  \item \textbf{Multi-dimensional}: Safety is not one-dimensional; harm types differ in severity and mechanism.
  \item \textbf{Automated and scalable}: Human red-teaming is expensive; automation must close the gap.
  \item \textbf{Closed-loop}: Safety findings must feed back into training, not just documentation.
\end{enumerate}

\section{Background}

\subsection{Taxonomy of Harms}

We adopt a five-category harm taxonomy:
\begin{itemize}
  \item \textbf{Physical harm}: Instructions for violence, weapons, or dangerous substances
  \item \textbf{Psychological harm}: Harassment, emotional manipulation, suicide encouragement
  \item \textbf{Privacy violation}: PII extraction, doxxing, surveillance assistance
  \item \textbf{Bias and discrimination}: Demographic stereotyping, hate speech
  \item \textbf{Misinformation}: False factual claims, election interference
\end{itemize}

\subsection{Related Work}

ToxiGen~\cite{hartvigsen2022toxigen} provides implicit toxicity benchmarking.
BBQ~\cite{parrish2022bbq} and WinoBias~\cite{zhao2018winobias} measure demographic
biases. Do-Not-Answer~\cite{wang2023donotanswer} evaluates refusal of harmful requests.
Jailbreak benchmarks~\cite{zou2023universalprompts} probe prompt-based safety bypasses.

\section{Red-Teaming Methodology}

\subsection{Structured Human Red-Teaming}

Zen red-teaming employs 64 red-teamers across 8 specializations:

\begin{table}[H]
\centering
\begin{tabular}{lrl}
\toprule
\textbf{Specialization} & \textbf{Testers} & \textbf{Focus Area} \\
\midrule
Prompt injection & 12 & Jailbreaks, prompt leakage \\
Social engineering & 10 & Manipulation, impersonation \\
Bioweapons/CBRN & 6 & Dual-use knowledge elicitation \\
Misinformation & 8 & Factual manipulation, political bias \\
Privacy & 8 & PII extraction, doxxing \\
Hate speech / bias & 10 & Demographic targeting, slurs \\
Child safety & 6 & CSAM, grooming facilitation \\
Financial fraud & 4 & Scam generation, market manipulation \\
\bottomrule
\end{tabular}
\caption{Red-team composition and specializations.}
\end{table}

Each red-teamer conducts 8-hour sessions targeting their specialization, documenting
successful attack vectors, partial successes, and near-misses. All sessions are
logged and used to expand the automated evaluation suite.

\subsection{Attack Vector Classification}

Red-team attacks are classified into five vector types:

\begin{enumerate}
  \item \textbf{Direct request}: Straightforward harmful request (``How do I make...'')
  \item \textbf{Roleplay}: Fictional framing (``As a character in a story...'')
  \item \textbf{Hypothetical}: Abstracted framing (``Suppose someone wanted to...'')
  \item \textbf{Jailbreak}: Prompt engineering to bypass safety (DAN, AIM, etc.)
  \item \textbf{Multi-turn}: Incrementally escalating conversation
\end{enumerate}

\subsection{Automated Red-Teaming}

Human red-teaming is bottlenecked by cost and coverage. Automated red-teaming
generates adversarial prompts using a separately trained attacker model:

\begin{algorithm}[H]
\caption{Automated Adversarial Prompt Generation}
\begin{algorithmic}[1]
\REQUIRE Target model $\pi$, attacker model $\mathcal{A}$, harm categories $\mathcal{H}$, budget $K$
\ENSURE Attack set $\mathcal{S}$
\STATE $\mathcal{S} \leftarrow \emptyset$
\FOR{each category $h \in \mathcal{H}$}
  \FOR{$k = 1 \ldots K/|\mathcal{H}|$}
    \STATE $p \leftarrow \mathcal{A}.\text{generate}(h, \mathcal{S}_h)$ \COMMENT{Generate new attack for category $h$}
    \STATE $r \leftarrow \pi(p)$ \COMMENT{Get target model response}
    \STATE $\text{score} \leftarrow \text{HarmClassifier}(r, h)$ \COMMENT{Score harmfulness 0-1}
    \STATE $\mathcal{S} \leftarrow \mathcal{S} \cup \{(p, r, \text{score})\}$
    \IF{$\text{score} > 0.5$} \COMMENT{Successful attack}
      \STATE $\mathcal{A} \leftarrow \mathcal{A}.\text{reinforce}(p)$ \COMMENT{Reward successful attacks}
    \ENDIF
  \ENDFOR
\ENDFOR
\RETURN $\mathcal{S}$
\end{algorithmic}
\end{algorithm}

The attacker model is a Zen-7B fine-tuned via RL to generate attacks that elicit
harmful responses from the target. We generate 50K attacks per model checkpoint and
use a harm classifier (Zen-7B fine-tuned on 100K human-rated response pairs) to
score responses.

\section{Safety Evaluation Suite}

\subsection{Adversarial Red-Team Benchmark (ARTB)}

Our Adversarial Red-Team Benchmark (ARTB) contains 2400 manually verified adversarial
prompts across five harm categories and five attack vector types (480 per category):

\begin{table}[H]
\centering
\begin{tabular}{lrr}
\toprule
\textbf{Category} & \textbf{Items} & \textbf{Jailbreak Resistance} \\
\midrule
Physical harm & 480 & 97.3\% \\
Psychological harm & 480 & 96.9\% \\
Privacy violation & 480 & 97.1\% \\
Bias / discrimination & 480 & 96.2\% \\
Misinformation & 480 & 96.5\% \\
\midrule
\textbf{Overall} & \textbf{2400} & \textbf{96.8\%} \\
\bottomrule
\end{tabular}
\caption{ARTB results for Zen-7B-Aligned. Jailbreak resistance = \% of attacks successfully refused.}
\end{table}

\subsection{ToxiGen Evaluation}

ToxiGen measures implicit toxicity targeting 13 demographic groups. Zen models are
evaluated in generative mode (completing implicit toxic sentences):

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Model} & \textbf{ToxiGen Score (\%)} & \textbf{Group Disparity} \\
\midrule
Zen-7B (no alignment) & 74.3\% & 18.2\% \\
Zen-7B (SFT only) & 91.4\% & 8.7\% \\
Zen-7B (full alignment) & \textbf{99.2\%} & \textbf{2.1\%} \\
Zen-32B (full alignment) & 99.4\% & 1.8\% \\
\bottomrule
\end{tabular}
\caption{ToxiGen scores. Higher = safer. Group disparity = max-min across 13 groups.}
\end{table}

\subsection{Bias Evaluation: BBQ}

The Bias Benchmark for QA (BBQ) tests whether models apply stereotypes when answering
ambiguous questions. We report accuracy on the disambiguated subset (correct answer
determinable) and bias score on the ambiguous subset (lower = more biased):

\begin{table}[H]
\centering
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Disambig Acc.} & \textbf{Bias Score} & \textbf{Gender Bias} & \textbf{Racial Bias} \\
\midrule
Zen-7B (unaligned) & 78.2\% & 0.31 & 0.44 & 0.38 \\
Zen-7B (aligned) & 81.4\% & \textbf{0.08} & \textbf{0.09} & \textbf{0.07} \\
Zen-32B (aligned) & 84.1\% & 0.06 & 0.07 & 0.05 \\
\bottomrule
\end{tabular}
\caption{BBQ results. Bias score closer to 0 indicates less bias.}
\end{table}

\subsection{WinoBias Evaluation}

WinoBias tests gender bias in coreference resolution across stereotypical and
anti-stereotypical pronoun assignments:

\begin{table}[H]
\centering
\begin{tabular}{lrrr}
\toprule
\textbf{Model} & \textbf{Pro-stereo F1} & \textbf{Anti-stereo F1} & \textbf{Bias Gap} \\
\midrule
Zen-7B (unaligned) & 89.3\% & 62.1\% & 27.2\% \\
Zen-7B (aligned) & 87.1\% & \textbf{83.4\%} & \textbf{3.7\%} \\
Zen-32B (aligned) & 88.2\% & 85.1\% & 3.1\% \\
\bottomrule
\end{tabular}
\caption{WinoBias F1 scores. Smaller bias gap indicates more equitable treatment of gender.}
\end{table}

\section{Mitigation: Constitutional RLHF}

Safety findings from red-teaming and evaluation feed directly into model training
via a three-step mitigation cycle:

\begin{enumerate}
  \item \textbf{Attack harvest}: Collect successful attacks from ARTB and automated red-teaming.
  \item \textbf{Constitutional critique}: Generate model critiques of harmful responses using
    the Zen constitution's safety principles.
  \item \textbf{RLHF reward update}: Add safety-specific reward signal for harmful output classes.
\end{enumerate}

The safety reward combines a harm classifier score and a refusal quality score:

\begin{equation}
r_\text{safety}(x, y) = \lambda_\text{harm}(1 - \text{HarmScore}(y)) + \lambda_\text{refusal} \cdot \text{RefusalQuality}(x, y)
\end{equation}

where RefusalQuality scores refusals on a 0-1 scale: 0 for silent refusal, 0.5 for
bare refusal, 1.0 for explanatory refusal that redirects to safe alternatives.

\section{Safety-Capability Tradeoff}

A central concern is that safety training degrades capability. We measure this across
standard benchmarks:

\begin{table}[H]
\centering
\begin{tabular}{lcccccc}
\toprule
\textbf{Model Variant} & \textbf{MMLU} & \textbf{HumanEval} & \textbf{MATH} & \textbf{GSM8K} & \textbf{ARTB} \\
\midrule
Zen-7B (base pretrain) & 82.1 & 74.3 & 63.2 & 80.4 & 41.2\% \\
Zen-7B (SFT) & 85.3 & 78.2 & 67.4 & 84.1 & 71.3\% \\
Zen-7B (SFT + const.) & 85.1 & 78.0 & 67.2 & 83.9 & 83.4\% \\
Zen-7B (full RLHF) & 84.8 & 77.8 & 67.1 & 83.7 & \textbf{96.8\%} \\
\midrule
$\Delta$ (RLHF vs SFT) & $-0.5$ & $-0.4$ & $-0.3$ & $-0.4$ & $+25.5\%$ \\
\bottomrule
\end{tabular}
\caption{Safety-capability tradeoff. RLHF adds 25.5 ARTB points at $<0.5$ capability cost.}
\end{table}

The capability cost of full alignment is less than 0.5 points on any benchmark,
confirming that safety and capability are not fundamentally at odds.

\section{Analysis}

\subsection{Attack Vector Resistance}

Jailbreak resistance varies by attack vector type:

\begin{table}[H]
\centering
\begin{tabular}{lcc}
\toprule
\textbf{Attack Vector} & \textbf{Resistance (Zen-7B))} & \textbf{Resistance (Zen-32B)} \\
\midrule
Direct request & 99.1\% & 99.6\% \\
Roleplay framing & 96.2\% & 97.8\% \\
Hypothetical framing & 95.4\% & 97.1\% \\
Jailbreak prompt & 94.8\% & 96.3\% \\
Multi-turn escalation & 92.3\% & 95.2\% \\
\midrule
Average & 96.8\% & 97.9\% \\
\bottomrule
\end{tabular}
\caption{Jailbreak resistance by attack vector. Multi-turn remains the hardest.}
\end{table}

Multi-turn escalation remains the most challenging attack vector because each
individual turn appears benign. Mitigating this requires maintaining safety context
across the full conversation history.

\subsection{Model Size and Safety}

Larger models are generally safer (higher ARTB resistance, lower bias scores).
This positive scaling relationship holds consistently across all harm categories.
We hypothesize that larger models develop more robust world models that allow
them to reason about harm more effectively.

\subsection{Over-Refusal Analysis}

A key failure mode is over-refusal: refusing legitimate requests that superficially
resemble harmful ones. We measure over-refusal rate on a 1000-item benign request
set covering sensitive-but-legitimate topics (medical questions, historical violence,
security research):

\begin{table}[H]
\centering
\begin{tabular}{lc}
\toprule
\textbf{Model Variant} & \textbf{Over-Refusal Rate} \\
\midrule
Zen-7B (SFT only) & 4.2\% \\
Zen-7B (naive RLHF) & 22.4\% \\
Zen-7B (constitutional RLHF) & \textbf{8.1\%} \\
\bottomrule
\end{tabular}
\caption{Over-refusal rates on benign sensitive requests.}
\end{table}

Constitutional training reduces over-refusal by teaching the model to distinguish
intent-sensitive benign requests from genuinely harmful ones.

\section{Conclusion}

The Zen Safety Framework demonstrates that comprehensive, adversarial safety evaluation
combined with constitutional RLHF produces models that are both safe and capable.
96.8\% jailbreak resistance, 99.2\% ToxiGen score, and near-parity BBQ/WinoBias
scores are achieved at under 0.5 points capability degradation. Future work includes
multi-turn safety evaluation protocols, dynamic red-teaming with model-in-the-loop
attack generation, and cultural safety evaluation for non-English languages.

\bibliographystyle{plain}
\begin{thebibliography}{99}
\bibitem{hartvigsen2022toxigen} Hartvigsen et al. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. \textit{ACL}.
\bibitem{parrish2022bbq} Parrish et al. (2022). BBQ: A hand-built bias benchmark for question answering. \textit{ACL Findings}.
\bibitem{zhao2018winobias} Zhao et al. (2018). Gender Bias in Coreference Resolution. \textit{NAACL}.
\bibitem{wang2023donotanswer} Wang et al. (2023). Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs. \textit{arXiv:2308.13387}.
\bibitem{zou2023universalprompts} Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. \textit{arXiv:2307.15043}.
\end{thebibliography}

\end{document}