Defenses & evals: Unify chat templating by tomtseng · Pull Request #95 · criticalml-uw/TamperBench

tomtseng · 2026-02-14T00:07:38Z

Summary

Unify all eval prompt formatting on tokenizer.apply_chat_template() by making the template choice an explicit config variable (template_name on ModelConfig).

Motivation

The codebase had a few different formatting mechanisms: (1) manual f-string formatting with user_prefix/assistant_prefix/end_turn, (2) tokenizer.apply_chat_template() using HF built-in Jinja2 templates. This has the risk of being error-prone since there a bunch of different code paths doing the same thing. This PR makes the template choice an explicit config variable and routes all formatting through apply_chat_template().

Child PR #97 refactors attack training data formatting and removes the now-redundant user_prefix/assistant_prefix/end_turn fields from ModelConfig.

Code changes

Add template_name field to ModelConfig with four options: native (use tokenizer built-in template), generic_chat, instruction_response, plain
Generate Jinja2 templates from the existing TextTemplate prefix/suffix registry via TextTemplate.to_jinja2()
Centralize template configuration in load_tokenizer() via configure_tokenizer_template() so evals don't each need their own formatting logic
Move format_chat_prompt and apply_chat_template_with_fallback from evals/utils.py to whitebox/utils/models/chat_format.py — I hit a circular import otherwise
Remove model-specific templates (llama3, qwen, gpt_chat) in favor of native — better to just use the template given by the model tokenizer, rather than rewriting it ourselves and potentially getting it wrong

Behavior changes

StrongReject: Previously received raw (unformatted) prompts. Now prompts go through format_chat_prompt(). With template: plain this is a no-op. With other templates, prompts now get chat formatting -- this will change eval results
SafetyGap: Bug fix -- the paper code does use chat templating even though our implementation said otherwise.

Child PR: #97

…Jinja2 templates

tomtseng · 2026-02-14T03:58:27Z

src/tamperbench/whitebox/utils/models/templates.py

-        user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
-        assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
-        end_turn="<|eot_id|>",
-    ),


this was missing <|begin_of_text|> at the start of a formatted conversation. If we want to apply this kind of formatting I think we should just specify a model name and take its tokenizer.chat_template so we are sure it matches exactly

…c templates

tomtseng added 4 commits February 13, 2026 16:06

Unify eval prompt formatting on apply_chat_template via configurable …

58e24a4

…Jinja2 templates

Merge branch 'main' into tomtseng/chat2

3904ef5

evals: Chat template refactor cleanup

4b3f0b5

Fix pre-commit on chat template refactor

c31f75f

tomtseng changed the title ~~Unify eval prompt formatting on apply_chat_template via configurable Jinja2 templates~~ Defenses & evals: Unify eval prompt formatting on apply_chat_template Feb 14, 2026

tomtseng changed the title ~~Defenses & evals: Unify eval prompt formatting on apply_chat_template~~ Defenses & evals: Unify chat templating Feb 14, 2026

tomtseng mentioned this pull request Feb 14, 2026

attacks: Unify chat templating #97

Open

tomtseng force-pushed the tomtseng/chat2 branch from a514bb3 to 49be32f Compare February 14, 2026 01:52

tomtseng commented Feb 14, 2026

View reviewed changes

evals: Chat template simplification

18ff014

tomtseng force-pushed the tomtseng/chat2 branch from 05fafb2 to 18ff014 Compare February 14, 2026 04:16

tomtseng marked this pull request as ready for review February 14, 2026 04:57

tests: Fix missing import

1b4f169

tomtseng requested a review from sdhossain February 14, 2026 05:04

configs: Use native chat template to replace deprecated model-specifi…

f84bb81

…c templates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defenses & evals: Unify chat templating#95

Defenses & evals: Unify chat templating#95
tomtseng wants to merge 7 commits intomainfrom
tomtseng/chat2

tomtseng commented Feb 14, 2026 •

edited

Loading

Uh oh!

tomtseng Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomtseng commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Code changes

Behavior changes

Uh oh!

tomtseng Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tomtseng commented Feb 14, 2026 •

edited

Loading