Skip to content

Defenses & evals: Unify chat templating#95

Open
tomtseng wants to merge 7 commits intomainfrom
tomtseng/chat2
Open

Defenses & evals: Unify chat templating#95
tomtseng wants to merge 7 commits intomainfrom
tomtseng/chat2

Conversation

@tomtseng
Copy link
Collaborator

@tomtseng tomtseng commented Feb 14, 2026

Summary

Unify all eval prompt formatting on tokenizer.apply_chat_template() by making the template choice an explicit config variable (template_name on ModelConfig).

Motivation

The codebase had a few different formatting mechanisms: (1) manual f-string formatting with user_prefix/assistant_prefix/end_turn, (2) tokenizer.apply_chat_template() using HF built-in Jinja2 templates. This has the risk of being error-prone since there a bunch of different code paths doing the same thing. This PR makes the template choice an explicit config variable and routes all formatting through apply_chat_template().

Child PR #97 refactors attack training data formatting and removes the now-redundant user_prefix/assistant_prefix/end_turn fields from ModelConfig.

Code changes

  • Add template_name field to ModelConfig with four options: native (use tokenizer built-in template), generic_chat, instruction_response, plain
  • Generate Jinja2 templates from the existing TextTemplate prefix/suffix registry via TextTemplate.to_jinja2()
  • Centralize template configuration in load_tokenizer() via configure_tokenizer_template() so evals don't each need their own formatting logic
  • Move format_chat_prompt and apply_chat_template_with_fallback from evals/utils.py to whitebox/utils/models/chat_format.py — I hit a circular import otherwise
  • Remove model-specific templates (llama3, qwen, gpt_chat) in favor of native — better to just use the template given by the model tokenizer, rather than rewriting it ourselves and potentially getting it wrong

Behavior changes

  • StrongReject: Previously received raw (unformatted) prompts. Now prompts go through format_chat_prompt(). With template: plain this is a no-op. With other templates, prompts now get chat formatting -- this will change eval results
  • SafetyGap: Bug fix -- the paper code does use chat templating even though our implementation said otherwise.

Child PR: #97

@tomtseng tomtseng changed the title Unify eval prompt formatting on apply_chat_template via configurable Jinja2 templates Defenses & evals: Unify eval prompt formatting on apply_chat_template Feb 14, 2026
@tomtseng tomtseng changed the title Defenses & evals: Unify eval prompt formatting on apply_chat_template Defenses & evals: Unify chat templating Feb 14, 2026
user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
end_turn="<|eot_id|>",
),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was missing <|begin_of_text|> at the start of a formatted conversation. If we want to apply this kind of formatting I think we should just specify a model name and take its tokenizer.chat_template so we are sure it matches exactly

@tomtseng tomtseng marked this pull request as ready for review February 14, 2026 04:57
@tomtseng tomtseng requested a review from sdhossain February 14, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant