attacks: Unify chat templating by tomtseng · Pull Request #97 · criticalml-uw/TamperBench

tomtseng · 2026-02-14T00:58:18Z

Summary

Convert attack training data formatting from manual f-strings to tokenizer.apply_chat_template(), then remove the now-unused user_prefix / assistant_prefix / end_turn fields from ModelConfig.

Parent PR: #95

Motivation

PR #95 unified all eval prompt formatting onto apply_chat_template via configurable Jinja2 templates. However, the attack training data pipelines still formatted prompts using manual f-strings with model_config.user_prefix / assistant_prefix / end_turn:

# Old approach — duplicated template knowledge in f-strings
prompt = f"{user_prefix}{data_point['prompt']}{end_turn}"
completion = f"{assistant_prefix}{data_point['rejected']}{end_turn}"

This meant ModelConfig had to carry three string fields to support this formatting path, but it's nicer to have attacks also go through apply_chat_template to reduce the number of different chat templating code paths.

Changes

New helper (evals/utils.py): format_prompt_completion(user_content, assistant_content, tokenizer) uses the chat template to create the prompt and completion fields calls to derive a prompt/completion split that exactly matches inference-time formatting:
ModelConfig simplified:*Drops user_prefix, assistant_prefix, and end_turn fields.

Behavior changes

ModelConfig API — Constructing a ModelConfig no longer requires or accepts user_prefix, assistant_prefix, or end_turn. Any external code or YAML configs passing these fields to ModelConfig.from_dict() will now raise ValueError.
SFT loss boundary shift for non-PLAIN templates — The old f-string approach put the assistant prefix (e.g., "Assistant: ") at the start of the completion field; the new format_prompt_completion uses add_generation_prompt=True, which places the assistant prefix at the end of the prompt instead. The full concatenated string is identical, but because SFTTrainer computes loss only on completion tokens, the model no longer receives gradient signal on the fixed assistant-prefix tokens. This affects runs using generic_chat, instruction_response, or other non-PLAIN templates (PLAIN has empty prefixes, so the boundary is unchanged). This is the more correct behavior since the assistant prefix is a fixed formatting token that doesn't need to be learned during fine-tuning.

attacks: Unify chat templating

8a5cd7b

tomtseng changed the base branch from main to tomtseng/chat2 February 14, 2026 00:58

tomtseng mentioned this pull request Feb 14, 2026

Defenses & evals: Unify chat templating #95

Open

tomtseng force-pushed the tomtseng/chat2 branch from 05fafb2 to 18ff014 Compare February 14, 2026 04:16

attacks: Chat templating clean-up

ec287da

tomtseng force-pushed the tomtseng/chat-attack branch from bd11c13 to ec287da Compare February 14, 2026 04:28

tomtseng added 3 commits February 13, 2026 20:34

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

f82aa04

attacks: Clarify chat templating util

34c6681

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

45107b0

tomtseng marked this pull request as ready for review February 14, 2026 05:03

tomtseng requested a review from sdhossain February 14, 2026 05:04

tomtseng added 2 commits February 17, 2026 18:07

scripts: Add scripts for testing PR 97

fc4eddb

scripts chat_template_refactor_testing: Convert legacy template names

b87c16d

sdhossain mentioned this pull request Feb 24, 2026

Tracking Technical Debt #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attacks: Unify chat templating#97

attacks: Unify chat templating#97
tomtseng wants to merge 7 commits intotomtseng/chat2from
tomtseng/chat-attack

tomtseng commented Feb 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomtseng commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Behavior changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tomtseng commented Feb 14, 2026 •

edited

Loading