Skip to content

attacks: Unify chat templating#97

Open
tomtseng wants to merge 7 commits intotomtseng/chat2from
tomtseng/chat-attack
Open

attacks: Unify chat templating#97
tomtseng wants to merge 7 commits intotomtseng/chat2from
tomtseng/chat-attack

Conversation

@tomtseng
Copy link
Collaborator

@tomtseng tomtseng commented Feb 14, 2026

Summary

Convert attack training data formatting from manual f-strings to tokenizer.apply_chat_template(), then remove the now-unused user_prefix / assistant_prefix / end_turn fields from ModelConfig.

Parent PR: #95

Motivation

PR #95 unified all eval prompt formatting onto apply_chat_template via configurable Jinja2 templates. However, the attack training data pipelines still formatted prompts using manual f-strings with model_config.user_prefix / assistant_prefix / end_turn:

# Old approach — duplicated template knowledge in f-strings
prompt = f"{user_prefix}{data_point['prompt']}{end_turn}"
completion = f"{assistant_prefix}{data_point['rejected']}{end_turn}"

This meant ModelConfig had to carry three string fields to support this formatting path, but it's nicer to have attacks also go through apply_chat_template to reduce the number of different chat templating code paths.

Changes

  • New helper (evals/utils.py): format_prompt_completion(user_content, assistant_content, tokenizer) uses the chat template to create the prompt and completion fields calls to derive a prompt/completion split that exactly matches inference-time formatting:
  • ModelConfig simplified:*Drops user_prefix, assistant_prefix, and end_turn fields.

Behavior changes

  • ModelConfig API — Constructing a ModelConfig no longer requires or accepts user_prefix, assistant_prefix, or end_turn. Any external code or YAML configs passing these fields to ModelConfig.from_dict() will now raise ValueError.
  • SFT loss boundary shift for non-PLAIN templates — The old f-string approach put the assistant prefix (e.g., "Assistant: ") at the start of the completion field; the new format_prompt_completion uses add_generation_prompt=True, which places the assistant prefix at the end of the prompt instead. The full concatenated string is identical, but because SFTTrainer computes loss only on completion tokens, the model no longer receives gradient signal on the fixed assistant-prefix tokens. This affects runs using generic_chat, instruction_response, or other non-PLAIN templates (PLAIN has empty prefixes, so the boundary is unchanged). This is the more correct behavior since the assistant prefix is a fixed formatting token that doesn't need to be learned during fine-tuning.

@tomtseng tomtseng changed the base branch from main to tomtseng/chat2 February 14, 2026 00:58
@tomtseng tomtseng force-pushed the tomtseng/chat-attack branch from bd11c13 to ec287da Compare February 14, 2026 04:28
@tomtseng tomtseng marked this pull request as ready for review February 14, 2026 05:03
@tomtseng tomtseng requested a review from sdhossain February 14, 2026 05:04
@sdhossain sdhossain mentioned this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant