Open
Conversation
05fafb2 to
18ff014
Compare
bd11c13 to
ec287da
Compare
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Convert attack training data formatting from manual f-strings to
tokenizer.apply_chat_template(), then remove the now-unuseduser_prefix/assistant_prefix/end_turnfields fromModelConfig.Parent PR: #95
Motivation
PR #95 unified all eval prompt formatting onto
apply_chat_templatevia configurable Jinja2 templates. However, the attack training data pipelines still formatted prompts using manual f-strings withmodel_config.user_prefix/assistant_prefix/end_turn:This meant
ModelConfighad to carry three string fields to support this formatting path, but it's nicer to have attacks also go throughapply_chat_templateto reduce the number of different chat templating code paths.Changes
evals/utils.py):format_prompt_completion(user_content, assistant_content, tokenizer)uses the chat template to create thepromptandcompletionfields calls to derive a prompt/completion split that exactly matches inference-time formatting:ModelConfigsimplified:*Dropsuser_prefix,assistant_prefix, andend_turnfields.Behavior changes
ModelConfigAPI — Constructing aModelConfigno longer requires or acceptsuser_prefix,assistant_prefix, orend_turn. Any external code or YAML configs passing these fields toModelConfig.from_dict()will now raiseValueError."Assistant: ") at the start of thecompletionfield; the newformat_prompt_completionusesadd_generation_prompt=True, which places the assistant prefix at the end of thepromptinstead. The full concatenated string is identical, but becauseSFTTrainercomputes loss only on completion tokens, the model no longer receives gradient signal on the fixed assistant-prefix tokens. This affects runs usinggeneric_chat,instruction_response, or other non-PLAIN templates (PLAIN has empty prefixes, so the boundary is unchanged). This is the more correct behavior since the assistant prefix is a fixed formatting token that doesn't need to be learned during fine-tuning.