infra: arxiv version by sdhossain · Pull Request #53 · criticalml-uw/TamperBench

sdhossain · 2026-01-14T13:02:30Z

Changes

This PR contains a number of changes, including:

infra refactors: these have been kind of reviewed already, but not merged -- so will close those PRs in favour of this (those PRs were split from a previous commit of this - and a bit scattered), I've put TODOs for unresolved things. <--- could use some attention, but I think mostly looked at, so a skim is probably fine
Large refactors to some of the fine-tuning attacks (although a lot of these have already been reviewed in some other currently open PRs, and i've looked to address some of the comments over there here, so the code in whitebox/attacks has already mostly been reviewed -- for unresolved items, I've put a TODO. <--- needs a bit of a skim, but I think most of the code here was approved so I think a skim is fine here too
For evals -- I added MBPP, Minerva MATH, and IFEval. They loosely match metrics that I'd seen in the technical reports, but these were pretty quickly put together. <-- Some files are just copied over, <-- will leave at reviewer's discretion how much attention we want to give these. I don't believe we intend to use this for anything major, but if we do, perhaps we can have follow up review just for these?
For analysis --> lots of this code is just for plotting / filtering <--- probably can be skimmed, not something I think users will really use, as it is kind of a bit tailored to our usecase.
For benchmark --> These scripts are the entry points to the repo (how to use)
docs --> These are the docs, which I'm hoping are sufficiently clear / comprehensive <--- probably needs to be quite high quality.

Testing

Describe how you tested the changes in this PR. E.g., added tests, or ran
command foo and checked the results looked good.

_legacy/evaluation/__init__.py

docs/ANALYSIS.md

configs/analysis/default.yaml

scripts/figures/plot_main_heatmap.py

scripts/figures/plot_supplemental_heatmap.py

scripts/whitebox/benchmark_grid.py

scripts/whitebox/optuna_single.py

src/safetunebed/whitebox/utils/analysis/__init__.py

src/safetunebed/whitebox/evals/mbpp/mbpp.py

docs/MISC.md

tomtseng · 2026-01-14T19:51:35Z

Will look today!
I do see there are a bunch of merge conflicts with main, so would recommend figuring out how to address those conflicts

…optuna based hparam sweeps

scripts/whitebox/benchmark_grid.py

These internal instructions I think feel separate from a CONTRIBUTING.md which can be more public facing.

Doesn't change the rendered Markdown of the tables, but makes them easier to read as plaintext

tomtseng

read through the docs, looks good & thorough

configs/analysis/default.yaml

src/safetunebed/whitebox/evals/strong_reject/strong_reject.py

README.md

docs/ANALYSIS.md

docs/USAGE.md

docs/ANALYSIS.md

docs/USAGE.md

These scripts should give a non-zero exit code when they fail, instead of printing an error message but exiting with exit code 0.

Fixing a few nits I have about the benchmark_grid script: 1. Duplicate line (lines 117 and 121) attack_config_cls, attack_cls = ATTACKS_MAP[attack_name] # line 117 for config_name, attack_config_dict in whitebox_attack_config_grids[...]: attack_config_cls, attack_cls = ATTACKS_MAP[attack_name] # line 121 - duplicate Line 121 is redundant since the same unpacking already happens at line 117. 2. Missing required=True for --attacks argument (line 61-70) If --attacks is not provided, args.attacks will be None, causing a crash at line 113 when iterating. 3. Redundant ATTACKS_MAP alias (line 42) ATTACKS_MAP = ATTACKS_REGISTRY isn't necessary, just use ATTACKS_REGISTRY 4. Unnecessary global variable (lines 45-47) whitebox_attack_config_grids is declared at module scope but only used within the if __name__ == "__main__" block. 5. Inconsistent argument naming style Arguments mix underscore (--random-seed) and hyphen (--results_dir) styles. I've changed `--results_dir` to `--results-dir`.

…I flags

- Class comment is wrong - setdefault doesn't need an if guard - making one error message slightly more informative

…hs.from_existing(). StudyPaths attribute order differed from StudyPaths.from_existing() argument order, let's keep them consistent.

tomtseng

OK I think this is about as much as I'm going to review this. Was skimming and skipping a lot of stuff and asking Claude Code to look at a bunch of files.

scripts/analysis/analyze_results.py

src/safetunebed/whitebox/utils/names.py

src/safetunebed/whitebox/attacks/base.py

tomtseng · 2026-01-16T03:34:58Z

src/safetunebed/whitebox/defenses/registry.py

+
+
+def register_defense(
+    name: DefenseName, config_cls: type[H]


I think H is not necessary? could we just do type[AlignmentDefenseConfig] here and delete H?

Weirdly, got type errors when I tried that, put it as a TODO to resolve.

src/safetunebed/whitebox/evals/embedding_attack/embedding_attack.py

src/safetunebed/whitebox/utils/analysis/pipeline.py

src/safetunebed/whitebox/utils/benchmark/sweep.py

tests/evals/test_minerva_math.py

tests/evals/test_mbpp.py

tests/evals/test_strong_reject.py

tomtseng · 2026-01-16T06:36:00Z

pyproject.toml

+  "sentencepiece>=0.2.0",
+  "tokenizers>=0.22.0",
+  "pytest>=8.4.2",
+  "flash-attention>=1.0.0",


the correct package for this is flash-attn rather than flash-attention. however it takes some work to install so I wouldn't try to switch this to flash-attn in this PR, i'd just delete flash-attention here.

I looked at how to install it for an hour in PR #55, but currently we don't actually use flash-attn anywhere except in GCG — which we're not using, and if we want to use it, we could turn off flash attention probably. So not a priority to fix. vllm I think already does some flash attention thing internally without needing to install flash-attn

sdhossain requested a review from tomtseng January 14, 2026 13:02