infra: added defense hparam sweep scripts by sdhossain · Pull Request #98 · criticalml-uw/TamperBench

sdhossain · 2026-02-17T18:02:08Z

Changes

[infra] Defense hyperparameter sweep infrastructure

Changes

Adds Optuna sweep and grid-based benchmarking for defenses, mirroring the existing attack sweep system.

The pipeline runs: defend -> eval defense checkpoint -> attack defended model -> eval post-attack checkpoint -> cleanup.

New scripts defense_sweep.py and defense_grid.py drive this. Post-defense attacks support both grid mode (fixed configs) and sweep mode (inner Optuna). Shared Optuna/grid logic was extracted from optuna_single.py and benchmark_grid.py into runners.py so both attack and defense scripts reuse the same core.

Testing

Ran defense_grid.py end-to-end on SmolLM-135M-Instruct with minimal configs for Booster and CRL -- both completed the full defend -> eval -> attack -> eval pipeline successfully.

Note: TAR needs a fix (only saves lora adapter)

uv run python scripts/whitebox/defense_grid.py \
  HuggingFaceTB/SmolLM-135M-Instruct --defense booster --config-name base \
  --results-dir /tmp/defense_test_results --model-alias smollm_test

sdhossain · 2026-02-17T18:03:36Z

configs/whitebox/defenses/booster/grid.yaml

+        - name: lora_finetune
+          mode: grid
+          config_name: base
+    # Dataset configuration (Paper Section 5.1)


haven't checked these myself yet

sdhossain · 2026-02-17T18:04:16Z

configs/whitebox/defenses/booster/single_objective_sweep.yaml

+        choices: [4, 8, 10, 16]
+    num_train_epochs:
+        choices: [10, 15, 20, 30]
+    weight_decay:


haven't checked whether these are fully sensible yet (placeholder-ish for now)

sdhossain · 2026-02-17T18:04:26Z

configs/whitebox/defenses/crl/grid.yaml

+        - name: lora_finetune
+          mode: grid
+          config_name: base
+    # Dataset configuration


same for these (need to check the values)

tomtseng

looks great on a skim!
one general comment is whether things could be made less verbose, e.g., DefenseGridConfig and DefenseSweepConfig seem kinda similar, and so do StudyPaths and DefenseStudyPaths

tomtseng · 2026-02-21T00:15:38Z

docs/DEFENSES.md

+## Tips
+
+1. **Always use `--model-alias`** -- keeps results organized and enables Optuna resume
+2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk


Suggested change

2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk

2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk space

I think this is the right default though I would also suggest including directions for how to run a defense w/o deleting the checkpoint — seems potentially useful for debugging or if the sweep is small

tomtseng · 2026-02-21T00:27:34Z

src/tamperbench/whitebox/utils/benchmark/path_generation.py

+        return self.model_results_dir / str(self.defense_name) / self.sweep_subdir
+
+    @property
+    def attack_results_dir(self) -> Path:


it could be potentially confusing for attack_results_dir and defense_results_dir to both refer to the same thing. could we make StudyPathsLike expect a more genericly named attribute results_dir rather than attack_results_dir?

tomtseng · 2026-02-21T00:30:47Z

configs/whitebox/defenses/tar/grid.yaml

        max_generation_length: 1024
        inference_batch_size: 16
-    evals: [strong_reject_small]
+    defense_evals: [strong_reject_small]


Suggested change

defense_evals: [strong_reject_small]

defense_evals: [strong_reject]

i think we got rid of strong_reject_small

tomtseng · 2026-02-21T00:35:13Z

scripts/whitebox/defense_sweep.py

+    args: Namespace = parser.parse_args()
+    multiprocessing.set_start_method("spawn", force=True)
+
+    config_root = cast(Path, args.configs_dir)


are these cast()s necessary even when you've set type=Path in the parser.add_argument() call?

tomtseng · 2026-02-21T00:40:42Z

src/tamperbench/whitebox/utils/benchmark/defense_trial_manager.py

+                config_metrics[str(eval_name)] = float(eval_cls.load_result_objective(results_df))
+            all_config_results[config_name] = config_metrics
+
+        return DefenseSweepTrialManager._get_worst_case_metrics(all_config_results, post_attack_eval_names)


looks fine for now, i'm just recalling that we did discuss the right objective function here not being very clear — do we want to pick defenses based on worst-case harmfulness or average-case harmfulness?

sdhossain commented Feb 17, 2026

View reviewed changes

sdhossain marked this pull request as ready for review February 19, 2026 09:00

sdhossain requested review from samuelsimko and tomtseng February 19, 2026 09:14

tomtseng approved these changes Feb 21, 2026

View reviewed changes

sdhossain force-pushed the sh/defense_sweep branch from acd7757 to 433ffec Compare February 23, 2026 07:07

sdhossain added 4 commits February 25, 2026 11:10

infra: added defense hparam sweep scripts

cd2747d

fix: stashing an intermediate checkpoint

a9ba110

fix: fixed accidental deletion bug

6859db2

fix: caching for ctrl and correct saving for tar

fd1ba6e

sdhossain force-pushed the sh/defense_sweep branch from 433ffec to fd1ba6e Compare February 26, 2026 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: added defense hparam sweep scripts#98

infra: added defense hparam sweep scripts#98
sdhossain wants to merge 4 commits intomainfrom
sh/defense_sweep

sdhossain commented Feb 17, 2026

Uh oh!

sdhossain Feb 17, 2026

Uh oh!

sdhossain Feb 17, 2026

Uh oh!

sdhossain Feb 17, 2026

Uh oh!

tomtseng left a comment

Uh oh!

tomtseng Feb 21, 2026

Uh oh!

tomtseng Feb 21, 2026

Uh oh!

tomtseng Feb 21, 2026

Uh oh!

tomtseng Feb 21, 2026

Uh oh!

tomtseng Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	2. Defense checkpoints are cleaned up -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk
	2. Defense checkpoints are cleaned up -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk space

	defense_evals: [strong_reject_small]
	defense_evals: [strong_reject]

Conversation

sdhossain commented Feb 17, 2026

Changes

[infra] Defense hyperparameter sweep infrastructure

Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomtseng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants