Skip to content

infra: added defense hparam sweep scripts#98

Open
sdhossain wants to merge 4 commits intomainfrom
sh/defense_sweep
Open

infra: added defense hparam sweep scripts#98
sdhossain wants to merge 4 commits intomainfrom
sh/defense_sweep

Conversation

@sdhossain
Copy link
Collaborator

Changes

[infra] Defense hyperparameter sweep infrastructure

Changes

Adds Optuna sweep and grid-based benchmarking for defenses, mirroring the existing attack sweep system.

The pipeline runs: defend -> eval defense checkpoint -> attack defended model -> eval post-attack checkpoint -> cleanup.

New scripts defense_sweep.py and defense_grid.py drive this. Post-defense attacks support both grid mode (fixed configs) and sweep mode (inner Optuna). Shared Optuna/grid logic was extracted from optuna_single.py and benchmark_grid.py into runners.py so both attack and defense scripts reuse the same core.

Testing

Ran defense_grid.py end-to-end on SmolLM-135M-Instruct with minimal configs for Booster and CRL -- both completed the full defend -> eval -> attack -> eval pipeline successfully.

Note: TAR needs a fix (only saves lora adapter)

uv run python scripts/whitebox/defense_grid.py \
  HuggingFaceTB/SmolLM-135M-Instruct --defense booster --config-name base \
  --results-dir /tmp/defense_test_results --model-alias smollm_test

- name: lora_finetune
mode: grid
config_name: base
# Dataset configuration (Paper Section 5.1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't checked these myself yet

choices: [4, 8, 10, 16]
num_train_epochs:
choices: [10, 15, 20, 30]
weight_decay:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't checked whether these are fully sensible yet (placeholder-ish for now)

- name: lora_finetune
mode: grid
config_name: base
# Dataset configuration
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these (need to check the values)

@sdhossain sdhossain marked this pull request as ready for review February 19, 2026 09:00
Copy link
Collaborator

@tomtseng tomtseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great on a skim!
one general comment is whether things could be made less verbose, e.g., DefenseGridConfig and DefenseSweepConfig seem kinda similar, and so do StudyPaths and DefenseStudyPaths

## Tips

1. **Always use `--model-alias`** -- keeps results organized and enables Optuna resume
2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk
2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk space

I think this is the right default though I would also suggest including directions for how to run a defense w/o deleting the checkpoint — seems potentially useful for debugging or if the sweep is small

return self.model_results_dir / str(self.defense_name) / self.sweep_subdir

@property
def attack_results_dir(self) -> Path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be potentially confusing for attack_results_dir and defense_results_dir to both refer to the same thing. could we make StudyPathsLike expect a more genericly named attribute results_dir rather than attack_results_dir?

max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject_small]
defense_evals: [strong_reject_small]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defense_evals: [strong_reject_small]
defense_evals: [strong_reject]

i think we got rid of strong_reject_small

args: Namespace = parser.parse_args()
multiprocessing.set_start_method("spawn", force=True)

config_root = cast(Path, args.configs_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these cast()s necessary even when you've set type=Path in the parser.add_argument() call?

config_metrics[str(eval_name)] = float(eval_cls.load_result_objective(results_df))
all_config_results[config_name] = config_metrics

return DefenseSweepTrialManager._get_worst_case_metrics(all_config_results, post_attack_eval_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine for now, i'm just recalling that we did discuss the right objective function here not being very clear — do we want to pick defenses based on worst-case harmfulness or average-case harmfulness?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants