Conversation
| - name: lora_finetune | ||
| mode: grid | ||
| config_name: base | ||
| # Dataset configuration (Paper Section 5.1) |
There was a problem hiding this comment.
haven't checked these myself yet
| choices: [4, 8, 10, 16] | ||
| num_train_epochs: | ||
| choices: [10, 15, 20, 30] | ||
| weight_decay: |
There was a problem hiding this comment.
haven't checked whether these are fully sensible yet (placeholder-ish for now)
| - name: lora_finetune | ||
| mode: grid | ||
| config_name: base | ||
| # Dataset configuration |
There was a problem hiding this comment.
same for these (need to check the values)
tomtseng
left a comment
There was a problem hiding this comment.
looks great on a skim!
one general comment is whether things could be made less verbose, e.g., DefenseGridConfig and DefenseSweepConfig seem kinda similar, and so do StudyPaths and DefenseStudyPaths
| ## Tips | ||
|
|
||
| 1. **Always use `--model-alias`** -- keeps results organized and enables Optuna resume | ||
| 2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk |
There was a problem hiding this comment.
| 2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk | |
| 2. **Defense checkpoints are cleaned up** -- only eval artifacts and `trial_results.json` persist; model weights are deleted after all attacks complete to save disk space |
I think this is the right default though I would also suggest including directions for how to run a defense w/o deleting the checkpoint — seems potentially useful for debugging or if the sweep is small
| return self.model_results_dir / str(self.defense_name) / self.sweep_subdir | ||
|
|
||
| @property | ||
| def attack_results_dir(self) -> Path: |
There was a problem hiding this comment.
it could be potentially confusing for attack_results_dir and defense_results_dir to both refer to the same thing. could we make StudyPathsLike expect a more genericly named attribute results_dir rather than attack_results_dir?
| max_generation_length: 1024 | ||
| inference_batch_size: 16 | ||
| evals: [strong_reject_small] | ||
| defense_evals: [strong_reject_small] |
There was a problem hiding this comment.
| defense_evals: [strong_reject_small] | |
| defense_evals: [strong_reject] |
i think we got rid of strong_reject_small
| args: Namespace = parser.parse_args() | ||
| multiprocessing.set_start_method("spawn", force=True) | ||
|
|
||
| config_root = cast(Path, args.configs_dir) |
There was a problem hiding this comment.
are these cast()s necessary even when you've set type=Path in the parser.add_argument() call?
| config_metrics[str(eval_name)] = float(eval_cls.load_result_objective(results_df)) | ||
| all_config_results[config_name] = config_metrics | ||
|
|
||
| return DefenseSweepTrialManager._get_worst_case_metrics(all_config_results, post_attack_eval_names) |
There was a problem hiding this comment.
looks fine for now, i'm just recalling that we did discuss the right objective function here not being very clear — do we want to pick defenses based on worst-case harmfulness or average-case harmfulness?
acd7757 to
433ffec
Compare
433ffec to
fd1ba6e
Compare
Changes
[infra] Defense hyperparameter sweep infrastructure
Changes
Adds Optuna sweep and grid-based benchmarking for defenses, mirroring the existing attack sweep system.
The pipeline runs: defend -> eval defense checkpoint -> attack defended model -> eval post-attack checkpoint -> cleanup.
New scripts
defense_sweep.pyanddefense_grid.pydrive this. Post-defense attacks support bothgridmode (fixed configs) andsweepmode (inner Optuna). Shared Optuna/grid logic was extracted fromoptuna_single.pyandbenchmark_grid.pyintorunners.pyso both attack and defense scripts reuse the same core.Testing
Ran
defense_grid.pyend-to-end onSmolLM-135M-Instructwith minimal configs for Booster and CRL -- both completed the full defend -> eval -> attack -> eval pipeline successfully.Note: TAR needs a fix (only saves lora adapter)