Skip to content

infra: arxiv version#53

Merged
sdhossain merged 55 commits intomainfrom
sh/arxiv_techdebt
Jan 22, 2026
Merged

infra: arxiv version#53
sdhossain merged 55 commits intomainfrom
sh/arxiv_techdebt

Conversation

@sdhossain
Copy link
Collaborator

Changes

This PR contains a number of changes, including:

  • infra refactors: these have been kind of reviewed already, but not merged -- so will close those PRs in favour of this (those PRs were split from a previous commit of this - and a bit scattered), I've put TODOs for unresolved things. <--- could use some attention, but I think mostly looked at, so a skim is probably fine
  • Large refactors to some of the fine-tuning attacks (although a lot of these have already been reviewed in some other currently open PRs, and i've looked to address some of the comments over there here, so the code in whitebox/attacks has already mostly been reviewed -- for unresolved items, I've put a TODO. <--- needs a bit of a skim, but I think most of the code here was approved so I think a skim is fine here too
  • For evals -- I added MBPP, Minerva MATH, and IFEval. They loosely match metrics that I'd seen in the technical reports, but these were pretty quickly put together. <-- Some files are just copied over, <-- will leave at reviewer's discretion how much attention we want to give these. I don't believe we intend to use this for anything major, but if we do, perhaps we can have follow up review just for these?
  • For analysis --> lots of this code is just for plotting / filtering <--- probably can be skimmed, not something I think users will really use, as it is kind of a bit tailored to our usecase.
  • For benchmark --> These scripts are the entry points to the repo (how to use)
  • docs --> These are the docs, which I'm hoping are sufficiently clear / comprehensive <--- probably needs to be quite high quality.

Testing

Describe how you tested the changes in this PR. E.g., added tests, or ran
command foo and checked the results looked good.

@sdhossain sdhossain requested a review from tomtseng January 14, 2026 13:02
@tomtseng
Copy link
Collaborator

Will look today!
I do see there are a bunch of merge conflicts with main, so would recommend figuring out how to address those conflicts

@sdhossain sdhossain requested a review from tomtseng January 15, 2026 11:40
These internal instructions I think feel separate from a CONTRIBUTING.md
which can be more public facing.
Doesn't change the rendered Markdown of the tables, but makes them
easier to read as plaintext
Copy link
Collaborator

@tomtseng tomtseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read through the docs, looks good & thorough

These scripts should give a non-zero exit code when they fail, instead
of printing an error message but exiting with exit code 0.
Fixing a few nits I have about the benchmark_grid script:

1. Duplicate line (lines 117 and 121)
attack_config_cls, attack_cls = ATTACKS_MAP[attack_name]  # line 117
for config_name, attack_config_dict in whitebox_attack_config_grids[...]:
    attack_config_cls, attack_cls = ATTACKS_MAP[attack_name]  # line 121 - duplicate
Line 121 is redundant since the same unpacking already happens at line 117.

2. Missing required=True for --attacks argument (line 61-70)
If --attacks is not provided, args.attacks will be None, causing a crash at line 113 when iterating.

3. Redundant ATTACKS_MAP alias (line 42)
ATTACKS_MAP = ATTACKS_REGISTRY isn't necessary, just use ATTACKS_REGISTRY

4. Unnecessary global variable (lines 45-47)
whitebox_attack_config_grids is declared at module scope but only used within the if __name__ == "__main__" block.

5. Inconsistent argument naming style
Arguments mix underscore (--random-seed) and hyphen (--results_dir) styles. I've changed `--results_dir` to `--results-dir`.
- Class comment is wrong
- setdefault doesn't need an if guard
- making one error message slightly more informative
…hs.from_existing().

StudyPaths attribute order differed from StudyPaths.from_existing()
argument order, let's keep them consistent.
Copy link
Collaborator

@tomtseng tomtseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think this is about as much as I'm going to review this. Was skimming and skipping a lot of stuff and asking Claude Code to look at a bunch of files.



def register_defense(
name: DefenseName, config_cls: type[H]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think H is not necessary? could we just do type[AlignmentDefenseConfig] here and delete H?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, got type errors when I tried that, put it as a TODO to resolve.

"sentencepiece>=0.2.0",
"tokenizers>=0.22.0",
"pytest>=8.4.2",
"flash-attention>=1.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the correct package for this is flash-attn rather than flash-attention. however it takes some work to install so I wouldn't try to switch this to flash-attn in this PR, i'd just delete flash-attention here.

I looked at how to install it for an hour in PR #55, but currently we don't actually use flash-attn anywhere except in GCG — which we're not using, and if we want to use it, we could turn off flash attention probably. So not a priority to fix. vllm I think already does some flash attention thing internally without needing to install flash-attn

@sdhossain sdhossain merged commit 3d58702 into main Jan 22, 2026
1 check passed
@tomtseng tomtseng mentioned this pull request Jan 29, 2026
@sdhossain sdhossain deleted the sh/arxiv_techdebt branch February 5, 2026 22:35
@sdhossain sdhossain mentioned this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants