Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

Ran the single GPU finetuning script: Beaker.

finbarrtimbers and others added 24 commits January 5, 2026 12:02
…mmand

- Add --dataset_mixer_list, --cache_dataset_only, --output_dir to train subparser
- Handle --cache_dataset_only flag in main function
- Update debug script to use --dataset_mixer_list pattern
- Make --dataset_path optional (required only for training)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The olmo_core.launch.beaker module requires a different beaker-py version
than what's installed. Make the import optional so --cache_dataset_only
mode works without beaker dependencies.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When --dataset_path is not provided but --output_dir is, use output_dir
as the dataset path. This matches how mason.py caches datasets to
output_dir and then runs training.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
olmo_core.config.Config doesn't support integer keys properly.
Changed SFTConfig from inheriting Config to plain @DataClass.
Removed .merge(overrides) call and debug print statements.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Uses a two-step approach:
1. Cache HF dataset to numpy format on weka
2. Run training on the cached data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Needed for WandB and config saver callbacks since SFTConfig is now
a plain dataclass instead of inheriting from olmo_core.config.Config.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes S3 403 Forbidden errors by using the weka-mounted path directly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous checkpoint path (OLMo-2-1124-7B/step556000-unsharded) was either
empty or in an incompatible format. Switch to OLMo3 checkpoint which has
proper .metadata files for olmo-core's distributed checkpoint loading.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add MODEL_CONFIGS mapping for supported architectures (olmo2_7B, olmo3_7B, etc.)
- Add --model_config argument to train and launch subcommands
- Remove separate dry_run subcommand, use --dry_run flag instead
- Update debug script to use OLMo3-7B checkpoint and model config

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use --rdzv-backend=c10d --rdzv-endpoint=localhost:0 to let torchrun
automatically find a free port instead of hardcoding a port number.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add 'from_checkpoint' option for --model_config that reads the
TransformerConfig directly from the checkpoint's config.json file.
This ensures model architecture always matches the checkpoint.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
resource_path takes folder and filename as separate arguments.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Avoid OmegaConf compatibility issues by manually constructing the
TransformerConfig from the checkpoint's config.json values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The checkpoint uses per-head QK norm which has different weight shapes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use bias, eps, and name from checkpoint config for qk_norm,
layer_norm, and lm_head layer_norm.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Required by OLMo3 checkpoints that were saved with torchao optimizations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The checkpoint was saved with an older torchao version that had
different float8 module structure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The OLMo3 float8 checkpoint is incompatible with current library
versions. Need to use a checkpoint without float8 dependencies.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous OLMo3 checkpoint was saved with float8 optimizations
that require an older version of torchao incompatible with our
transformers dependency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy BeakerCallbackV2 from refactor-dpo branch. This callback
updates Beaker experiment descriptions with training progress
and W&B URLs, using the beaker-py v2 API.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural shift in the finetuning process by migrating the core finetuning script to the OLMo-core training infrastructure. This change aims to standardize and optimize the training of OLMo models. Consequently, all previous HuggingFace Accelerate-based Beaker configurations and finetuning scripts have been deprecated. The update also includes a new Beaker callback for compatibility and a dedicated subcommand for efficient dataset caching to a numpy format, aligning with OLMo-core's data handling mechanisms.

Highlights

  • Migration to OLMo-core: The primary finetuning script, open_instruct/finetune.py, has been completely rewritten to utilize the OLMo-core training infrastructure, replacing the previous HuggingFace Accelerate-based approach.
  • Deprecation of Old Beaker Configurations: All existing Beaker configuration files for finetuning (e.g., default_finetune.yaml, default_finetune_lora_multinode.yaml) have been marked as deprecated. They now provide a message directing users to the new OLMo-core based finetune.py launch command.
  • Temporary Removal of LoRA/QLoRA Support: LoRA and QLoRA training functionalities are no longer supported in the new finetune.py script. Future re-implementation is planned, as indicated by TODO comments.
  • New BeakerCallbackV2 Module: A new module, open_instruct/beaker_callback.py, has been introduced to provide Beaker v2 compatibility for OLMo-core, which currently uses Beaker v1. This module is intended to be temporary.
  • Enhanced Dataset Caching: The finetune.py script now includes a cache_dataset_only subcommand, allowing users to convert HuggingFace datasets into a numpy format optimized for OLMo-core training, complete with detailed statistics generation.
  • Updated Finetuning Scripts: All relevant finetuning shell scripts have been updated to reflect the new OLMo-core based finetune.py usage, with many older scripts now marked as deprecated and providing guidance to the new approach.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the supervised finetuning (SFT) process, replacing the existing HuggingFace Accelerate-based implementation with a new one built on olmo-core. The changes include deprecating old configurations and scripts, and introducing a new, more structured finetune.py with a clear command-line interface and better configuration management using dataclasses. The addition of a custom BeakerCallbackV2 for Beaker v2 compatibility is also a great improvement.

Overall, the new implementation is much cleaner and more modular. I've identified a few areas for improvement:

  • A potential out-of-memory issue in the dataset caching logic for very large datasets.
  • A minor inconsistency in time measurement within the new Beaker callback.
  • A question regarding the removal of a file from the type-checking configuration.

Details are in the specific comments. Great work on this large-scale refactoring!

train_dataset = train_dataset.select(range(args.num_examples))

print("Collecting tokens from dataset...")
token_ids = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of cache_dataset_only collects all token_ids and labels_mask into lists in memory before writing them to disk. For very large datasets, this can lead to out-of-memory errors, which undermines the benefit of chunking the output files.

To fix this, I suggest refactoring this part to process the dataset in chunks and write to the memory-mapped files incrementally. This would involve iterating through the dataset, accumulating a chunk of a predefined size (e.g., 1GB of tokens), writing it to a file part, and then proceeding to the next chunk. This approach would keep memory usage low regardless of the total dataset size.

self.enabled
and get_rank() == 0
and self.step % self.trainer.metrics_collect_interval == 0
and (self._last_update is None or (time.monotonic() - self._last_update) > 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, it's better to use the same clock for measuring time intervals. _start_time is initialized with time.perf_counter(), so _last_update should also use time.perf_counter() to avoid potential issues from mixing different clocks (time.monotonic() and time.perf_counter()).

Suggested change
and (self._last_update is None or (time.monotonic() - self._last_update) > 10)
and (self._last_update is None or (time.perf_counter() - self._last_update) > 10)

allow_multiple=False,
distributed=False,
)
self._last_update = time.monotonic()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, please use time.perf_counter() here, to match the clock used for _start_time and the check in post_step.

Suggested change
self._last_update = time.monotonic()
self._last_update = time.perf_counter()

"open_instruct/data_loader.py",
"open_instruct/data_types.py",
"open_instruct/logger_utils.py",
"open_instruct/mix_data.py",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file open_instruct/mix_data.py was removed from the ty (type checking) include list. Was this intentional? It seems like it should still be type-checked to maintain code quality.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd45c4f3d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

dataset_config = build_sft_dataset(
root_dir=root_dir, tokenizer_config=tokenizer_config, sequence_length=seq_len, dataset_path=dataset_path
)
gpu_type = CLUSTER_TO_GPU_TYPE[cluster]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle clusters accepted by get_root_dir

The new config builder dereferences CLUSTER_TO_GPU_TYPE[cluster] directly, but get_root_dir still treats clusters like ai2/prior, ai2/saturn, ai2/aristo-elara-cirrascale, ai2/allennlp, ai2/mosaic, and local as valid. If a user passes any of those (which the script now allows via get_root_dir), this line raises a KeyError and the run never starts. Please either add GPU type entries for those clusters or handle a default/fallback so these previously supported clusters don’t crash during config construction.

Useful? React with 👍 / 👎.

@BrownianNotion
Copy link

@finbarrtimbers, the new script looks super clean! Was wondering if it would also be possible to update the scripts/configs for training OLMo-2 1B? Should docs/olmo2.md also be updated or removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants