-
Notifications
You must be signed in to change notification settings - Fork 491
Replaces existing SFT with Olmo-core version. #1327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mmand - Add --dataset_mixer_list, --cache_dataset_only, --output_dir to train subparser - Handle --cache_dataset_only flag in main function - Update debug script to use --dataset_mixer_list pattern - Make --dataset_path optional (required only for training) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The olmo_core.launch.beaker module requires a different beaker-py version than what's installed. Make the import optional so --cache_dataset_only mode works without beaker dependencies. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When --dataset_path is not provided but --output_dir is, use output_dir as the dataset path. This matches how mason.py caches datasets to output_dir and then runs training. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
olmo_core.config.Config doesn't support integer keys properly. Changed SFTConfig from inheriting Config to plain @DataClass. Removed .merge(overrides) call and debug print statements. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Uses a two-step approach: 1. Cache HF dataset to numpy format on weka 2. Run training on the cached data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Needed for WandB and config saver callbacks since SFTConfig is now a plain dataclass instead of inheriting from olmo_core.config.Config. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes S3 403 Forbidden errors by using the weka-mounted path directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous checkpoint path (OLMo-2-1124-7B/step556000-unsharded) was either empty or in an incompatible format. Switch to OLMo3 checkpoint which has proper .metadata files for olmo-core's distributed checkpoint loading. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add MODEL_CONFIGS mapping for supported architectures (olmo2_7B, olmo3_7B, etc.) - Add --model_config argument to train and launch subcommands - Remove separate dry_run subcommand, use --dry_run flag instead - Update debug script to use OLMo3-7B checkpoint and model config 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use --rdzv-backend=c10d --rdzv-endpoint=localhost:0 to let torchrun automatically find a free port instead of hardcoding a port number. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add 'from_checkpoint' option for --model_config that reads the TransformerConfig directly from the checkpoint's config.json file. This ensures model architecture always matches the checkpoint. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
resource_path takes folder and filename as separate arguments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Avoid OmegaConf compatibility issues by manually constructing the TransformerConfig from the checkpoint's config.json values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The checkpoint uses per-head QK norm which has different weight shapes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use bias, eps, and name from checkpoint config for qk_norm, layer_norm, and lm_head layer_norm. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Required by OLMo3 checkpoints that were saved with torchao optimizations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The checkpoint was saved with an older torchao version that had different float8 module structure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The OLMo3 float8 checkpoint is incompatible with current library versions. Need to use a checkpoint without float8 dependencies. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous OLMo3 checkpoint was saved with float8 optimizations that require an older version of torchao incompatible with our transformers dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy BeakerCallbackV2 from refactor-dpo branch. This callback updates Beaker experiment descriptions with training progress and W&B URLs, using the beaker-py v2 API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant architectural shift in the finetuning process by migrating the core finetuning script to the OLMo-core training infrastructure. This change aims to standardize and optimize the training of OLMo models. Consequently, all previous HuggingFace Accelerate-based Beaker configurations and finetuning scripts have been deprecated. The update also includes a new Beaker callback for compatibility and a dedicated subcommand for efficient dataset caching to a numpy format, aligning with OLMo-core's data handling mechanisms. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-executed refactoring of the supervised finetuning (SFT) process, replacing the existing HuggingFace Accelerate-based implementation with a new one built on olmo-core. The changes include deprecating old configurations and scripts, and introducing a new, more structured finetune.py with a clear command-line interface and better configuration management using dataclasses. The addition of a custom BeakerCallbackV2 for Beaker v2 compatibility is also a great improvement.
Overall, the new implementation is much cleaner and more modular. I've identified a few areas for improvement:
- A potential out-of-memory issue in the dataset caching logic for very large datasets.
- A minor inconsistency in time measurement within the new Beaker callback.
- A question regarding the removal of a file from the type-checking configuration.
Details are in the specific comments. Great work on this large-scale refactoring!
| train_dataset = train_dataset.select(range(args.num_examples)) | ||
|
|
||
| print("Collecting tokens from dataset...") | ||
| token_ids = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of cache_dataset_only collects all token_ids and labels_mask into lists in memory before writing them to disk. For very large datasets, this can lead to out-of-memory errors, which undermines the benefit of chunking the output files.
To fix this, I suggest refactoring this part to process the dataset in chunks and write to the memory-mapped files incrementally. This would involve iterating through the dataset, accumulating a chunk of a predefined size (e.g., 1GB of tokens), writing it to a file part, and then proceeding to the next chunk. This approach would keep memory usage low regardless of the total dataset size.
| self.enabled | ||
| and get_rank() == 0 | ||
| and self.step % self.trainer.metrics_collect_interval == 0 | ||
| and (self._last_update is None or (time.monotonic() - self._last_update) > 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, it's better to use the same clock for measuring time intervals. _start_time is initialized with time.perf_counter(), so _last_update should also use time.perf_counter() to avoid potential issues from mixing different clocks (time.monotonic() and time.perf_counter()).
| and (self._last_update is None or (time.monotonic() - self._last_update) > 10) | |
| and (self._last_update is None or (time.perf_counter() - self._last_update) > 10) |
| allow_multiple=False, | ||
| distributed=False, | ||
| ) | ||
| self._last_update = time.monotonic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "open_instruct/data_loader.py", | ||
| "open_instruct/data_types.py", | ||
| "open_instruct/logger_utils.py", | ||
| "open_instruct/mix_data.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd45c4f3d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dataset_config = build_sft_dataset( | ||
| root_dir=root_dir, tokenizer_config=tokenizer_config, sequence_length=seq_len, dataset_path=dataset_path | ||
| ) | ||
| gpu_type = CLUSTER_TO_GPU_TYPE[cluster] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle clusters accepted by get_root_dir
The new config builder dereferences CLUSTER_TO_GPU_TYPE[cluster] directly, but get_root_dir still treats clusters like ai2/prior, ai2/saturn, ai2/aristo-elara-cirrascale, ai2/allennlp, ai2/mosaic, and local as valid. If a user passes any of those (which the script now allows via get_root_dir), this line raises a KeyError and the run never starts. Please either add GPU type entries for those clusters or handle a default/fallback so these previously supported clusters don’t crash during config construction.
Useful? React with 👍 / 👎.
|
@finbarrtimbers, the new script looks super clean! Was wondering if it would also be possible to update the scripts/configs for training OLMo-2 1B? Should |
Ran the single GPU finetuning script: Beaker.