Add Apple Silicon MLX backend (self-contained, GPU-accelerated Muon)#205
Open
elementalcollision wants to merge 1 commit intokarpathy:masterfrom
Open
Add Apple Silicon MLX backend (self-contained, GPU-accelerated Muon)#205elementalcollision wants to merge 1 commit intokarpathy:masterfrom
elementalcollision wants to merge 1 commit intokarpathy:masterfrom
Conversation
Self-contained single-file MLX training script for Apple Silicon Macs. Ports the full training pipeline including: - Muon+AdamW optimizer with Newton-Schulz orthogonalization - Float32 Newton-Schulz fix (prevents NaN divergence on Apple Silicon) - MLX-native dataloader (numpy buffers → mx.array) - MLX-native BPB evaluation (computes byte lengths from tiktoken directly) - Apple Silicon hardware detection for MFU calculation - All features: GQA, RoPE, value embeddings, sliding window attention, softcap, gradient accumulation, LR warmup/cooldown schedules No existing files are modified beyond adding mlx as an optional dependency in pyproject.toml. Does not depend on torch at runtime. Usage: pip install mlx tiktoken pyarrow rustbpe numpy python prepare.py # one-time data prep (still needs torch) python train_mlx.py Tested on M1 Max 64GB: val_bpb 2.094 → 1.621 over 25 experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
|
Validation log here. M5 testing will be done ~5 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a self-contained
train_mlx.pyfor training on Apple Silicon Macs using MLX. No existing files are modified beyond addingmlxas an optional dependency inpyproject.toml.token_bytes.ptdependency)Differences from #202
This PR and #202 solve the same problem but take different approaches:
pyproject.tomlonly (+5 lines)prepare.py(+70 lines),pyproject.toml(+2/−11)prepare.pytorchdeptorch==2.9.1pinned to CUDA)torch>=2.3.0, removes CUDA indexmlxdep[project.optional-dependencies])token_bytes.ptKey philosophical difference: this PR is additive-only — it adds two files and doesn't touch existing CUDA code or dependencies, so there's zero risk of breaking the default GPU path.
Files changed
train_mlx.pypyproject.tomlmlxas optional dependency (+5 lines)Test plan
🤖 Generated with Claude Code