Apple Silicon dual-backend port of autoresearch (PyTorch MPS + MLX) with full Muon optimizer
-
Updated
Mar 14, 2026 - Python
Apple Silicon dual-backend port of autoresearch (PyTorch MPS + MLX) with full Muon optimizer
High-performance CUDA implementation of Muon optimizer for LLM training. Features Newton-Schulz polar decomposition, cuBLAS acceleration, and transpose optimization for 8x FLOP savings on transformer FFN layers. Benchmarked on NVIDIA A100 with Llama 3.1 8B architectures (4096×11008 weights).
Few-Shot Adaptation for Vision-Language Models. Implements Base-to-Novel generalization on CLIP using LoRA, LP++, and Muon Optimizer to enhance performance on the Oxford Flowers-102 dataset.
A performance-optimized Muon optimizer implementation for PyTorch
ARS2-Neo: Slide directly with the geodesic of the loss landscape to the global optimum.
Add a description, image, and links to the muon-optimizer topic page so that developers can more easily learn about it.
To associate your repository with the muon-optimizer topic, visit your repo's landing page and select "manage topics."