Pytorch implementation of preconditioned stochastic gradient descent (Kron and affine preconditioner, low-rank approximation preconditioner and more)
-
Updated
Jan 11, 2026 - Python
Pytorch implementation of preconditioned stochastic gradient descent (Kron and affine preconditioner, low-rank approximation preconditioner and more)
The ultimate learning resource for the Muon optimizer - Newton-Schulz orthogonalization, theory, code examples, and production guides
High-performance CUDA implementation of Muon optimizer for LLM training. Features Newton-Schulz polar decomposition, cuBLAS acceleration, and transpose optimization for 8x FLOP savings on transformer FFN layers. Benchmarked on NVIDIA A100 with Llama 3.1 8B architectures (4096×11008 weights).
A performance-optimized Muon optimizer implementation for PyTorch
Add a description, image, and links to the newton-schulz topic page so that developers can more easily learn about it.
To associate your repository with the newton-schulz topic, visit your repo's landing page and select "manage topics."