Mixed-curvature transformer experiments built on nanoGPT. The project asks whether adding hyperbolic structure inside the residual stream changes representation geometry in a useful way, and whether that geometric change can translate into faster optimization.
Standard transformer training assumes a Euclidean representation space everywhere. The working hypothesis here is that hidden states are anisotropic enough that a mixed-curvature parameterization can be a better fit: use Euclidean machinery where it is convenient, but let the residual stream operate in a curved space when that improves geometry or optimization.
This repo grew out of two concrete questions:
- Does the mixed-curvature model produce more isotropic hidden representations than a matched Euclidean baseline?
- Does it train faster in practice, and if so, is that effect broad, schedule-sensitive, or optimizer-dependent?
The validated model is mixed-curvature, not fully hyperbolic end to end.
- Token and positional embeddings are combined in Euclidean space.
- In the main validation configs,
use_embedding_curvature=False. - Transformer blocks apply hyperbolic residual-style updates internally.
- The final layer norm and LM head remain Euclidean.
The initial March 2026 validation used:
curvature_mode='random'dynamic_curvature=Trueper_head_curvature=Trueuse_embedding_curvature=False
Later ablations found that a simpler configuration works better.
All main validation runs used 1500 optimizer steps per run.
- Shakespeare budget:
3072tokens/step,4.608Mtokens/run - FineWeb budget:
4096tokens/step,6.144Mtokens/run
Under the original AdamW setup:
- Isotropy: mixed-curvature improves isotropy relative to baseline.
- Shakespeare: the gain appears mainly in deeper layers.
- FineWeb: the gain appears across all six probed layers.
- Speed: the optimization gain is real but regime-sensitive.
- Shakespeare LR sweep: mixed-curvature beats baseline across the full sweep.
- FineWeb LR sweep: mixed-curvature wins at
5e-4and1e-3, while baseline wins at1e-4and2e-4.
Under the Muon extension:
- Isotropy survives: deeper-layer isotropy gains remain visible on both Shakespeare and FineWeb.
- Speed does not survive yet: the mixed-curvature Muon speed branch diverged at the Shakespeare gate under the original coarse sweep settings, so the downstream FineWeb speed branch was intentionally canceled.
The ablation suite isolates which parts of the mixed-curvature design actually matter.
Most important takeaways:
- Best overall run: fixed curvature
c=0.1 - Dynamic curvature: neutral in this setup
- Per-head curvature: tiny gain, not worth the extra complexity/overhead
- Embedding curvature: slightly worse than leaving embeddings Euclidean
- Muon: unstable in the tested regime
Best quality configuration from the graph:
curvature_mode = "fixed"
curvature = 0.1
dynamic_curvature = False
per_head_curvature = False
use_embedding_curvature = False
optimizer = "AdamW"
learning_rate = 5e-4This configuration reached val_loss = 5.5822 on FineWeb versus the Euclidean baseline at 6.022.
The initial matched FineWeb 5e-4 rerun showed a substantial overhead for mixed-curvature:
- end-to-end slowdown: about
3.18x - train-step-only slowdown: about
5.86x
The March 25 speed branch then attacked that overhead directly:
model_fused.py: TorchScript-fused hyperbolic opsmodel_compiled.py:torch.compile(mode="default", fullgraph=False)wrappersmodel_precompute.py: cached static curvature transformsmodel_triton.py: custom Triton kernels formobius_addition,expmap, andlogmapbench_s4.py,bench_s5.py: reproducible benchmark harnesses
Key benchmark findings:
- TorchScript fusion cut kernel launches by about
71%in the profiling branch torch.compile(..., mode="default")gave the biggest model-level gain in the benchmark harness- Triton kernels improved primitive throughput:
mobius_addition:1.70xexpmap:2.14xlogmap:3.42x
These speed modules are included in the public repo as benchmarked implementation paths. The default training path in train.py still uses the standard model unless you explicitly swap in one of the optimized variants.
train.py: main training entrypoint, logging, checkpointing, Muon supportmodel.py: reference mixed-curvature modelmodel_baseline.py: Euclidean baseline modelmodel_fused.py: TorchScript-fused hyperbolic opsmodel_compiled.py: compile helpers and compiled op wrappersmodel_precompute.py: cached-curvature model variantmodel_triton.py: Triton kernels for the core hyperbolic opsrun_*.sh: experiment launchers used in the validation graphruns/: run logs, summaries, histories, plotsanalysis_out/: isotropy analyses and root-level markdown summaries
The current graph-level abstract is mirrored in:
analysis_out/root_validation_readme_20260320.md
Shakespeare-char:
python data/shakespeare_char/prepare.pyFineWeb:
python data/fineweb/prepare.pyInstall the basic dependencies:
pip install torch numpy matplotlib
pip install datasets tiktokenRepresentative runs:
# AdamW LR sweeps
./run_shakespeare_lr_sweep.sh
./run_fineweb_lr_sweep.sh
# Muon LR sweep
./run_shakespeare_lr_sweep_muon.sh
# Isotropy studies
./run_shakespeare_isotropy.sh
./run_fineweb_isotropy.sh
./run_shakespeare_isotropy_muon.sh
./run_fineweb_isotropy_muon.sh
# Speed benchmarks
python bench_s4.py
python bench_s5.pyPlots and summaries can be regenerated from saved run directories:
python summarize_lr_sweep.py --run_root runs/shakespeare_lr_sweep_modal_20260320
python plot_lr_sweep.py --run_root runs/shakespeare_lr_sweep_modal_20260320
python analyze_representations.pyIf the goal is best validation quality in the current setup, start with the fixed-curvature configuration above.
If the goal is to keep the original mixed-curvature idea while reducing overhead, use the speed branch as the implementation roadmap:
- fused ops
torch.compile(mode="default", fullgraph=False)- Triton kernels for the core hyperbolic primitives
- Make the mixed-curvature speed story robust under Muon
- Determine whether the instability is optimizer-driven, geometry-driven, or specifically tied to Mobius residual updates
- Validate the speed stack end-to-end in the main training loop, not only in benchmark harnesses
- Push beyond the current negative-curvature-only stereographic setup