Nested Learning Reproduction

Mechanism-level reproduction of Google's Nested Learning (HOPE) architecture (HOPE blocks, CMS, and Self‑Modifying TITANs), matching the quality bar set by lucidrains' TITAN reference while remaining fully open-source and uv managed.

Faithfulness scope (high level):

✅ HOPE / CMS / Self‑Modifying Titans update rules + wiring (mechanism-level)
✅ Tensor-level invariants covered by unit tests (teach-signal, δℓ, CMS chunking, causality)
⚠️ Online “writes” are stop‑grad (no backprop through online updates / boundary-state training procedure)
⚠️ Multi‑GPU “paper-faithful online updates” are not supported in this repo (DDP disables some features)

Quickstart

uv python install 3.12
uv sync --all-extras
uv run bash scripts/data/run_sample.sh
uv run bash scripts/run_smoke.sh pilot  # CPU-friendly HOPE block smoke test
uv run bash scripts/run_e2e_smoke.sh    # sync + sample data + smoke train + zeroshot eval
uv run python scripts/eval/zeroshot.py \
  --config configs/hope/pilot.yaml \
  --checkpoint artifacts/examples/pilot_dummy.pt \
  --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
  --tasks piqa --max-samples 32 --device cpu

Requirements

Python 3.12+
uv package manager (https://github.com/astral-sh/uv)
PyTorch 2.9.0 LTS + CUDA-capable GPUs for accelerated runs (CPU works for smoke tests)

Setup

uv python install 3.12
uv sync --all-extras

Developer checks:

uv run ruff check .
uv run mypy src
uv run pytest

Data Pipeline

Tokenizer training

uv run python scripts/data/train_tokenizer.py \
  --manifest configs/data/refinedweb_mixture.yaml \
  --vocab-size 32000 \
  --output-dir artifacts/tokenizer/refinedweb_mix \
  --log-file data/mixtures/refinedweb_mix_tokenizer.json

Corpus filtering + sharding

uv run python scripts/data/process_mixture.py \
  configs/data/refinedweb_mixture_filtered.yaml \
  --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
  --log-file data/mixtures/refinedweb_mix_filtered_shards.json

Sample pipeline (downloads/licensed datasets, filters, shards, records stats)
```
uv run bash scripts/data/run_sample.sh
```

Full pipeline (set env vars like RW_LIMIT, WIKI_LIMIT, etc. to scale ingestion)

uv run bash scripts/data/run_full.sh  # default ~50k docs per corpus; increase limits as needed

Training

Single GPU / CPU:

uv run python train.py --config-name pilot_smoke

Apple Silicon (MPS, if available):

uv run python train.py --config-name pilot_smoke train.device=mps

DDP (torchrun):

torchrun --nproc_per_node=2 train_dist.py --config-name mid

CPU-only DDP smoke (verifies gloo backend and deterministic seeding):
```
uv run bash scripts/run_cpu_ddp_smoke.sh
```

FSDP (see docs/FSDP_SCALING_GUIDE.md for VRAM/batch sizing):

# 760M run
torchrun --nproc_per_node=2 train_fsdp.py --config-name hope/mid_fsdp
# 1.3B run
torchrun --nproc_per_node=2 train_fsdp.py --config-name hope/target_fsdp

DeepSpeed (requires deepspeed installed separately):

deepspeed --num_gpus=2 train_deepspeed.py --config-name target \
  deepspeed.config=configs/deepspeed/zero3.json

Paper-faithful mechanisms (HOPE / Nested Learning)

Use the paper-faithful preset configs (single GPU):

uv run python train.py --config-name pilot_paper_faithful
# HOPE self-mod variant:
uv run python train.py --config-name pilot_selfmod_paper_faithful

Notes:

Paper-faithful presets set data.batch_size=1 to avoid cross-sample fast-memory sharing.

Overrides:

optim.type=m3 (paper optimizer option)
train.steps=... / train.device=...

See docs/PAPER_COMPLIANCE.md for full fidelity notes.

Pilot (3 B tokens) workflow

Ensure TMUX session:
```
tmux new -s pilot_train
```

Launch the long run on cuda:1 (≈52 h wall clock):

set -a && source git.env && set +a
export UV_CACHE_DIR=/tmp/uv-cache UV_LINK_MODE=copy
uv run python train.py --config-name pilot \
  logging.enabled=true logging.backend=wandb \
  logging.project=nested-learning logging.run_name=pilot-main-$(date +%Y%m%d%H%M%S) \
  train.device=cuda:1

Checkpoints appear in artifacts/checkpoints/pilot/step_*.pt every 1 000 steps; the accompanying W&B run captures full telemetry.
Copy the final checkpoint, config, logs, and eval JSON/CSV into artifacts/pilot_release/ for distribution.

Logging

Set logging.enabled=true in Hydra configs (or override via CLI) to send metrics to W&B (default). For local JSON logs, use logging.backend=json logging.path=logs/run.json. Sample outputs reside in logs/ and artifacts/examples/.

Evaluation

Zero-shot:

uv run python scripts/eval/zeroshot.py \
--config configs/hope/mid.yaml \
--checkpoint checkpoints/mid/step_000100.pt \
--tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
--tasks all --max-samples 200 --device cuda:0

Use uv run python scripts/eval/zeroshot.py --list-tasks to display the full benchmark roster (PIQA, HellaSwag, WinoGrande, ARC-E/C, BoolQ, SIQA, CommonsenseQA, OpenBookQA). See docs/zeroshot_eval.md for details.

Needle-in-a-Haystack:

uv run python scripts/eval/niah.py \
  --config configs/hope/mid.yaml \
  --checkpoint checkpoints/mid/step_000100.pt \
  --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
  --context-lengths 2048 4096 8192 --samples-per-length 20

Continual-learning forgetting:

uv run python scripts/eval/continual.py \
  --config configs/hope/mid.yaml \
  --checkpoints checkpoints/mid/step_000050.pt checkpoints/mid/step_000100.pt \
  --segments-yaml configs/data/continual_segments_sample.yaml \
  --batch-size 4 --max-batches 10 --memorize --memorize-steps 2

Plot forgetting curves via uv run python scripts/eval/plot_forgetting.py --continual-json eval/continual_mid.json.

Long-context diagnostics:

uv run python scripts/eval/passkey.py --config configs/hope/pilot.yaml --checkpoint artifacts/checkpoints/pilot/step_230000.pt \
  --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model --samples 64 --memorize

uv run python scripts/eval/pg19_perplexity.py --config configs/hope/pilot.yaml --checkpoint artifacts/checkpoints/pilot/step_230000.pt \
  --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model --max-samples 64

Evaluation summaries are written to eval/ alongside per-task JSON metrics.

Test-time memorization toggles

Every evaluator supports TITAN-style memorization so you can reproduce test-time adaptation:

uv run python scripts/eval/zeroshot.py \
  ... \
  --memorize \
  --memorize-steps 2 \
  --memorize-use-correct-answer \
  --memorize-no-reset  # optional: retain updates across samples
  --memorize-paths titan,cms_fast \
  --memorize-surprise-threshold 0.01

--memorize turns on the learner with one LMS step per example by default.
--memorize-steps controls the number of adaptation passes per prompt.
--memorize-use-correct-answer injects ground-truth text during memorization for ablations.
--memorize-no-reset carries memories across samples; omit it to reset every question.
--memorize-paths restricts which levels receive teach-signal updates (titan, cms_fast, or all).
--memorize-surprise-threshold gates updates on average teach-signal norm, matching the paper’s surprise trigger.

Memorization metrics (baseline vs adaptive) are emitted alongside task accuracy for easy comparisons.

Architecture variants

Select the paper-defined variant via model.block_variant in Hydra configs:

hope_attention (paper HOPE-Attention): Attention → CMS (paper-defined).
hope_selfmod (paper HOPE scaffold): Self-modifying Titans (Eqs. 83–93; Eq. 91 residual MLP memories) → CMS with (by default) fixed q and local conv window=4, plus chunked updates via model.self_mod_chunk_size (others) and model.self_mod_chunk_size_memory (M_memory). See docs/PAPER_COMPLIANCE.md for the “differentiable read / update-pass writes” semantics.
hope_hybrid (legacy): Attention + TitanMemory + CMS (exploratory; not paper-defined).
transformer (baseline): Attention → MLP (no TITAN/CMS learning updates; useful for Phase 2 comparisons).

Self-modifying Titans knobs (ablation-friendly, paper-aligned):

model.self_mod_objective (l2 vs dot), model.self_mod_use_rank1_precond (DGD-like preconditioner), model.self_mod_use_alpha (weight-decay/retention gate), model.self_mod_stopgrad_vhat, model.self_mod_momentum, model.self_mod_adaptive_q, model.self_mod_local_conv_window.

Fast state (Nested Learning semantics)

In-context updates can run against a per-context fast state so meta parameters never change:

HOPEModel.init_fast_state() / TitanOnlyModel.init_fast_state() returns a ModelFastState.
MemorizeConfig.use_fast_state=true (default) requires passing fast_state into memorize_tokens() / memorize_sequence(); evaluation scripts handle this automatically.
Training can also run update passes against a per-batch fast state via train.use_fast_state=true (meta+delta fast state: meta params are learnable; online updates write deltas only). If data.batch_size>1, CMS/TITAN fast state is shared across the batch; use data.batch_size=1 for strict per-context semantics. See docs/PAPER_COMPLIANCE.md.

Releases

Before tagging or announcing a new checkpoint, work through docs/release_checklist.md so the bundle includes manifest validation reports, tokenizer coverage JSON, zero-shot/NIAH/continual/passkey/PG-19 eval outputs, forgetting plots, and filled checkpoint reports.

Performance & optimizer options

Mixed precision: enable bf16 autocast via train.mixed_precision.enabled=true train.mixed_precision.dtype=bf16 (already enabled in pilot/mid/target configs).
torch.compile: accelerate attention/core loops by toggling train.compile.enable=true train.compile.mode=max-autotune; failure falls back to eager unless train.compile.strict=true.
Muon hybrid (default): all HOPE configs now set optim.type=muon, routing ≥2D tensors through PyTorch 2.9's Muon optimizer while embeddings/norms stay on AdamW. Training logs emit optim.muon_param_elems / optim.adamw_param_elems so you can confirm the split.
Fused AdamW fallback: override with optim.type=adamw optim.fused=auto if Muon is unavailable or if you want to compare against the AdamW ablation in reports/ablations.md.
Surprise gating: set model.surprise_threshold=<float> to gate all inner updates. By default the surprise metric is the average L2 norm of the (scaled/clipped) teach signal (model.surprise_metric=l2); you can also use loss or logit_entropy for ablations. Evaluation CLIs expose --memorize-surprise-threshold for ad-hoc gating.

All Hydra knobs can be overridden from the CLI or composed via config groups (configs/hope/*.yaml). Use these flags in tandem with scripts/run_e2e_smoke.sh (automation) or scripts/run_cpu_ddp_smoke.sh (CPU-only determinism check) to validate releases quickly.

Documentation & References

docs/guide.md – full onboarding (setup → data → training → eval).
docs/release_plan.md – release readiness checklist.
docs/data_pipeline.md – large-scale sharding/tokenizer workflow.
docs/scaling_guidance.md – roadmap for expanding data + compute footprints.
docs/stage1_plan.md, docs/stage2_plan.md – architecture + experiment roadmaps.
docs/stage2_progress.md – latest dual-GPU training/eval status and commands.
docs/experiments_report.md – draft paper covering completed experiments.
docs/stability_journal.md – chronological notes on NaN fixes & teach-scale tuning.
docs/future_directions.md – prioritized roadmap after the initial release.
reports/stage2_smoke.md – exact commands/artifacts for the release-ready smoke workflow.
docs/FSDP_SCALING_GUIDE.md – dual-RTX 6000 Ada instructions for the mid/target FSDP configs.
google_papers/ – PDFs/markdown of Nested Learning & TITAN papers.
CHANGELOG.md – user-facing changes per release.

Contributing

Run formatting/tests (uv run ruff check ., uv run pytest).
Document new configs or scripts in docs/guide.md and update CHANGELOG.md.
Open a PR referencing the relevant NL/TITAN spec sections or planner transcript snippets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nested Learning Reproduction

Quickstart

Requirements

Setup

Data Pipeline

Training

Paper-faithful mechanisms (HOPE / Nested Learning)

Pilot (3 B tokens) workflow

Logging

Evaluation

Test-time memorization toggles

Architecture variants

Fast state (Nested Learning semantics)

Releases

Performance & optimizer options

Documentation & References

Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
configs		configs
docs		docs
eval		eval
google_papers		google_papers
reports		reports
scripts		scripts
src/nested_learning		src/nested_learning
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml
train.py		train.py
train_deepspeed.py		train_deepspeed.py
train_dist.py		train_dist.py
train_fsdp.py		train_fsdp.py
uv.lock		uv.lock

License

kmccleary3301/nested_learning

Folders and files

Latest commit

History

Repository files navigation

Nested Learning Reproduction

Quickstart

Requirements

Setup

Data Pipeline

Training

Paper-faithful mechanisms (HOPE / Nested Learning)

Pilot (3 B tokens) workflow

Logging

Evaluation

Test-time memorization toggles

Architecture variants

Fast state (Nested Learning semantics)

Releases

Performance & optimizer options

Documentation & References

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Pilot (3 B tokens) workflow

Packages