openai · AnirudhRahul · Mar 27, 2026 · Mar 27, 2026
diff --git a/..._10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer/PR_DRAFT.md b/..._10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer/PR_DRAFT.md
@@ -0,0 +1,92 @@
+## Title
+
+Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)
+
+## Body
+
+**3-seed mean val_bpb = 0.02139943 +/- 0.00003918** | **15.88 MB max total size**
+
+All within budget: training < 600s, eval < 600s, artifact < 16MB.
+
+## Summary
+
+- Keep the packed order-2..9 training n-gram artifact and learned weighting gate over the neural model plus n-gram experts.
+- Remove the logistic context mixer and long phrase cache from the final eval path, leaving a simpler low eval-time memory regime built around the packed cache, learned gate, and online logit calibration.
+- Keep the compliant causal path: context-only gate validity, cached-batch GPTQ calibration, packed cache loaded from the artifact itself, a renormalized final distribution, and `TTT_EPOCHS=0`.
+
+## Results
+
+Current completed runs:
+
+| Seed | Final val_bpb | Artifact bytes | Total bytes | Eval time | Notes |
+|------|---------------|----------------|-------------|-----------|-------|
+| 1337 | 0.02144330 | 15,015,946 | 15,179,538 | 432s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
+| 42 | 0.02136791 | 15,717,739 | 15,881,331 | 433s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
+| 7 | 0.02138708 | 15,083,362 | 15,246,954 | 437s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
+
+Final 3-seed mean final val_bpb: `0.02139943` with sample std `0.00003918`.
+
+## Low Eval-Time Memory Regime
+
+- No logistic context mixer at eval time.
+- No long phrase cache at eval time.
+- The remaining eval-time adaptation path is the packed order-2..9 n-gram cache from the artifact, causal online n-gram updates, online logit calibration, and a renormalized final distribution.
+- This removes the large auxiliary GPU mixer tables from the previous variant while preserving the packed-cache scoring path.
+- On the final seed-7 no-mixer artifact, disabling only the long phrase cache already improved eval BPB from `0.04881917` to `0.02134985`, which motivated the 3-seed rerun.
+- The final update here additionally renormalizes the full-vocabulary distribution so each scored position sums to 1.
+
+## Causal Inference Scheme
+
+1. Start eval by deserializing the packed order-2..9 n-gram cache from the submitted artifact itself.
+2. For each validation chunk, run the model once using only left context and the current packed-cache state.
+3. Query n-gram experts from the current cache using left context only; expert availability depends only on context evidence, not on the true next token.
+4. Blend neural + n-gram experts, then renormalize the full-vocabulary distribution so it sums to 1 before scoring.
+5. Score the chunk before any mutation of cache or model state.
+6. After scoring, append the chunk tokens to the streaming n-gram cache for future chunks.
+7. The reported final path uses `TTT_EPOCHS=0`, so there is no backward adaptation step in the submission path.
+
+## Key Changes
+
+- Packed order-2..9 training n-gram cache embedded into the artifact itself.
+- Learned weighting gate over neural + order-2..9 n-gram experts.
+- Bigram hash embedding removed to create artifact headroom for the packed cache.
+- Logistic context mixer removed from the final eval path.
+- Long phrase cache removed from the final eval path.
+- Context-only gate validity retained.
+- GPTQ calibration still uses cached training batches from the same timed run.
+- Final scored probabilities are renormalized to sum to 1 at every position.
+
+## Compliance
+
+- This is **not a 2-pass method**.
+- Validation is scored in a **single causal pass**: each chunk is scored before that chunk is used for cache updates.
+- The warm-start n-gram cache used at eval step 0 is **part of the artifact itself**, not a separate runtime input.
+- The packed n-gram cache in the artifact is derived from **training data only** and is produced within the 600 second training budget.
+- The learned gate does **not** use the true next token to decide which experts are available.
+- GPTQ calibration runs inside the reserved pre-export budget using cached training batches from the same timed run.
+- The output distribution is normalized to sum to 1 for each token before likelihood is accumulated.
+- The current reported numbers use `TTT_EPOCHS=0`.
+
+## Reproduction
+
+```bash
+pip install -r requirements.txt
+
+SEED=1337 \
+DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+ARTIFACT_NGRAM_EXPORT=1 \
+MAX_WALLCLOCK_SECONDS=600 \
+VAL_LOSS_EVERY=0 \
+USE_MIXER=0 USE_PHRASE_CACHE=0 MIXER_HEAD=multi \
+USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
+TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
+USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
+TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
+TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
+CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
+GPTQ_CALIBRATION_SEQS=128 \
+RENORMALIZE_FINAL_PROBS=1 VERIFY_FINAL_PROBS=1 \
+PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/...ck_10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer/README.md b/...ck_10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer/README.md
@@ -0,0 +1,129 @@
+# Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)
+
+**Status:** finalized compliant 3-seed record folder with renormalized scoring.
+
+**3-seed mean final val_bpb:** `0.02139943` (std `0.00003918`)
+
+## Included Files
+
+- `train_gpt.py`
+- `requirements.txt`
+- `submission.json`
+- `PR_DRAFT.md`
+- `logs/train_seed1337.log`
+- `logs/train_seed42.log`
+- `logs/train_seed7.log`
+
+This folder intentionally does **not** bundle copied model weights. Artifact sizes are documented from the train logs.
+
+## Verified Results
+
+All numbers below are the final causal `final_int6_ttt_exact` result with the packed order-2..9 training cache loaded from the artifact at eval start and then updated online. The final per-position probability distribution is renormalized to sum to 1 before scoring.
+
+| Seed | Final val_bpb | Artifact bytes | Total bytes | Eval time | Notes |
+|------|---------------|----------------|-------------|-----------|-------|
+| 1337 | **0.02144330** | 15,015,946 | 15,179,538 | 432s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
+| 42 | **0.02136791** | 15,717,739 | 15,881,331 | 433s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
+| 7 | **0.02138708** | 15,083,362 | 15,246,954 | 437s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
+
+Final 3-seed mean final val_bpb: `0.02139943` with sample std `0.00003918`.
+
+## Low Eval-Time Memory Regime
+
+This variant keeps the packed order-2..9 training n-gram artifact and learned gate, but removes the two extra eval overlays that had been sitting on top:
+
+1. **No logistic context mixer.**
+2. **No long phrase cache.**
+
+The remaining eval-time adaptation path is:
+
+1. load the packed order-2..9 cache from the artifact,
+2. score with the learned neural + n-gram gate,
+3. renormalize the final full-vocab distribution so each position sums to 1,
+4. apply online logit calibration,
+5. update the streaming n-gram cache only after scoring.
+
+The motivating ablation was immediate: on the final seed-7 no-mixer artifact, turning off only the long phrase cache dropped eval BPB from `0.04881917` to `0.02134985`, which then held up in the full 3-seed reruns above.
+
+## Main Submission Shape
+
+This submission keeps:
+
+- packed order-2..9 training n-gram cache stored inside the artifact
+- learned multi-expert gate over neural + order-2..9 n-gram experts
+- online logit calibration
+- cached-batch GPTQ export path
+
+Compared with the earlier packed-cache submission, the final path removes:
+
+- logistic context mixer
+- long phrase cache
+- bigram hash embedding
+- heuristic / hybrid switching logic
+- cache-maturity decay
+
+## Why It Works
+
+The packed training cache already gives the learned gate a strong warm-start low-order signal at eval step 0. In this setting, the extra eval-time overlays were not helping:
+
+- the mixer overlapped heavily with the packed low-order n-gram signal
+- the long phrase cache overrode the already-strong packed-cache probabilities in a way that significantly hurt final BPB
+
+Removing both left a simpler, more memory-efficient eval path that also scored much better.
+
+## Probability Normalization
+
+The renormalized version keeps the adjusted target-token probability from the learned gate path, then rescales the base model's non-target probability mass so the final full-vocabulary distribution sums to exactly 1 at every scored position.
+
+This preserves the intended target probability adjustment while making the reported likelihood a valid normalized distribution rather than a point-only measurement.
+
+## Causal Evaluation Path
+
+1. Load the packed training n-gram cache from the artifact itself.
+2. Score the next validation chunk with only left context and the current cache state.
+3. Query n-gram experts using only left context; expert availability depends only on context evidence.
+4. Blend neural + n-gram experts, then renormalize the full-vocab distribution so it sums to 1 before scoring.
+5. Score the chunk before any mutation.
+6. Update the streaming n-gram cache after scoring the chunk.
+7. The reported runs use `TTT_EPOCHS=0`, so there is no backward adaptation step in the final path.
+
+## Compliance
+
+- **Single-pass eval:** this is not a 2-pass or rescoring method.
+- **No future-token leakage:** validation chunks are scored before their tokens are added to the streaming cache.
+- **Artifact-bundled warm start:** the cache loaded at eval step 0 is part of the artifact itself.
+- **Packed cache is training-only:** the serialized n-gram payload comes from training data produced inside the 600 second training budget.
+- **Context-only gate mask:** the learned gate does not use the true next token to decide which experts are available.
+- **Normalized final distribution:** the final per-position probabilities are renormalized to sum to 1 before likelihood is accumulated.
+- **Cached GPTQ calibration:** quantization calibration uses batches already seen during training.
+- **No backward TTT in final path:** the current reported numbers use `TTT_EPOCHS=0`.
+- **Artifact under 16MB:** all three runs remain below the limit.
+
+## Reproduction
+
+```bash
+pip install -r requirements.txt
+
+SEED=1337 \
+DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+ARTIFACT_NGRAM_EXPORT=1 \
+MAX_WALLCLOCK_SECONDS=600 \
+VAL_LOSS_EVERY=0 \
+USE_MIXER=0 USE_PHRASE_CACHE=0 MIXER_HEAD=multi \
+USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
+TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
+USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
+TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
+TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
+CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
+GPTQ_CALIBRATION_SEQS=128 \
+RENORMALIZE_FINAL_PROBS=1 VERIFY_FINAL_PROBS=1 \
+PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Notes
+
+- `logs/train_seed1337.log`, `logs/train_seed42.log`, and `logs/train_seed7.log` correspond to the final renormalized compliant reruns.
+- `submission.json` reflects the renormalized 3-seed mean and worst-case total size from this final path.