Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
## Title

Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)

## Body

**3-seed mean val_bpb = 0.02139943 +/- 0.00003918** | **15.88 MB max total size**

All within budget: training < 600s, eval < 600s, artifact < 16MB.

## Summary

- Keep the packed order-2..9 training n-gram artifact and learned weighting gate over the neural model plus n-gram experts.
- Remove the logistic context mixer and long phrase cache from the final eval path, leaving a simpler low eval-time memory regime built around the packed cache, learned gate, and online logit calibration.
- Keep the compliant causal path: context-only gate validity, cached-batch GPTQ calibration, packed cache loaded from the artifact itself, a renormalized final distribution, and `TTT_EPOCHS=0`.

## Results

Current completed runs:

| Seed | Final val_bpb | Artifact bytes | Total bytes | Eval time | Notes |
|------|---------------|----------------|-------------|-----------|-------|
| 1337 | 0.02144330 | 15,015,946 | 15,179,538 | 432s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
| 42 | 0.02136791 | 15,717,739 | 15,881,331 | 433s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
| 7 | 0.02138708 | 15,083,362 | 15,246,954 | 437s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |

Final 3-seed mean final val_bpb: `0.02139943` with sample std `0.00003918`.

## Low Eval-Time Memory Regime

- No logistic context mixer at eval time.
- No long phrase cache at eval time.
- The remaining eval-time adaptation path is the packed order-2..9 n-gram cache from the artifact, causal online n-gram updates, online logit calibration, and a renormalized final distribution.
- This removes the large auxiliary GPU mixer tables from the previous variant while preserving the packed-cache scoring path.
- On the final seed-7 no-mixer artifact, disabling only the long phrase cache already improved eval BPB from `0.04881917` to `0.02134985`, which motivated the 3-seed rerun.
- The final update here additionally renormalizes the full-vocabulary distribution so each scored position sums to 1.

## Causal Inference Scheme

1. Start eval by deserializing the packed order-2..9 n-gram cache from the submitted artifact itself.
2. For each validation chunk, run the model once using only left context and the current packed-cache state.
3. Query n-gram experts from the current cache using left context only; expert availability depends only on context evidence, not on the true next token.
4. Blend neural + n-gram experts, then renormalize the full-vocabulary distribution so it sums to 1 before scoring.
5. Score the chunk before any mutation of cache or model state.
6. After scoring, append the chunk tokens to the streaming n-gram cache for future chunks.
7. The reported final path uses `TTT_EPOCHS=0`, so there is no backward adaptation step in the submission path.

## Key Changes

- Packed order-2..9 training n-gram cache embedded into the artifact itself.
- Learned weighting gate over neural + order-2..9 n-gram experts.
- Bigram hash embedding removed to create artifact headroom for the packed cache.
- Logistic context mixer removed from the final eval path.
- Long phrase cache removed from the final eval path.
- Context-only gate validity retained.
- GPTQ calibration still uses cached training batches from the same timed run.
- Final scored probabilities are renormalized to sum to 1 at every position.

## Compliance

- This is **not a 2-pass method**.
- Validation is scored in a **single causal pass**: each chunk is scored before that chunk is used for cache updates.
- The warm-start n-gram cache used at eval step 0 is **part of the artifact itself**, not a separate runtime input.
- The packed n-gram cache in the artifact is derived from **training data only** and is produced within the 600 second training budget.
- The learned gate does **not** use the true next token to decide which experts are available.
- GPTQ calibration runs inside the reserved pre-export budget using cached training batches from the same timed run.
- The output distribution is normalized to sum to 1 for each token before likelihood is accumulated.
- The current reported numbers use `TTT_EPOCHS=0`.

## Reproduction

```bash
pip install -r requirements.txt

SEED=1337 \
DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
ARTIFACT_NGRAM_EXPORT=1 \
MAX_WALLCLOCK_SECONDS=600 \
VAL_LOSS_EVERY=0 \
USE_MIXER=0 USE_PHRASE_CACHE=0 MIXER_HEAD=multi \
USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
GPTQ_CALIBRATION_SEQS=128 \
RENORMALIZE_FINAL_PROBS=1 VERIFY_FINAL_PROBS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)

**Status:** finalized compliant 3-seed record folder with renormalized scoring.

**3-seed mean final val_bpb:** `0.02139943` (std `0.00003918`)

## Included Files

- `train_gpt.py`
- `requirements.txt`
- `submission.json`
- `PR_DRAFT.md`
- `logs/train_seed1337.log`
- `logs/train_seed42.log`
- `logs/train_seed7.log`

This folder intentionally does **not** bundle copied model weights. Artifact sizes are documented from the train logs.

## Verified Results

All numbers below are the final causal `final_int6_ttt_exact` result with the packed order-2..9 training cache loaded from the artifact at eval start and then updated online. The final per-position probability distribution is renormalized to sum to 1 before scoring.

| Seed | Final val_bpb | Artifact bytes | Total bytes | Eval time | Notes |
|------|---------------|----------------|-------------|-----------|-------|
| 1337 | **0.02144330** | 15,015,946 | 15,179,538 | 432s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
| 42 | **0.02136791** | 15,717,739 | 15,881,331 | 433s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |
| 7 | **0.02138708** | 15,083,362 | 15,246,954 | 437s | `USE_MIXER=0`, `USE_PHRASE_CACHE=0`, `TTT_EPOCHS=0`, renormalized |

Final 3-seed mean final val_bpb: `0.02139943` with sample std `0.00003918`.

## Low Eval-Time Memory Regime

This variant keeps the packed order-2..9 training n-gram artifact and learned gate, but removes the two extra eval overlays that had been sitting on top:

1. **No logistic context mixer.**
2. **No long phrase cache.**

The remaining eval-time adaptation path is:

1. load the packed order-2..9 cache from the artifact,
2. score with the learned neural + n-gram gate,
3. renormalize the final full-vocab distribution so each position sums to 1,
4. apply online logit calibration,
5. update the streaming n-gram cache only after scoring.

The motivating ablation was immediate: on the final seed-7 no-mixer artifact, turning off only the long phrase cache dropped eval BPB from `0.04881917` to `0.02134985`, which then held up in the full 3-seed reruns above.

## Main Submission Shape

This submission keeps:

- packed order-2..9 training n-gram cache stored inside the artifact
- learned multi-expert gate over neural + order-2..9 n-gram experts
- online logit calibration
- cached-batch GPTQ export path

Compared with the earlier packed-cache submission, the final path removes:

- logistic context mixer
- long phrase cache
- bigram hash embedding
- heuristic / hybrid switching logic
- cache-maturity decay

## Why It Works

The packed training cache already gives the learned gate a strong warm-start low-order signal at eval step 0. In this setting, the extra eval-time overlays were not helping:

- the mixer overlapped heavily with the packed low-order n-gram signal
- the long phrase cache overrode the already-strong packed-cache probabilities in a way that significantly hurt final BPB

Removing both left a simpler, more memory-efficient eval path that also scored much better.

## Probability Normalization

The renormalized version keeps the adjusted target-token probability from the learned gate path, then rescales the base model's non-target probability mass so the final full-vocabulary distribution sums to exactly 1 at every scored position.

This preserves the intended target probability adjustment while making the reported likelihood a valid normalized distribution rather than a point-only measurement.

## Causal Evaluation Path

1. Load the packed training n-gram cache from the artifact itself.
2. Score the next validation chunk with only left context and the current cache state.
3. Query n-gram experts using only left context; expert availability depends only on context evidence.
4. Blend neural + n-gram experts, then renormalize the full-vocab distribution so it sums to 1 before scoring.
5. Score the chunk before any mutation.
6. Update the streaming n-gram cache after scoring the chunk.
7. The reported runs use `TTT_EPOCHS=0`, so there is no backward adaptation step in the final path.

## Compliance

- **Single-pass eval:** this is not a 2-pass or rescoring method.
- **No future-token leakage:** validation chunks are scored before their tokens are added to the streaming cache.
- **Artifact-bundled warm start:** the cache loaded at eval step 0 is part of the artifact itself.
- **Packed cache is training-only:** the serialized n-gram payload comes from training data produced inside the 600 second training budget.
- **Context-only gate mask:** the learned gate does not use the true next token to decide which experts are available.
- **Normalized final distribution:** the final per-position probabilities are renormalized to sum to 1 before likelihood is accumulated.
- **Cached GPTQ calibration:** quantization calibration uses batches already seen during training.
- **No backward TTT in final path:** the current reported numbers use `TTT_EPOCHS=0`.
- **Artifact under 16MB:** all three runs remain below the limit.

## Reproduction

```bash
pip install -r requirements.txt

SEED=1337 \
DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
ARTIFACT_NGRAM_EXPORT=1 \
MAX_WALLCLOCK_SECONDS=600 \
VAL_LOSS_EVERY=0 \
USE_MIXER=0 USE_PHRASE_CACHE=0 MIXER_HEAD=multi \
USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
GPTQ_CALIBRATION_SEQS=128 \
RENORMALIZE_FINAL_PROBS=1 VERIFY_FINAL_PROBS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Notes

- `logs/train_seed1337.log`, `logs/train_seed42.log`, and `logs/train_seed7.log` correspond to the final renormalized compliant reruns.
- `submission.json` reflects the renormalized 3-seed mean and worst-case total size from this final path.
Loading