Skip to content

Review: Rerun of #972 with actual full-vocab normalization#978

Open
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:pr972-honest-normalization-rerun
Open

Review: Rerun of #972 with actual full-vocab normalization#978
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:pr972-honest-normalization-rerun

Conversation

@AnirudhRahul
Copy link
Copy Markdown

@AnirudhRahul AnirudhRahul commented Mar 27, 2026

Summary

  • correct eval_val_sliding() so the n-gram score is normalized by the summed hashed-vocab mass pair_counts.sum(...) + beta instead of ctx_count + beta
  • update the #972 record README and submission metadata to reflect the honest rerun rather than the earlier 0.3922 claim
  • show that once the normalization is actually enforced, the n-gram path degrades to 1.51343368 BPB and loses to the neural sliding baseline

Issue in #972

#972 claimed to compute a full-vocab normalized n-gram distribution, but the evaluation code did not actually use that normalization.

What the code effectively did was:

pair_c = ng_pair[order][pair_h].float()  # [chunk, 1024]
raw_correct = pair_c.gather(1, gold_idx).squeeze(1)
p_local = (raw_correct + beta * p_neural) / (ctx_count + beta)

The problem is that ctx_count + beta is not the normalization constant for the 1024-way hashed candidate distribution. Once pair_c has been materialized, the correct denominator for a full-vocab posterior is the total mass over candidate tokens:

pair_total = pair_c.sum(dim=1)
p_local = (raw_correct + beta * p_neural) / (pair_total + beta)

So the previous PR was still scoring a gold-token scalar against a non-normalized denominator, even though it had already gathered counts for all 1024 tokens.

Test plan

  • reran torchrun --standalone --nproc_per_node=8 train_gpt.py on 8xH100 with:
    • DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024
    • TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
    • CTW_BETA=2.0
    • CTW_BLEND=0.5
  • observed:
    • final_int8_zlib_roundtrip_exact val_loss:1.97320202 val_bpb:1.16864138
    • final_sliding_window_exact sliding_bpb:1.14740867 val_bpb:1.51343368

Idan3011 and others added 2 commits March 27, 2026 15:46
Correct the eval-time n-gram posterior to normalize by the summed hashed-vocab mass and update the recorded metrics. The honest rerun lands at 1.5134 BPB, showing the earlier 0.3922 result came from the flawed normalization path.

Made-with: Cursor
@AnirudhRahul AnirudhRahul changed the title Honest rerun of #972 with actual full-vocab normalization Rerun of #972 with actual full-vocab normalization Mar 27, 2026
@Idan3011
Copy link
Copy Markdown

Thanks for the review. I'd already identified the denominator issue and reran with pair_c.sum(dim=1) + beta — confirmed it degrades to ~1.23-1.51 BPP, The entire n-gram gain is collision artifact. Closing this PR.

@AnirudhRahul AnirudhRahul changed the title Rerun of #972 with actual full-vocab normalization Review: Rerun of #972 with actual full-vocab normalization Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants