Skip to content

Record: Complementary Training + Backoff N-gram Mixer — 0.4377 BPB#811

Closed
quietsmile wants to merge 1 commit intoopenai:mainfrom
quietsmile:submission/complementary-backoff-ngram-mixer
Closed

Record: Complementary Training + Backoff N-gram Mixer — 0.4377 BPB#811
quietsmile wants to merge 1 commit intoopenai:mainfrom
quietsmile:submission/complementary-backoff-ngram-mixer

Conversation

@quietsmile
Copy link

Summary

Key Techniques

  1. Complementary Training (COMPLEMENT_ALPHA=0.5): bigram-weighted loss reweighting
  2. BackoffNgramMixer: orders 2-10, entropy-adaptive alpha mixing
  3. Legal score-first AdamW TTT: 4 epochs, lr=5e-4, freeze first 2 blocks
  4. Stride=128: negligible BPB impact, halves eval time

Acknowledgment

Based on PR #803 by @pentxayc. Core innovation of complementary training is their contribution.

Test plan

  • Verified on 2 seeds (1337, 42) with consistent results
  • Training completes within 10-min wallclock
  • Eval completes within 10-min budget (450s)
  • Artifact under 16MB
  • All TTT is legal score-first (tokens scored before any update)

🤖 Generated with Claude Code

Reproduction of PR openai#803's complementary training approach on 8x L20Z (H100).
Two-seed validation: 0.4377 (seed=1337), 0.4380 (seed=42).

Key: bigram-weighted loss reweighting (COMPLEMENT_ALPHA=0.5) trains the
neural model to specialize on tokens n-gram caches can't predict, combined
with BackoffNgramMixer (orders 2-10) and legal score-first AdamW TTT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant