Record: X-WING 3D Cubric + Complementary Training (val_bpb=0.4820)#814
Open
newjordan wants to merge 8 commits intoopenai:mainfrom
Open
Record: X-WING 3D Cubric + Complementary Training (val_bpb=0.4820)#814newjordan wants to merge 8 commits intoopenai:mainfrom
newjordan wants to merge 8 commits intoopenai:mainfrom
Conversation
3D cubric pattern recognizer (54 warm-started adaptive multipliers) + complementary training. Seeds: 1337=0.4818, 300=0.4821, 58=0.4821. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three variants targeting the 0.187 BPB gap to openai#1: - bwing_alpha: clip 0.95, alpha 0.05-0.60 (isolate alpha curve) - bwing_entropy_shift: per-order entropy center shift (isolate) - bwing_full_port: all openai#809 techniques + fixed order mults (fire first) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Cubric 3D back online (CADENCE=32, warm-start) - Per-order entropy center shift from openai#809 - Alpha 0.05-0.60, clip 0.95 - Our sliding-window TTT spliced in (1 epoch, SGD, freeze 2 blocks) - TTT runs BEFORE n-gram eval → adapted model feeds n-gram Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Port openai#809 LoRA TTT: rank-8 adapters on Q/V/LM head, AdamW, Polyak - Add LoRA injection to CausalSelfAttention, Block, GPT forward paths - 53s vs our old 410s TTT, 6x better BPB gain - Cubric 3D ON + entropy shift + alpha 0.05-0.60 clip 0.95 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixed mults + entropy shift + alpha 0.05-0.60 clip 0.95 (no cubric). Base sliding: 1.1194, n-gram9: 0.4512. Delta from X-WING: -0.031. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deleted LoRA TTT abomination. bwing_III is now a clean copy of our best scoring variant for further iteration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bwing_IV: Prime fix only — adds primes 283721, 347237 to eliminate XOR hash collisions for orders 8-9 (the 2.0x multiplier orders). With 7 primes, prime[7] wrapped to prime[0], causing context tokens at positions j-8 and j-1 to cancel when equal. bwing_V: Prime fix + cubric 3D stacked on top of fixed mults. Cubric warm-starts at 1.0 (neutral) and refines per (order × entropy × count) on top of the fixed order multiplier scaling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adapted from old setup.sh. Fixes FA3 detection (old one skipped FA3 when FA2 was present), uses sp1024 dataset, adds zstandard install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
are you calibrating gptq after the wallclock cap fires? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Key Innovation: 3D Cubric Pattern Recognizer + Complementary Training
Two novel techniques stacked on chunk-based shared n-gram tables:
1. 3D Cubric (original)
54 adaptive multipliers across (order × entropy_bin × count_bin). Each cell independently tracks n-gram beat rates and adjusts its alpha multiplier. Captures patterns invisible to 1D scaling — e.g. "order 7 at mid-entropy with high count → trust fully (2.0x)" vs "order 3 at any entropy → suppress (0.30x)".
Warm-start: multipliers initialize at proven converged values instead of 1.0. Full cubric power from chunk 1 instead of wasting ~30 of 60 chunks converging.
2. Complementary Training (adapted from PR #803)
During training, tokens predictable by bigram statistics receive lower loss weight (
COMPLEMENT_ALPHA=0.5). The model specializes on tokens n-grams can't predict. This enables higher eval-time alpha (20-75% vs 5-70%) because the model is deliberately weak where n-grams are strong.3. Shared N-gram Tables
All 8 GPU ranks update tables with the same chunk tokens → every rank sees the full 62M-token picture (vs 1/8 with rank-local). Insight from @deanbrr (PR #779).
Ablation
Legality
(1-α)·P_neural + α·P_ngramwhere α is a fixed function of model entropy × cubric multipliers — target-independent, committed before scoringCredits
Reproduce
8xH100 SXM, 600s training + ~203s eval.
Test plan