Skip to content

Latest commit

 

History

History
260 lines (198 loc) · 11.4 KB

File metadata and controls

260 lines (198 loc) · 11.4 KB

FlashLM

CPU-Native Ternary Language Models

No GPUs · No pretraining · Trained from scratch on free-tier CPUs

License: MIT

Development Log — full research history from v3 to present


Model Lineup

Version Name Architecture Params Hardware Time PPL Status
v4 Bolt GatedRecurrence (ternary) 4.3M 2 vCPU 2h 15.05 Archived
v5 Thunderbolt ParallelGatedRecurrence (ternary) 29.7M Ryzen 7950X3D 40h 1.36 Complete
v5.2 Nova-Ignition Transformer (Attention) 5.0M 2 vCPU 2h 10.56 Archived
v6 SUPERNOVA Linear mixer + GLU (ternary) 4.1M 2 vCPU 3h 14.0 Data bug
v7.4 CORTEX-VIII Gated DeltaNet + Local SWA 6.6M 2 vCPU 2h 2.33 Best PPL
v7.5 CORTEX-IX + Unlikelihood + Multi-Token Pred 7.6M 2 vCPU 2h 3.29 Archived
v7.6 CORTEX-X Gated DeltaNet + Curated Data 6.6M 2 vCPU 2h 7.54 Archived
v8 SearchLM Transformer + Lookahead Value Heads 7.1M 2 vCPU 2h 2.40 Superseded
v8.1 SearchLM CORTEX-VIII + Value Heads 6.6M 2 vCPU 2h 2.40 Superseded
v8.2 CORTEX-VIII + Subset Training (20M tok) 6.6M 2 vCPU 2h 2.42 Superseded
v8.3 CORTEX-VIII + 10M subset + Entropy Reg 6.6M 2 vCPU 2h 2.50 Current
v8.4 Lean CORTEX Full Attention + Delta Memory 1.77M 2 vCPU 2h 7.80 Too small
v9.0 Reckoning CPU-native (binary routing + cell mem) ~1.2M 2 vCPU 2h 130.19 Failed
v9.1 Reckoning v2 Delta rule + running state + conv 17.3M 4 vCPU 2h 24.60 Improving
v9.2 CORTEX+Compass CORTEX-VIII + Story Compass 6.7M 4 vCPU 2h 17.56 Compass hurt

Evolution

v4  Bolt                4.3M   PPL 15.05   2h  · 2 vCPU        · ternary recurrence
 ↓
v5  Thunderbolt        29.7M   PPL  1.36  40h  · Ryzen 7950X3D · ternary recurrence ← best overall
 ↓
v5.2 Nova-Ignition      5.0M   PPL 10.56   2h  · 2 vCPU        · float32 attention
 ↓
v6  SUPERNOVA            4.1M   PPL 14.0    3h  · 2 vCPU        · ternary GLU
 ↓
v7.4  CORTEX-VIII       6.6M   PPL  2.33   2h  · 2 vCPU        · Gated DeltaNet + SWA ← best PPL
 ↓
v7.5  CORTEX-IX         7.6M   PPL  3.29   2h  · 2 vCPU        · coherence training
 ↓
v7.6  CORTEX-X          6.6M   PPL  7.54   2h  · 2 vCPU        · curated data
 ↓
v8   SearchLM           7.1M   PPL  2.40   2h  · 2 vCPU        · transformer + lookahead
 ↓
v8.1 SearchLM           6.6M   PPL  2.40   2h  · 2 vCPU        · CORTEX + value heads
 ↓
v8.2 CORTEX-VIII        6.6M   PPL  2.42   2h  · 2 vCPU        · subset + entropy reg
 ↓
v8.3 CORTEX-VIII        6.6M   PPL  2.50   2h  · 2 vCPU        · best generation
 ↓
v8.4 Lean CORTEX        1.8M   PPL  7.80   2h  · 2 vCPU        · too small for CORTEX
 ↓
v9.0 Reckoning          1.2M   PPL 130.19  2h  · 2 vCPU        · CPU-native failed
 ↓
v9.1 Reckoning v2      17.3M   PPL  24.60  2h  · 4 vCPU        · delta rule + running state
 ↓
v9.2 CORTEX+Compass     6.7M   PPL  17.56  2h  · 4 vCPU        · compass competed with CE ← current

v8 — SearchLM: Policy + Value + Search

Inspired by AlphaGo and DeepMind's test-time compute scaling (Snell et al. 2024): a smaller model with search-guided decoding can produce more coherent text than standard generation.

Architecture: CORTEX-VIII backbone + lookahead value heads (one per layer) that predict average future CE loss. At inference, K=4 candidate tokens are scored by log_prob - beta * value_pred.

Results

Version Change PPL Speed Generation
v8 Transformer + lookahead 2.40 ~1,500 tok/s Loops + incoherent
v8.1 CORTEX + lookahead 2.40 ~2,136 tok/s Loops, V_Corr +0.66
v8.2 20M subset + entropy reg 2.42 ~1,688 tok/s Broke loops, broken grammar
v8.3 10M subset, D_FF=512 2.50 1,861 tok/s Best diversity, broken grammar

Key Findings

  • Value heads learn — V_Corr +0.66 proves the mechanism works, but search-guided decoding didn't improve generation
  • Entropy regularization works — broke the "Lily x20" repetition loops
  • PPL ≠ coherence — PPL 2.50 but no grammar. Model learned word statistics, not sentence structure
  • Greedy = worst — stuck in "thought looked" loops. High temperature produces more readable text

CORTEX-VIII — Gated DeltaNet (Best PPL)

The Delta Rule

Mechanism Operation Limitation
Attention Reads ALL past tokens O(T²), no write/update
Hebbian M += v ⊗ k Blind accumulation, can't correct
Delta Rule M += β·(v − M·k) ⊗ k Targeted correction only

Every layer gets local SWA (W=64) + global delta memory (d_mem=32).

Results

  • Training: 1,699 steps · 13.9M tokens · 120 min · 1,928 tok/s
  • Best val PPL: 2.33 (v5.2 on same tokenizer: 8.32 — 2.6x architecture improvement)
  • Model: 6.56M params · d=256 · 6L · T=256 · LR=5e-4

CORTEX-IX — Coherence Training

CORTEX-VIII achieved PPL 2.33 but generated repetitive text. CORTEX-IX adds 4 changes:

Technique What It Does Cost
Unlikelihood Training Penalizes repeating recent tokens +5% time
Multi-Token Prediction Forces model to plan 2 tokens ahead +10% params
Entropy Regularization Prevents overconfident mode collapse free
Word Dropout Replaces random input tokens with <unk> free

Result: PPL 3.29 (worse from harder objective). Generation still incoherent — techniques are sound but 7.6M params is too small.


CORTEX-X — Curated Data

Filter TinyStories to only the simplest stories (10-40 words). A 6.6M model trained on curated tokens sees each pattern 3-4x instead of 0.02x.

Result: PPL 7.54 (3x worse). Overfit to curated patterns that don't generalize.

Three Experiments, One Conclusion

Experiment Approach PPL Generation
CORTEX-VIII Baseline (best architecture) 2.33 Repetitive
CORTEX-IX Coherence training 3.29 Still incoherent
CORTEX-X Curated data 7.54 Worse

6.6M params is below the coherence threshold regardless of training tricks or data strategy.


CORTEX Experiments

Name Idea PPL Verdict
v7 RWKV + ternary RWKV at small scale 377.66 RWKV fails below 100M params
CORTEX-III 10+ arch sweep, k=15 won 18.16 Dense wide kernel wins
CORTEX-IV DDRF Data-dep exponential taps 1.13x worse Sparse taps lose to dense conv
CORTEX-V Story Memory 8 slots x 32d per layer 1.44x worse Too slow, concept OK
CORTEX-VI Hebbian d_mem=64 correlation matrix ~18 Non-causal mask bug
CORTEX-VII 3 SWA + 3 data-dep Hebbian 16.88 Half layers bottlenecked
CORTEX-VIII All-6L delta rule + SWA 2.33 Beat v5.2 by 2.6x

Key lessons: Delta rule corrects stored memory, Hebbian only accumulates · PPL ≠ coherence · Speed = quality · Hyperparams matter enormously


Files

FlashLM/
+-- README.md
+-- DEVLOG.md                     development log (v3→present)
+-- LICENSE
+-- v4/
|   +-- train_v4_bolt.py              v4 Bolt (ternary recurrence)
+-- v5/
|   +-- train_v52_nova.py             v5.2 Nova-Ignition (attention baseline)
+-- v6/
|   +-- train_v6_supernova.py         v6 SUPERNOVA (ternary GLU)
+-- v7/
|   +-- train_v74.py                  CORTEX-VIII (best PPL)
|   +-- train_v75.py                  CORTEX-IX (coherence training)
|   +-- train_v76.py                  CORTEX-X (curated data)
|   +-- train_v7_rwkv.py              v7 RWKV (failed)
|   +-- train_v71_cortex3.py          CORTEX-III
|   +-- train_v72_cortex6.py          CORTEX-VI
|   +-- train_v73_cortex7.py          CORTEX-VII
|   +-- gen_v72.py                    generation test
|   +-- eval_bpc.py                   BPC evaluation
+-- v8/
    +-- train_v8.py                   v8 SearchLM (Transformer + lookahead)
    +-- train_v81.py                  v8.1 (CORTEX + lookahead)
    +-- train_v82.py                  v8.2 (subset + entropy reg)
    +-- train_v83.py                  v8.3 (best generation)
    +-- train_v84.py                  v8.4 Lean CORTEX (too small)
    +-- train_v90.py                  v9.0 Reckoning (CPU-native, failed)
    +-- generate_v81.py               v8.1 generation test
    +-- generate_v83.py               v8.3 generation test
    +-- generate_knn.py               kNN retrieval augmented generation

Philosophy

  1. Train from scratch — no fine-tuning pretrained models
  2. Fixed time budgets — 2 hours, forces efficiency
  3. Honest reporting — all experiments documented, including failures
  4. Constrained hardware — free-tier cloud CPUs, no GPUs
  5. Research-driven — architecture choices backed by systematic experiments

Links


References


Acknowledgments

  • arki05 for providing the AMD Ryzen 7950X3D used to train v5 Thunderbolt.
  • Code assistance by Claude Code (Anthropic). Architecture design and research direction by Cheng Chang.

Citation

@misc{flashlm,
  author = {Cheng Chang},
  title = {FlashLM: CPU-Native Ternary Language Models},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}

License

MIT — see LICENSE.