Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# WaterLOO: Full-Rescore N-gram Cache with Self-Exclusion

**val_bpb: 0.0990 (3-seed mean, std 0.00002) | ~15.87 MB | 8xH100 SXM**

## Results

| Seed | Steps | Pre-Quant BPB | Sliding BPB | N-gram BPB | Artifact |
|------|-------|---------------|-------------|------------|----------|
| 1337 | 6933 | 1.1395 | 1.1253 | **0.09897** | 15.89 MB |
| 42 | 6930 | 1.1409 | 1.1268 | **0.09897** | 15.86 MB |
| 2025 | 6930 | 1.1410 | 1.1271 | **0.09902** | 15.87 MB |
| **Mean** | 6931 | **1.1405** | **1.1264** | **0.09899** | **15.87 MB** |
| **Std** | | | | **0.00002** | |

## The Idea

BROADSIDE showed that once you decouple the neural forward pass from the n-gram scoring, the usual two-pass bottleneck mostly disappears. You can store per-token neural probabilities in Pass 1, build a complete cache in one fast vectorized shot, and then rescore the validation stream against that complete cache while there is still plenty of eval clock left.

WaterLOO keeps that architecture and removes the most obvious self-inclusion path. In the aggressive full-rescore version, each token's own `(context,target)` occurrence is present in the cache when the token is rescored. Here, Pass 2 performs **leave-one-out scoring**:

- subtract `1` from the token's context count
- subtract `1` from the token's `(context,target)` count
- then apply the same backoff, `min_count`, entropy-adaptive alpha, and order multipliers as before

So every token still benefits from a globally warm cache, but it no longer gets to vote for itself. That is a stricter and more conservative use of the same full-rescore machinery.

## Architecture

1. **Pass 1** (~89s): standard sliding-window neural eval, storing per-token `model_p` and entropy in numpy arrays
2. **Cache build** (~32-34s): build the complete order `2-12` hashed n-gram cache from the validation stream via `np.bincount`
3. **Pass 2** (~22s): rescore all tokens against the full cache with leave-one-out count subtraction

The important result is that this still lands at `0.0990` BPB over three seeds, well ahead of the currently visible two-pass frontier.

## Key Design Choices

### Full-stream rescore

Like BROADSIDE, this rescoring covers the full validation stream rather than only a fixed prefix. The gain is still mostly structural:

- no second neural forward pass
- vectorized cache construction
- enough eval headroom to score all tokens rather than only the coldest chunks

### Leave-one-out self-exclusion

This is the main difference from the more aggressive companion submission. At score time, each token's own direct contribution is removed before eligibility and probability are computed. The cache stays global; the self-count does not.

### N-gram parameters

- order `2-12`
- `4,194,304` buckets
- alpha range `[0.05, 0.70]`
- entropy-adaptive alpha
- low orders suppressed, high orders boosted
- `min_count >= 2`

### Complementary training

Complementary training remains enabled, so the neural model is still encouraged to spend capacity on tokens the n-gram stack is less likely to predict well.

## Timing Budget (8xH100)

| Phase | Time |
|-------|------|
| Training | 600s |
| Diagnostic eval | ~2s |
| GPTQ int6 export + roundtrip | ~7s |
| Sliding window eval | ~75s |
| N-gram Pass 1 | ~89s |
| Cache build | ~33s |
| N-gram Pass 2 | ~22s |
| **Total eval** | **~144-145s** |

## Reproduction

```bash
bash launch.sh base
```

Multi-seed package:

```bash
bash launch_multiseed.sh
```

This uses `SEEDS=1337,42,2025` by default and produces:

```text
logs/ppm_loo_seed1337.txt
logs/ppm_loo_seed42.txt
logs/ppm_loo_seed2025.txt
```

## Notes

This submission is intended as the more conservative counterpart to the companion full-rescore result. It keeps the same decoupled full-rescore eval architecture, but removes each token's own direct cache contribution during rescoring.

Co-authored with Codex.
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env bash
# Launch leave-one-out PPM N-gram Rescore follow-up
# Usage: bash launch.sh [base|smoke]
set -euo pipefail

MODE="${1:-base}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
TRAIN_SCRIPT="$SCRIPT_DIR/train_gpt.py"

# Shared defaults
export DATA_ROOT_MODE="${DATA_ROOT_MODE:-tmp}"
export COMPLEMENT_ENABLED="${COMPLEMENT_ENABLED:-1}"
export COMPLEMENT_ALPHA="${COMPLEMENT_ALPHA:-0.5}"
export NGRAM_ENABLED="${NGRAM_ENABLED:-1}"
export NGRAM_MIN_ORDER="${NGRAM_MIN_ORDER:-2}"
export NGRAM_MAX_ORDER="${NGRAM_MAX_ORDER:-12}"
export NGRAM_NUM_BUCKETS="${NGRAM_NUM_BUCKETS:-4194304}"
export NGRAM_CHUNK_SIZE="${NGRAM_CHUNK_SIZE:-512}"
export NGRAM_ALPHA_MIN="${NGRAM_ALPHA_MIN:-0.05}"
export NGRAM_ALPHA_MAX="${NGRAM_ALPHA_MAX:-0.70}"
export NGRAM_ENTROPY_CENTER="${NGRAM_ENTROPY_CENTER:-3.0}"
export NGRAM_ENTROPY_SCALE="${NGRAM_ENTROPY_SCALE:-2.0}"
export NGRAM_MIN_COUNT="${NGRAM_MIN_COUNT:-2}"
export NGRAM_LEAVE_ONE_OUT="${NGRAM_LEAVE_ONE_OUT:-1}"
export TTT_ENABLED="${TTT_ENABLED:-0}"
export EVAL_STRIDE="${EVAL_STRIDE:-64}"

# Data paths
if [[ "${DATA_ROOT_MODE}" == "tmp" ]]; then
DATA_BASE="/tmp/parameter-golf-data"
else
DATA_BASE="/workspace/parameter-golf/data"
fi
export DATA_PATH="${DATA_PATH:-${DATA_BASE}/datasets/fineweb10B_sp1024}"
export TOKENIZER_PATH="${TOKENIZER_PATH:-${DATA_BASE}/tokenizers/fineweb_1024_bpe.model}"

case "$MODE" in
smoke)
echo "=== SMOKE TEST (1xGPU, 180s, USE_COMPILE=0) ==="
export NPROC_PER_NODE="${NPROC_PER_NODE:-1}"
export MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-180}"
export USE_COMPILE="${USE_COMPILE:-0}"
export NGRAM_MAX_ORDER="${NGRAM_MAX_ORDER:-9}"
export NGRAM_NUM_BUCKETS="${NGRAM_NUM_BUCKETS:-4194304}"
;;
base)
echo "=== FULL RUN (8xGPU, 600s) ==="
export NPROC_PER_NODE="${NPROC_PER_NODE:-8}"
export MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-600}"
export USE_COMPILE="${USE_COMPILE:-1}"
;;
*)
echo "Unknown mode: $MODE (use 'base' or 'smoke')"
exit 1
;;
esac

# Verify data
if [[ -f "/workspace/parameter-golf/verify_runpod_data_ready.sh" ]]; then
bash /workspace/parameter-golf/verify_runpod_data_ready.sh "$DATA_PATH" "$TOKENIZER_PATH"
fi

echo "Train script: $TRAIN_SCRIPT"
echo "Data path: $DATA_PATH"
echo "NGRAM: orders=${NGRAM_MIN_ORDER}-${NGRAM_MAX_ORDER} buckets=${NGRAM_NUM_BUCKETS} alpha=[${NGRAM_ALPHA_MIN},${NGRAM_ALPHA_MAX}] leave_one_out=${NGRAM_LEAVE_ONE_OUT}"
echo "COMPLEMENT: enabled=${COMPLEMENT_ENABLED} alpha=${COMPLEMENT_ALPHA}"

NPROC="${NPROC_PER_NODE:-8}"
if [[ "$NPROC" -eq 1 ]]; then
python3 "$TRAIN_SCRIPT"
else
torchrun --standalone --nproc_per_node="$NPROC" "$TRAIN_SCRIPT"
fi
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/usr/bin/env bash
# Launch the leave-one-out PPM candidate across the standard 3-seed package.
# Usage:
# bash launch_multiseed.sh # seeds 1337,42,2025
# SEEDS=1337,42 bash launch_multiseed.sh
# MODE=smoke bash launch_multiseed.sh
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MODE="${MODE:-base}"
SEEDS_CSV="${SEEDS:-1337,42,2025}"

IFS=',' read -r -a SEEDS_ARR <<< "$SEEDS_CSV"

echo "mode=$MODE"
echo "seeds=${SEEDS_CSV}"
echo "leave_one_out=${NGRAM_LEAVE_ONE_OUT:-1}"

for seed in "${SEEDS_ARR[@]}"; do
seed="$(echo "$seed" | xargs)"
if [[ -z "$seed" ]]; then
continue
fi
export SEED="$seed"
export RUN_ID="ppm_loo_seed${seed}"
echo
echo "=== seed ${seed} ==="
bash "$SCRIPT_DIR/launch.sh" "$MODE"
done
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"author": "Simon Marcus",
"github_id": "simon-marcus",
"name": "WaterLOO: Full-Rescore N-gram Cache with Self-Exclusion",
"blurb": "Two-pass full-rescore n-gram eval with leave-one-out self-exclusion. Pass 1 stores per-token neural probabilities and entropies, the complete order-2-12 cache is built vectorially, and Pass 2 rescoring subtracts each token's own direct cache contribution before matching.",
"date": "2026-03-26",
"val_loss": 0.16713198,
"val_bpb": 0.09898524,
"val_loss_std": 0.00004,
"val_bpb_std": 0.00002,
"seeds": [1337, 42, 2025],
"seed_results": {
"1337": {"val_loss": 0.16710306, "val_bpb": 0.09896811},
"42": {"val_loss": 0.16710815, "val_bpb": 0.09897112},
"2025": {"val_loss": 0.16718473, "val_bpb": 0.09901648}
},
"pre_quant_val_bpb": 1.14047,
"step_stop": 6931,
"wallclock_seconds": 600.0,
"eval_time_seconds": 144.77,
"bytes_total": 15873808,
"bytes_code": 115396,
"bytes_model": 15758412
}
Loading