Two-stage character-level next-character prediction system for multilingual text. Given a partial string, predicts the three most likely next characters (character autocomplete).
This implementation achieved 1st out of 75 teams (211 participants) with a 0.9618 success rate and 167.2117 seconds runtime.
See SPECIFICATION.md for the original CSE 447 project specification. Also, the closely related Interstellar Autocomplete Challenge.
Stage 1 — KenLM: An n-gram language model scores all vocabulary tokens and returns the top-K candidates.
Stage 2 — Reranker: A causal Transformer rescores the K candidates using learned context encoding, blended with KenLM scores via a mixture model. Falls back to KenLM-only if no reranker checkpoint is present.
config.yaml # All inference + reranker training hyperparameters
data_config.yaml # Dataset download + preprocessing settings
src/
predict.py # Two-stage inference (KenLM + reranker)
predict.sh # Shell wrapper for predict.py
myprogram.py # KenLM-only baseline inference (used for previous checkpoints, unused)
reranker/ # Reranker package (model, training, data loading)
data/ # Dataset building + candidate precomputation
utils/ # Shared text preprocessing utilities
scripts/ # Evaluation, grading, distillation, plotting
work/ # Model binaries + checkpoints (not included)
See src/reranker/README.md, src/data/README.md, and scripts/README.md for detailed documentation.
uv sync
uv run pre-commit installKenLM is installed as a Python package from source via uv (see pyproject.toml). For the CLI tools (lmplz, build_binary), follow the KenLM build instructions.
Note: PyTorch and Lightning are listed as dev dependencies (not core dependencies). This is intentional — the Docker submission image (
pytorch/pytorch:2.10.0-cuda13.0-cudnn9-runtime) already has PyTorch pre-installed via system packages, so the submitted container uses that instead of pip-installing it. Install dev dependencies locally withuv sync(which includes--devby default).
Settings (languages, samples per language, etc.) live in data_config.yaml.
uv run python src/data/builddataset.pySettings live in data_config.yaml under preprocess.
uv run python src/data/preprocess.pyOutput: data/<output_dir>/ with train.txt, valid.txt, vocab.json, input_valid.txt, answer_valid.txt.
# 6-gram model with pruning (adjust order / prune thresholds as needed)
lmplz --verbose_header -o 6 --prune 0 0 0 1 2 2 \
< data/madlad_multilang_clean_35k_optionB_kenlm/train.txt \
> work/char6_pruned.arpa
build_binary work/char6_pruned.arpa work/char6.binaryUpdate config.yaml -> model.binary to point to the new binary.
Scores every vocab token at every position in train/valid and writes TSVs used by the reranker as hard negatives.
uv run python src/data/precompute_kenlm_candidates.py --split train --k 64
uv run python src/data/precompute_kenlm_candidates.py --split valid --k 64Update config.yaml -> reranker.data.candidates_train_path and candidates_valid_path to the generated TSV paths. See src/data/README.md for more options (stratified sampling, sibling gold exclusion, etc.).
All hyperparameters live in config.yaml under reranker.
uv run python src/reranker/train.pyOutputs:
work/best_reranker.ckpt— Lightning checkpoint (saved by callback on metric improvement)work/<checkpoint_name>.pt— plain inference checkpoint (saved unconditionally at end of run, including on Ctrl+C)
See src/reranker/README.md for details on dataset modes, config options, and checkpoint conversion.
uv run python src/predict.py \
--work_dir work \
--test_data example/input.txt \
--test_output pred.txtpredict.py reads model paths and inference settings from config.yaml automatically:
config.yaml field (not exhaustive, look at top of predict.py) |
what it controls |
|---|---|
model.binary |
KenLM binary path (relative to --work_dir) |
model.vocab |
Vocabulary file path (relative to --work_dir) |
reranker.output.checkpoint_name |
Reranker .pt path (relative to --work_dir); if absent, falls back to KenLM-only |
reranker.training.candidate_size |
Top-K candidates passed from KenLM to the reranker |
reranker.training.eval_batch_size |
Batch size for reranker forward passes |
max_context_len is read from the reranker checkpoint itself, so it always matches the value the model was trained with.
Optional CLI overrides:
--kenlm-only— skip the reranker even if a checkpoint exists--alpha <float>— override the KenLM blend weight saved in the checkpoint--device cpu|cuda— force a device (auto-detected by default)
uv run python scripts/grade.py \
--pred pred.txt \
--answer example/answer.txt \
--top-k 3bash submit.shBuilds submit.zip with the model files and source code, runs inference on the example data, and grades the output.
config.yaml— KenLM inference settings, multiprocessing workers, wandb logging, and all reranker architecture/training hyperparametersdata_config.yaml— dataset download parameters (languages, sample counts, text length filters) and preprocessing settings
