This directory contains scripts for building and preprocessing multilingual datasets for character-level language modeling with KenLM, and for precomputing reranker training candidates.
Downloads and builds a multilingual dataset from the MADLAD-400 corpus.
What it does:
- Streams text data from allenai/madlad-400 for the languages listed in
data_config.yaml - Collects 1,000 samples per language (configurable)
- Filters texts by length (200-4,000 characters)
- Saves as a Hugging Face Dataset format
- Generates CSV summaries of language counts and rejected languages
Output:
data/madlad_multilang_clean_1k_optionB/- Dataset directorydataset_dict.json- Dataset metadatadata-*.arrow- Arrow-format data fileslanguage_counts.csv- Per-language sample countsrejected_langs.csv- Languages that failed to load (if any)
Prepares the downloaded dataset for KenLM training by converting to character-level tokenization.
What it does:
- Loads the dataset from
builddataset.py - Applies text normalization:
- NFC unicode normalization (consistent canonical form)
- Lowercase conversion
- Whitespace normalization (collapse to single spaces)
- Character-level tokenization (spaces →
<sp>token) - Splits data into train/validation sets (99%/1%)
- Outputs plain text files suitable for KenLM
Output:
data/madlad_multilang_clean_1k_optionB_kenlm/- Output directorytrain.txt- Training data (one tokenized line per document)valid.txt- Validation data
Precomputes KenLM top-K candidates for every position (or last position only) in a tokenized text file. Used to generate hard negatives for reranker training. For each position, builds the KenLM context state, scores all vocab tokens, and takes the top-K. Supports stratified sampling across language groups and sibling gold exclusion.
Output: TSV file with columns: seq_idx, pos, candidates, kenlm_scores, gold (candidates and scores are \x01-separated)
Same TSV format as precompute_kenlm_candidates.py, but candidates are drawn uniformly at random (no KenLM scoring). All kenlm_scores are 0. Gold is always force-included. Useful as a baseline to isolate the reranker's contribution vs KenLM candidate quality.
uv run python src/data/builddataset.pyThis will take some time as it streams data from Hugging Face. Progress is printed per language.
Note: Requires datasets==3.6.0 due to compatibility issues with newer versions.
uv run python src/data/preprocess.pyYou'll see a progress bar showing processing status.
Once preprocessing is complete, build a KenLM language model:
# Install KenLM first (if not already installed)
# See: https://github.com/kpu/kenlm
cd ../../data/madlad_multilang_clean_1k_optionB_kenlm
# Build a 5-gram model
lmplz -o 5 < train.txt > model_5gram.arpa
# Optionally, binarize for faster loading
build_binary model_5gram.arpa model_5gram.binGenerate hard-negative candidate TSVs for the reranker. These are referenced in config.yaml under reranker.data.candidates_train_path / candidates_valid_path.
# Full train set, all positions, K=64
uv run python src/data/precompute_kenlm_candidates.py --split train --k 64
# Last position only (for distillation data), with sibling gold exclusion
uv run python src/data/precompute_kenlm_candidates.py \
--split train --k 64 --last_position_only --exclude_sibling_golds
# Stratified sample of 35k lines across 52 languages
uv run python src/data/precompute_kenlm_candidates.py \
--split train --k 64 --stratified_sample 35000
# Limit to 5 random positions per line (reduces TSV size for large datasets)
uv run python src/data/precompute_kenlm_candidates.py \
--split train --k 64 --positions_per_line 5
# Valid set
uv run python src/data/precompute_kenlm_candidates.py --split valid --k 64Generates a TSV with random candidates instead of KenLM top-K. Useful for ablations to isolate the reranker's contribution from KenLM candidate quality.
uv run python src/data/precompute_random_candidates.py --split train --k 64 --last_position_onlyAll dataset and preprocessing settings live in the repo root:
TARGET_PER_LANG = 1000 # Samples to collect per language
MIN_CHARS = 200 # Minimum text length
MAX_CHARS = 4000 # Maximum text length (texts are truncated)
MAX_PER_LANG_SCAN = 200_000 # Max documents to scan per language
PATIENCE_PER_LANG = 50_000 # Early stop if no valid samples for N docs
SEED = 42 # Random seed for reproducibilityTo add/remove languages, edit the LANGS list.
VALID_RATIO = 0.01 # Fraction for validation split
SPACE_TOKEN = "<sp>" # Token to represent spaces
INPUT_DIR = "data/madlad_multilang_..." # Input dataset path
OUTPUT_DIR = "data/madlad_multilang_..." # Output directory
SEED = 42 # Random seed for train/val splitOriginal text:
Hello world!
After preprocessing:
h e l l o <sp> w o r l d !
Each character becomes a space-separated token. Spaces in the original text are replaced with the <sp> token.
from datasets import Dataset
# Load the built dataset (paths are relative to repo root)
ds = Dataset.load_from_disk("data/madlad_multilang_clean_1k_optionB")
print(ds)
print(ds[0]) # First example with 'lang' and 'text' fieldsRequired packages (already in pyproject.toml):
datasets==3.6.0- Hugging Face datasets librarypandas- CSV outputtqdm- Progress bars
Install with:
uv syncIf you see errors about trust_remote_code, ensure you're using datasets==3.6.0. Newer versions (4.x) don't support the MADLAD-400 loading script.
uv add --dev "datasets==3.6.0"The dataset is streamed, but preprocessing loads everything into memory. If you encounter memory issues:
- Reduce
TARGET_PER_LANGinbuilddataset.py - Process in batches by modifying
preprocess.pyto write incrementally
If builddataset.py reports 0 samples:
- Check your internet connection
- Verify Hugging Face Hub access
- Check
rejected_langs.csvfor error messages