Data Pipeline

This directory contains scripts for building and preprocessing multilingual datasets for character-level language modeling with KenLM, and for precomputing reranker training candidates.

Scripts Overview

`builddataset.py`

Downloads and builds a multilingual dataset from the MADLAD-400 corpus.

What it does:

Streams text data from allenai/madlad-400 for the languages listed in data_config.yaml
Collects 1,000 samples per language (configurable)
Filters texts by length (200-4,000 characters)
Saves as a Hugging Face Dataset format
Generates CSV summaries of language counts and rejected languages

Output:

data/madlad_multilang_clean_1k_optionB/ - Dataset directory
- dataset_dict.json - Dataset metadata
- data-*.arrow - Arrow-format data files
- language_counts.csv - Per-language sample counts
- rejected_langs.csv - Languages that failed to load (if any)

`preprocess.py`

Prepares the downloaded dataset for KenLM training by converting to character-level tokenization.

What it does:

Loads the dataset from builddataset.py
Applies text normalization:
- NFC unicode normalization (consistent canonical form)
- Lowercase conversion
- Whitespace normalization (collapse to single spaces)
Character-level tokenization (spaces → <sp> token)
Splits data into train/validation sets (99%/1%)
Outputs plain text files suitable for KenLM

Output:

data/madlad_multilang_clean_1k_optionB_kenlm/ - Output directory
- train.txt - Training data (one tokenized line per document)
- valid.txt - Validation data

`precompute_kenlm_candidates.py`

Precomputes KenLM top-K candidates for every position (or last position only) in a tokenized text file. Used to generate hard negatives for reranker training. For each position, builds the KenLM context state, scores all vocab tokens, and takes the top-K. Supports stratified sampling across language groups and sibling gold exclusion.

Output: TSV file with columns: seq_idx, pos, candidates, kenlm_scores, gold (candidates and scores are \x01-separated)

`precompute_random_candidates.py`

Same TSV format as precompute_kenlm_candidates.py, but candidates are drawn uniformly at random (no KenLM scoring). All kenlm_scores are 0. Gold is always force-included. Useful as a baseline to isolate the reranker's contribution vs KenLM candidate quality.

Workflows

1. Build the Dataset

uv run python src/data/builddataset.py

This will take some time as it streams data from Hugging Face. Progress is printed per language.

Note: Requires datasets==3.6.0 due to compatibility issues with newer versions.

2. Preprocess for KenLM

uv run python src/data/preprocess.py

You'll see a progress bar showing processing status.

3. Train KenLM Model

Once preprocessing is complete, build a KenLM language model:

# Install KenLM first (if not already installed)
# See: https://github.com/kpu/kenlm

cd ../../data/madlad_multilang_clean_1k_optionB_kenlm

# Build a 5-gram model
lmplz -o 5 < train.txt > model_5gram.arpa

# Optionally, binarize for faster loading
build_binary model_5gram.arpa model_5gram.bin

4. Precompute KenLM Candidates for Reranker Training

Generate hard-negative candidate TSVs for the reranker. These are referenced in config.yaml under reranker.data.candidates_train_path / candidates_valid_path.

# Full train set, all positions, K=64
uv run python src/data/precompute_kenlm_candidates.py --split train --k 64

# Last position only (for distillation data), with sibling gold exclusion
uv run python src/data/precompute_kenlm_candidates.py \
    --split train --k 64 --last_position_only --exclude_sibling_golds

# Stratified sample of 35k lines across 52 languages
uv run python src/data/precompute_kenlm_candidates.py \
    --split train --k 64 --stratified_sample 35000

# Limit to 5 random positions per line (reduces TSV size for large datasets)
uv run python src/data/precompute_kenlm_candidates.py \
    --split train --k 64 --positions_per_line 5

# Valid set
uv run python src/data/precompute_kenlm_candidates.py --split valid --k 64

5. (Optional) Precompute Random Candidates (Baseline)

Generates a TSV with random candidates instead of KenLM top-K. Useful for ablations to isolate the reranker's contribution from KenLM candidate quality.

uv run python src/data/precompute_random_candidates.py --split train --k 64 --last_position_only

Configuration

All dataset and preprocessing settings live in the repo root:

data_config.yaml

`builddataset.py` Parameters

TARGET_PER_LANG = 1000        # Samples to collect per language
MIN_CHARS = 200               # Minimum text length
MAX_CHARS = 4000              # Maximum text length (texts are truncated)
MAX_PER_LANG_SCAN = 200_000   # Max documents to scan per language
PATIENCE_PER_LANG = 50_000    # Early stop if no valid samples for N docs
SEED = 42                     # Random seed for reproducibility

To add/remove languages, edit the LANGS list.

`preprocess.py` Parameters

VALID_RATIO = 0.01                        # Fraction for validation split
SPACE_TOKEN = "<sp>"                      # Token to represent spaces
INPUT_DIR = "data/madlad_multilang_..."   # Input dataset path
OUTPUT_DIR = "data/madlad_multilang_..."  # Output directory
SEED = 42                                 # Random seed for train/val split

Output Format

Character Tokenization Example

Original text:

Hello world!

After preprocessing:

h e l l o <sp> w o r l d !

Each character becomes a space-separated token. Spaces in the original text are replaced with the <sp> token.

Loading the Dataset Programmatically

from datasets import Dataset

# Load the built dataset (paths are relative to repo root)
ds = Dataset.load_from_disk("data/madlad_multilang_clean_1k_optionB")
print(ds)
print(ds[0])  # First example with 'lang' and 'text' fields

Dependencies

Required packages (already in pyproject.toml):

datasets==3.6.0 - Hugging Face datasets library
pandas - CSV output
tqdm - Progress bars

Install with:

uv sync

Troubleshooting

`trust_remote_code` Error

If you see errors about trust_remote_code, ensure you're using datasets==3.6.0. Newer versions (4.x) don't support the MADLAD-400 loading script.

uv add --dev "datasets==3.6.0"

Memory Issues

The dataset is streamed, but preprocessing loads everything into memory. If you encounter memory issues:

Reduce TARGET_PER_LANG in builddataset.py
Process in batches by modifying preprocess.py to write incrementally

Empty Dataset

If builddataset.py reports 0 samples:

Check your internet connection
Verify Hugging Face Hub access
Check rejected_langs.csv for error messages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Pipeline

Scripts Overview

`builddataset.py`

`preprocess.py`

`precompute_kenlm_candidates.py`

`precompute_random_candidates.py`

Workflows

1. Build the Dataset

2. Preprocess for KenLM

3. Train KenLM Model

4. Precompute KenLM Candidates for Reranker Training

5. (Optional) Precompute Random Candidates (Baseline)

Configuration

`builddataset.py` Parameters

`preprocess.py` Parameters

Output Format

Character Tokenization Example

Loading the Dataset Programmatically

Dependencies

Troubleshooting

`trust_remote_code` Error

Memory Issues

Empty Dataset

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Pipeline

Scripts Overview

builddataset.py

preprocess.py

precompute_kenlm_candidates.py

precompute_random_candidates.py

Workflows

1. Build the Dataset

2. Preprocess for KenLM

3. Train KenLM Model

4. Precompute KenLM Candidates for Reranker Training

5. (Optional) Precompute Random Candidates (Baseline)

Configuration

builddataset.py Parameters

preprocess.py Parameters

Output Format

Character Tokenization Example

Loading the Dataset Programmatically

Dependencies

Troubleshooting

trust_remote_code Error

Memory Issues

Empty Dataset

`builddataset.py`

`preprocess.py`

`precompute_kenlm_candidates.py`

`precompute_random_candidates.py`

`builddataset.py` Parameters

`preprocess.py` Parameters

`trust_remote_code` Error