Skip to content

Latest commit

 

History

History
215 lines (148 loc) · 6.86 KB

File metadata and controls

215 lines (148 loc) · 6.86 KB

Data Pipeline

This directory contains scripts for building and preprocessing multilingual datasets for character-level language modeling with KenLM, and for precomputing reranker training candidates.

Scripts Overview

builddataset.py

Downloads and builds a multilingual dataset from the MADLAD-400 corpus.

What it does:

  • Streams text data from allenai/madlad-400 for the languages listed in data_config.yaml
  • Collects 1,000 samples per language (configurable)
  • Filters texts by length (200-4,000 characters)
  • Saves as a Hugging Face Dataset format
  • Generates CSV summaries of language counts and rejected languages

Output:

  • data/madlad_multilang_clean_1k_optionB/ - Dataset directory
    • dataset_dict.json - Dataset metadata
    • data-*.arrow - Arrow-format data files
    • language_counts.csv - Per-language sample counts
    • rejected_langs.csv - Languages that failed to load (if any)

preprocess.py

Prepares the downloaded dataset for KenLM training by converting to character-level tokenization.

What it does:

  • Loads the dataset from builddataset.py
  • Applies text normalization:
    • NFC unicode normalization (consistent canonical form)
    • Lowercase conversion
    • Whitespace normalization (collapse to single spaces)
  • Character-level tokenization (spaces → <sp> token)
  • Splits data into train/validation sets (99%/1%)
  • Outputs plain text files suitable for KenLM

Output:

  • data/madlad_multilang_clean_1k_optionB_kenlm/ - Output directory
    • train.txt - Training data (one tokenized line per document)
    • valid.txt - Validation data

precompute_kenlm_candidates.py

Precomputes KenLM top-K candidates for every position (or last position only) in a tokenized text file. Used to generate hard negatives for reranker training. For each position, builds the KenLM context state, scores all vocab tokens, and takes the top-K. Supports stratified sampling across language groups and sibling gold exclusion.

Output: TSV file with columns: seq_idx, pos, candidates, kenlm_scores, gold (candidates and scores are \x01-separated)

precompute_random_candidates.py

Same TSV format as precompute_kenlm_candidates.py, but candidates are drawn uniformly at random (no KenLM scoring). All kenlm_scores are 0. Gold is always force-included. Useful as a baseline to isolate the reranker's contribution vs KenLM candidate quality.


Workflows

1. Build the Dataset

uv run python src/data/builddataset.py

This will take some time as it streams data from Hugging Face. Progress is printed per language.

Note: Requires datasets==3.6.0 due to compatibility issues with newer versions.

2. Preprocess for KenLM

uv run python src/data/preprocess.py

You'll see a progress bar showing processing status.

3. Train KenLM Model

Once preprocessing is complete, build a KenLM language model:

# Install KenLM first (if not already installed)
# See: https://github.com/kpu/kenlm

cd ../../data/madlad_multilang_clean_1k_optionB_kenlm

# Build a 5-gram model
lmplz -o 5 < train.txt > model_5gram.arpa

# Optionally, binarize for faster loading
build_binary model_5gram.arpa model_5gram.bin

4. Precompute KenLM Candidates for Reranker Training

Generate hard-negative candidate TSVs for the reranker. These are referenced in config.yaml under reranker.data.candidates_train_path / candidates_valid_path.

# Full train set, all positions, K=64
uv run python src/data/precompute_kenlm_candidates.py --split train --k 64

# Last position only (for distillation data), with sibling gold exclusion
uv run python src/data/precompute_kenlm_candidates.py \
    --split train --k 64 --last_position_only --exclude_sibling_golds

# Stratified sample of 35k lines across 52 languages
uv run python src/data/precompute_kenlm_candidates.py \
    --split train --k 64 --stratified_sample 35000

# Limit to 5 random positions per line (reduces TSV size for large datasets)
uv run python src/data/precompute_kenlm_candidates.py \
    --split train --k 64 --positions_per_line 5

# Valid set
uv run python src/data/precompute_kenlm_candidates.py --split valid --k 64

5. (Optional) Precompute Random Candidates (Baseline)

Generates a TSV with random candidates instead of KenLM top-K. Useful for ablations to isolate the reranker's contribution from KenLM candidate quality.

uv run python src/data/precompute_random_candidates.py --split train --k 64 --last_position_only

Configuration

All dataset and preprocessing settings live in the repo root:

builddataset.py Parameters

TARGET_PER_LANG = 1000        # Samples to collect per language
MIN_CHARS = 200               # Minimum text length
MAX_CHARS = 4000              # Maximum text length (texts are truncated)
MAX_PER_LANG_SCAN = 200_000   # Max documents to scan per language
PATIENCE_PER_LANG = 50_000    # Early stop if no valid samples for N docs
SEED = 42                     # Random seed for reproducibility

To add/remove languages, edit the LANGS list.

preprocess.py Parameters

VALID_RATIO = 0.01                        # Fraction for validation split
SPACE_TOKEN = "<sp>"                      # Token to represent spaces
INPUT_DIR = "data/madlad_multilang_..."   # Input dataset path
OUTPUT_DIR = "data/madlad_multilang_..."  # Output directory
SEED = 42                                 # Random seed for train/val split

Output Format

Character Tokenization Example

Original text:

Hello world!

After preprocessing:

h e l l o <sp> w o r l d !

Each character becomes a space-separated token. Spaces in the original text are replaced with the <sp> token.

Loading the Dataset Programmatically

from datasets import Dataset

# Load the built dataset (paths are relative to repo root)
ds = Dataset.load_from_disk("data/madlad_multilang_clean_1k_optionB")
print(ds)
print(ds[0])  # First example with 'lang' and 'text' fields

Dependencies

Required packages (already in pyproject.toml):

  • datasets==3.6.0 - Hugging Face datasets library
  • pandas - CSV output
  • tqdm - Progress bars

Install with:

uv sync

Troubleshooting

trust_remote_code Error

If you see errors about trust_remote_code, ensure you're using datasets==3.6.0. Newer versions (4.x) don't support the MADLAD-400 loading script.

uv add --dev "datasets==3.6.0"

Memory Issues

The dataset is streamed, but preprocessing loads everything into memory. If you encounter memory issues:

  • Reduce TARGET_PER_LANG in builddataset.py
  • Process in batches by modifying preprocess.py to write incrementally

Empty Dataset

If builddataset.py reports 0 samples:

  • Check your internet connection
  • Verify Hugging Face Hub access
  • Check rejected_langs.csv for error messages