Skip to content

An end-to-end data pipeline for LLM pre-training. TokenTurbine ingests a sample of unprepared data, cleans and normalizes text, performs deduplication and quality filtering, and outputs a training-ready dataset. Designed for scalability, modularity, and experiment tracking.

License

Notifications You must be signed in to change notification settings

EMventura/TokenTurbine

Repository files navigation

TokenTurbine

An end-to-end data pipeline for LLM pre-training. TokenTurbine ingests a sample of unprepared data, cleans and normalizes text, performs deduplication and quality filtering, and outputs a training-ready dataset. The default dataset can be downloaded from (here). Built with Ray for distributed processing.

Pipeline stages:

  • Data Ingestion - Load and normalize text. Initial low-cost filtering.
  • Quality Filtering - Language detection, PII handling, toxicity filtering
  • Deduplication - Exact and fuzzy duplicate removal using MinHash LSH
  • Tokenization - Fast tokenization with tiktoken
  • Export - Training-ready format (single JSONL file by default, Parquet available for tokens)

Key Features

Distributed Processing - Distributed processing with Ray
Production-Ready - Containerized with Docker for reproducible runs
Configurable - YAML-based configuration for easy experimentation
Efficient - Vectorized operations with PyArrow for high throughput
Observable - Comprehensive logging and statistics tracking


Prerequisites

  • Docker 20.10+ and Docker Compose 2.0+ (Install Docker)
  • At least 6GB RAM and 1GB disk space (or 3x input data size if using a different dataset).

Verify prerequisites:

docker --version           # Should be 20.10+
docker-compose --version   # Should be 2.0+

Installation & Usage

TokenTurbine offers 3 setup options - choose the one that fits your workflow:

  • Option 1: Script-based (Recommended)
  • Option 2: Makefile
  • Option 3: Manual Setup

Option 1: use script download_data.sh

# 1. Clone the repository
git clone https://github.com/EMventura/TokenTurbine.git 
cd TokenTurbine

# 2. Make download script executable
chmod +x scripts/download_data.sh

# 3. Run the download script
./scripts/download_data.sh

# 4. Build the Docker image
docker-compose build

# 5. Run the pipeline
docker-compose up

# 6. Access results
ls -lh data/processed/cleaned_dataset.jsonl

Option 2: Makefile

# 1. Clone the repository
git clone 
cd TokenTurbine

# 2. Download data
make download

# 3. Build the container
make build

# 4. Run the pipeline
make run

# 5. Access results
ls -lh data/processed/cleaned_dataset.jsonl

Option 3: Manual download

# 1. Clone the repository
git clone 
cd TokenTurbine

# 2. Create data directory
mkdir -p data/raw

# 3. Download the dataset manually
curl -L "https://s3.us-east-1.amazonaws.com/mainpipe.maincode.com/mainpipe_data_v1.jsonl" \
  -o data/raw/mainpipe_data_v1.jsonl

# 4. Verify the download
ls -lh data/raw/mainpipe_data_v1.jsonl

# 5. Build the Docker image
docker-compose build

# 6. Run the pipeline
docker-compose up

# 7. Access results
ls -lh data/processed/cleaned_dataset.jsonl

That's it! Your cleaned dataset is ready at data/processed/cleaned_dataset.jsonl


Configuration

Edit configs/base.yaml to customize the pipeline:

# Example: Adjust filtering thresholds
filtering:
  enabled: true
  min_lang_score: 0.65        # Language confidence (0-1)
  max_punc_ratio: 0.3         # Max punctuation ratio
  enable_pii: true
  pii_action: "redact"        # "redact" or "drop"
  enable_toxicity: true

# Example: Tune deduplication
deduplication:
  enabled: true
  num_perm: 128               # MinHash permutations (64-256)
  threshold: 0.85             # Similarity threshold (0.7-0.95)
  max_lsh_items: 1000000      # Memory limit

# Example: Enable tokenization
tokenization:
  enabled: true
  tokenizer_model: "gpt2"     # "gpt2", "cl100k_base", etc.
  max_length: 2048
  export_format: "parquet"    # "parquet" or "jsonl"

See configs/base.yaml for all available options.


Pipeline Stages

1. Ingestion

  • Loads JSONL input files
  • Cleans HTML and normalizes Unicode
  • Filters code-like content and short documents
  • Generates document IDs with xxhash

2. Quality Filtering

  • Language Detection: FastText-based filtering (176 languages)
  • PII Handling: Detects and redacts emails, phone numbers, IPs
  • Toxicity Filter: Removes hate speech and toxic content
  • Quality Heuristics: Punctuation ratio, character distribution

3. Deduplication

  • Exact Dedup: Hash-based duplicate removal
  • Fuzzy Dedup: MinHash LSH for near-duplicate detection
  • Configurable similarity threshold (Jaccard distance)

4. Tokenization (Optional)

  • Fast BPE tokenization with tiktoken
  • Multiple tokenizer support (GPT-2, GPT-4, custom)
  • Configurable sequence length and truncation
  • Exports to Parquet or JSONL

5. Export

  • Single consolidated JSONL file
  • Preserves document metadata (source, URL, timestamps)
  • Optional tokenized shards for training

Project Structure

TokenTurbine/
├── src/
│   ├── main.py                 # Pipeline orchestrator
│   ├── data_load.py            # Ingestion stage
│   ├── filtering.py            # Quality filtering
│   ├── deduplication.py        # Dedup stage
│   ├── tokenization.py         # Tokenization stage
│   └── utils/
│       ├── config_loader.py    # Config management
│       ├── helper.py           # Helper functions
│       └── single_jsonl.py     # Export utilities
├── configs/
│   └── base.yaml               # Default configuration
├── scripts/                    # Helper scripts (For download data option 1)
│   └── download_data.sh
├── reports/                    # Final report and metric plots
│   ├── report_mainpipe.pdf     # Final report
│   └── plots/                  # Metric plots
│       ├── Char_Words_Dist.png    
│       ├── Dedup_Check.png           
│       ├── Integrity_Check.png    
│       ├── Language_Check.png          
│       ├── Perplexity_Dist.png    
│       ├── PII_Check.png
│       ├── pipeline_steps.png          
│       └── Toxicity_Check.png     
├── data/
│   ├── raw/                    # Input data 
│   └── processed/              # Output data (cleaned non-tokenized dataset)
│       └── tokenized/          # Tokenized files
├── notebooks                   # Jupyter notebooks for output analysis
│   ├── read_input.ipynb        # Analysis of the raw dataset
│   └──inspect_dataset.ipynb    # Analysis of the cleaned dataset
├── Dockerfile                  # Container definition
├── docker-compose.yml          # Container orchestration
├── requirements.txt            # Python dependencies
├── Makefile                    # Helper commands
└── README.md                   # This file

Makefile Commands

Command Description
make download Download the input dataset
make build Build Docker image
make run Run pipeline with default config
make shell Open interactive shell in container
make logs View pipeline logs
make stop Stop running containers
make clean Remove containers and images
make clean-data Remove all processed data
make validate Verify installation

Run make help to see all available commands.


Performance

System: 4 CPU cores, 8GB RAM

Stage Time (with compute_counts) Time (no compute_counts)
Ingestion ~2 min <5 sec
Filtering ~2 min <5 sec
Deduplication ~8 min <5 sec
Export ~8 min ~8 min
Tokenization ~8 min ~8 min
Total ~28 min ~17 min

Troubleshooting

Common Issues

Issue: Pipeline runs slowly
Solution: Reduce batch_size in config or increase CPU allocation

Issue: "FastText model not found"
Solution: Model should be auto-downloaded. If not: cd data && wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Issue: Permission denied on data files
Solution: chmod -R 755 data/ or run with your user: docker-compose run --user $(id -u):$(id -g) pipeline

Issue: Data file not found (FileNotFoundError: data/raw/mainpipe_data_v1.jsonl)
Solution: Download data file manually, using the download_data.sh script or with make download

Issue: Old containers interfering (ERROR: 'ContainerConfig') Solution:

# Clean everything
docker-compose down -v

docker-compose build --no-cache

docker-compose up

Advanced Usage

Custom Configurations

Create multiple configs for different experiments:

# Create custom config
cp configs/base.yaml configs/experiment_1.yaml

# Edit thresholds
vim configs/experiment_1.yaml

# Run with custom config
docker-compose run --rm pipeline --config configs/experiment_1.yaml

Use your own data

# Copy your data file
cp /path/to/your/data.jsonl data/raw/mainpipe_data_v1.jsonl

# Run pipeline
docker-compose up

Output Format

Cleaned Dataset (data/processed/cleaned_dataset.jsonl)

{
  "doc_id": "a1b2c3d4e5f6...",
  "text": "This is the cleaned document text...",
  "url": "https://example.com/page",
  "source": "mainpipe_data_v1.jsonl"
}

Tokenized Dataset (data/processed/tokenized/*.parquet)

Parquet files containing:

  • doc_id: Document identifier
  • input_ids: List of token IDs
  • token_count: Number of tokens
  • text: Original text (optional)
  • source, url: Metadata

Metric Plots (reports/plots/*.png)

Plots include (generated using notebooks/inspect_dataset.ipynb):

  • Char_Words_Dist.png: Histogram of lengths
  • Dedup_Check.png: Unique vs duplicate docs
  • Integrity_Check.png: Docs with artifacts
  • Language_Check.png: English vs Non-English docs
  • Perplexity_Check.png: Perplexity scores
  • PII_Check.png: PII docs
  • pipeline_steps.png: Metadata
  • Toxicity_Check.png: Toxic docs

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

Built with:


Contact

For questions or issues:

  • Review troubleshooting section above
  • Open an issue on GitHub

Current Version (v1.0)

  • Basic pipeline (ingestion, filtering, dedup, tokenization)
  • Docker containerization
  • Configuration system
  • Quality filters (language, PII, toxicity)
  • MinHash LSH deduplication
  • Inspectability dashboard (metrics, histograms)

About

An end-to-end data pipeline for LLM pre-training. TokenTurbine ingests a sample of unprepared data, cleans and normalizes text, performs deduplication and quality filtering, and outputs a training-ready dataset. Designed for scalability, modularity, and experiment tracking.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published