The fastest no-code data preprocessing engine for Machine Learning.
Powered by Rust & Polars.
📺 Watch: mlprep Introduction (NotebookLM)
Stop writing slow, fragile pandas boilerplate.
Start defining robust, reproducible pipelines.
mlprep is a high-performance CLI tool and Python library that handles the dirty work of ML engineers: type inference, missing value imputation, complex joins, and feature engineering—all defined in a simple YAML config.
Built on Rust and Polars, mlprep processes gigabytes of data in seconds, not minutes. It leverages multi-threading and SIMD vectorization out of the box.
Define your entire preprocessing workflow in pipeline.yaml. No more "spaghetti code" notebooks that no one can read.
Don't let dirty data crash your training. mlprep isolates invalid rows (schema mismatch, outliers) into a separate "quarantine" file, so your pipeline stays green and your models stay clean.
fit your feature engineering steps (scaling, encoding) on training data and transform production data with exact reproducibility. No more training-serving skew.
pip install mlprepinputs:
- path: "data/raw_users.csv"
format: csv
steps:
# ETL
- fillna:
strategy: mean
columns: [age, income]
- filter: "age >= 18"
# Data Quality Check
- validate:
mode: quarantine # Bad rows go to 'quarantine.parquet'
checks:
- name: email
regex: "^.+@.+\\..+$"
# Feature Engineering
- features:
config: features.yaml
outputs:
- path: "data/processed_users.parquet"
format: parquet
compression: zstdmlprep run pipeline.yamlResult: A clean, highly-compressed Parquet file ready for training. 🚀
| Feature | Pandas | mlprep |
|---|---|---|
| Speed | 🐢 Single-threaded | 🐆 Multi-threaded (Rust) |
| Pipeline | Python Script | YAML Config |
| Validation | Manual .loc[] checks |
Built-in Quality Engine |
| Bad Data | Crash or Silent Fail | Quarantine Execution |
| Memory | Bloated Objects | Zero-Copy Arrow |
mlprep is designed for speed, leveraging Rust's ownership model and Polars' query engine.
| Operation | vs Pandas | Note |
|---|---|---|
| CSV Read | ~3-5x Faster | Multi-threaded parsing |
| Pipeline | ~10x Faster | Lazy evaluation & query optimization |
| Memory | ~1/4 Usage | Zero-copy Arrow memory format |
Benchmarks run on 1GB generated dataset. To run your own benchmarks:
python scripts/benchmark.py --size 1.0 --compare-pandasWe are actively building MVP (Phase 1). Check out our documentation:
Explore full examples in the examples/ directory:
- Scenario: Filter, select columns, and convert CSV to Parquet.
- Key Features:
filter,select,write_parquet.
- Scenario: Ensure data quality before training.
- Key Features: Schema validation,
quarantinemode for invalid rows.
- Scenario: Generate features for ML training.
- Key Features:
fit(train) /transform(prod) pattern,standard_scaler,one_hot_encoding.
- Scenario: Use mlprep as a preprocessing step in a Scikit-Learn pipeline.
- Key Features: Seamless integration with Python ML ecosystem.
- Scenario: Track preprocessing parameters and artifacts in MLflow.
- Key Features: Reproducibility and experiment management.
6. Airflow DAG
- Scenario: Schedule and monitor
mlprep runas part of an Airflow DAG. - Key Features: Production-friendly orchestration with
BashOperator.
7. DVC Pipeline
- Scenario: Version control processed datasets with a DVC stage that calls
mlprep. - Key Features: Reproducible data artifacts (
dvc repro+mlprep run pipeline.yaml).
We welcome contributions! Please see the issue tracker for good first issues.
MIT
