mlprep 🚀

The fastest no-code data preprocessing engine for Machine Learning.
Powered by Rust & Polars.

🎬 Introduction Video

📺 Watch: mlprep Introduction (NotebookLM)

Stop writing slow, fragile pandas boilerplate.
Start defining robust, reproducible pipelines.

mlprep is a high-performance CLI tool and Python library that handles the dirty work of ML engineers: type inference, missing value imputation, complex joins, and feature engineering—all defined in a simple YAML config.

🔥 Why mlprep?

🚀 Blazing Speed

Built on Rust and Polars, mlprep processes gigabytes of data in seconds, not minutes. It leverages multi-threading and SIMD vectorization out of the box.

✨ Zero-Code Pipelines

Define your entire preprocessing workflow in pipeline.yaml. No more "spaghetti code" notebooks that no one can read.

🛡️ Quarantine Mode

Don't let dirty data crash your training. mlprep isolates invalid rows (schema mismatch, outliers) into a separate "quarantine" file, so your pipeline stays green and your models stay clean.

🔄 Build Once, Run Anywhere

fit your feature engineering steps (scaling, encoding) on training data and transform production data with exact reproducibility. No more training-serving skew.

⚡️ Quick Start

1. Install

pip install mlprep

2. Define your pipeline (`pipeline.yaml`)

inputs:
  - path: "data/raw_users.csv"
    format: csv

steps:
  # ETL
  - fillna:
      strategy: mean
      columns: [age, income]
  - filter: "age >= 18"
  
  # Data Quality Check
  - validate:
      mode: quarantine # Bad rows go to 'quarantine.parquet'
      checks:
        - name: email
          regex: "^.+@.+\\..+$"

  # Feature Engineering
  - features:
      config: features.yaml

outputs:
  - path: "data/processed_users.parquet"
    format: parquet
    compression: zstd

3. Run it

mlprep run pipeline.yaml

Result: A clean, highly-compressed Parquet file ready for training. 🚀

🆚 Comparison

Feature	Pandas	mlprep
Speed	🐢 Single-threaded	🐆 Multi-threaded (Rust)
Pipeline	Python Script	YAML Config
Validation	Manual `.loc[]` checks	Built-in Quality Engine
Bad Data	Crash or Silent Fail	Quarantine Execution
Memory	Bloated Objects	Zero-Copy Arrow

⚡️ Performance

mlprep is designed for speed, leveraging Rust's ownership model and Polars' query engine.

Operation	vs Pandas	Note
CSV Read	~3-5x Faster	Multi-threaded parsing
Pipeline	~10x Faster	Lazy evaluation & query optimization
Memory	~1/4 Usage	Zero-copy Arrow memory format

Benchmarks run on 1GB generated dataset. To run your own benchmarks:

python scripts/benchmark.py --size 1.0 --compare-pandas

🗺️ Roadmap

We are actively building MVP (Phase 1). Check out our documentation:

📚 Use Cases & Examples

Explore full examples in the examples/ directory:

1. Basic ETL Pipeline

Scenario: Filter, select columns, and convert CSV to Parquet.
Key Features: filter, select, write_parquet.

2. Data Validation

Scenario: Ensure data quality before training.
Key Features: Schema validation, quarantine mode for invalid rows.

3. Feature Engineering

Scenario: Generate features for ML training.
Key Features: fit (train) / transform (prod) pattern, standard_scaler, one_hot_encoding.

4. Scikit-Learn Integration

Scenario: Use mlprep as a preprocessing step in a Scikit-Learn pipeline.
Key Features: Seamless integration with Python ML ecosystem.

5. MLflow Experiment Tracking

Scenario: Track preprocessing parameters and artifacts in MLflow.
Key Features: Reproducibility and experiment management.

6. Airflow DAG

Scenario: Schedule and monitor mlprep run as part of an Airflow DAG.
Key Features: Production-friendly orchestration with BashOperator.

7. DVC Pipeline

Scenario: Version control processed datasets with a DVC stage that calls mlprep.
Key Features: Reproducible data artifacts (dvc repro + mlprep run pipeline.yaml).

🤝 Contributing

We welcome contributions! Please see the issue tracker for good first issues.

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
debug_repro		debug_repro
docs		docs
examples		examples
python		python
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
GEMINI.md		GEMINI.md
QWEN.md		QWEN.md
README.md		README.md
audit.toml		audit.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlprep 🚀

🎬 Introduction Video

🔥 Why mlprep?

🚀 Blazing Speed

✨ Zero-Code Pipelines

🛡️ Quarantine Mode

🔄 Build Once, Run Anywhere

⚡️ Quick Start

1. Install

2. Define your pipeline (`pipeline.yaml`)

3. Run it

🆚 Comparison

⚡️ Performance

🗺️ Roadmap

📚 Use Cases & Examples

1. Basic ETL Pipeline

2. Data Validation

3. Feature Engineering

4. Scikit-Learn Integration

5. MLflow Experiment Tracking

6. Airflow DAG

7. DVC Pipeline

🤝 Contributing

📄 License

About

Uh oh!

Releases 2

Packages

Languages

takurot/mlprep

Folders and files

Latest commit

History

Repository files navigation

mlprep 🚀

🎬 Introduction Video

🔥 Why mlprep?

🚀 Blazing Speed

✨ Zero-Code Pipelines

🛡️ Quarantine Mode

🔄 Build Once, Run Anywhere

⚡️ Quick Start

1. Install

2. Define your pipeline (pipeline.yaml)

3. Run it

🆚 Comparison

⚡️ Performance

🗺️ Roadmap

📚 Use Cases & Examples

1. Basic ETL Pipeline

2. Data Validation

3. Feature Engineering

4. Scikit-Learn Integration

5. MLflow Experiment Tracking

6. Airflow DAG

7. DVC Pipeline

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

2. Define your pipeline (`pipeline.yaml`)

Packages