A declarative pipeline for reproducible ML preprocessing
Modern machine learning (ML) workflows live or die by their data‑preprocessing steps, yet in Python—a language with a
rich ecosystem for data science and ML—these steps are often scattered across ad‑hoc scripts or opaque Scikit-Learn
(sklearn) snippets that are hard to read, audit, or reuse. ReciPies provides a concise, human‑readable, and fully
reproducible way to declare, execute, and share preprocessing pipelines, adhering to Configuration as Code principles.
It lets users describe transformations as a recipe made of ordered steps (e.g., imputing, encoding, normalizing)
applied to variables identified by semantic roles (predictor, outcome, ID, time stamp, etc.). Recipes can be prepped
(trained) once, baked many times, and cleanly separated between training and new data—preventing data leakage by
construction. Under the hood, ReciPies targets both Pandas and Polars backends for performance and flexibility, and
it is easily extensible: users can register custom steps with minimal boilerplate. Each recipe is serializable to
JSON/YAML for provenance tracking, collaboration, and publication, and integrates smoothly with downstream modeling
libraries. Packaging preprocessing as clear, declarative objects, ReciPies lowers the cognitive load of feature
engineering, improves reproducibility, and makes methodological choices explicit, benefiting individual researchers,
engineering teams, and peer reviewers alike.
The backend can either be Polars or Pandas dataframes. The operation of this package is inspired by the R-package recipes. Please check the documentation for more details.
You can install ReciPies from pip using:
pip install recipies
You can install ReciPies using uv (the unified package manager) with the following command:
uv add recipiesNote that the package is called
recipieson pip.
# with conda (optional)
conda env create -n ReciPies python=3.12
conda activate ReciPiesThen, from the root of the repository, run with pip:
pip install -e .Or with uv (if you have uv installed):
uv venv && source .venv/bin/activateHere's a simple example of using ReciPies:
# Import necessary libraries
import polars as pl
import numpy as np
from datetime import datetime, MINYEAR
from recipies import Ingredients, Recipe
from recipies.selector import all_numeric_predictors, all_predictors
from recipies.step import StepSklearn, StepHistorical, Accumulator, StepImputeFill
from sklearn.impute import MissingIndicator
# Set up random state for reproducible results
rand_state = np.random.RandomState(42)
# Create time columns for two different groups
timecolumn = pl.concat(
[
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 5), "1h", eager=True),
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 3), "1h", eager=True),
]
)
# Create sample DataFrame
df = pl.DataFrame(
{
"id": [1] * 6 + [2] * 4,
"time": timecolumn,
"y": rand_state.normal(size=(10,)),
"x1": rand_state.normal(loc=10, scale=5, size=(10,)),
"x2": rand_state.binomial(n=1, p=0.3, size=(10,)),
"x3": pl.Series(["a", "b", "c", "a", "c", "b", "c", "a", "b", "c"], dtype=pl.Categorical),
"x4": pl.Series(["x", "y", "y", "x", "y", "y", "x", "x", "y", "x"], dtype=pl.Categorical),
}
)
# Introduce some missing values
df = df.with_columns(pl.when(pl.int_range(pl.len()).is_in([1, 2, 4, 7])).then(None).otherwise(pl.col("x1")).alias("x1"))
df2 = df.clone()
# Create Ingredients and Recipe
ing = Ingredients(df)
rec = Recipe(ing, outcomes=["y"], predictors=["x1", "x2", "x3", "x4"], groups=["id"], sequences=["time"])
rec.add_step(StepSklearn(MissingIndicator(features="all"), sel=all_predictors()))
rec.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
# Apply the recipe to the ingredients
df = rec.prep()
# Apply the recipe to a new DataFrame (e.g., test set)
df2 = rec.bake(df2)Below is a schematic overview of ReciPies' architecture. We 1) load a Pandas or Polars (training) dataframe, then 2) wrap it in an Ingredients object that maintains column role information (i.e., what does this column do in this dataset). Next, we 3) define a Recipe consisting of multiple Steps that operate on selected columns. Finally, we 4) prep the Recipe on the training data and 5) bake it on new data. We can then 6) run our ML pipeline on train and test data.
The main building blocks of ReciPies are:- Ingredients: A wrapper around DataFrames that maintains column role information, ensuring data semantics are preserved during transformations.
- Recipe: A collection of processing steps that can be applied to Ingredients objects to create reproducible data pipelines.
- Step: Individual data transformation operations that understand column roles and can work with both Polars and Pandas backends.
- Selector: Utilities for selecting columns based on their roles or other criteria.
ReciPies supports both Polars and Pandas backends:
- Polars: High-performance DataFrame library with lazy evaluation
- Pandas: Traditional DataFrame library with extensive ecosystem support
The package automatically detects the backend and provides a consistent API regardless of the underlying DataFrame implementation.
Check out the examples/ directory for Jupyter notebooks demonstrating various use cases of ReciPies.
Check out the benchmarks/ directory for performance comparisons between Polars and Pandas backends.
Contributions are welcome! Please see our contributing guidelines and open an issue or submit a pull request on the GitHub repository.
This project is licensed under the MIT License. See the LICENSE file for details.
If you use ReciPies in your work, please cite the newly accepted JOSS publication. Use 'Cite this repository', or:
@article{van_de_Water2026,
doi = {10.21105/joss.09261},
url = {https://doi.org/10.21105/joss.09261},
year = {2026},
publisher = {The Open Journal},
volume = {11},
number = {117},
pages = {9261},
author = {van de Water, Robin P. and Schmidt, Hendrik and Rockenschaub, Patrick},
title = {ReciPies: A Lightweight Data Transformation Pipeline for Reproducible ML}, journal = {Journal of Open Source Software} }If you use Yet Another ICU Benchmark in your research, please cite the following:
@inproceedings{vandewaterYetAnotherICUBenchmark2024,
title = {Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML},
shorttitle = {Yet Another ICU Benchmark},
booktitle = {The Twelfth International Conference on Learning Representations},
author = {van de Water, Robin and Schmidt, Hendrik Nils Aurel and Elbers, Paul and Thoral, Patrick and Arnrich, Bert and Rockenschaub, Patrick},
year = {2024},
month = oct,
urldate = {2024-02-19},
langid = {english},
}
This paper can also be found on arxiv: arxiv.