DataRefinery

A small data preprocessing demo project that shows simple cleaning, outlier handling, scaling, and categorical encoding using Python and common data libraries.

Overview

This repository contains a lightweight example pipeline (app.py) that demonstrates basic data preprocessing steps on a dummy dataset. The repository includes sample data files in the data/ folder and saves a preprocessed output file (preprocessed_dummy_data.csv).

Project structure

app.py — Main script that creates a dummy dataset, handles missing values, removes outliers, scales numeric features, encodes categorical variables, and saves the result.
preprocessed_dummy_data.csv — Output produced by app.py (generated during a run).
requirements.txt — Python package dependencies.
data/ — Example CSV files included for reference.

Requirements

Python 3.8+ (recommended)
A virtual environment (optional but recommended)
Install dependencies with: pip install -r requirements.txt

Usage

Activate your virtual environment and run: python app.py.
The script prints progress to the console and writes preprocessed_dummy_data.csv when finished.

Common issue explained (brief)

If you see a traceback that ends with KeyError: 'column_name', it means the code tried to access a DataFrame column literally named column_name that does not exist. This typically happens when a placeholder column name was left in the script or when a column was renamed or removed earlier in the pipeline. It does not mean the data is "already clean" — it indicates a mismatch between the column names the code expects and the actual columns in the data.

Contributing

Contributions are welcome. For small changes (typos, documentation), open a pull request. For larger changes, please open an issue first to discuss the proposal.

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataRefinery

Overview

Project structure

Requirements

Usage

Common issue explained (brief)

Contributing

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DataRefinery

Overview

Project structure

Requirements

Usage

Common issue explained (brief)

Contributing

License