Skip to content

Latest commit

 

History

History
39 lines (23 loc) · 1.89 KB

File metadata and controls

39 lines (23 loc) · 1.89 KB

DataRefinery

A small data preprocessing demo project that shows simple cleaning, outlier handling, scaling, and categorical encoding using Python and common data libraries.

Overview

This repository contains a lightweight example pipeline (app.py) that demonstrates basic data preprocessing steps on a dummy dataset. The repository includes sample data files in the data/ folder and saves a preprocessed output file (preprocessed_dummy_data.csv).

Project structure

  • app.py — Main script that creates a dummy dataset, handles missing values, removes outliers, scales numeric features, encodes categorical variables, and saves the result.
  • preprocessed_dummy_data.csv — Output produced by app.py (generated during a run).
  • requirements.txt — Python package dependencies.
  • data/ — Example CSV files included for reference.

Requirements

  • Python 3.8+ (recommended)
  • A virtual environment (optional but recommended)
  • Install dependencies with: pip install -r requirements.txt

Usage

  • Activate your virtual environment and run: python app.py.
  • The script prints progress to the console and writes preprocessed_dummy_data.csv when finished.

Common issue explained (brief)

If you see a traceback that ends with KeyError: 'column_name', it means the code tried to access a DataFrame column literally named column_name that does not exist. This typically happens when a placeholder column name was left in the script or when a column was renamed or removed earlier in the pipeline. It does not mean the data is "already clean" — it indicates a mismatch between the column names the code expects and the actual columns in the data.

Contributing

Contributions are welcome. For small changes (typos, documentation), open a pull request. For larger changes, please open an issue first to discuss the proposal.

License

This project is released under the MIT License.