Skip to content

Pieter1821/Data-Refinery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataRefinery

A small data preprocessing demo project that shows simple cleaning, outlier handling, scaling, and categorical encoding using Python and common data libraries.

Overview

This repository contains a lightweight example pipeline (app.py) that demonstrates basic data preprocessing steps on a dummy dataset. The repository includes sample data files in the data/ folder and saves a preprocessed output file (preprocessed_dummy_data.csv).

Project structure

  • app.py — Main script that creates a dummy dataset, handles missing values, removes outliers, scales numeric features, encodes categorical variables, and saves the result.
  • preprocessed_dummy_data.csv — Output produced by app.py (generated during a run).
  • requirements.txt — Python package dependencies.
  • data/ — Example CSV files included for reference.

Requirements

  • Python 3.8+ (recommended)
  • A virtual environment (optional but recommended)
  • Install dependencies with: pip install -r requirements.txt

Usage

  • Activate your virtual environment and run: python app.py.
  • The script prints progress to the console and writes preprocessed_dummy_data.csv when finished.

Common issue explained (brief)

If you see a traceback that ends with KeyError: 'column_name', it means the code tried to access a DataFrame column literally named column_name that does not exist. This typically happens when a placeholder column name was left in the script or when a column was renamed or removed earlier in the pipeline. It does not mean the data is "already clean" — it indicates a mismatch between the column names the code expects and the actual columns in the data.

Contributing

Contributions are welcome. For small changes (typos, documentation), open a pull request. For larger changes, please open an issue first to discuss the proposal.

License

This project is released under the MIT License.


About

A small data preprocessing demo project that shows simple cleaning, outlier handling, scaling, and categorical encoding using Python and common data libraries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages