A small data preprocessing demo project that shows simple cleaning, outlier handling, scaling, and categorical encoding using Python and common data libraries.
This repository contains a lightweight example pipeline (app.py) that demonstrates basic data preprocessing steps on a dummy dataset. The repository includes sample data files in the data/ folder and saves a preprocessed output file (preprocessed_dummy_data.csv).
app.py— Main script that creates a dummy dataset, handles missing values, removes outliers, scales numeric features, encodes categorical variables, and saves the result.preprocessed_dummy_data.csv— Output produced byapp.py(generated during a run).requirements.txt— Python package dependencies.data/— Example CSV files included for reference.
- Python 3.8+ (recommended)
- A virtual environment (optional but recommended)
- Install dependencies with:
pip install -r requirements.txt
- Activate your virtual environment and run:
python app.py. - The script prints progress to the console and writes
preprocessed_dummy_data.csvwhen finished.
If you see a traceback that ends with KeyError: 'column_name', it means the code tried to access a DataFrame column literally named column_name that does not exist. This typically happens when a placeholder column name was left in the script or when a column was renamed or removed earlier in the pipeline. It does not mean the data is "already clean" — it indicates a mismatch between the column names the code expects and the actual columns in the data.
Contributions are welcome. For small changes (typos, documentation), open a pull request. For larger changes, please open an issue first to discuss the proposal.
This project is released under the MIT License.