A hands-on data science workshop that demonstrates building an end-to-end machine learning pipeline using Kedro. This project predicts Singapore HDB (Housing Development Board) resale prices based on proximity to MRT stations and shopping malls.
- Data Engineering: Extract, clean, and transform housing, transport, and geolocation data
- Feature Engineering: Calculate distances to amenities using geographical coordinates
- Machine Learning: Train a linear regression model to predict property prices
- Data Visualization: Create interactive maps showing Singapore's housing and transport infrastructure
Option A: Using uv (Recommended)
uv syncOption B: Using pip
pip install -r requirements.txtkedro runThis will:
- Extract and clean HDB resale data, MRT stations, and mall locations
- Generate geographical features (distances to nearest amenities)
- Train a linear regression model
- Create visualizations and performance reports
Pipeline Visualization: View the interactive pipeline graph:
kedro vizOpen your browser to see the data flow, pipeline dependencies, and execution status.
Interactive Map: Open the Jupyter notebook to view Singapore's housing locations:
kedro jupyter notebookNavigate to notebooks/map_view.ipynb to see HDB locations (red), MRT stations (blue), and malls (green) on an interactive map.
Model Outputs: Check the data/08_reporting/ folder for:
- Model performance metrics
- Accessibility heatmap visualization
- Extract Pipeline: Fetches HDB resale prices, MRT station data, and mall geodata
- Clean Pipeline: Validates and standardizes the datasets
- Transform Pipeline: Calculates geographical features and distances
- Model Pipeline: Trains and evaluates the price prediction model
kedro run --pipeline extract # Data extraction only
kedro run --pipeline clean # Data cleaning only
kedro run --pipeline transform # Feature engineering only
kedro run --pipeline model # Model training onlyThis workshop showcases several data science and engineering best practices:
- Modular Pipeline Design: Code is organized into reusable, testable pipeline components (extract, clean, transform, model)
- Data Catalog: Centralized data management with automatic loading/saving and format handling
- Data Versioning: Automatic versioning of model outputs and datasets for reproducibility
- Configuration Management: Parameters separated from code using YAML configuration files
- Environment Isolation: Dependencies managed with
uv.lockfor reproducible environments - Testing: Unit tests for pipeline components to ensure code quality
- Documentation: Clear separation between raw, cleaned, and processed data layers
- Visualization: Interactive pipeline exploration with Kedro Viz
pytest