Skip to content

A workshop to introduce kedro and general good practices for data processing.

Notifications You must be signed in to change notification settings

ymekesser/kedro_workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Singapore HDB Resale Price Prediction Workshop

Powered by Kedro

Overview

A hands-on data science workshop that demonstrates building an end-to-end machine learning pipeline using Kedro. This project predicts Singapore HDB (Housing Development Board) resale prices based on proximity to MRT stations and shopping malls.

What This Covers

  • Data Engineering: Extract, clean, and transform housing, transport, and geolocation data
  • Feature Engineering: Calculate distances to amenities using geographical coordinates
  • Machine Learning: Train a linear regression model to predict property prices
  • Data Visualization: Create interactive maps showing Singapore's housing and transport infrastructure

Quick Setup

1. Install Dependencies

Option A: Using uv (Recommended)

uv sync

Option B: Using pip

pip install -r requirements.txt

2. Run the Pipeline

kedro run

This will:

  • Extract and clean HDB resale data, MRT stations, and mall locations
  • Generate geographical features (distances to nearest amenities)
  • Train a linear regression model
  • Create visualizations and performance reports

3. Explore Results

Pipeline Visualization: View the interactive pipeline graph:

kedro viz

Open your browser to see the data flow, pipeline dependencies, and execution status.

Interactive Map: Open the Jupyter notebook to view Singapore's housing locations:

kedro jupyter notebook

Navigate to notebooks/map_view.ipynb to see HDB locations (red), MRT stations (blue), and malls (green) on an interactive map.

Model Outputs: Check the data/08_reporting/ folder for:

  • Model performance metrics
  • Accessibility heatmap visualization

Project Structure

  • Extract Pipeline: Fetches HDB resale prices, MRT station data, and mall geodata
  • Clean Pipeline: Validates and standardizes the datasets
  • Transform Pipeline: Calculates geographical features and distances
  • Model Pipeline: Trains and evaluates the price prediction model

Running Individual Pipelines

kedro run --pipeline extract    # Data extraction only
kedro run --pipeline clean      # Data cleaning only
kedro run --pipeline transform  # Feature engineering only
kedro run --pipeline model      # Model training only

Best Practices Demonstrated

This workshop showcases several data science and engineering best practices:

  • Modular Pipeline Design: Code is organized into reusable, testable pipeline components (extract, clean, transform, model)
  • Data Catalog: Centralized data management with automatic loading/saving and format handling
  • Data Versioning: Automatic versioning of model outputs and datasets for reproducibility
  • Configuration Management: Parameters separated from code using YAML configuration files
  • Environment Isolation: Dependencies managed with uv.lock for reproducible environments
  • Testing: Unit tests for pipeline components to ensure code quality
  • Documentation: Clear separation between raw, cleaned, and processed data layers
  • Visualization: Interactive pipeline exploration with Kedro Viz

Testing

pytest

About

A workshop to introduce kedro and general good practices for data processing.

Resources

Stars

Watchers

Forks

Packages

No packages published