Skip to content

Real estate data analytics platform with ML price prediction, Snowflake data warehouse, and automated market intelligence pipeline.

License

Notifications You must be signed in to change notification settings

adeeshperera/real-estate-analytics-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Real Estate Data Analytics Platform

A comprehensive big data analytics project for real estate market analysis using modern data engineering and machine learning techniques. This project implements a complete data pipeline from ingestion to predictive modeling using Snowflake, Python, and machine learning algorithms.

πŸ—οΈ Project Architecture

This project implements a modern data analytics architecture with the following components:

  • Data Source: Otodom.pl real estate listings (100K+ records)
  • Real-time Data Pipeline: AWS S3 + Snowpipe for automated data ingestion
  • Data Warehouse: Snowflake Cloud Data Platform
  • Data Processing: Python with pandas, dask for parallel processing
  • Machine Learning: Scikit-learn Random Forest for price prediction
  • Geocoding: Geopy for address enrichment
  • Translation: Google Translate API for multilingual support

πŸ“Š Dataset Overview

The project analyzes apartment listings from major Polish cities with the following attributes:

  • Property Details: Price, surface area, number of rooms, location
  • Features: Balcony/garden/terrace, parking space, heating type
  • Metadata: Advertiser type, market type (primary/secondary), descriptions
  • Location Data: GPS coordinates, detailed address information
  • Translations: Multi-language support for property descriptions

Sample Data Structure

title,price,market,surface,location,no_of_rooms,form_of_property,url,is_for_sale,posting_id
"Przestronne|Zadbane Osiedle|Ochrona|Monitoring",899000,"[""market"",""secondary""]",48.8,"ul. Jana Kazimierza, Wola, Warszawa, mazowieckie",2,"peΕ‚na wΕ‚asnoΕ›Δ‡","https://www.otodom.pl/pl/oferta/...",true,"4tFaF"

πŸš€ Features

1. Real-time Data Pipeline

  • Automated Data Ingestion: Snowpipe with S3 auto-ingest
  • Data Chunking: 1000-record chunks for simulated real-time processing (63 chunks)
  • Error Handling: Robust error handling and monitoring

2. Data Transformation & Enrichment

  • Address Geocoding: Convert coordinates to detailed address information
  • Text Translation: Multi-language support using Google Translate API
  • Data Cleaning: Outlier detection and removal using IQR method
  • Feature Engineering: Price per square meter, interaction features

3. Machine Learning Model

  • Algorithm: Random Forest Regressor with hyperparameter tuning
  • Performance: 96.58% RΒ² score on test data
  • Features: Automated feature selection, cross-validation
  • Evaluation: RMSE: 182,424 PLN, robust cross-validation

4. Data Quality & Validation

  • Data Profiling: Comprehensive data quality checks
  • Missing Value Handling: Intelligent imputation strategies
  • Outlier Detection: Statistical methods for anomaly detection

πŸ“ Project Structure

data-analytics/
β”œβ”€β”€ README.md                          # Project documentation
β”œβ”€β”€ Guide.pdf                          # Comprehensive project guide
β”œβ”€β”€ sample_dataset.csv                 # Sample data for testing
β”‚
β”œβ”€β”€ Datasets/                          # Raw and processed datasets
β”‚   β”œβ”€β”€ Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part1.csv
β”‚   β”œβ”€β”€ Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part2.csv
β”‚   β”œβ”€β”€ Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part3.csv
β”‚   β”œβ”€β”€ split.py                       # Data chunking script
β”‚   └── real-time-data-pipeline/
β”‚       β”œβ”€β”€ upload.py                  # S3 upload automation
β”‚       └── data_chunks/               # Chunked data files (63 chunks)
β”‚           β”œβ”€β”€ chunk_0.csv
β”‚           β”œβ”€β”€ chunk_1.csv
β”‚           └── ... (chunk_62.csv)
β”‚
β”œβ”€β”€ Address and Title/                 # Geocoded and translated data
β”‚   β”œβ”€β”€ Apartment_major_cities_dataset_Address.csv
β”‚   └── Apartment_major_cities_dataset_Translate.csv
β”‚
β”œβ”€β”€ Prediction Model/                  # Machine learning components
β”‚   β”œβ”€β”€ predict_prices.py             # ML model implementation
β”‚   β”œβ”€β”€ Prediction_Model_Output.jpg   # Model results visualization
β”‚   └── real estate price prediction model.txt
β”‚
└── Scripts/                          # Implementation scripts
    β”œβ”€β”€ Python_scripts/
    β”‚   β”œβ”€β”€ 1. Python_Prerequisites_Otodom_Analysis.txt
    β”‚   β”œβ”€β”€ 2. fetch_address_Analysis.py        # Geocoding implementation
    β”‚   β”œβ”€β”€ 3. fetch_address_Analysis2.py       # Enhanced geocoding
    β”‚   β”œβ”€β”€ 4. translate_text_gsheet_Analysis.py # Translation service
    β”‚   └── 5. load_data_gsheet_to_SF_Analysis.py # Snowflake integration
    └── Snowflake_scripts/
        β”œβ”€β”€ 1. Load_Dataset_to_SF.txt           # Data loading procedures
        β”œβ”€β”€ 2. Snowflake_script_Otodom_Analysis.txt # SQL transformations
        └── 3. Problems_and_Solutions.txt       # Troubleshooting guide

πŸ› οΈ Installation & Setup

Prerequisites

  1. Python Environment (using zsh shell)
conda create --name real_estate_analytics python=3.12
conda activate real_estate_analytics
  1. Required Packages
pip install pandas SQLAlchemy "snowflake-connector-python[pandas]"
pip install snowflake-sqlalchemy gspread gspread-dataframe
pip install geopy dask deep-translator boto3
pip install scikit-learn numpy matplotlib seaborn
  1. External Services
  • Snowflake account with appropriate warehouse setup
  • AWS S3 bucket for data storage
  • Google Cloud credentials for translation API
  • Google Sheets API access (optional)

Environment Configuration

  1. Snowflake Setup
-- Create database and warehouse
CREATE OR REPLACE DATABASE REAL_ESTATE_DB;
CREATE OR REPLACE WAREHOUSE SNOWPIP_WH;

-- Create staging area and file format
CREATE OR REPLACE STAGE real_estate_stage;
CREATE OR REPLACE FILE FORMAT csv_format
    TYPE = 'CSV'
    FIELD_DELIMITER = ','
    SKIP_HEADER = 1;
  1. AWS Configuration
# Configure AWS credentials
aws_access_key_id = 'your_access_key'
aws_secret_access_key = 'your_secret_key'
bucket_name = 'your_s3_bucket'

πŸ“ˆ Usage

1. Data Ingestion

# Split large dataset into chunks
python Datasets/split.py

# Upload to S3 with simulated real-time delay
python Datasets/real-time-data-pipeline/upload.py

2. Data Processing & Enrichment

# Geocode addresses from coordinates
python "Scripts/Python_scripts/2. fetch_address_Analysis.py"

# Translate property descriptions
python "Scripts/Python_scripts/4. translate_text_gsheet_Analysis.py"

# Load processed data to Snowflake
python "Scripts/Python_scripts/5. load_data_gsheet_to_SF_Analysis.py"

3. Machine Learning Model

# Train and evaluate the price prediction model
python "Prediction Model/predict_prices.py"

🎯 Model Performance

The Random Forest price prediction model achieves excellent performance:

  • RΒ² Score: 96.58% (log scale)
  • Cross-validation: 96.22% Β± 0.52%
  • RMSE: 182,424 PLN
  • Primary Feature: Price per square meter
  • Optimization: Grid search with 5-fold CV

Model Features

  • Logarithmic transformation for price normalization
  • Feature selection using model-based selection
  • Hyperparameter optimization with GridSearchCV
  • Robust cross-validation strategy

Model Output Example

Selected features: ['PRICE_PER_SQM']
Cross-validation R2 scores: [0.9642408  0.95988394 0.96591911 0.95903045 0.96197944]
Mean CV R2 score: 0.9622 (+/- 0.0052)
Best parameters: {'max_depth': 8, 'min_samples_leaf': 6, 'min_samples_split': 15, 'n_estimators': 100}

Root Mean Squared Error (RMSE): 182424.33
R-squared Score (log scale): 0.9658
Average Price in Test Set: 507834.36

🏒 Business Applications

This platform enables various real estate analytics use cases:

  1. Price Prediction: Automated property valuation for buyers and sellers
  2. Market Analysis: Trend identification and forecasting for investors
  3. Investment Decisions: Data-driven investment strategies
  4. Portfolio Management: Performance tracking and optimization
  5. Market Research: Comprehensive market intelligence for agencies

πŸ“Š Key Insights

  • Data Volume: 100K+ property listings across major Polish cities
  • Geographic Coverage: Warsaw, Krakow, Wroclaw, Katowice, and more
  • Price Range: From 467K to 899K PLN for apartments
  • Surface Area: 27-102 mΒ² apartment sizes
  • Market Split: Primary and secondary market analysis
  • Data Processing: 63 chunks for real-time simulation

πŸ”§ Technical Highlights

  • Scalable Architecture: Cloud-native design with Snowflake
  • Real-time Processing: Snowpipe for automated data ingestion
  • Parallel Processing: Dask for efficient data processing
  • Modern ML Pipeline: Feature engineering and model optimization
  • Comprehensive Logging: Full audit trail and error handling
  • Multi-language Support: Translation capabilities for international use
  • Geocoding Integration: Enhanced location intelligence

🀝 Contributing

This project is actively maintained and open to contributions from the community! Whether you're interested in data engineering, machine learning, or cloud infrastructure, there are many ways to contribute.

How to Contribute

  1. Fork the repository and clone it locally
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Make your changes and test thoroughly
  4. Commit your changes (git commit -m 'Add some AmazingFeature')
  5. Push to your branch (git push origin feature/AmazingFeature)
  6. Open a Pull Request with a clear description of your changes

Areas for Contribution

  • Data Engineering: Enhance the real-time pipeline, add new data sources
  • Machine Learning: Improve model performance, add new prediction features
  • Data Visualization: Create dashboards and interactive visualizations
  • Documentation: Improve guides, add tutorials, translate content
  • Testing: Add unit tests, integration tests, data validation
  • Performance: Optimize processing speed and resource usage
  • New Features: Add support for new property types, markets, or regions

Getting Started

  1. Check the Issues for beginner-friendly tasks labeled good first issue
  2. Read the project documentation and setup guide
  3. Join discussions in Discussions for questions and ideas
  4. Follow the code style and testing conventions established in the project

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘₯ Project Maintainer

Lead Developer & Maintainer - Passionate about data analytics and open-source collaboration

  • Full-Stack Implementation: Data pipeline, ML models, cloud infrastructure
  • Project Vision: Building a comprehensive real estate analytics platform
  • Community Focus: Welcoming contributors and fostering open-source collaboration

Looking for contributors in data engineering, machine learning, and visualization!

About the Developer

This comprehensive real estate analytics platform was developed to demonstrate modern data engineering practices and advanced machine learning techniques in production environments. The project showcases professional expertise across multiple domains:

Technical Leadership & Architecture

  • Enterprise Data Engineering: Designed scalable cloud-native architectures using Snowflake and AWS
  • Real-time Data Processing: Implemented production-grade ETL pipelines with automated ingestion
  • Advanced Analytics: Developed high-performance ML models achieving 96.58% accuracy
  • Full-Stack Development: End-to-end platform implementation from data acquisition to insights

Professional Standards & Best Practices

  • Code Quality: Robust error handling, comprehensive logging, and modular design
  • Performance Optimization: Efficient parallel processing and resource management
  • Documentation Excellence: Detailed technical documentation and implementation guides
  • Open-source Leadership: Community-focused development and collaborative practices

Industry Impact & Innovation

  • Real Estate Intelligence: Automated valuation models for market analysis
  • Scalable Solutions: Processing 100K+ records with enterprise-grade performance
  • Modern Tech Stack: Integration of cutting-edge technologies and frameworks
  • Business Value: Practical applications for investment decisions and market research

Connect with the Project

  • GitHub: Follow this repository for updates and contributions
  • Issues: Report bugs, request features, or ask questions
  • Discussions: Join community conversations about real estate analytics

πŸ“ž Support

For questions, issues, or feature requests, please create an issue in the repository. I'm committed to helping contributors get started and making this project accessible to developers of all skill levels.

Common Questions

  • Getting Started: Check the installation guide and prerequisites
  • Data Access: Sample dataset provided; full dataset available on request
  • Cloud Setup: Detailed Snowflake and AWS configuration instructions included
  • Model Training: Step-by-step ML pipeline with example outputs

Built with ❀️ for the real estate analytics community

Empowering data-driven decisions in real estate through open-source collaboration

About

Real estate data analytics platform with ML price prediction, Snowflake data warehouse, and automated market intelligence pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages