Real Estate Data Analytics Platform

A comprehensive big data analytics project for real estate market analysis using modern data engineering and machine learning techniques. This project implements a complete data pipeline from ingestion to predictive modeling using Snowflake, Python, and machine learning algorithms.

🏗️ Project Architecture

This project implements a modern data analytics architecture with the following components:

Data Source: Otodom.pl real estate listings (100K+ records)
Real-time Data Pipeline: AWS S3 + Snowpipe for automated data ingestion
Data Warehouse: Snowflake Cloud Data Platform
Data Processing: Python with pandas, dask for parallel processing
Machine Learning: Scikit-learn Random Forest for price prediction
Geocoding: Geopy for address enrichment
Translation: Google Translate API for multilingual support

📊 Dataset Overview

The project analyzes apartment listings from major Polish cities with the following attributes:

Property Details: Price, surface area, number of rooms, location
Features: Balcony/garden/terrace, parking space, heating type
Metadata: Advertiser type, market type (primary/secondary), descriptions
Location Data: GPS coordinates, detailed address information
Translations: Multi-language support for property descriptions

Sample Data Structure

title,price,market,surface,location,no_of_rooms,form_of_property,url,is_for_sale,posting_id
"Przestronne|Zadbane Osiedle|Ochrona|Monitoring",899000,"[""market"",""secondary""]",48.8,"ul. Jana Kazimierza, Wola, Warszawa, mazowieckie",2,"pełna własność","https://www.otodom.pl/pl/oferta/...",true,"4tFaF"

🚀 Features

1. Real-time Data Pipeline

Automated Data Ingestion: Snowpipe with S3 auto-ingest
Data Chunking: 1000-record chunks for simulated real-time processing (63 chunks)
Error Handling: Robust error handling and monitoring

2. Data Transformation & Enrichment

Address Geocoding: Convert coordinates to detailed address information
Text Translation: Multi-language support using Google Translate API
Data Cleaning: Outlier detection and removal using IQR method
Feature Engineering: Price per square meter, interaction features

3. Machine Learning Model

Algorithm: Random Forest Regressor with hyperparameter tuning
Performance: 96.58% R² score on test data
Features: Automated feature selection, cross-validation
Evaluation: RMSE: 182,424 PLN, robust cross-validation

4. Data Quality & Validation

Data Profiling: Comprehensive data quality checks
Missing Value Handling: Intelligent imputation strategies
Outlier Detection: Statistical methods for anomaly detection

📁 Project Structure

data-analytics/
├── README.md                          # Project documentation
├── Guide.pdf                          # Comprehensive project guide
├── sample_dataset.csv                 # Sample data for testing
│
├── Datasets/                          # Raw and processed datasets
│   ├── Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part1.csv
│   ├── Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part2.csv
│   ├── Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part3.csv
│   ├── split.py                       # Data chunking script
│   └── real-time-data-pipeline/
│       ├── upload.py                  # S3 upload automation
│       └── data_chunks/               # Chunked data files (63 chunks)
│           ├── chunk_0.csv
│           ├── chunk_1.csv
│           └── ... (chunk_62.csv)
│
├── Address and Title/                 # Geocoded and translated data
│   ├── Apartment_major_cities_dataset_Address.csv
│   └── Apartment_major_cities_dataset_Translate.csv
│
├── Prediction Model/                  # Machine learning components
│   ├── predict_prices.py             # ML model implementation
│   ├── Prediction_Model_Output.jpg   # Model results visualization
│   └── real estate price prediction model.txt
│
└── Scripts/                          # Implementation scripts
    ├── Python_scripts/
    │   ├── 1. Python_Prerequisites_Otodom_Analysis.txt
    │   ├── 2. fetch_address_Analysis.py        # Geocoding implementation
    │   ├── 3. fetch_address_Analysis2.py       # Enhanced geocoding
    │   ├── 4. translate_text_gsheet_Analysis.py # Translation service
    │   └── 5. load_data_gsheet_to_SF_Analysis.py # Snowflake integration
    └── Snowflake_scripts/
        ├── 1. Load_Dataset_to_SF.txt           # Data loading procedures
        ├── 2. Snowflake_script_Otodom_Analysis.txt # SQL transformations
        └── 3. Problems_and_Solutions.txt       # Troubleshooting guide

🛠️ Installation & Setup

Prerequisites

Python Environment (using zsh shell)

conda create --name real_estate_analytics python=3.12
conda activate real_estate_analytics

Required Packages

pip install pandas SQLAlchemy "snowflake-connector-python[pandas]"
pip install snowflake-sqlalchemy gspread gspread-dataframe
pip install geopy dask deep-translator boto3
pip install scikit-learn numpy matplotlib seaborn

External Services

Snowflake account with appropriate warehouse setup
AWS S3 bucket for data storage
Google Cloud credentials for translation API
Google Sheets API access (optional)

Environment Configuration

Snowflake Setup

-- Create database and warehouse
CREATE OR REPLACE DATABASE REAL_ESTATE_DB;
CREATE OR REPLACE WAREHOUSE SNOWPIP_WH;

-- Create staging area and file format
CREATE OR REPLACE STAGE real_estate_stage;
CREATE OR REPLACE FILE FORMAT csv_format
    TYPE = 'CSV'
    FIELD_DELIMITER = ','
    SKIP_HEADER = 1;

AWS Configuration

# Configure AWS credentials
aws_access_key_id = 'your_access_key'
aws_secret_access_key = 'your_secret_key'
bucket_name = 'your_s3_bucket'

📈 Usage

1. Data Ingestion

# Split large dataset into chunks
python Datasets/split.py

# Upload to S3 with simulated real-time delay
python Datasets/real-time-data-pipeline/upload.py

2. Data Processing & Enrichment

# Geocode addresses from coordinates
python "Scripts/Python_scripts/2. fetch_address_Analysis.py"

# Translate property descriptions
python "Scripts/Python_scripts/4. translate_text_gsheet_Analysis.py"

# Load processed data to Snowflake
python "Scripts/Python_scripts/5. load_data_gsheet_to_SF_Analysis.py"

3. Machine Learning Model

# Train and evaluate the price prediction model
python "Prediction Model/predict_prices.py"

🎯 Model Performance

The Random Forest price prediction model achieves excellent performance:

R² Score: 96.58% (log scale)
Cross-validation: 96.22% ± 0.52%
RMSE: 182,424 PLN
Primary Feature: Price per square meter
Optimization: Grid search with 5-fold CV

Model Features

Logarithmic transformation for price normalization
Feature selection using model-based selection
Hyperparameter optimization with GridSearchCV
Robust cross-validation strategy

Model Output Example

Selected features: ['PRICE_PER_SQM']
Cross-validation R2 scores: [0.9642408  0.95988394 0.96591911 0.95903045 0.96197944]
Mean CV R2 score: 0.9622 (+/- 0.0052)
Best parameters: {'max_depth': 8, 'min_samples_leaf': 6, 'min_samples_split': 15, 'n_estimators': 100}

Root Mean Squared Error (RMSE): 182424.33
R-squared Score (log scale): 0.9658
Average Price in Test Set: 507834.36

🏢 Business Applications

This platform enables various real estate analytics use cases:

Price Prediction: Automated property valuation for buyers and sellers
Market Analysis: Trend identification and forecasting for investors
Investment Decisions: Data-driven investment strategies
Portfolio Management: Performance tracking and optimization
Market Research: Comprehensive market intelligence for agencies

📊 Key Insights

Data Volume: 100K+ property listings across major Polish cities
Geographic Coverage: Warsaw, Krakow, Wroclaw, Katowice, and more
Price Range: From 467K to 899K PLN for apartments
Surface Area: 27-102 m² apartment sizes
Market Split: Primary and secondary market analysis
Data Processing: 63 chunks for real-time simulation

🔧 Technical Highlights

Scalable Architecture: Cloud-native design with Snowflake
Real-time Processing: Snowpipe for automated data ingestion
Parallel Processing: Dask for efficient data processing
Modern ML Pipeline: Feature engineering and model optimization
Comprehensive Logging: Full audit trail and error handling
Multi-language Support: Translation capabilities for international use
Geocoding Integration: Enhanced location intelligence

🤝 Contributing

This project is actively maintained and open to contributions from the community! Whether you're interested in data engineering, machine learning, or cloud infrastructure, there are many ways to contribute.

How to Contribute

Fork the repository and clone it locally
Create a feature branch (git checkout -b feature/AmazingFeature)
Make your changes and test thoroughly
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to your branch (git push origin feature/AmazingFeature)
Open a Pull Request with a clear description of your changes

Areas for Contribution

Data Engineering: Enhance the real-time pipeline, add new data sources
Machine Learning: Improve model performance, add new prediction features
Data Visualization: Create dashboards and interactive visualizations
Documentation: Improve guides, add tutorials, translate content
Testing: Add unit tests, integration tests, data validation
Performance: Optimize processing speed and resource usage
New Features: Add support for new property types, markets, or regions

Getting Started

Check the Issues for beginner-friendly tasks labeled good first issue
Read the project documentation and setup guide
Join discussions in Discussions for questions and ideas
Follow the code style and testing conventions established in the project

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Project Maintainer

Lead Developer & Maintainer - Passionate about data analytics and open-source collaboration

Full-Stack Implementation: Data pipeline, ML models, cloud infrastructure
Project Vision: Building a comprehensive real estate analytics platform
Community Focus: Welcoming contributors and fostering open-source collaboration

Looking for contributors in data engineering, machine learning, and visualization!

About the Developer

This comprehensive real estate analytics platform was developed to demonstrate modern data engineering practices and advanced machine learning techniques in production environments. The project showcases professional expertise across multiple domains:

Technical Leadership & Architecture

Enterprise Data Engineering: Designed scalable cloud-native architectures using Snowflake and AWS
Real-time Data Processing: Implemented production-grade ETL pipelines with automated ingestion
Advanced Analytics: Developed high-performance ML models achieving 96.58% accuracy
Full-Stack Development: End-to-end platform implementation from data acquisition to insights

Professional Standards & Best Practices

Code Quality: Robust error handling, comprehensive logging, and modular design
Performance Optimization: Efficient parallel processing and resource management
Documentation Excellence: Detailed technical documentation and implementation guides
Open-source Leadership: Community-focused development and collaborative practices

Industry Impact & Innovation

Real Estate Intelligence: Automated valuation models for market analysis
Scalable Solutions: Processing 100K+ records with enterprise-grade performance
Modern Tech Stack: Integration of cutting-edge technologies and frameworks
Business Value: Practical applications for investment decisions and market research

Connect with the Project

GitHub: Follow this repository for updates and contributions
Issues: Report bugs, request features, or ask questions
Discussions: Join community conversations about real estate analytics

📞 Support

For questions, issues, or feature requests, please create an issue in the repository. I'm committed to helping contributors get started and making this project accessible to developers of all skill levels.

Common Questions

Getting Started: Check the installation guide and prerequisites
Data Access: Sample dataset provided; full dataset available on request
Cloud Setup: Detailed Snowflake and AWS configuration instructions included
Model Training: Step-by-step ML pipeline with example outputs

Built with ❤️ for the real estate analytics community

Empowering data-driven decisions in real estate through open-source collaboration

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Address and Title		Address and Title
Datasets		Datasets
Prediction Model		Prediction Model
Scripts		Scripts
LICENSE		LICENSE
README.md		README.md

License

adeeshperera/real-estate-analytics-platform

Folders and files

Latest commit

History

Repository files navigation

Real Estate Data Analytics Platform

🏗️ Project Architecture

📊 Dataset Overview

Sample Data Structure

🚀 Features

1. Real-time Data Pipeline

2. Data Transformation & Enrichment

3. Machine Learning Model

4. Data Quality & Validation

📁 Project Structure

🛠️ Installation & Setup

Prerequisites

Environment Configuration

📈 Usage

1. Data Ingestion

2. Data Processing & Enrichment

3. Machine Learning Model

🎯 Model Performance

Model Features

Model Output Example

🏢 Business Applications

📊 Key Insights

🔧 Technical Highlights

🤝 Contributing

How to Contribute

Areas for Contribution

Getting Started

📄 License

👥 Project Maintainer

About the Developer

Connect with the Project

📞 Support

Common Questions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages