A comprehensive big data analytics project for real estate market analysis using modern data engineering and machine learning techniques. This project implements a complete data pipeline from ingestion to predictive modeling using Snowflake, Python, and machine learning algorithms.
This project implements a modern data analytics architecture with the following components:
- Data Source: Otodom.pl real estate listings (100K+ records)
- Real-time Data Pipeline: AWS S3 + Snowpipe for automated data ingestion
- Data Warehouse: Snowflake Cloud Data Platform
- Data Processing: Python with pandas, dask for parallel processing
- Machine Learning: Scikit-learn Random Forest for price prediction
- Geocoding: Geopy for address enrichment
- Translation: Google Translate API for multilingual support
The project analyzes apartment listings from major Polish cities with the following attributes:
- Property Details: Price, surface area, number of rooms, location
- Features: Balcony/garden/terrace, parking space, heating type
- Metadata: Advertiser type, market type (primary/secondary), descriptions
- Location Data: GPS coordinates, detailed address information
- Translations: Multi-language support for property descriptions
title,price,market,surface,location,no_of_rooms,form_of_property,url,is_for_sale,posting_id
"Przestronne|Zadbane Osiedle|Ochrona|Monitoring",899000,"[""market"",""secondary""]",48.8,"ul. Jana Kazimierza, Wola, Warszawa, mazowieckie",2,"peΕna wΕasnoΕΔ","https://www.otodom.pl/pl/oferta/...",true,"4tFaF"
- Automated Data Ingestion: Snowpipe with S3 auto-ingest
- Data Chunking: 1000-record chunks for simulated real-time processing (63 chunks)
- Error Handling: Robust error handling and monitoring
- Address Geocoding: Convert coordinates to detailed address information
- Text Translation: Multi-language support using Google Translate API
- Data Cleaning: Outlier detection and removal using IQR method
- Feature Engineering: Price per square meter, interaction features
- Algorithm: Random Forest Regressor with hyperparameter tuning
- Performance: 96.58% RΒ² score on test data
- Features: Automated feature selection, cross-validation
- Evaluation: RMSE: 182,424 PLN, robust cross-validation
- Data Profiling: Comprehensive data quality checks
- Missing Value Handling: Intelligent imputation strategies
- Outlier Detection: Statistical methods for anomaly detection
data-analytics/
βββ README.md # Project documentation
βββ Guide.pdf # Comprehensive project guide
βββ sample_dataset.csv # Sample data for testing
β
βββ Datasets/ # Raw and processed datasets
β βββ Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part1.csv
β βββ Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part2.csv
β βββ Otodom_Apartment_major_cities_dataset_ORG_JSON_Format_Part3.csv
β βββ split.py # Data chunking script
β βββ real-time-data-pipeline/
β βββ upload.py # S3 upload automation
β βββ data_chunks/ # Chunked data files (63 chunks)
β βββ chunk_0.csv
β βββ chunk_1.csv
β βββ ... (chunk_62.csv)
β
βββ Address and Title/ # Geocoded and translated data
β βββ Apartment_major_cities_dataset_Address.csv
β βββ Apartment_major_cities_dataset_Translate.csv
β
βββ Prediction Model/ # Machine learning components
β βββ predict_prices.py # ML model implementation
β βββ Prediction_Model_Output.jpg # Model results visualization
β βββ real estate price prediction model.txt
β
βββ Scripts/ # Implementation scripts
βββ Python_scripts/
β βββ 1. Python_Prerequisites_Otodom_Analysis.txt
β βββ 2. fetch_address_Analysis.py # Geocoding implementation
β βββ 3. fetch_address_Analysis2.py # Enhanced geocoding
β βββ 4. translate_text_gsheet_Analysis.py # Translation service
β βββ 5. load_data_gsheet_to_SF_Analysis.py # Snowflake integration
βββ Snowflake_scripts/
βββ 1. Load_Dataset_to_SF.txt # Data loading procedures
βββ 2. Snowflake_script_Otodom_Analysis.txt # SQL transformations
βββ 3. Problems_and_Solutions.txt # Troubleshooting guide
- Python Environment (using zsh shell)
conda create --name real_estate_analytics python=3.12
conda activate real_estate_analytics- Required Packages
pip install pandas SQLAlchemy "snowflake-connector-python[pandas]"
pip install snowflake-sqlalchemy gspread gspread-dataframe
pip install geopy dask deep-translator boto3
pip install scikit-learn numpy matplotlib seaborn- External Services
- Snowflake account with appropriate warehouse setup
- AWS S3 bucket for data storage
- Google Cloud credentials for translation API
- Google Sheets API access (optional)
- Snowflake Setup
-- Create database and warehouse
CREATE OR REPLACE DATABASE REAL_ESTATE_DB;
CREATE OR REPLACE WAREHOUSE SNOWPIP_WH;
-- Create staging area and file format
CREATE OR REPLACE STAGE real_estate_stage;
CREATE OR REPLACE FILE FORMAT csv_format
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1;- AWS Configuration
# Configure AWS credentials
aws_access_key_id = 'your_access_key'
aws_secret_access_key = 'your_secret_key'
bucket_name = 'your_s3_bucket'# Split large dataset into chunks
python Datasets/split.py
# Upload to S3 with simulated real-time delay
python Datasets/real-time-data-pipeline/upload.py# Geocode addresses from coordinates
python "Scripts/Python_scripts/2. fetch_address_Analysis.py"
# Translate property descriptions
python "Scripts/Python_scripts/4. translate_text_gsheet_Analysis.py"
# Load processed data to Snowflake
python "Scripts/Python_scripts/5. load_data_gsheet_to_SF_Analysis.py"# Train and evaluate the price prediction model
python "Prediction Model/predict_prices.py"The Random Forest price prediction model achieves excellent performance:
- RΒ² Score: 96.58% (log scale)
- Cross-validation: 96.22% Β± 0.52%
- RMSE: 182,424 PLN
- Primary Feature: Price per square meter
- Optimization: Grid search with 5-fold CV
- Logarithmic transformation for price normalization
- Feature selection using model-based selection
- Hyperparameter optimization with GridSearchCV
- Robust cross-validation strategy
Selected features: ['PRICE_PER_SQM']
Cross-validation R2 scores: [0.9642408 0.95988394 0.96591911 0.95903045 0.96197944]
Mean CV R2 score: 0.9622 (+/- 0.0052)
Best parameters: {'max_depth': 8, 'min_samples_leaf': 6, 'min_samples_split': 15, 'n_estimators': 100}
Root Mean Squared Error (RMSE): 182424.33
R-squared Score (log scale): 0.9658
Average Price in Test Set: 507834.36
This platform enables various real estate analytics use cases:
- Price Prediction: Automated property valuation for buyers and sellers
- Market Analysis: Trend identification and forecasting for investors
- Investment Decisions: Data-driven investment strategies
- Portfolio Management: Performance tracking and optimization
- Market Research: Comprehensive market intelligence for agencies
- Data Volume: 100K+ property listings across major Polish cities
- Geographic Coverage: Warsaw, Krakow, Wroclaw, Katowice, and more
- Price Range: From 467K to 899K PLN for apartments
- Surface Area: 27-102 mΒ² apartment sizes
- Market Split: Primary and secondary market analysis
- Data Processing: 63 chunks for real-time simulation
- Scalable Architecture: Cloud-native design with Snowflake
- Real-time Processing: Snowpipe for automated data ingestion
- Parallel Processing: Dask for efficient data processing
- Modern ML Pipeline: Feature engineering and model optimization
- Comprehensive Logging: Full audit trail and error handling
- Multi-language Support: Translation capabilities for international use
- Geocoding Integration: Enhanced location intelligence
This project is actively maintained and open to contributions from the community! Whether you're interested in data engineering, machine learning, or cloud infrastructure, there are many ways to contribute.
- Fork the repository and clone it locally
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Make your changes and test thoroughly
- Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to your branch (
git push origin feature/AmazingFeature) - Open a Pull Request with a clear description of your changes
- Data Engineering: Enhance the real-time pipeline, add new data sources
- Machine Learning: Improve model performance, add new prediction features
- Data Visualization: Create dashboards and interactive visualizations
- Documentation: Improve guides, add tutorials, translate content
- Testing: Add unit tests, integration tests, data validation
- Performance: Optimize processing speed and resource usage
- New Features: Add support for new property types, markets, or regions
- Check the Issues for beginner-friendly tasks labeled
good first issue - Read the project documentation and setup guide
- Join discussions in Discussions for questions and ideas
- Follow the code style and testing conventions established in the project
This project is licensed under the MIT License - see the LICENSE file for details.
Lead Developer & Maintainer - Passionate about data analytics and open-source collaboration
- Full-Stack Implementation: Data pipeline, ML models, cloud infrastructure
- Project Vision: Building a comprehensive real estate analytics platform
- Community Focus: Welcoming contributors and fostering open-source collaboration
Looking for contributors in data engineering, machine learning, and visualization!
This comprehensive real estate analytics platform was developed to demonstrate modern data engineering practices and advanced machine learning techniques in production environments. The project showcases professional expertise across multiple domains:
Technical Leadership & Architecture
- Enterprise Data Engineering: Designed scalable cloud-native architectures using Snowflake and AWS
- Real-time Data Processing: Implemented production-grade ETL pipelines with automated ingestion
- Advanced Analytics: Developed high-performance ML models achieving 96.58% accuracy
- Full-Stack Development: End-to-end platform implementation from data acquisition to insights
Professional Standards & Best Practices
- Code Quality: Robust error handling, comprehensive logging, and modular design
- Performance Optimization: Efficient parallel processing and resource management
- Documentation Excellence: Detailed technical documentation and implementation guides
- Open-source Leadership: Community-focused development and collaborative practices
Industry Impact & Innovation
- Real Estate Intelligence: Automated valuation models for market analysis
- Scalable Solutions: Processing 100K+ records with enterprise-grade performance
- Modern Tech Stack: Integration of cutting-edge technologies and frameworks
- Business Value: Practical applications for investment decisions and market research
- GitHub: Follow this repository for updates and contributions
- Issues: Report bugs, request features, or ask questions
- Discussions: Join community conversations about real estate analytics
For questions, issues, or feature requests, please create an issue in the repository. I'm committed to helping contributors get started and making this project accessible to developers of all skill levels.
- Getting Started: Check the installation guide and prerequisites
- Data Access: Sample dataset provided; full dataset available on request
- Cloud Setup: Detailed Snowflake and AWS configuration instructions included
- Model Training: Step-by-step ML pipeline with example outputs
Built with β€οΈ for the real estate analytics community
Empowering data-driven decisions in real estate through open-source collaboration