🏠 Zimmo Airflow Scraper

Because Data Collection Should (Not) Feel Like Burning in Red (Taylor's Version)

for those who get the reference ;)

🎵 The Story Behind This Project

This is an extension of my original Zimmo scraping project, now orchestrated with Apache Airflow. As a first-timer with Airflow, I've learned that:

Setting up workflows can be challenging, much like solving a complex puzzle with no clear solution.
Debugging Docker containers often requires understanding unfamiliar systems and interactions.
But when everything runs smoothly, it delivers clean, structured data efficiently and reliably.

🌟 What This Project Does

Zimmo Airflow Scraper extracts Belgian real estate data from zimmo.be using:

🕷️ Web Scraping: CloudScraper to bypass protections
🐘 PostgreSQL: For data storage
🌊 Apache Airflow: Workflow orchestration
🐳 Docker: Containerized deployment

Features

Multi-range price scraping – Collects data across all property price ranges
Automatic retry logic – Ensures reliable scraping even when network or server issues occur
Database conflict handling – Uses smart upserts to prevent data loss and maintain integrity
Error fallback system – Generates placeholder or sample data when scraping fails

🚀 Quick Start

Prerequisites

Docker and Docker Compose installed
At least 4GB RAM and 2 CPUs available for Docker
10GB+ disk space

1. Clone and Setup

git clone https://github.com/jgchoti/immoeliza-airflow.git
cd immoeliza-airflow

2. Configure Environment

# Copy environment template
cp .env.template .env

3. Essential Configuration

Edit .env file with these essential settings:

# User ID (CRITICAL - prevents permission issues)
AIRFLOW_UID=1000  # Replace with output of: id -u

# Security (REQUIRED)
# Generate with: python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
AIRFLOW_FERNET_KEY=your_generated_fernet_key_here

# Credentials (Change in production!)
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow
POSTGRES_PASSWORD=airflow
PGADMIN_DEFAULT_PASSWORD=root

# Performance settings
AIRFLOW_PARALLELISM=8
AIRFLOW_MAX_ACTIVE_TASKS_PER_DAG=4

# Port configuration
AIRFLOW_WEBSERVER_PORT=8080
POSTGRES_PORT=5432
PGLADMIN_PORT=5050

# Additional Python packages
_PIP_ADDITIONAL_REQUIREMENTS=pandas==1.5.0,requests==2.28.0,cloudscraper,beautifulsoup4,psycopg2-binary

4. Generate Required Keys

# Set your user ID
echo "AIRFLOW_UID=$(id -u)" >> .env

# Generate Fernet key
python3 -c "from cryptography.fernet import Fernet; print('AIRFLOW_FERNET_KEY=' + Fernet.generate_key().decode())" >> .env

5. Start Airflow

# Start all services
docker-compose up -d

# Or start with pgAdmin
docker-compose --profile tools up -d

6. Initialize Database Schema

# Wait for services to be healthy (2-3 minutes), then run:
docker exec -i $(docker-compose ps -q postgres) psql -U airflow -d airflow < sql/zimmo_schema.sql

🌐 Access Points

Airflow Web UI: http://localhost:8080
- Username: airflow | Password: airflow
pgAdmin (if started with tools profile): http://localhost:5050
- Email: admin@admin.com | Password: root

Database Connection Details

For connecting to PostgreSQL from external tools or scripts:

Host: localhost (external) / postgres (internal)
Port: 5432
Database: airflow
Username: airflow
Password: airflow

📁 Project Structure

immoeliza-airflow/
├── 🎭 dags/                    # Airflow workflows
├── 🔧 plugins/                 # Custom scrapers & utilities
├── 📜 sql/                     # Database schema files
├── 📊 logs/                    # All the debugging adventures
├── 📝 scripts/                 # ML training & dashboard generation
├── 📋 .env.template            # Environment variables template
├── 🐳 docker-compose.yml       # Container orchestration
└── 📖 README.md               # This file

🔄 Data Pipeline Flow

start_pipeline → check_dependencies → scrape_apartments → scrape_houses → deduplicate_data
                                                                              ↓
                            end_pipeline ← final_summary ← train_regression_model
                                                        ↖ generate_dashboard_data

🛠️ Common Operations

View Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f airflow-scheduler

Execute Airflow Commands

# Access Airflow CLI
docker-compose exec airflow-scheduler bash

# List DAGs
docker-compose exec airflow-scheduler airflow dags list

Troubleshooting

# Check service status
docker-compose ps

# Reset everything (⚠️ removes all data)
docker-compose down -v
docker-compose up -d

✅ Best Practices

While building this project, I also picked up some Airflow best practices that made things more reliable and maintainable:

Keep imports at the task level Instead of importing heavy libraries at the top of the DAG file, import them inside the task function to keeps the scheduler lightweight.
Use a linter/formatter (Ruff)

ruff check dags/ --select AIR3 --preview

(This runs Ruff’s rule for detecting imports in Airflow DAGs,” which is a common pitfall.)

Small, modular tasks Better to have more lightweight tasks
Don’t overload the scheduler Use task parallelism wisely ( set via AIRFLOW_PARALLELISM and MAX_ACTIVE_TASKS_PER_DAG).

📝 Future Enhancements

📱 Enhanced Streamlit dashboard: More interactive features and real-time updates
🤖 Advanced ML models: Deep learning for better price predictions
📧 Alert system: Get notified when scraping hits those green success notes
🔄 Incremental updates: Smart scraping of only new/changed listings
⚡ Parallel scraping: Ability to run both house and apartment scrapers in parallel (currently limited by local machine resources)

🙏 Acknowledgments

Taylor Swift - For the emotional inspiration behind this README
Apache Airflow Community - For making workflow orchestration feel less impossible
zimmo.be - For having data worth scraping (and for not blocking me... yet)
My Future Self – The data engineer version of me who will finally understand this first-ever Airflow project better
Airflow Errors – For teaching patience, persistence, and the true meaning of "retry"

📊 Demo Dashboard

Visit Demo Dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
airflow_home		airflow_home
assets		assets
config		config
dags		dags
data		data
plugins/utils		plugins/utils
scripts		scripts
sql		sql
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker-compose.yaml		docker-compose.yaml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏠 Zimmo Airflow Scraper

Because Data Collection Should (Not) Feel Like Burning in Red (Taylor's Version)

🎵 The Story Behind This Project

🌟 What This Project Does

Features

🚀 Quick Start

Prerequisites

1. Clone and Setup

2. Configure Environment

3. Essential Configuration

4. Generate Required Keys

5. Start Airflow

6. Initialize Database Schema

🌐 Access Points

Database Connection Details

📁 Project Structure

🔄 Data Pipeline Flow

🛠️ Common Operations

View Logs

Execute Airflow Commands

Troubleshooting

✅ Best Practices

📝 Future Enhancements

🙏 Acknowledgments

📊 Demo Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏠 Zimmo Airflow Scraper

Because Data Collection Should (Not) Feel Like Burning in Red (Taylor's Version)

🎵 The Story Behind This Project

🌟 What This Project Does

Features

🚀 Quick Start

Prerequisites

1. Clone and Setup

2. Configure Environment

3. Essential Configuration

4. Generate Required Keys

5. Start Airflow

6. Initialize Database Schema

🌐 Access Points

Database Connection Details

📁 Project Structure

🔄 Data Pipeline Flow

🛠️ Common Operations

View Logs

Execute Airflow Commands

Troubleshooting

✅ Best Practices

📝 Future Enhancements

🙏 Acknowledgments

📊 Demo Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages