Skip to content

jgchoti/immoeliza-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏠 Zimmo Airflow Scraper

Because Data Collection Should (Not) Feel Like Burning in Red (Taylor's Version)

red

for those who get the reference ;)


Python Apache Airflow Docker PostgreSQL Status

🎡 The Story Behind This Project

This is an extension of my original Zimmo scraping project, now orchestrated with Apache Airflow. As a first-timer with Airflow, I've learned that:

  • Setting up workflows can be challenging, much like solving a complex puzzle with no clear solution.
  • Debugging Docker containers often requires understanding unfamiliar systems and interactions.
  • But when everything runs smoothly, it delivers clean, structured data efficiently and reliably.

🌟 What This Project Does

Zimmo Airflow Scraper extracts Belgian real estate data from zimmo.be using:

  • πŸ•·οΈ Web Scraping: CloudScraper to bypass protections
  • 🐘 PostgreSQL: For data storage
  • 🌊 Apache Airflow: Workflow orchestration
  • 🐳 Docker: Containerized deployment

airflow

Features

  • Multi-range price scraping – Collects data across all property price ranges
  • Automatic retry logic – Ensures reliable scraping even when network or server issues occur
  • Database conflict handling – Uses smart upserts to prevent data loss and maintain integrity
  • Error fallback system – Generates placeholder or sample data when scraping fails

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose installed
  • At least 4GB RAM and 2 CPUs available for Docker
  • 10GB+ disk space

1. Clone and Setup

git clone https://github.com/jgchoti/immoeliza-airflow.git
cd immoeliza-airflow

2. Configure Environment

# Copy environment template
cp .env.template .env

3. Essential Configuration

Edit .env file with these essential settings:

# User ID (CRITICAL - prevents permission issues)
AIRFLOW_UID=1000  # Replace with output of: id -u

# Security (REQUIRED)
# Generate with: python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
AIRFLOW_FERNET_KEY=your_generated_fernet_key_here

# Credentials (Change in production!)
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow
POSTGRES_PASSWORD=airflow
PGADMIN_DEFAULT_PASSWORD=root

# Performance settings
AIRFLOW_PARALLELISM=8
AIRFLOW_MAX_ACTIVE_TASKS_PER_DAG=4

# Port configuration
AIRFLOW_WEBSERVER_PORT=8080
POSTGRES_PORT=5432
PGLADMIN_PORT=5050

# Additional Python packages
_PIP_ADDITIONAL_REQUIREMENTS=pandas==1.5.0,requests==2.28.0,cloudscraper,beautifulsoup4,psycopg2-binary

4. Generate Required Keys

# Set your user ID
echo "AIRFLOW_UID=$(id -u)" >> .env

# Generate Fernet key
python3 -c "from cryptography.fernet import Fernet; print('AIRFLOW_FERNET_KEY=' + Fernet.generate_key().decode())" >> .env

5. Start Airflow

# Start all services
docker-compose up -d

# Or start with pgAdmin
docker-compose --profile tools up -d

6. Initialize Database Schema

# Wait for services to be healthy (2-3 minutes), then run:
docker exec -i $(docker-compose ps -q postgres) psql -U airflow -d airflow < sql/zimmo_schema.sql

🌐 Access Points

Database Connection Details

For connecting to PostgreSQL from external tools or scripts:

  • Host: localhost (external) / postgres (internal)
  • Port: 5432
  • Database: airflow
  • Username: airflow
  • Password: airflow

πŸ“ Project Structure

immoeliza-airflow/
β”œβ”€β”€ 🎭 dags/                    # Airflow workflows
β”œβ”€β”€ πŸ”§ plugins/                 # Custom scrapers & utilities
β”œβ”€β”€ πŸ“œ sql/                     # Database schema files
β”œβ”€β”€ πŸ“Š logs/                    # All the debugging adventures
β”œβ”€β”€ πŸ“ scripts/                 # ML training & dashboard generation
β”œβ”€β”€ πŸ“‹ .env.template            # Environment variables template
β”œβ”€β”€ 🐳 docker-compose.yml       # Container orchestration
└── πŸ“– README.md               # This file

πŸ”„ Data Pipeline Flow

start_pipeline β†’ check_dependencies β†’ scrape_apartments β†’ scrape_houses β†’ deduplicate_data
                                                                              ↓
                            end_pipeline ← final_summary ← train_regression_model
                                                        β†– generate_dashboard_data

πŸ› οΈ Common Operations

View Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f airflow-scheduler

Execute Airflow Commands

# Access Airflow CLI
docker-compose exec airflow-scheduler bash

# List DAGs
docker-compose exec airflow-scheduler airflow dags list

Troubleshooting

# Check service status
docker-compose ps

# Reset everything (⚠️ removes all data)
docker-compose down -v
docker-compose up -d

βœ… Best Practices

While building this project, I also picked up some Airflow best practices that made things more reliable and maintainable:

  • Keep imports at the task level Instead of importing heavy libraries at the top of the DAG file, import them inside the task function to keeps the scheduler lightweight.

  • Use a linter/formatter (Ruff)

ruff check dags/ --select AIR3 --preview

(This runs Ruff’s rule for detecting imports in Airflow DAGs,” which is a common pitfall.)

  • Small, modular tasks Better to have more lightweight tasks

  • Don’t overload the scheduler Use task parallelism wisely ( set via AIRFLOW_PARALLELISM and MAX_ACTIVE_TASKS_PER_DAG).

πŸ“ Future Enhancements

  • πŸ“± Enhanced Streamlit dashboard: More interactive features and real-time updates
  • πŸ€– Advanced ML models: Deep learning for better price predictions
  • πŸ“§ Alert system: Get notified when scraping hits those green success notes
  • πŸ”„ Incremental updates: Smart scraping of only new/changed listings
  • ⚑ Parallel scraping: Ability to run both house and apartment scrapers in parallel (currently limited by local machine resources)

πŸ™ Acknowledgments

  • Taylor Swift - For the emotional inspiration behind this README
  • Apache Airflow Community - For making workflow orchestration feel less impossible
  • zimmo.be - For having data worth scraping (and for not blocking me... yet)
  • My Future Self – The data engineer version of me who will finally understand this first-ever Airflow project better
  • Airflow Errors – For teaching patience, persistence, and the true meaning of "retry"

πŸ“Š Demo Dashboard

Visit Demo Dashboard dashboard


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors