for those who get the reference ;)
This is an extension of my original Zimmo scraping project, now orchestrated with Apache Airflow. As a first-timer with Airflow, I've learned that:
- Setting up workflows can be challenging, much like solving a complex puzzle with no clear solution.
- Debugging Docker containers often requires understanding unfamiliar systems and interactions.
- But when everything runs smoothly, it delivers clean, structured data efficiently and reliably.
Zimmo Airflow Scraper extracts Belgian real estate data from zimmo.be using:
- π·οΈ Web Scraping: CloudScraper to bypass protections
- π PostgreSQL: For data storage
- π Apache Airflow: Workflow orchestration
- π³ Docker: Containerized deployment
- Multi-range price scraping β Collects data across all property price ranges
- Automatic retry logic β Ensures reliable scraping even when network or server issues occur
- Database conflict handling β Uses smart upserts to prevent data loss and maintain integrity
- Error fallback system β Generates placeholder or sample data when scraping fails
- Docker and Docker Compose installed
- At least 4GB RAM and 2 CPUs available for Docker
- 10GB+ disk space
git clone https://github.com/jgchoti/immoeliza-airflow.git
cd immoeliza-airflow# Copy environment template
cp .env.template .envEdit .env file with these essential settings:
# User ID (CRITICAL - prevents permission issues)
AIRFLOW_UID=1000 # Replace with output of: id -u
# Security (REQUIRED)
# Generate with: python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
AIRFLOW_FERNET_KEY=your_generated_fernet_key_here
# Credentials (Change in production!)
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow
POSTGRES_PASSWORD=airflow
PGADMIN_DEFAULT_PASSWORD=root
# Performance settings
AIRFLOW_PARALLELISM=8
AIRFLOW_MAX_ACTIVE_TASKS_PER_DAG=4
# Port configuration
AIRFLOW_WEBSERVER_PORT=8080
POSTGRES_PORT=5432
PGLADMIN_PORT=5050
# Additional Python packages
_PIP_ADDITIONAL_REQUIREMENTS=pandas==1.5.0,requests==2.28.0,cloudscraper,beautifulsoup4,psycopg2-binary# Set your user ID
echo "AIRFLOW_UID=$(id -u)" >> .env
# Generate Fernet key
python3 -c "from cryptography.fernet import Fernet; print('AIRFLOW_FERNET_KEY=' + Fernet.generate_key().decode())" >> .env# Start all services
docker-compose up -d
# Or start with pgAdmin
docker-compose --profile tools up -d# Wait for services to be healthy (2-3 minutes), then run:
docker exec -i $(docker-compose ps -q postgres) psql -U airflow -d airflow < sql/zimmo_schema.sql- Airflow Web UI: http://localhost:8080
- Username:
airflow| Password:airflow
- Username:
- pgAdmin (if started with tools profile): http://localhost:5050
- Email:
admin@admin.com| Password:root
- Email:
For connecting to PostgreSQL from external tools or scripts:
- Host:
localhost(external) /postgres(internal) - Port:
5432 - Database:
airflow - Username:
airflow - Password:
airflow
immoeliza-airflow/
βββ π dags/ # Airflow workflows
βββ π§ plugins/ # Custom scrapers & utilities
βββ π sql/ # Database schema files
βββ π logs/ # All the debugging adventures
βββ π scripts/ # ML training & dashboard generation
βββ π .env.template # Environment variables template
βββ π³ docker-compose.yml # Container orchestration
βββ π README.md # This file
start_pipeline β check_dependencies β scrape_apartments β scrape_houses β deduplicate_data
β
end_pipeline β final_summary β train_regression_model
β generate_dashboard_data
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f airflow-scheduler# Access Airflow CLI
docker-compose exec airflow-scheduler bash
# List DAGs
docker-compose exec airflow-scheduler airflow dags list# Check service status
docker-compose ps
# Reset everything (β οΈ removes all data)
docker-compose down -v
docker-compose up -dWhile building this project, I also picked up some Airflow best practices that made things more reliable and maintainable:
-
Keep imports at the task level Instead of importing heavy libraries at the top of the DAG file, import them inside the task function to keeps the scheduler lightweight.
-
Use a linter/formatter (Ruff)
ruff check dags/ --select AIR3 --preview
(This runs Ruffβs rule for detecting imports in Airflow DAGs,β which is a common pitfall.)
-
Small, modular tasks Better to have more lightweight tasks
-
Donβt overload the scheduler Use task parallelism wisely ( set via AIRFLOW_PARALLELISM and MAX_ACTIVE_TASKS_PER_DAG).
- π± Enhanced Streamlit dashboard: More interactive features and real-time updates
- π€ Advanced ML models: Deep learning for better price predictions
- π§ Alert system: Get notified when scraping hits those green success notes
- π Incremental updates: Smart scraping of only new/changed listings
- β‘ Parallel scraping: Ability to run both house and apartment scrapers in parallel (currently limited by local machine resources)
- Taylor Swift - For the emotional inspiration behind this README
- Apache Airflow Community - For making workflow orchestration feel less impossible
- zimmo.be - For having data worth scraping (and for not blocking me... yet)
- My Future Self β The data engineer version of me who will finally understand this first-ever Airflow project better
- Airflow Errors β For teaching patience, persistence, and the true meaning of "retry"


