A modular, scalable ETL system for job scraping across Middle East job boards. Designed for Data Engineering & Python practice using real-world scraping and pipeline architecture.
- Status: Active development — latest work on Pydantic models, packaging, and
Makefile. - Branch:
chore/pydantic-packaging-makefile - Implemented: Project skeleton, modular extract/transform/load layers, example scrapers, and basic test scaffolding.
- In Progress: Pydantic-based data models, packaging improvements (pyproject/packaging), Makefile updates, and CI integration.
- Notes: This repository is actively developed; some features and tests may be incomplete. Run
pytest -qlocally and reviewrequirements.txtbefore using in production.
This project is a production-ready job scraping pipeline organized into a clean ETL architecture:
Extract → Transform → Load → Automate
You can scrape multiple job boards (Wuzzuf, GulfTalent, Bayt, etc.), clean and normalize the data, extract metadata (tags, salary, seniority), dedupe jobs, load into SQLite/CSV, and schedule daily runs.
The entire pipeline is fully modular, configurable, and extendable — every site scraper lives in its own file.
job-scraper/
├── README.md
├── requirements.txt
├── Makefile
├── cli.py
│
├── extract/
│ ├── base_scraper.py
│ ├── wuzzuf.py
│ ├── gulftalent.py
│ ├── naukrigulf.py
│ ├── tanqeeb.py
│ ├── drjobs.py
│ ├── bayt.py
│ ├── laimoon.py
│ ├── akhtaboot.py
│ ├── example_site.py
│ └── utils/
│ ├── fetch.py
│ ├── rate_limit.py
│ ├── parse.py
│ └── logger.py
│
├── transform/
│ ├── normalize.py
│ ├── clean_text.py
│ ├── extract_metadata.py
│ ├── dedupe.py
│ └── text_normalization/
│ ├── arabic.py
│ ├── english.py
│ ├── html.py
│ └── unicode.py
│
├── load/
│ ├── to_csv.py
│ ├── to_sqlite.py
│ ├── to_parquet.py
│ ├── merge.py
│ └── schema.py
│
├── pipeline/
│ ├── runner.py
│ ├── scheduler.py
│ └── validation.py
│
├── core/
│ ├── models.py
│ ├── helpers.py
│ └── exceptions.py
│
├── configs/
│ ├── sites.yml
│ └── rules/
│ ├── salary.yml
│ ├── seniority.yml
│ └── tags.yml
│
├── data/
│ ├── raw/
│ ├── intermediate/
│ ├── processed/
│ └── logs/
│
├── metadata/
│ ├── run_history.json
│ ├── cache/
│ └── mapping/
│
└── scripts/
└── run_pipeline.sh
git clone https://github.com/husseini2000/job-scraper.git
cd job-scraperpython -m venv env
source env/bin/activate # Mac/Linux
env\Scripts\activate # Windowspip install -r requirements.txtOr if you're using Poetry:
poetry installpython cli.py run-allpython cli.py extract
python cli.py transform
python cli.py loadpython cli.py extract --site wuzzufpytest -qEach job site has its own Python scraper.
Enable/disable sites or adjust rate limits in:
configs/sites.yml
- HTML & emoji cleaning
- Arabic + English normalization
- Salary extraction
- Seniority detection
- Skill tag extraction (Python, SQL, AWS, Airflow, etc.)
- Duplicate job removal
- CSV
- SQLite
- Parquet
Use pipeline/runner.py or run via cron using:
scripts/run_pipeline.sh
This project is divided into phases to help build a strong, production-worthy pipeline.
Goal: Prepare the project structure and development environment.
Goal: Create the shared engine for all scrapers.
- Implement utils (fetcher, rate limiter, logger)
- Implement core models and helpers
- Create
configs/sites.yml
Output: Full scraper engine foundation.
Build BaseScraper and implement Wuzzuf as the first complete scraper.
Output: Working Wuzzuf scraper.
Clean, normalize, extract metadata, dedupe.
Output: Standardized job objects.
CSV, SQLite, Parquet, merging, schema.
Output: jobs_raw.csv, jobs_clean.csv, jobs.db
GulfTalent, Tanqeeb, DrJobs, Bayt, NaukriGulf, Laimoon, Akhtaboot.
One-command ETL workflow.
Add run history, validation, and clear error reporting.
Cron job automation + final polish.
Pull requests are welcome — thank you for contributing! To make contributions easy to review and merge, please follow this short checklist:
- Fork & Branch: Fork the repo and create a feature branch named
feat/<short-desc>orfix/<short-desc>. - Sync: Rebase or merge the latest
main(ormaster) before opening a PR. - Tests: Add or update tests for any new behavior and run
pytest -qlocally. - Type & Style: Ensure code passes linters and type checks (e.g.,
black,flake8,mypyif configured). - Docs: Update
README.mdor relevant docs when adding features or changing usage. - PR Description: Include a clear description, motivation, and any migration steps; reference any related issues.
If you're adding a scraper for a new job site, follow extract/example_site.py as a template and include a small sample output or fixture to help reviewers.
MIT License.