📦 Job Scraper Pipeline

A modular, scalable ETL system for job scraping across Middle East job boards. Designed for Data Engineering & Python practice using real-world scraping and pipeline architecture.

🔔 Current Status

Status: Active development — latest work on Pydantic models, packaging, and Makefile.
Branch: chore/pydantic-packaging-makefile
Implemented: Project skeleton, modular extract/transform/load layers, example scrapers, and basic test scaffolding.
In Progress: Pydantic-based data models, packaging improvements (pyproject/packaging), Makefile updates, and CI integration.
Notes: This repository is actively developed; some features and tests may be incomplete. Run pytest -q locally and review requirements.txt before using in production.

🚀 Overview

This project is a production-ready job scraping pipeline organized into a clean ETL architecture:

Extract → Transform → Load → Automate

You can scrape multiple job boards (Wuzzuf, GulfTalent, Bayt, etc.), clean and normalize the data, extract metadata (tags, salary, seniority), dedupe jobs, load into SQLite/CSV, and schedule daily runs.

The entire pipeline is fully modular, configurable, and extendable — every site scraper lives in its own file.

📂 Project Structure

job-scraper/
├── README.md
├── requirements.txt
├── Makefile
├── cli.py
│
├── extract/
│   ├── base_scraper.py
│   ├── wuzzuf.py
│   ├── gulftalent.py
│   ├── naukrigulf.py
│   ├── tanqeeb.py
│   ├── drjobs.py
│   ├── bayt.py
│   ├── laimoon.py
│   ├── akhtaboot.py
│   ├── example_site.py
│   └── utils/
│       ├── fetch.py
│       ├── rate_limit.py
│       ├── parse.py
│       └── logger.py
│
├── transform/
│   ├── normalize.py
│   ├── clean_text.py
│   ├── extract_metadata.py
│   ├── dedupe.py
│   └── text_normalization/
│       ├── arabic.py
│       ├── english.py
│       ├── html.py
│       └── unicode.py
│
├── load/
│   ├── to_csv.py
│   ├── to_sqlite.py
│   ├── to_parquet.py
│   ├── merge.py
│   └── schema.py
│
├── pipeline/
│   ├── runner.py
│   ├── scheduler.py
│   └── validation.py
│
├── core/
│   ├── models.py
│   ├── helpers.py
│   └── exceptions.py
│
├── configs/
│   ├── sites.yml
│   └── rules/
│       ├── salary.yml
│       ├── seniority.yml
│       └── tags.yml
│
├── data/
│   ├── raw/
│   ├── intermediate/
│   ├── processed/
│   └── logs/
│
├── metadata/
│   ├── run_history.json
│   ├── cache/
│   └── mapping/
│
└── scripts/
    └── run_pipeline.sh

🛠 Installation

1️⃣ Clone the repo

git clone https://github.com/husseini2000/job-scraper.git
cd job-scraper

2️⃣ Create a virtual environment

python -m venv env
source env/bin/activate        # Mac/Linux
env\Scripts\activate           # Windows

3️⃣ Install dependencies

pip install -r requirements.txt

Or if you're using Poetry:

poetry install

▶️ Quick Start

Run entire ETL pipeline

python cli.py run-all

Run per-stage

python cli.py extract
python cli.py transform
python cli.py load

Scrape one site

python cli.py extract --site wuzzuf

🧪 Testing

pytest -q

🎯 Features

✅ Modular scrapers

Each job site has its own Python scraper.

✅ Config-driven

Enable/disable sites or adjust rate limits in:

configs/sites.yml

✅ Strong Transform Layer

HTML & emoji cleaning
Arabic + English normalization
Salary extraction
Seniority detection
Skill tag extraction (Python, SQL, AWS, Airflow, etc.)
Duplicate job removal

✅ Multiple Load Targets

CSV
SQLite
Parquet

✅ Pipeline Automation

Use pipeline/runner.py or run via cron using:

scripts/run_pipeline.sh

🧭 Roadmap

This project is divided into phases to help build a strong, production-worthy pipeline.

🧱 PHASE 0 — Foundation & Environment

Goal: Prepare the project structure and development environment.

Tasks

✅ Project Structure: Organized, professional layout

✅ Virtual Environments: Isolated dependencies

✅ Dependency Management: requirements.txt

✅ Automation: Makefile for common tasks

✅ Version Control: .gitignore and git basics

✅ Testing: pytest with coverage

✅ Data Models: Pydantic for validation

✅ Code Quality: black, flake8, mypy setup

🐣 PHASE 1 — Core Engine & Utilities (2–3 days)

Goal: Create the shared engine for all scrapers.

Tasks

Implement utils (fetcher, rate limiter, logger)
Implement core models and helpers
Create configs/sites.yml

Output: Full scraper engine foundation.

🌐 PHASE 2 — First Scraper (Wuzzuf) + BaseScraper (3–4 days)

Build BaseScraper and implement Wuzzuf as the first complete scraper.

Output: Working Wuzzuf scraper.

🧹 PHASE 3 — Transform Layer (4–5 days)

Clean, normalize, extract metadata, dedupe.

Output: Standardized job objects.

🛢 PHASE 4 — Load Layer (2–3 days)

CSV, SQLite, Parquet, merging, schema.

Output: jobs_raw.csv, jobs_clean.csv, jobs.db

🌍 PHASE 5 — Additional Scrapers (5–12 days)

GulfTalent, Tanqeeb, DrJobs, Bayt, NaukriGulf, Laimoon, Akhtaboot.

🔁 PHASE 6 — CLI + Pipeline Runner (2–3 days)

One-command ETL workflow.

📊 PHASE 7 — Validation, Logging, Monitoring (1–2 days)

Add run history, validation, and clear error reporting.

🚀 PHASE 8 — Automation & Deployment (1 day)

Cron job automation + final polish.

❤️ Contributing

Pull requests are welcome — thank you for contributing! To make contributions easy to review and merge, please follow this short checklist:

Fork & Branch: Fork the repo and create a feature branch named feat/<short-desc> or fix/<short-desc>.
Sync: Rebase or merge the latest main (or master) before opening a PR.
Tests: Add or update tests for any new behavior and run pytest -q locally.
Type & Style: Ensure code passes linters and type checks (e.g., black, flake8, mypy if configured).
Docs: Update README.md or relevant docs when adding features or changing usage.
PR Description: Include a clear description, motivation, and any migration steps; reference any related issues.

If you're adding a scraper for a new job site, follow extract/example_site.py as a template and include a small sample output or fixture to help reviewers.

📜 License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
configs		configs
core		core
data		data
extract		extract
job_scraper		job_scraper
load		load
metadata/cache		metadata/cache
pipeline		pipeline
tests		tests
transform		transform
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cli.py		cli.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📦 Job Scraper Pipeline

🔔 Current Status

🚀 Overview

📂 Project Structure

🛠 Installation

1️⃣ Clone the repo

2️⃣ Create a virtual environment

3️⃣ Install dependencies

▶️ Quick Start

Run entire ETL pipeline

Run per-stage

Scrape one site

🧪 Testing

🎯 Features

✅ Modular scrapers

✅ Config-driven

✅ Strong Transform Layer

✅ Multiple Load Targets

✅ Pipeline Automation

🧭 Roadmap

🧱 **PHASE 0 — Foundation & Environment **

Tasks

✅ Project Structure: Organized, professional layout

✅ Virtual Environments: Isolated dependencies

✅ Dependency Management: requirements.txt

✅ Automation: Makefile for common tasks

✅ Version Control: .gitignore and git basics

✅ Testing: pytest with coverage

✅ Data Models: Pydantic for validation

✅ Code Quality: black, flake8, mypy setup

🐣 PHASE 1 — Core Engine & Utilities (2–3 days)

Tasks

🌐 PHASE 2 — First Scraper (Wuzzuf) + BaseScraper (3–4 days)

🧹 PHASE 3 — Transform Layer (4–5 days)

🛢 PHASE 4 — Load Layer (2–3 days)

🌍 PHASE 5 — Additional Scrapers (5–12 days)

🔁 PHASE 6 — CLI + Pipeline Runner (2–3 days)

📊 PHASE 7 — Validation, Logging, Monitoring (1–2 days)

🚀 PHASE 8 — Automation & Deployment (1 day)

❤️ Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🧱 PHASE 0 — Foundation & Environment

Packages