Skip to content

husseini2000/job-scraper

Repository files navigation

📦 Job Scraper Pipeline

A modular, scalable ETL system for job scraping across Middle East job boards. Designed for Data Engineering & Python practice using real-world scraping and pipeline architecture.


GitHub Actions



🔔 Current Status

  • Status: Active development — latest work on Pydantic models, packaging, and Makefile.
  • Branch: chore/pydantic-packaging-makefile
  • Implemented: Project skeleton, modular extract/transform/load layers, example scrapers, and basic test scaffolding.
  • In Progress: Pydantic-based data models, packaging improvements (pyproject/packaging), Makefile updates, and CI integration.
  • Notes: This repository is actively developed; some features and tests may be incomplete. Run pytest -q locally and review requirements.txt before using in production.

🚀 Overview

This project is a production-ready job scraping pipeline organized into a clean ETL architecture:

Extract → Transform → Load → Automate

You can scrape multiple job boards (Wuzzuf, GulfTalent, Bayt, etc.), clean and normalize the data, extract metadata (tags, salary, seniority), dedupe jobs, load into SQLite/CSV, and schedule daily runs.

The entire pipeline is fully modular, configurable, and extendable — every site scraper lives in its own file.


📂 Project Structure

job-scraper/
├── README.md
├── requirements.txt
├── Makefile
├── cli.py
│
├── extract/
│   ├── base_scraper.py
│   ├── wuzzuf.py
│   ├── gulftalent.py
│   ├── naukrigulf.py
│   ├── tanqeeb.py
│   ├── drjobs.py
│   ├── bayt.py
│   ├── laimoon.py
│   ├── akhtaboot.py
│   ├── example_site.py
│   └── utils/
│       ├── fetch.py
│       ├── rate_limit.py
│       ├── parse.py
│       └── logger.py
│
├── transform/
│   ├── normalize.py
│   ├── clean_text.py
│   ├── extract_metadata.py
│   ├── dedupe.py
│   └── text_normalization/
│       ├── arabic.py
│       ├── english.py
│       ├── html.py
│       └── unicode.py
│
├── load/
│   ├── to_csv.py
│   ├── to_sqlite.py
│   ├── to_parquet.py
│   ├── merge.py
│   └── schema.py
│
├── pipeline/
│   ├── runner.py
│   ├── scheduler.py
│   └── validation.py
│
├── core/
│   ├── models.py
│   ├── helpers.py
│   └── exceptions.py
│
├── configs/
│   ├── sites.yml
│   └── rules/
│       ├── salary.yml
│       ├── seniority.yml
│       └── tags.yml
│
├── data/
│   ├── raw/
│   ├── intermediate/
│   ├── processed/
│   └── logs/
│
├── metadata/
│   ├── run_history.json
│   ├── cache/
│   └── mapping/
│
└── scripts/
    └── run_pipeline.sh

🛠 Installation

1️⃣ Clone the repo

git clone https://github.com/husseini2000/job-scraper.git
cd job-scraper

2️⃣ Create a virtual environment

python -m venv env
source env/bin/activate        # Mac/Linux
env\Scripts\activate           # Windows

3️⃣ Install dependencies

pip install -r requirements.txt

Or if you're using Poetry:

poetry install

▶️ Quick Start

Run entire ETL pipeline

python cli.py run-all

Run per-stage

python cli.py extract
python cli.py transform
python cli.py load

Scrape one site

python cli.py extract --site wuzzuf

🧪 Testing

pytest -q

🎯 Features

✅ Modular scrapers

Each job site has its own Python scraper.

✅ Config-driven

Enable/disable sites or adjust rate limits in:

configs/sites.yml

✅ Strong Transform Layer

  • HTML & emoji cleaning
  • Arabic + English normalization
  • Salary extraction
  • Seniority detection
  • Skill tag extraction (Python, SQL, AWS, Airflow, etc.)
  • Duplicate job removal

✅ Multiple Load Targets

  • CSV
  • SQLite
  • Parquet

✅ Pipeline Automation

Use pipeline/runner.py or run via cron using:

scripts/run_pipeline.sh

🧭 Roadmap

This project is divided into phases to help build a strong, production-worthy pipeline.


🧱 **PHASE 0 — Foundation & Environment **

Goal: Prepare the project structure and development environment.

Tasks

✅ Project Structure: Organized, professional layout

✅ Virtual Environments: Isolated dependencies

✅ Dependency Management: requirements.txt

✅ Automation: Makefile for common tasks

✅ Version Control: .gitignore and git basics

✅ Testing: pytest with coverage

✅ Data Models: Pydantic for validation

✅ Code Quality: black, flake8, mypy setup


🐣 PHASE 1 — Core Engine & Utilities (2–3 days)

Goal: Create the shared engine for all scrapers.

Tasks

  • Implement utils (fetcher, rate limiter, logger)
  • Implement core models and helpers
  • Create configs/sites.yml

Output: Full scraper engine foundation.


🌐 PHASE 2 — First Scraper (Wuzzuf) + BaseScraper (3–4 days)

Build BaseScraper and implement Wuzzuf as the first complete scraper.

Output: Working Wuzzuf scraper.


🧹 PHASE 3 — Transform Layer (4–5 days)

Clean, normalize, extract metadata, dedupe.

Output: Standardized job objects.


🛢 PHASE 4 — Load Layer (2–3 days)

CSV, SQLite, Parquet, merging, schema.

Output: jobs_raw.csv, jobs_clean.csv, jobs.db


🌍 PHASE 5 — Additional Scrapers (5–12 days)

GulfTalent, Tanqeeb, DrJobs, Bayt, NaukriGulf, Laimoon, Akhtaboot.


🔁 PHASE 6 — CLI + Pipeline Runner (2–3 days)

One-command ETL workflow.


📊 PHASE 7 — Validation, Logging, Monitoring (1–2 days)

Add run history, validation, and clear error reporting.


🚀 PHASE 8 — Automation & Deployment (1 day)

Cron job automation + final polish.


❤️ Contributing

Pull requests are welcome — thank you for contributing! To make contributions easy to review and merge, please follow this short checklist:

  • Fork & Branch: Fork the repo and create a feature branch named feat/<short-desc> or fix/<short-desc>.
  • Sync: Rebase or merge the latest main (or master) before opening a PR.
  • Tests: Add or update tests for any new behavior and run pytest -q locally.
  • Type & Style: Ensure code passes linters and type checks (e.g., black, flake8, mypy if configured).
  • Docs: Update README.md or relevant docs when adding features or changing usage.
  • PR Description: Include a clear description, motivation, and any migration steps; reference any related issues.

If you're adding a scraper for a new job site, follow extract/example_site.py as a template and include a small sample output or fixture to help reviewers.


📜 License

MIT License.


About

Modular ETL pipeline for Middle East job boards

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors