Skip to content

Sanctions data pipeline with Python, tests, and CI. Extracts public data, normalizes to JSONL, validates quality, screens names.

License

Notifications You must be signed in to change notification settings

rajsandhu/sanctions-pipeline

Repository files navigation

Sanctions Data Mini-Pipeline

A focused demo of sanctions data engineering with Python, tests, and CI. Ready to run locally. It downloads a public source, normalizes varied headers into a consistent JSONL structure, validates basic quality, and provides a simple screening utility.

CI

Note: This is a compact, (reasonably) well-documented demo, not a full product. It’s designed to be cloned, run, and adapted.

Why

  • Reproducible: deterministic steps (extract → transform → validate → screen) that run locally and in CI
  • Open-source hygiene: tests, CI, pinned dependencies, and clear data provenance
  • Practical: converts a public source into screenable entities for quick checks

Features

  • Extract: download a public sanctions file (CSV/XLSX) to data/raw with a single CLI command
  • Transform: normalize varied headers across sources and write simplified JSONL entities (schema, id, name, notes)
  • Validate: run basic checks (valid JSON, non‑empty names, minimum rows) to catch silent failures early
  • Screen: perform a simple, case‑insensitive substring match of input names against entities; designed to be a baseline

Note: Screening is intentionally simple; fuzzy/entity‑resolution is listed in the roadmap.

Quickstart

Requirements

Setup

git clone https://github.com/rajsandhu/sanctions-pipeline.git
cd sanctions-pipeline
uv sync
uv pip install -e .

Example: Australia DFAT consolidated list (XLSX)

These commands fetch a public sanctions list, convert it to JSONL, validate it, and screen a tiny CSV of names—so you can see the outputs immediately

# Extract
uv run python -m sanctions_pipeline.cli extract \
  --url https://www.dfat.gov.au/sites/default/files/regulation8_consolidated.xlsx \
  --out data/raw/dfat_consolidated.xlsx

# Transform to JSONL
uv run python -m sanctions_pipeline.cli transform \
  --input data/raw/dfat_consolidated.xlsx \
  --output data/ftm/entities.jsonl

# Validate (ensure at least 10 rows)
uv run python -m sanctions_pipeline.cli validate \
  --input data/ftm/entities.jsonl \
  --min-rows 10

# Screen a list of names (demo)
printf "name\nJane Doe\nAcme Corp\n" > people.csv
uv run python -m sanctions_pipeline.cli screen \
  --input-csv people.csv \
  --entities data/ftm/entities.jsonl \
  --output-csv matches.csv
head matches.csv

Makefile (optional shortcuts)

make setup      # uv sync + editable install
make extract    # DFAT XLSX → data/raw
make transform  # → data/ftm/entities.jsonl
make validate   # basic checks
make screen     # requires people.csv
make lint       # ruff + black --check
make test       # pytest

Samples (no download needed)

Use built-in tiny CSVs for a quick demo.

# Transform the sample sanctions CSV → JSONL
uv run python -m sanctions_pipeline.cli transform \
  --input data/sample/sdn_sample.csv \
  --output data/ftm/entities.jsonl

# Validate
uv run python -m sanctions_pipeline.cli validate \
  --input data/ftm/entities.jsonl \
  --min-rows 1

# Screen sample names against entities
uv run python -m sanctions_pipeline.cli screen \
  --input-csv data/sample/people_sample.csv \
  --entities data/ftm/entities.jsonl \
  --output-csv data/sample/matches_sample.csv

# View results
cat data/sample/matches_sample.csv

DFAT Pipeline (End-to-End Example)

This shows the complete workflow for Australian DFAT sanctions data—exactly what I built for this project.

Column Mapping

The DFAT XLSX file uses these column names:

  • Name of Individual or Entity → normalized to name
  • Committees → included in notes
  • Listing Information → included in notes

Commands

# 1. Crawl: Download the DFAT XLSX file (10,118 rows as of Nov 2024)
uv run python scripts/crawl_dfat.py -v

# 2. Transform: Convert XLSX → simple JSONL format (jsonl is the default)
uv run python -m sanctions_pipeline.cli transform \
  --input data/raw/dfat_consolidated.xlsx \
  --output data/ftm/dfat_entities.jsonl \
  -v

# 3. Validate: Ensure at least 10,000 records
uv run python -m sanctions_pipeline.cli validate \
  --input data/ftm/dfat_entities.jsonl \
  --min-rows 10000 \
  -v

# 4. Optional: Transform to FollowTheMoney (FTM) format instead
uv run python -m sanctions_pipeline.cli transform \
  --input data/raw/dfat_consolidated.xlsx \
  --output data/ftm/dfat_ftm.jsonl \
  --format ftm \
  -v

Expected Output

After running these commands, you'll have:

  • data/raw/dfat_consolidated.xlsx (906 KB, downloaded from DFAT)
  • data/ftm/dfat_entities.jsonl (10,118 entities in simple format)
  • Validation confirms: ✅ all rows are valid JSON

What This Demonstrates

  • Real-world data ingestion: Downloads a live government dataset
  • Column normalization: Handles DFAT-specific headers
  • Data quality checks: Validates JSON structure and row counts
  • Dual output formats: Simple JSONL for screening, FTM for advanced use

Data

  • Source: Australia DFAT consolidated list (public): https://www.dfat.gov.au/

    • Example file used here: regulation8_consolidated.xlsx (live URL in Quickstart)
  • Provenance: Fetched directly from the public URL; no scraping

  • Format (output): simplified JSONL — one entity per line with fields:

    • schema (“Person” or “Organization”)
    • id (row‑based id)
    • name
    • notes (optional; program/remarks if present)
  • Normalization:

    • Handles CSV or XLSX
    • Normalizes varied column headers to a consistent shape (e.g., name, type, program/remarks)
  • Limitations:

    • Output is simplified JSON (not full FollowTheMoney yet)
    • Screening is substring‑only (case‑insensitive), so it may miss variants and produce false positives
      • Single snapshot; no scheduled updates or versioning

This repo emphasizes transparency and reproducibility over completeness; see Roadmap for next steps (FTM export, fuzzy matching, additional sources).

Pipeline Overview

  1. Extract: Download a public sanctions file (CSV/XLSX) to data/raw/
  2. Transform: Normalize headers, write simplified JSONL entities to data/ftm/entities.jsonl
  3. Validate: Check JSON validity, non-empty names, and minimum row count
  4. Screen: Match input names against entities; write matches to CSV

Testing & CI

Tests: uv run pytest -q (covers extract, transform, validate, screen) Lint: uv run ruff check . + uv run black --check . CI: GitHub Actions on push/PR (Python 3.12 + uv). See CI badge at the top of this doc.

note: don't forget the periods in there

Roadmap

  • Add one more public data source (EU or UN)
  • Improve screening a bit (case-insensitive, simple partial match)
  • Write a few more tests and small validation checks
  • Add a Dockerfile for easier running (optional)
  • Learn and use AI coding assistants in VS Code (GitHub Copilot, Claude Code) to refine this repo and future projects

Next Improvements

  • YAML-driven source configs (Zavod-style) instead of hard-coded Python
  • Retry/backoff + caching layer for downloads
  • Incremental runs (only re-transform changed sources)
  • FollowTheMoney enrichment: add identifiers, dates, jurisdictions
  • Fuzzy & phonetic matching (Levenshtein, Double Metaphone) for screening
  • Duplicate detection across multiple sources
  • Basic provenance fields (retrieved_at, source_url) in output
  • GitHub Actions job matrix for multiple Python versions
  • Container image (Dockerfile + slim runtime) for reproducible deploy
  • Security lint (bandit) and dependency audit (pip-audit)

AI-assisted development

  • Current use: I’ve started using GitHub Copilot for drafting code and unblocking issues directly in my editor.
  • Plan (in-editor tools):
    • GitHub Copilot (primary): enable inline suggestions and Copilot Chat in VS Code (first month free). Use it for test scaffolding, small refactors, and CI/YAML boilerplate.
  • Quality control: All AI-assisted changes are validated by tests (pytest), linting/formatting (Ruff/Black), and CI (GitHub Actions). I keep diffs small and readable.
  • Transparency: I document commands in the README and note AI-assisted changes in PRs when applicable.
  • Learning goals: get comfortable with Copilot inline suggestions and Copilot Chat; explore its capabilities for enhancing productivity and code quality.

License

MIT

Notes

Developed with AI coding assistance (initially with a messy mix of ChatGPT 5, Phind, Claude Sora, Llama, etc, then finally with GitHub Copilot); validated by tests and CI.

About

Sanctions data pipeline with Python, tests, and CI. Extracts public data, normalizes to JSONL, validates quality, screens names.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages