Sanctions Data Mini-Pipeline

A focused demo of sanctions data engineering with Python, tests, and CI. Ready to run locally. It downloads a public source, normalizes varied headers into a consistent JSONL structure, validates basic quality, and provides a simple screening utility.

Note: This is a compact, (reasonably) well-documented demo, not a full product. It’s designed to be cloned, run, and adapted.

Why

Reproducible: deterministic steps (extract → transform → validate → screen) that run locally and in CI
Open-source hygiene: tests, CI, pinned dependencies, and clear data provenance
Practical: converts a public source into screenable entities for quick checks

Features

Extract: download a public sanctions file (CSV/XLSX) to data/raw with a single CLI command
Transform: normalize varied headers across sources and write simplified JSONL entities (schema, id, name, notes)
Validate: run basic checks (valid JSON, non‑empty names, minimum rows) to catch silent failures early
Screen: perform a simple, case‑insensitive substring match of input names against entities; designed to be a baseline

Note: Screening is intentionally simple; fuzzy/entity‑resolution is listed in the roadmap.

Quickstart

Requirements

Python 3.12+
uv (https://docs.astral.sh/uv/)

Setup

git clone https://github.com/rajsandhu/sanctions-pipeline.git
cd sanctions-pipeline
uv sync
uv pip install -e .

Example: Australia DFAT consolidated list (XLSX)

These commands fetch a public sanctions list, convert it to JSONL, validate it, and screen a tiny CSV of names—so you can see the outputs immediately

# Extract
uv run python -m sanctions_pipeline.cli extract \
  --url https://www.dfat.gov.au/sites/default/files/regulation8_consolidated.xlsx \
  --out data/raw/dfat_consolidated.xlsx

# Transform to JSONL
uv run python -m sanctions_pipeline.cli transform \
  --input data/raw/dfat_consolidated.xlsx \
  --output data/ftm/entities.jsonl

# Validate (ensure at least 10 rows)
uv run python -m sanctions_pipeline.cli validate \
  --input data/ftm/entities.jsonl \
  --min-rows 10

# Screen a list of names (demo)
printf "name\nJane Doe\nAcme Corp\n" > people.csv
uv run python -m sanctions_pipeline.cli screen \
  --input-csv people.csv \
  --entities data/ftm/entities.jsonl \
  --output-csv matches.csv
head matches.csv

Makefile (optional shortcuts)

make setup      # uv sync + editable install
make extract    # DFAT XLSX → data/raw
make transform  # → data/ftm/entities.jsonl
make validate   # basic checks
make screen     # requires people.csv
make lint       # ruff + black --check
make test       # pytest

Samples (no download needed)

Use built-in tiny CSVs for a quick demo.

# Transform the sample sanctions CSV → JSONL
uv run python -m sanctions_pipeline.cli transform \
  --input data/sample/sdn_sample.csv \
  --output data/ftm/entities.jsonl

# Validate
uv run python -m sanctions_pipeline.cli validate \
  --input data/ftm/entities.jsonl \
  --min-rows 1

# Screen sample names against entities
uv run python -m sanctions_pipeline.cli screen \
  --input-csv data/sample/people_sample.csv \
  --entities data/ftm/entities.jsonl \
  --output-csv data/sample/matches_sample.csv

# View results
cat data/sample/matches_sample.csv

DFAT Pipeline (End-to-End Example)

This shows the complete workflow for Australian DFAT sanctions data—exactly what I built for this project.

Column Mapping

The DFAT XLSX file uses these column names:

Name of Individual or Entity → normalized to name
Committees → included in notes
Listing Information → included in notes

Commands

# 1. Crawl: Download the DFAT XLSX file (10,118 rows as of Nov 2024)
uv run python scripts/crawl_dfat.py -v

# 2. Transform: Convert XLSX → simple JSONL format (jsonl is the default)
uv run python -m sanctions_pipeline.cli transform \
  --input data/raw/dfat_consolidated.xlsx \
  --output data/ftm/dfat_entities.jsonl \
  -v

# 3. Validate: Ensure at least 10,000 records
uv run python -m sanctions_pipeline.cli validate \
  --input data/ftm/dfat_entities.jsonl \
  --min-rows 10000 \
  -v

# 4. Optional: Transform to FollowTheMoney (FTM) format instead
uv run python -m sanctions_pipeline.cli transform \
  --input data/raw/dfat_consolidated.xlsx \
  --output data/ftm/dfat_ftm.jsonl \
  --format ftm \
  -v

Expected Output

After running these commands, you'll have:

data/raw/dfat_consolidated.xlsx (906 KB, downloaded from DFAT)
data/ftm/dfat_entities.jsonl (10,118 entities in simple format)
Validation confirms: ✅ all rows are valid JSON

What This Demonstrates

Real-world data ingestion: Downloads a live government dataset
Column normalization: Handles DFAT-specific headers
Data quality checks: Validates JSON structure and row counts
Dual output formats: Simple JSONL for screening, FTM for advanced use

Data

Source: Australia DFAT consolidated list (public): https://www.dfat.gov.au/
- Example file used here: regulation8_consolidated.xlsx (live URL in Quickstart)
Provenance: Fetched directly from the public URL; no scraping
Format (output): simplified JSONL — one entity per line with fields:
- schema (“Person” or “Organization”)
- id (row‑based id)
- name
- notes (optional; program/remarks if present)
Normalization:
- Handles CSV or XLSX
- Normalizes varied column headers to a consistent shape (e.g., name, type, program/remarks)
Limitations:
- Output is simplified JSON (not full FollowTheMoney yet)
- Screening is substring‑only (case‑insensitive), so it may miss variants and produce false positives
  - Single snapshot; no scheduled updates or versioning

This repo emphasizes transparency and reproducibility over completeness; see Roadmap for next steps (FTM export, fuzzy matching, additional sources).

Pipeline Overview

Extract: Download a public sanctions file (CSV/XLSX) to data/raw/
Transform: Normalize headers, write simplified JSONL entities to data/ftm/entities.jsonl
Validate: Check JSON validity, non-empty names, and minimum row count
Screen: Match input names against entities; write matches to CSV

Testing & CI

Tests: uv run pytest -q (covers extract, transform, validate, screen) Lint: uv run ruff check . + uv run black --check . CI: GitHub Actions on push/PR (Python 3.12 + uv). See CI badge at the top of this doc.

note: don't forget the periods in there

Roadmap

Add one more public data source (EU or UN)
Improve screening a bit (case-insensitive, simple partial match)
Write a few more tests and small validation checks
Add a Dockerfile for easier running (optional)
Learn and use AI coding assistants in VS Code (GitHub Copilot, Claude Code) to refine this repo and future projects

Next Improvements

YAML-driven source configs (Zavod-style) instead of hard-coded Python
Retry/backoff + caching layer for downloads
Incremental runs (only re-transform changed sources)
FollowTheMoney enrichment: add identifiers, dates, jurisdictions
Fuzzy & phonetic matching (Levenshtein, Double Metaphone) for screening
Duplicate detection across multiple sources
Basic provenance fields (retrieved_at, source_url) in output
GitHub Actions job matrix for multiple Python versions
Container image (Dockerfile + slim runtime) for reproducible deploy
Security lint (bandit) and dependency audit (pip-audit)

AI-assisted development

Current use: I’ve started using GitHub Copilot for drafting code and unblocking issues directly in my editor.
Plan (in-editor tools):
- GitHub Copilot (primary): enable inline suggestions and Copilot Chat in VS Code (first month free). Use it for test scaffolding, small refactors, and CI/YAML boilerplate.
Quality control: All AI-assisted changes are validated by tests (pytest), linting/formatting (Ruff/Black), and CI (GitHub Actions). I keep diffs small and readable.
Transparency: I document commands in the README and note AI-assisted changes in PRs when applicable.
Learning goals: get comfortable with Copilot inline suggestions and Copilot Chat; explore its capabilities for enhancing productivity and code quality.

License

MIT

Notes

Developed with AI coding assistance (initially with a messy mix of ChatGPT 5, Phind, Claude Sora, Llama, etc, then finally with GitHub Copilot); validated by tests and CI.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
.vscode		.vscode
scripts		scripts
src/sanctions_pipeline		src/sanctions_pipeline
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
issue.md		issue.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sanctions Data Mini-Pipeline

Why

Features

Quickstart

DFAT Pipeline (End-to-End Example)

Column Mapping

Commands

Expected Output

What This Demonstrates

Data

Pipeline Overview

Testing & CI

Roadmap

Next Improvements

AI-assisted development

License

Notes

About

Uh oh!

Releases

Packages

Languages

License

rajsandhu/sanctions-pipeline

Folders and files

Latest commit

History

Repository files navigation

Sanctions Data Mini-Pipeline

Why

Features

Quickstart

DFAT Pipeline (End-to-End Example)

Column Mapping

Commands

Expected Output

What This Demonstrates

Data

Pipeline Overview

Testing & CI

Roadmap

Next Improvements

AI-assisted development

License

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages