Judicial Data Engineering Portfolio

End-to-end data engineering system for court case management — built to mirror real justice software implementations.

Overview

This portfolio demonstrates the complete data conversion lifecycle used in justice software implementations — from raw court system exports through transformation, validation, iterative reprocessing, and analytics-ready data warehouse modeling.

Built specifically to reflect the data engineering patterns used in court, attorney, and jury case management systems including Tyler Technologies Odyssey-style schemas, PII-safe data handling, and production-grade pipeline repeatability.

Projects

#	Project	Description	Key Skills
1	Court Case ETL Pipeline	Extract, transform, validate, and load court case records	ETL, PII hashing, Pydantic validation, idempotent upserts
2	Data Migration Lifecycle Engine	Full migration cycle with iterative reprocessing	Sampling, conversion, reprocessing, reconciliation
3	Automated Data Quality Framework	Rule-based validation with audit reports	Data quality, automated checks, JSON reporting
4	Airflow Orchestration	Nightly pipeline scheduling with retry logic	DAGs, SLA monitoring, task dependencies
5	CI/CD Pipeline	Automated testing on every commit	GitHub Actions, pytest, PostgreSQL integration tests
6	dbt Data Warehouse	Analytics-ready court data models	dbt, SQL transformations, schema tests, documentation

Tech Stack

Tool	Purpose	Why This Tool
Python 3.11+	ETL scripting	Industry standard for data transformation; rich judicial data ecosystem
PostgreSQL 15	Primary database	Used by Tyler Technologies, Odyssey, and C-Track court systems
Apache Airflow	Pipeline orchestration	Gold standard for iterative reprocessing workflows
dbt	SQL transformations	Modern standard for versioned, tested, documented analytics models
Docker	Environment management	Ensures pipeline reproducibility across dev and production
Pydantic	Data validation	Schema enforcement before data reaches the target database
Pandas + SQLAlchemy	Data manipulation	Handles complex deduplication, fuzzy matching, and DB abstraction
GitHub Actions	CI/CD	Automated testing on every commit — proves long-term repeatability
Great Expectations	Data quality	Industry-standard validation with stakeholder-facing HTML reports

Quick Start

# Clone the repo
git clone https://github.com/krishnach45/judicial-data-engineering.git
cd judicial-data-engineering

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Start PostgreSQL
docker-compose up -d

# Generate synthetic court data (10,000 cases with intentional quality issues)
python data/synthetic/generate_data.py

# Seed raw tables
python loaders/seed_raw_tables.py

# Run Project 1 — ETL Pipeline
python -m project_1_case_flow_etl.run

# Run Project 2 — Full Migration Lifecycle
python -m project_2_data_migration.run

# Run Project 3 — Data Quality Checks
python project_3_data_quality/run_checks.py

# Run all tests
pytest project_1_case_flow_etl/tests/ -v

Repository Structure

judicial-data-engineering/
│
├── data/
│   └── synthetic/
│       └── generate_data.py          ← Synthetic court data generator
│
├── loaders/
│   └── seed_raw_tables.py            ← Seed raw PostgreSQL tables
│
├── project_1_case_flow_etl/          ← Court Case ETL Pipeline
│   ├── extractors/court_extractor.py
│   ├── transformers/case_transformer.py
│   ├── validators/case_validator.py
│   ├── loaders/case_loader.py
│   ├── tests/
│   │   ├── test_transformer.py       ← 7 unit tests
│   │   └── test_validator.py         ← 5 unit tests
│   └── run.py
│
├── project_2_data_migration/         ← Data Migration Lifecycle Engine
│   ├── sampling/sampler.py
│   ├── conversion/migration_engine.py
│   ├── reprocessing/reprocessor.py
│   ├── rollback/rollback.py
│   └── run.py
│
├── project_3_data_quality/           ← Data Quality Framework
│   ├── expectations/
│   ├── checkpoints/
│   ├── reports/
│   └── run_checks.py
│
├── project_4_airflow_orchestration/  ← Airflow DAGs
│   └── dags/court_migration_dag.py
│
├── project_5_dbt_warehouse/          ← dbt Data Warehouse
│   ├── models/
│   │   ├── staging/stg_cases.sql
│   │   ├── intermediate/int_case_summary.sql
│   │   └── marts/
│   │       ├── case_analytics.sql
│   │       └── judge_metrics.sql
│   └── dbt_project.yml
│
├── .github/
│   └── workflows/
│       ├── ci.yml                    ← GitHub Actions CI/CD
│       └── data_quality_check.yml   ← Nightly data quality workflow
│
├── docker-compose.yml
└── requirements.txt

Project 1 — Court Case ETL Pipeline

Implements the full extract-transform-load cycle for court case records, modeled on real justice software data structures.

Schema covers: cases, parties, hearings, charges, verdicts, judges

Key capabilities:

Extracts from raw_cases PostgreSQL table (seeded from CSV exports)
Normalizes inconsistent case number formats (CR/2024/001, cr-2024-001, CRIM 2024 001 → CR-2024-001)
Handles 6 different legacy date formats from source systems
SHA-256 hashes all SSN fields — raw PII never persists in the target database
Removes ~3% duplicate case records introduced by legacy system re-entry
Validates every record with Pydantic schema enforcement
Saves failed records to migration_errors/ for iterative reprocessing
Idempotent upsert loading — safe to run multiple times without creating duplicates

Sample run output:

2024-01-15 02:01:12 — INFO — Starting ETL pipeline — Run ID: 20240115_020112
2024-01-15 02:01:15 — INFO — Extracted 10,000 rows from raw_cases
2024-01-15 02:01:18 — INFO — Case numbers normalized
2024-01-15 02:01:19 — WARNING — 7,512 unparseable dates in 'filed_date' set to NULL
2024-01-15 02:01:20 — INFO — SSN column hashed and removed
2024-01-15 02:01:21 — INFO — Removed 247 duplicate case records
2024-01-15 02:01:22 — INFO — Transformation complete: 9,753 rows remaining
2024-01-15 02:01:25 — INFO — Validation: 9,195 passed | 558 failed | Pass rate: 94.3%
2024-01-15 02:01:26 — INFO — Saved 558 errors to migration_errors/
2024-01-15 02:01:28 — INFO — Upserted 9,195 rows into clean_cases

Project 2 — Data Migration Lifecycle Engine

Implements the full conversion lifecycle required in justice software migrations — the same workflow used in Odyssey and Tyler Technologies implementations.

Phases:

Sampling — Stratified 10% sample by case type for pre-migration profiling
Profiling — Null rate analysis, type detection, anomaly flagging
Transformation — Full cleaning and standardization
Loading — Idempotent upsert with reconciliation
Iterative Reprocessing — Failed records isolated, fix applied, reloaded without restarting

Justice software migrations run in cycles — not as single batch jobs. Records that fail validation in one pass get isolated, fixed with targeted scripts, and reloaded. This engine supports that pattern natively.

Phase 1 — Sampling:   998 rows sampled (10% of 10,000)
Phase 2 — Migration:  9,195 valid | 558 failed | 94.3% pass rate
Phase 3 — Reprocess:  412 recovered | 146 still failing

Project 3 — Automated Data Quality Framework

Rule-based validation suite that runs against clean_cases after every migration cycle.

Check	Column	Result
Not null	case_number	✅ PASS
Unique	case_number	✅ PASS
Valid values	case_type	✅ PASS
Valid values	status	✅ PASS
Not null	status	✅ PASS
Not null	court_id	✅ PASS
PII removed	ssn	✅ PASS
Row count > 0	table	✅ PASS

Reports saved to project_3_data_quality/reports/ as timestamped JSON files.

Project 4 — Airflow Orchestration

Nightly DAG that automates the full pipeline with production-grade reliability features.

Schedule: 0 2 * * * (nightly at 2am)
Retries: 3 attempts with 5-minute delay
SLA: 2-hour alert threshold
Task flow: extract → transform → validate → load → quality_check → reprocess
reprocess task uses trigger_rule='all_done' — runs even if prior tasks fail

Note: Airflow requires Linux/WSL2. The DAG code is fully implemented and syntax-validated. Deploy to any Linux Airflow instance or cloud (MWAA, Cloud Composer, Astronomer).

Project 5 — CI/CD Pipeline

GitHub Actions workflow that runs on every push to main and develop.

Pipeline steps:

Spin up PostgreSQL 15 service container
Install all Python dependencies
Run 12 unit tests against transformer and validator
Validate Airflow DAG syntax
Run data quality checks

Test results: 12/12 passing

test_transformer.py::test_cleans_case_numbers      PASSED
test_transformer.py::test_hashes_ssn               PASSED
test_transformer.py::test_removes_duplicates        PASSED
test_transformer.py::test_standardizes_case_type   PASSED
test_transformer.py::test_null_case_number         PASSED
test_transformer.py::test_drops_system_columns     PASSED
test_transformer.py::test_full_transform_pipeline  PASSED
test_validator.py::test_valid_record_passes        PASSED
test_validator.py::test_invalid_case_number_fails  PASSED
test_validator.py::test_invalid_case_type_fails    PASSED
test_validator.py::test_null_case_type_passes      PASSED
test_validator.py::test_multiple_records_mixed     PASSED

Project 6 — dbt Data Warehouse

Analytics-ready SQL models with built-in tests and documentation.

staging/
  stg_cases.sql             ← Type casting, renaming, null filtering

intermediate/
  int_case_summary.sql      ← Case counts by type, status, and court

marts/
  case_analytics.sql        ← Final analytics table with derived fields
  judge_metrics.sql         ← Per-judge caseload and closure rates

Derived fields in case_analytics:

resolution_category — active vs resolved
court_division — criminal justice, civil justice, family court
filing_year, filing_month, filing_quarter

Data Generation

No external API or real court data needed. The synthetic generator creates realistic court records with intentional data quality problems that mirror what you find in legacy court system exports.

python data/synthetic/generate_data.py

Table	Rows
Cases	10,000
Parties	~25,000
Hearings	~30,000
Charges	~5,900
Judges	50

Intentional quality problems injected:

Problem	Why It Exists in Real Systems
Mixed date formats	Different clerks export dates differently across counties
~3% duplicate case numbers	Legacy systems allow re-entry under slightly different numbers
Inconsistent prefixes (CR, cr, CRIM, blank)	Abbreviation conventions changed over decades
5–20% null rates on optional fields	Historical records pre-date mandatory field requirements
Raw SSNs in source data	Legacy systems stored PII in plain text
Future hearing dates mixed with past	Scheduled future hearings coexist with historical records

Alignment with Justice Software Requirements

Requirement	How This Project Covers It
Justice software implementation (court, attorney, jury)	Court case schema, party/attorney relationships, hearing scheduling
Data hygiene and cleaning scripts	`case_transformer.py` — normalization, deduplication, PII hashing
Scripts for data migration and conversion	`migration_engine.py` — full lifecycle with run logs
Sampling, transforming, loading, iterative reprocessing	`sampler.py`, `migration_engine.py`, `reprocessor.py`
Clean, accurate, ready-to-use data	Pydantic validation + Great Expectations quality gates
Strong programmer with automation mindset	Airflow DAGs replace all manual execution
System reliability and long-term repeatability	CI/CD ensures every commit is tested before deployment

License

MIT License — see LICENSE

Built as a targeted portfolio project for justice software data engineering roles. Every component maps directly to production patterns used in court case management system implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Judicial Data Engineering Portfolio

Overview

Projects

Tech Stack

Quick Start

Repository Structure

Project 1 — Court Case ETL Pipeline

Project 2 — Data Migration Lifecycle Engine

Project 3 — Automated Data Quality Framework

Project 4 — Airflow Orchestration

Project 5 — CI/CD Pipeline

Project 6 — dbt Data Warehouse

Data Generation

Alignment with Justice Software Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
data/synthetic		data/synthetic
loaders		loaders
project_1_case_flow_etl		project_1_case_flow_etl
project_2_data_migration		project_2_data_migration
project_3_data_quality		project_3_data_quality
project_4_airflow_orchestration		project_4_airflow_orchestration
project_5_dbt_warehouse		project_5_dbt_warehouse
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Judicial Data Engineering Portfolio

Overview

Projects

Tech Stack

Quick Start

Repository Structure

Project 1 — Court Case ETL Pipeline

Project 2 — Data Migration Lifecycle Engine

Project 3 — Automated Data Quality Framework

Project 4 — Airflow Orchestration

Project 5 — CI/CD Pipeline

Project 6 — dbt Data Warehouse

Data Generation

Alignment with Justice Software Requirements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages