This repo demonstrates data platform behaviors locally, with a layer naming scheme closer to many real systems:
- Landing → ingest as-is (staging/temp storage, think external bucket)
- Raw → copy/convert into lakehouse-format (Parquet) and keep history
- Cleaned → transforms + idempotent incremental builds (stored in DuckDB)
- Serving (Curated) → the only layer consumers should read (
serving/exports/)
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
./run_all.shquality/reports/*.mdandquality/quality_summary.csvserving/exports/(consumer boundary)state/audit_log.jsonl
- Ownership required:
governance/ownership.yml - Classification required (including Serving):
governance/data_classification.yml - Executable tests:
make test-governanceInstead of deleting facts (which can break analytics), we mask PII in cleaned_users:
python governance/gdpr_mask_user.py --user-id 3
python pipelines/build_serving_daily_metrics.pyserving/exports/user_daily_metrics.parquet- keyed by
user_key(stable anonymized key derived fromuser_id) - allows user-level analytics without exposing PII
- keyed by
Instead of rebuilding everything, this demo shows partition-based remediation after a GDPR request.
./run_gdpr_flow.sh 3What it does:
- records the request in
gdpr_requests(DuckDB) - masks PII in
cleaned_users - recomputes only the affected
event_datepartitions for that user'sserving_user_daily_metrics