Skip to content

A small end-to-end data platform demo showing a practical raw → clean → curated → serving flow, with lightweight validation and governance concepts.

Notifications You must be signed in to change notification settings

anhkhoadx/data-platform-local

Repository files navigation

Local Data Platform Demo (Landing → Raw → Cleaned → Serving)

This repo demonstrates data platform behaviors locally, with a layer naming scheme closer to many real systems:

  • Landing → ingest as-is (staging/temp storage, think external bucket)
  • Raw → copy/convert into lakehouse-format (Parquet) and keep history
  • Cleaned → transforms + idempotent incremental builds (stored in DuckDB)
  • Serving (Curated) → the only layer consumers should read (serving/exports/)

Run

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

./run_all.sh

Outputs

  • quality/reports/*.md and quality/quality_summary.csv
  • serving/exports/ (consumer boundary)
  • state/audit_log.jsonl

Governance + Tests

  • Ownership required: governance/ownership.yml
  • Classification required (including Serving): governance/data_classification.yml
  • Executable tests:
make test-governance

GDPR masking

Instead of deleting facts (which can break analytics), we mask PII in cleaned_users:

python governance/gdpr_mask_user.py --user-id 3
python pipelines/build_serving_daily_metrics.py

Additional Serving model (PII-free user-level)

  • serving/exports/user_daily_metrics.parquet
    • keyed by user_key (stable anonymized key derived from user_id)
    • allows user-level analytics without exposing PII

Targeted recompute after GDPR (v5)

Instead of rebuilding everything, this demo shows partition-based remediation after a GDPR request.

./run_gdpr_flow.sh 3

What it does:

  1. records the request in gdpr_requests (DuckDB)
  2. masks PII in cleaned_users
  3. recomputes only the affected event_date partitions for that user's serving_user_daily_metrics

About

A small end-to-end data platform demo showing a practical raw → clean → curated → serving flow, with lightweight validation and governance concepts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published