- Python 3.10+
- PostgreSQL 14+
python generate_data.pyThis creates a data/ folder containing three CSVs:
| File | Rows | Description |
|---|---|---|
users.csv |
~1,000 | User accounts (signups spanning 90 days before + during the month) |
transactions.csv |
~5,000 | P2P transactions with log-normal amount distribution |
app_events.csv |
~10,000 | In-app events grouped into realistic user sessions |
The script uses a fixed seed (42) so results are fully reproducible.
psql -d your_database -f sql/01_create_tables.sql
# Then load the CSVs:
psql -d your_database -c "\copy users FROM 'data/users.csv' CSV HEADER"
psql -d your_database -c "\copy transactions FROM 'data/transactions.csv' CSV HEADER"
psql -d your_database -c "\copy app_events FROM 'data/app_events.csv' CSV HEADER"psql -d your_database -f sql/02_identify_anomalies.sql
psql -d your_database -f sql/03_clean_data.sql
psql -d your_database -f sql/04_metrics.sql The generator introduces five distinct anomaly types:
| # | Anomaly | Count | Real-World Cause |
|---|---|---|---|
| 1 | Negative / zero transaction amounts | ~60 | A bug in the refund system sends negative dollar values instead of positive ones |
| 2 | Orphaned foreign keys | ~40 | A user's account was deleted, but their transactions were not removed alongside it |
| 3 | Future-dated event timestamps | ~80 | A user's phone clock is set to the wrong date, so events appear to happen in the future |
| 4 | Impossible event sequences | ~100 | A user somehow confirmed a payment without ever visiting the send-money screen first |
| 5 | Inconsistent status enum values | ~70 | Different app versions write the same status differently (e.g. "completed" vs "COMPLETED" vs "complted") |
Anomalies are sampled independently, so they can co-occur on the same row (as happens in real data).
- Transaction amounts follow a log-normal distribution (median ≈ $45, mean ≈ $78, 95th percentile ≈ $253) to mimic real P2P payment patterns.
- App events are generated as multi-step user journeys (login → browse → send → confirm → logout) rather than independent random events, with shared session IDs within each journey.
.
├── README.md
├── generate_data.py
├── data/
│ ├── users.csv
│ ├── transactions.csv
│ └── app_events.csv
└── sql/
├── 01_create_tables.sql
├── 02_identify_anomalies.sql
├── 03_clean_data.sql
└── 04_metrics.sql