Skip to content

lukessmalley/-P2P-Payment-Feature-Sample-Dataset-and-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

P2P Payment Feature Sample Dataset and Analysis

Quick Start

Prerequisites

  • Python 3.10+
  • PostgreSQL 14+

1. Generate the data

python generate_data.py

This creates a data/ folder containing three CSVs:

File Rows Description
users.csv ~1,000 User accounts (signups spanning 90 days before + during the month)
transactions.csv ~5,000 P2P transactions with log-normal amount distribution
app_events.csv ~10,000 In-app events grouped into realistic user sessions

The script uses a fixed seed (42) so results are fully reproducible.

2. Load into PostgreSQL

psql -d your_database -f sql/01_create_tables.sql

# Then load the CSVs:
psql -d your_database -c "\copy users       FROM 'data/users.csv'        CSV HEADER"
psql -d your_database -c "\copy transactions FROM 'data/transactions.csv' CSV HEADER"
psql -d your_database -c "\copy app_events  FROM 'data/app_events.csv'   CSV HEADER"

3. Run the SQL scripts in order

psql -d your_database -f sql/02_identify_anomalies.sql   
psql -d your_database -f sql/03_clean_data.sql           
psql -d your_database -f sql/04_metrics.sql              

Intentional Data Anomalies

The generator introduces five distinct anomaly types:

# Anomaly Count Real-World Cause
1 Negative / zero transaction amounts ~60 A bug in the refund system sends negative dollar values instead of positive ones
2 Orphaned foreign keys ~40 A user's account was deleted, but their transactions were not removed alongside it
3 Future-dated event timestamps ~80 A user's phone clock is set to the wrong date, so events appear to happen in the future
4 Impossible event sequences ~100 A user somehow confirmed a payment without ever visiting the send-money screen first
5 Inconsistent status enum values ~70 Different app versions write the same status differently (e.g. "completed" vs "COMPLETED" vs "complted")

Anomalies are sampled independently, so they can co-occur on the same row (as happens in real data).

Data realism notes

  • Transaction amounts follow a log-normal distribution (median ≈ $45, mean ≈ $78, 95th percentile ≈ $253) to mimic real P2P payment patterns.
  • App events are generated as multi-step user journeys (login → browse → send → confirm → logout) rather than independent random events, with shared session IDs within each journey.

Project Structure

.
├── README.md
├── generate_data.py           
├── data/
│   ├── users.csv
│   ├── transactions.csv
│   └── app_events.csv
└── sql/
    ├── 01_create_tables.sql    
    ├── 02_identify_anomalies.sql  
    ├── 03_clean_data.sql          
    └── 04_metrics.sql           

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages