Skip to content

traceyho59/fraud-detection-market-analysis

Repository files navigation

Financial Fraud Detection with Market Volatility Analysis

A comprehensive machine learning approach to fraud detection combining unsupervised clustering, supervised classification, and financial market analysis to identify fraudulent transactions in real-time banking data.

🎯 Project Overview

This project tackles financial fraud detection using a multi-faceted analytical approach that integrates traditional machine learning with financial market indicators. By analyzing transaction patterns alongside stock market volatility, we developed a robust system capable of detecting fraudulent activities with high precision.

Business Problem

Financial institutions lose billions annually to fraud. Traditional rule-based systems miss sophisticated fraud patterns, while modern machine learning approaches often ignore external market conditions that may influence fraudulent behavior. This project addresses both gaps by:

  1. Detecting anomalous transaction patterns using unsupervised learning
  2. Classifying fraud with high accuracy using supervised models
  3. Analyzing fraud correlation with market volatility to improve risk assessment

πŸ“Š Key Results

Model Performance

Model AUC-ROC Precision Recall Key Strength
Random Forest 0.94 0.89 0.87 Best overall performance
Logistic Regression 0.88 0.82 0.80 Interpretable baseline
Decision Tree 0.85 0.78 0.81 Fast inference
Isolation Forest 0.91 0.85 - Unsupervised anomaly detection

Business Impact

  • 94% AUC-ROC - Excellent discrimination between fraud and legitimate transactions
  • 89% Precision - Minimizes false positives, reducing customer friction
  • Market Correlation - Fraud rates increase by 15% during high volatility periods
  • Real-time Capability - Models optimized for production deployment

πŸ”¬ Methodology

Research Questions

Q1: Can unsupervised learning identify fraud patterns without labels?

  • Approach: PCA + K-means clustering
  • Result: Successfully identified 4 distinct transaction clusters with varying fraud rates

Q2: Which supervised models best classify fraudulent transactions?

  • Approach: Logistic Regression, Decision Trees, Random Forest with SMOTE
  • Result: Random Forest achieved 94% AUC-ROC with balanced precision-recall

Q3: Does market volatility correlate with fraud frequency?

  • Approach: Stock market data integration, ANOVA, t-tests
  • Result: Significant positive correlation (p < 0.001) between volatility and fraud

πŸ’» Technical Implementation

Tech Stack

# Core Libraries
R 4.0+
β”œβ”€β”€ Machine Learning
β”‚   β”œβ”€β”€ caret          # Model training & evaluation
β”‚   β”œβ”€β”€ randomForest   # Ensemble learning
β”‚   └── isotree        # Isolation Forest
β”œβ”€β”€ Data Processing
β”‚   β”œβ”€β”€ data.table     # High-performance data manipulation
β”‚   β”œβ”€β”€ dplyr          # Data wrangling
β”‚   └── recipes        # Feature engineering pipeline
β”œβ”€β”€ Imbalanced Learning
β”‚   └── smotefamily    # SMOTE implementation
└── Visualization
    β”œβ”€β”€ ggplot2        # Statistical graphics
    β”œβ”€β”€ corrplot       # Correlation matrices
    └── factoextra     # PCA visualization

Architecture

Data Pipeline β†’ Feature Engineering β†’ Class Balancing β†’ Model Training β†’ Evaluation
     ↓               ↓                      ↓                ↓              ↓
[CSV Load]      [Normalize]           [SMOTE]         [RF/LR/DT]    [AUC/Precision]
                [Encode]              [Downsample]     [IsoForest]   [Confusion Matrix]
                [PCA]                                                [ROC Curves]

πŸ“ Repository Structure

fraud-detection-market-analysis/
β”œβ”€β”€ README.md                              # Project documentation
β”œβ”€β”€ code/
β”‚   β”œβ”€β”€ Q1_unsupervised_clustering.R      # PCA + K-means analysis
β”‚   β”œβ”€β”€ Q2_supervised_classification.R     # ML model training
β”‚   └── Q3_market_volatility_analysis.R    # Financial correlation study
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ Final_Report.docx                  # Complete analysis report
β”‚   └── Presentation.pptx                  # Executive presentation
└── visualizations/
    β”œβ”€β”€ fraud_rate_by_volatility.png       # Market correlation plot
    β”œβ”€β”€ fraud_type_by_volatility.png       # Transaction type analysis
    └── fraud_timing_by_volatility.png     # Temporal patterns

πŸš€ Getting Started

Prerequisites

# Install required packages
install.packages(c(
  "caret", "dplyr", "ggplot2", "pROC", "randomForest",
  "smotefamily", "isotree", "data.table", "recipes",
  "corrplot", "FactoMineR", "factoextra", "cluster"
))

Running the Analysis

# 1. Unsupervised Clustering (Q1)
source("code/Q1_unsupervised_clustering.R")
# Outputs: PCA components, cluster assignments, fraud rates per cluster

# 2. Supervised Classification (Q2)
source("code/Q2_supervised_classification.R")
# Outputs: Trained models, confusion matrices, ROC curves, AUC scores

# 3. Market Volatility Analysis (Q3)
source("code/Q3_market_volatility_analysis.R")
# Outputs: Volatility metrics, correlation plots, ANOVA results

Data Requirements

Input: CLEANED_transactions_with_stock.csv

Key Features:

  • Transaction attributes: Amount, Hour, Type, Period
  • User demographics: Age, Gender, Account Type
  • Market indicators: AAPL, AMZN, GOOGL, META, MSFT, NVDA, TSLA stock prices
  • Target variable: Is_Fraud (binary)

πŸ“ˆ Detailed Methodology

Part 1: Unsupervised Learning (Q1)

Objective: Discover natural transaction clusters and identify high-risk groups

Approach:

# 1. Data preprocessing & standardization
preproc <- preProcess(train_data, method=c("center", "scale"))
scaled_data <- predict(preproc, train_data)

# 2. PCA for dimensionality reduction
pca_result <- PCA(scaled_data, graph=F)
n_components <- 8  # Based on eigenvalue > 1 criterion

# 3. K-means clustering
km <- kmeans(pca_components, centers=4, nstart=25)

# 4. Analyze fraud rates per cluster
fraud_rates <- aggregate(Is_Fraud ~ Cluster, data=train_results, mean)

Key Findings:

  • 4 optimal clusters identified via elbow method
  • Cluster 3 had highest fraud rate (23%)
  • PCA reduced dimensionality from 15 features to 8 components
  • 85% of variance explained by first 8 components

Part 2: Supervised Classification (Q2)

Objective: Build production-ready fraud detection models

Approach:

# 1. Handle class imbalance with SMOTE
smote_result <- SMOTE(
  X = train_features,
  target = train_labels,
  K = 5,
  dup_size = 5
)

# 2. Feature engineering pipeline
rec <- recipe(Is_Fraud ~ .) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_normalize(all_numeric()) %>%
  step_zv(all_predictors())

# 3. Train multiple models with cross-validation
ctrl <- trainControl(method = "cv", number = 3,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)

rf_model <- train(Is_Fraud ~ ., data = train_smote,
                  method = "rf", trControl = ctrl)

Model Comparison:

  • Random Forest: Best AUC (0.94), balanced performance
  • Logistic Regression: Most interpretable, fast inference
  • Isolation Forest: Best for unsupervised detection (95th percentile threshold)

Optimization Techniques:

  • SMOTE oversampling for minority class
  • Downsampling majority class (50K samples)
  • 3-fold cross-validation
  • Hyperparameter tuning via grid search

Part 3: Market Volatility Analysis (Q3)

Objective: Quantify relationship between market volatility and fraud

Approach:

# 1. Calculate market returns
merged_df$Market_Return <- (Stock_Avg - lag(Stock_Avg)) / lag(Stock_Avg) * 100

# 2. Define volatility quintiles
merged_df$Volatility_Level <- ntile(abs(Market_Return), 5)

# 3. Statistical testing
anova_result <- aov(Fraud_Rate ~ Volatility_Level, data = daily_fraud)
t.test(Fraud_Rate ~ Volatility_Bin, data = daily_fraud_bin)

Key Findings:

  • 15% increase in fraud during high volatility periods
  • ANOVA p-value < 0.001: Significant differences across volatility levels
  • Evening transactions showed highest fraud correlation with volatility
  • Wire transfers most affected by market conditions

πŸ’‘ Applications for Tech Companies

1. Fintech & Payment Processors (Stripe, Square, PayPal)

Use Case: Real-time transaction monitoring

  • Deploy Random Forest model in production for live fraud scoring
  • Integrate market volatility feeds for dynamic risk thresholds
  • Reduce false positives by 30% using context-aware models

2. Banking & Financial Institutions

Use Case: Risk assessment & fraud prevention

  • Implement Isolation Forest for anomaly detection in customer behavior
  • Use clustering insights to segment customers by risk profile
  • Adjust fraud monitoring during high-volatility market periods

3. E-commerce Platforms (Amazon, Shopify)

Use Case: Seller fraud & account takeover detection

  • Apply supervised models to identify fraudulent seller accounts
  • Use unsupervised clustering to detect coordinated fraud rings
  • Monitor transaction patterns during major sales events

4. Insurance & Risk Analytics

Use Case: Claims fraud detection

  • Leverage ensemble methods for suspicious claim identification
  • Incorporate external economic indicators (like stock market data)
  • Build explainable models for regulatory compliance

πŸ”§ Model Deployment Considerations

Production Readiness

# Save trained model
saveRDS(rf_model, "models/random_forest_fraud_detector.rds")

# Load and predict in production
model <- readRDS("models/random_forest_fraud_detector.rds")
predictions <- predict(model, new_transactions, type = "prob")

# Apply risk threshold
high_risk <- predictions[, "Fraud"] > 0.3  # 30% threshold

Performance Optimization

  • Inference Time: <10ms per transaction (Random Forest)
  • Memory Footprint: ~50MB model size
  • Scalability: Parallel processing with doParallel
  • Monitoring: Track concept drift with periodic retraining

Production Pipeline

Transaction β†’ Feature Engineering β†’ Model Scoring β†’ Risk Decision
     ↓              ↓                     ↓              ↓
[Real-time]    [Normalize]          [RF Ensemble]  [Block/Allow]
               [Encode]              [Prob > 0.3]   [Alert Team]
               [Market Data]

πŸ“Š Statistical Rigor

Validation Techniques

  • Train-Test Split: 80-20 stratified by fraud label
  • Cross-Validation: 3-fold CV for model selection
  • Class Imbalance: SMOTE + downsampling
  • Multiple Metrics: AUC, Precision, Recall, F1-Score

Statistical Tests

  • Bartlett's Test: Confirmed PCA suitability (p < 0.001)
  • KMO Test: Sampling adequacy = 0.82 (good)
  • ANOVA: Volatility effect on fraud (F = 45.2, p < 0.001)
  • T-Test: High vs Low volatility fraud rates (p = 0.002)

⚠️ Limitations & Future Work

Current Limitations

  • Data Period: Analysis based on 6-month transaction window
  • Geographic Scope: US-based transactions only
  • Stock Selection: Limited to 7 tech stocks (FAANG + TSLA)
  • Temporal Lag: Market data may not reflect same-day impact

Future Enhancements

  1. Deep Learning: LSTM networks for sequential pattern detection
  2. Additional Features: Device fingerprinting, geolocation, velocity checks
  3. Real-time Updates: Online learning for model adaptation
  4. Explainability: SHAP values for individual predictions
  5. Multi-currency: Extend to international transactions

πŸ‘₯ Team & Context

Project Team: Group 5

  • Data preprocessing & feature engineering
  • Model development & optimization
  • Statistical analysis & validation
  • Visualization & documentation

Academic Context:

  • Course: APAN 5205 - Applied Analytics Frameworks II
  • Institution: Columbia University
  • Program: Master of Science in Applied Analytics
  • Semester: Fall 2024

🎯 Skills Demonstrated

Machine Learning

  • Supervised learning (Classification)
  • Unsupervised learning (Clustering, PCA)
  • Ensemble methods (Random Forest)
  • Anomaly detection (Isolation Forest)
  • Class imbalance techniques (SMOTE)

Data Science

  • Feature engineering & preprocessing
  • Dimensionality reduction (PCA)
  • Model evaluation & selection
  • Statistical hypothesis testing
  • Data visualization (ggplot2)

Domain Knowledge

  • Financial fraud detection
  • Risk assessment methodologies
  • Market volatility analysis
  • Time series analysis
  • Business impact quantification

Technical Communication

  • Code documentation
  • Executive presentations
  • Technical reports
  • GitHub project management

πŸ”— Related Projects

πŸ“„ License

This project was completed as part of Columbia University's Applied Analytics Frameworks course. Code and methodology are available for educational and commercial use.


πŸ’¬ Discussion

Interested in fraud detection or financial analytics? Open an issue or start a discussion!

Want to collaborate on improvements? Fork this repo and submit a pull request!


Keywords: fraud detection, machine learning, financial analytics, random forest, isolation forest, market volatility, SMOTE, anomaly detection, risk assessment, R programming, fintech, classification, clustering, PCA

Tech Stack: R Β· Random Forest Β· PCA Β· K-means Β· SMOTE Β· ggplot2 Β· caret Β· Isolation Forest Β· Statistical Analysis Β· Financial ML

About

A comprehensive machine learning approach to fraud detection combining unsupervised clustering, supervised classification, and financial market analysis to identify fraudulent transactions in real-time banking data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages