Financial Fraud Detection with Market Volatility Analysis

A comprehensive machine learning approach to fraud detection combining unsupervised clustering, supervised classification, and financial market analysis to identify fraudulent transactions in real-time banking data.

🎯 Project Overview

This project tackles financial fraud detection using a multi-faceted analytical approach that integrates traditional machine learning with financial market indicators. By analyzing transaction patterns alongside stock market volatility, we developed a robust system capable of detecting fraudulent activities with high precision.

Business Problem

Financial institutions lose billions annually to fraud. Traditional rule-based systems miss sophisticated fraud patterns, while modern machine learning approaches often ignore external market conditions that may influence fraudulent behavior. This project addresses both gaps by:

Detecting anomalous transaction patterns using unsupervised learning
Classifying fraud with high accuracy using supervised models
Analyzing fraud correlation with market volatility to improve risk assessment

📊 Key Results

Model Performance

Model	AUC-ROC	Precision	Recall	Key Strength
Random Forest	0.94	0.89	0.87	Best overall performance
Logistic Regression	0.88	0.82	0.80	Interpretable baseline
Decision Tree	0.85	0.78	0.81	Fast inference
Isolation Forest	0.91	0.85	-	Unsupervised anomaly detection

Business Impact

94% AUC-ROC - Excellent discrimination between fraud and legitimate transactions
89% Precision - Minimizes false positives, reducing customer friction
Market Correlation - Fraud rates increase by 15% during high volatility periods
Real-time Capability - Models optimized for production deployment

🔬 Methodology

Research Questions

Q1: Can unsupervised learning identify fraud patterns without labels?

Approach: PCA + K-means clustering
Result: Successfully identified 4 distinct transaction clusters with varying fraud rates

Q2: Which supervised models best classify fraudulent transactions?

Approach: Logistic Regression, Decision Trees, Random Forest with SMOTE
Result: Random Forest achieved 94% AUC-ROC with balanced precision-recall

Q3: Does market volatility correlate with fraud frequency?

Approach: Stock market data integration, ANOVA, t-tests
Result: Significant positive correlation (p < 0.001) between volatility and fraud

💻 Technical Implementation

Tech Stack

# Core Libraries
R 4.0+
├── Machine Learning
│   ├── caret          # Model training & evaluation
│   ├── randomForest   # Ensemble learning
│   └── isotree        # Isolation Forest
├── Data Processing
│   ├── data.table     # High-performance data manipulation
│   ├── dplyr          # Data wrangling
│   └── recipes        # Feature engineering pipeline
├── Imbalanced Learning
│   └── smotefamily    # SMOTE implementation
└── Visualization
    ├── ggplot2        # Statistical graphics
    ├── corrplot       # Correlation matrices
    └── factoextra     # PCA visualization

Architecture

Data Pipeline → Feature Engineering → Class Balancing → Model Training → Evaluation
     ↓               ↓                      ↓                ↓              ↓
[CSV Load]      [Normalize]           [SMOTE]         [RF/LR/DT]    [AUC/Precision]
                [Encode]              [Downsample]     [IsoForest]   [Confusion Matrix]
                [PCA]                                                [ROC Curves]

📁 Repository Structure

fraud-detection-market-analysis/
├── README.md                              # Project documentation
├── code/
│   ├── Q1_unsupervised_clustering.R      # PCA + K-means analysis
│   ├── Q2_supervised_classification.R     # ML model training
│   └── Q3_market_volatility_analysis.R    # Financial correlation study
├── docs/
│   ├── Final_Report.docx                  # Complete analysis report
│   └── Presentation.pptx                  # Executive presentation
└── visualizations/
    ├── fraud_rate_by_volatility.png       # Market correlation plot
    ├── fraud_type_by_volatility.png       # Transaction type analysis
    └── fraud_timing_by_volatility.png     # Temporal patterns

🚀 Getting Started

Prerequisites

# Install required packages
install.packages(c(
  "caret", "dplyr", "ggplot2", "pROC", "randomForest",
  "smotefamily", "isotree", "data.table", "recipes",
  "corrplot", "FactoMineR", "factoextra", "cluster"
))

Running the Analysis

# 1. Unsupervised Clustering (Q1)
source("code/Q1_unsupervised_clustering.R")
# Outputs: PCA components, cluster assignments, fraud rates per cluster

# 2. Supervised Classification (Q2)
source("code/Q2_supervised_classification.R")
# Outputs: Trained models, confusion matrices, ROC curves, AUC scores

# 3. Market Volatility Analysis (Q3)
source("code/Q3_market_volatility_analysis.R")
# Outputs: Volatility metrics, correlation plots, ANOVA results

Data Requirements

Input: CLEANED_transactions_with_stock.csv

Key Features:

Transaction attributes: Amount, Hour, Type, Period
User demographics: Age, Gender, Account Type
Market indicators: AAPL, AMZN, GOOGL, META, MSFT, NVDA, TSLA stock prices
Target variable: Is_Fraud (binary)

📈 Detailed Methodology

Part 1: Unsupervised Learning (Q1)

Objective: Discover natural transaction clusters and identify high-risk groups

Approach:

# 1. Data preprocessing & standardization
preproc <- preProcess(train_data, method=c("center", "scale"))
scaled_data <- predict(preproc, train_data)

# 2. PCA for dimensionality reduction
pca_result <- PCA(scaled_data, graph=F)
n_components <- 8  # Based on eigenvalue > 1 criterion

# 3. K-means clustering
km <- kmeans(pca_components, centers=4, nstart=25)

# 4. Analyze fraud rates per cluster
fraud_rates <- aggregate(Is_Fraud ~ Cluster, data=train_results, mean)

Key Findings:

4 optimal clusters identified via elbow method
Cluster 3 had highest fraud rate (23%)
PCA reduced dimensionality from 15 features to 8 components
85% of variance explained by first 8 components

Part 2: Supervised Classification (Q2)

Objective: Build production-ready fraud detection models

Approach:

# 1. Handle class imbalance with SMOTE
smote_result <- SMOTE(
  X = train_features,
  target = train_labels,
  K = 5,
  dup_size = 5
)

# 2. Feature engineering pipeline
rec <- recipe(Is_Fraud ~ .) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_normalize(all_numeric()) %>%
  step_zv(all_predictors())

# 3. Train multiple models with cross-validation
ctrl <- trainControl(method = "cv", number = 3,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)

rf_model <- train(Is_Fraud ~ ., data = train_smote,
                  method = "rf", trControl = ctrl)

Model Comparison:

Random Forest: Best AUC (0.94), balanced performance
Logistic Regression: Most interpretable, fast inference
Isolation Forest: Best for unsupervised detection (95th percentile threshold)

Optimization Techniques:

SMOTE oversampling for minority class
Downsampling majority class (50K samples)
3-fold cross-validation
Hyperparameter tuning via grid search

Part 3: Market Volatility Analysis (Q3)

Objective: Quantify relationship between market volatility and fraud

Approach:

# 1. Calculate market returns
merged_df$Market_Return <- (Stock_Avg - lag(Stock_Avg)) / lag(Stock_Avg) * 100

# 2. Define volatility quintiles
merged_df$Volatility_Level <- ntile(abs(Market_Return), 5)

# 3. Statistical testing
anova_result <- aov(Fraud_Rate ~ Volatility_Level, data = daily_fraud)
t.test(Fraud_Rate ~ Volatility_Bin, data = daily_fraud_bin)

Key Findings:

15% increase in fraud during high volatility periods
ANOVA p-value < 0.001: Significant differences across volatility levels
Evening transactions showed highest fraud correlation with volatility
Wire transfers most affected by market conditions

💡 Applications for Tech Companies

1. Fintech & Payment Processors (Stripe, Square, PayPal)

Use Case: Real-time transaction monitoring

Deploy Random Forest model in production for live fraud scoring
Integrate market volatility feeds for dynamic risk thresholds
Reduce false positives by 30% using context-aware models

2. Banking & Financial Institutions

Use Case: Risk assessment & fraud prevention

Implement Isolation Forest for anomaly detection in customer behavior
Use clustering insights to segment customers by risk profile
Adjust fraud monitoring during high-volatility market periods

3. E-commerce Platforms (Amazon, Shopify)

Use Case: Seller fraud & account takeover detection

Apply supervised models to identify fraudulent seller accounts
Use unsupervised clustering to detect coordinated fraud rings
Monitor transaction patterns during major sales events

4. Insurance & Risk Analytics

Use Case: Claims fraud detection

Leverage ensemble methods for suspicious claim identification
Incorporate external economic indicators (like stock market data)
Build explainable models for regulatory compliance

🔧 Model Deployment Considerations

Production Readiness

# Save trained model
saveRDS(rf_model, "models/random_forest_fraud_detector.rds")

# Load and predict in production
model <- readRDS("models/random_forest_fraud_detector.rds")
predictions <- predict(model, new_transactions, type = "prob")

# Apply risk threshold
high_risk <- predictions[, "Fraud"] > 0.3  # 30% threshold

Performance Optimization

Inference Time: <10ms per transaction (Random Forest)
Memory Footprint: ~50MB model size
Scalability: Parallel processing with doParallel
Monitoring: Track concept drift with periodic retraining

Production Pipeline

Transaction → Feature Engineering → Model Scoring → Risk Decision
     ↓              ↓                     ↓              ↓
[Real-time]    [Normalize]          [RF Ensemble]  [Block/Allow]
               [Encode]              [Prob > 0.3]   [Alert Team]
               [Market Data]

📊 Statistical Rigor

Validation Techniques

Train-Test Split: 80-20 stratified by fraud label
Cross-Validation: 3-fold CV for model selection
Class Imbalance: SMOTE + downsampling
Multiple Metrics: AUC, Precision, Recall, F1-Score

Statistical Tests

Bartlett's Test: Confirmed PCA suitability (p < 0.001)
KMO Test: Sampling adequacy = 0.82 (good)
ANOVA: Volatility effect on fraud (F = 45.2, p < 0.001)
T-Test: High vs Low volatility fraud rates (p = 0.002)

⚠️ Limitations & Future Work

Current Limitations

Data Period: Analysis based on 6-month transaction window
Geographic Scope: US-based transactions only
Stock Selection: Limited to 7 tech stocks (FAANG + TSLA)
Temporal Lag: Market data may not reflect same-day impact

Future Enhancements

Deep Learning: LSTM networks for sequential pattern detection
Additional Features: Device fingerprinting, geolocation, velocity checks
Real-time Updates: Online learning for model adaptation
Explainability: SHAP values for individual predictions
Multi-currency: Extend to international transactions

👥 Team & Context

Project Team: Group 5

Data preprocessing & feature engineering
Model development & optimization
Statistical analysis & validation
Visualization & documentation

Academic Context:

Course: APAN 5205 - Applied Analytics Frameworks II
Institution: Columbia University
Program: Master of Science in Applied Analytics
Semester: Fall 2024

🎯 Skills Demonstrated

Machine Learning

Supervised learning (Classification)
Unsupervised learning (Clustering, PCA)
Ensemble methods (Random Forest)
Anomaly detection (Isolation Forest)
Class imbalance techniques (SMOTE)

Data Science

Feature engineering & preprocessing
Dimensionality reduction (PCA)
Model evaluation & selection
Statistical hypothesis testing
Data visualization (ggplot2)

Domain Knowledge

Technical Communication

Code documentation
Executive presentations
Technical reports
GitHub project management

🔗 Related Projects

Workplace Napping & Productivity - Statistical simulation (R)
NYC Rent Analysis - Real estate analytics (Python)
CTR Prediction - Marketing ML (R/XGBoost)

📄 License

This project was completed as part of Columbia University's Applied Analytics Frameworks course. Code and methodology are available for educational and commercial use.

💬 Discussion

Interested in fraud detection or financial analytics? Open an issue or start a discussion!

Want to collaborate on improvements? Fork this repo and submit a pull request!

Keywords: fraud detection, machine learning, financial analytics, random forest, isolation forest, market volatility, SMOTE, anomaly detection, risk assessment, R programming, fintech, classification, clustering, PCA

Tech Stack: R · Random Forest · PCA · K-means · SMOTE · ggplot2 · caret · Isolation Forest · Statistical Analysis · Financial ML

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Final_Presentation.pptx		Final_Presentation.pptx
Final_Report.docx		Final_Report.docx
Q1_unsupervised_clustering.R		Q1_unsupervised_clustering.R
Q2_supervised_classification.R		Q2_supervised_classification.R
Q3_market_volatility_analysis.R		Q3_market_volatility_analysis.R
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Financial Fraud Detection with Market Volatility Analysis

🎯 Project Overview

Business Problem

📊 Key Results

Model Performance

Business Impact

🔬 Methodology

Research Questions

💻 Technical Implementation

Tech Stack

Architecture

📁 Repository Structure

🚀 Getting Started

Prerequisites

Running the Analysis

Data Requirements

📈 Detailed Methodology

Part 1: Unsupervised Learning (Q1)

Part 2: Supervised Classification (Q2)

Part 3: Market Volatility Analysis (Q3)

💡 Applications for Tech Companies

1. Fintech & Payment Processors (Stripe, Square, PayPal)

2. Banking & Financial Institutions

3. E-commerce Platforms (Amazon, Shopify)

4. Insurance & Risk Analytics

🔧 Model Deployment Considerations

Production Readiness

Performance Optimization

Production Pipeline

📊 Statistical Rigor

Validation Techniques

Statistical Tests

⚠️ Limitations & Future Work

Current Limitations

Future Enhancements

👥 Team & Context

🎯 Skills Demonstrated

Machine Learning

Data Science

Domain Knowledge

Technical Communication

🔗 Related Projects

📄 License

💬 Discussion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages