A comprehensive machine learning approach to fraud detection combining unsupervised clustering, supervised classification, and financial market analysis to identify fraudulent transactions in real-time banking data.
This project tackles financial fraud detection using a multi-faceted analytical approach that integrates traditional machine learning with financial market indicators. By analyzing transaction patterns alongside stock market volatility, we developed a robust system capable of detecting fraudulent activities with high precision.
Financial institutions lose billions annually to fraud. Traditional rule-based systems miss sophisticated fraud patterns, while modern machine learning approaches often ignore external market conditions that may influence fraudulent behavior. This project addresses both gaps by:
- Detecting anomalous transaction patterns using unsupervised learning
- Classifying fraud with high accuracy using supervised models
- Analyzing fraud correlation with market volatility to improve risk assessment
| Model | AUC-ROC | Precision | Recall | Key Strength |
|---|---|---|---|---|
| Random Forest | 0.94 | 0.89 | 0.87 | Best overall performance |
| Logistic Regression | 0.88 | 0.82 | 0.80 | Interpretable baseline |
| Decision Tree | 0.85 | 0.78 | 0.81 | Fast inference |
| Isolation Forest | 0.91 | 0.85 | - | Unsupervised anomaly detection |
- 94% AUC-ROC - Excellent discrimination between fraud and legitimate transactions
- 89% Precision - Minimizes false positives, reducing customer friction
- Market Correlation - Fraud rates increase by 15% during high volatility periods
- Real-time Capability - Models optimized for production deployment
Q1: Can unsupervised learning identify fraud patterns without labels?
- Approach: PCA + K-means clustering
- Result: Successfully identified 4 distinct transaction clusters with varying fraud rates
Q2: Which supervised models best classify fraudulent transactions?
- Approach: Logistic Regression, Decision Trees, Random Forest with SMOTE
- Result: Random Forest achieved 94% AUC-ROC with balanced precision-recall
Q3: Does market volatility correlate with fraud frequency?
- Approach: Stock market data integration, ANOVA, t-tests
- Result: Significant positive correlation (p < 0.001) between volatility and fraud
# Core Libraries
R 4.0+
βββ Machine Learning
β βββ caret # Model training & evaluation
β βββ randomForest # Ensemble learning
β βββ isotree # Isolation Forest
βββ Data Processing
β βββ data.table # High-performance data manipulation
β βββ dplyr # Data wrangling
β βββ recipes # Feature engineering pipeline
βββ Imbalanced Learning
β βββ smotefamily # SMOTE implementation
βββ Visualization
βββ ggplot2 # Statistical graphics
βββ corrplot # Correlation matrices
βββ factoextra # PCA visualizationData Pipeline β Feature Engineering β Class Balancing β Model Training β Evaluation
β β β β β
[CSV Load] [Normalize] [SMOTE] [RF/LR/DT] [AUC/Precision]
[Encode] [Downsample] [IsoForest] [Confusion Matrix]
[PCA] [ROC Curves]
fraud-detection-market-analysis/
βββ README.md # Project documentation
βββ code/
β βββ Q1_unsupervised_clustering.R # PCA + K-means analysis
β βββ Q2_supervised_classification.R # ML model training
β βββ Q3_market_volatility_analysis.R # Financial correlation study
βββ docs/
β βββ Final_Report.docx # Complete analysis report
β βββ Presentation.pptx # Executive presentation
βββ visualizations/
βββ fraud_rate_by_volatility.png # Market correlation plot
βββ fraud_type_by_volatility.png # Transaction type analysis
βββ fraud_timing_by_volatility.png # Temporal patterns
# Install required packages
install.packages(c(
"caret", "dplyr", "ggplot2", "pROC", "randomForest",
"smotefamily", "isotree", "data.table", "recipes",
"corrplot", "FactoMineR", "factoextra", "cluster"
))# 1. Unsupervised Clustering (Q1)
source("code/Q1_unsupervised_clustering.R")
# Outputs: PCA components, cluster assignments, fraud rates per cluster
# 2. Supervised Classification (Q2)
source("code/Q2_supervised_classification.R")
# Outputs: Trained models, confusion matrices, ROC curves, AUC scores
# 3. Market Volatility Analysis (Q3)
source("code/Q3_market_volatility_analysis.R")
# Outputs: Volatility metrics, correlation plots, ANOVA resultsInput: CLEANED_transactions_with_stock.csv
Key Features:
- Transaction attributes: Amount, Hour, Type, Period
- User demographics: Age, Gender, Account Type
- Market indicators: AAPL, AMZN, GOOGL, META, MSFT, NVDA, TSLA stock prices
- Target variable: Is_Fraud (binary)
Objective: Discover natural transaction clusters and identify high-risk groups
Approach:
# 1. Data preprocessing & standardization
preproc <- preProcess(train_data, method=c("center", "scale"))
scaled_data <- predict(preproc, train_data)
# 2. PCA for dimensionality reduction
pca_result <- PCA(scaled_data, graph=F)
n_components <- 8 # Based on eigenvalue > 1 criterion
# 3. K-means clustering
km <- kmeans(pca_components, centers=4, nstart=25)
# 4. Analyze fraud rates per cluster
fraud_rates <- aggregate(Is_Fraud ~ Cluster, data=train_results, mean)Key Findings:
- 4 optimal clusters identified via elbow method
- Cluster 3 had highest fraud rate (23%)
- PCA reduced dimensionality from 15 features to 8 components
- 85% of variance explained by first 8 components
Objective: Build production-ready fraud detection models
Approach:
# 1. Handle class imbalance with SMOTE
smote_result <- SMOTE(
X = train_features,
target = train_labels,
K = 5,
dup_size = 5
)
# 2. Feature engineering pipeline
rec <- recipe(Is_Fraud ~ .) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric()) %>%
step_zv(all_predictors())
# 3. Train multiple models with cross-validation
ctrl <- trainControl(method = "cv", number = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
rf_model <- train(Is_Fraud ~ ., data = train_smote,
method = "rf", trControl = ctrl)Model Comparison:
- Random Forest: Best AUC (0.94), balanced performance
- Logistic Regression: Most interpretable, fast inference
- Isolation Forest: Best for unsupervised detection (95th percentile threshold)
Optimization Techniques:
- SMOTE oversampling for minority class
- Downsampling majority class (50K samples)
- 3-fold cross-validation
- Hyperparameter tuning via grid search
Objective: Quantify relationship between market volatility and fraud
Approach:
# 1. Calculate market returns
merged_df$Market_Return <- (Stock_Avg - lag(Stock_Avg)) / lag(Stock_Avg) * 100
# 2. Define volatility quintiles
merged_df$Volatility_Level <- ntile(abs(Market_Return), 5)
# 3. Statistical testing
anova_result <- aov(Fraud_Rate ~ Volatility_Level, data = daily_fraud)
t.test(Fraud_Rate ~ Volatility_Bin, data = daily_fraud_bin)Key Findings:
- 15% increase in fraud during high volatility periods
- ANOVA p-value < 0.001: Significant differences across volatility levels
- Evening transactions showed highest fraud correlation with volatility
- Wire transfers most affected by market conditions
Use Case: Real-time transaction monitoring
- Deploy Random Forest model in production for live fraud scoring
- Integrate market volatility feeds for dynamic risk thresholds
- Reduce false positives by 30% using context-aware models
Use Case: Risk assessment & fraud prevention
- Implement Isolation Forest for anomaly detection in customer behavior
- Use clustering insights to segment customers by risk profile
- Adjust fraud monitoring during high-volatility market periods
Use Case: Seller fraud & account takeover detection
- Apply supervised models to identify fraudulent seller accounts
- Use unsupervised clustering to detect coordinated fraud rings
- Monitor transaction patterns during major sales events
Use Case: Claims fraud detection
- Leverage ensemble methods for suspicious claim identification
- Incorporate external economic indicators (like stock market data)
- Build explainable models for regulatory compliance
# Save trained model
saveRDS(rf_model, "models/random_forest_fraud_detector.rds")
# Load and predict in production
model <- readRDS("models/random_forest_fraud_detector.rds")
predictions <- predict(model, new_transactions, type = "prob")
# Apply risk threshold
high_risk <- predictions[, "Fraud"] > 0.3 # 30% threshold- Inference Time: <10ms per transaction (Random Forest)
- Memory Footprint: ~50MB model size
- Scalability: Parallel processing with
doParallel - Monitoring: Track concept drift with periodic retraining
Transaction β Feature Engineering β Model Scoring β Risk Decision
β β β β
[Real-time] [Normalize] [RF Ensemble] [Block/Allow]
[Encode] [Prob > 0.3] [Alert Team]
[Market Data]
- Train-Test Split: 80-20 stratified by fraud label
- Cross-Validation: 3-fold CV for model selection
- Class Imbalance: SMOTE + downsampling
- Multiple Metrics: AUC, Precision, Recall, F1-Score
- Bartlett's Test: Confirmed PCA suitability (p < 0.001)
- KMO Test: Sampling adequacy = 0.82 (good)
- ANOVA: Volatility effect on fraud (F = 45.2, p < 0.001)
- T-Test: High vs Low volatility fraud rates (p = 0.002)
- Data Period: Analysis based on 6-month transaction window
- Geographic Scope: US-based transactions only
- Stock Selection: Limited to 7 tech stocks (FAANG + TSLA)
- Temporal Lag: Market data may not reflect same-day impact
- Deep Learning: LSTM networks for sequential pattern detection
- Additional Features: Device fingerprinting, geolocation, velocity checks
- Real-time Updates: Online learning for model adaptation
- Explainability: SHAP values for individual predictions
- Multi-currency: Extend to international transactions
Project Team: Group 5
- Data preprocessing & feature engineering
- Model development & optimization
- Statistical analysis & validation
- Visualization & documentation
Academic Context:
- Course: APAN 5205 - Applied Analytics Frameworks II
- Institution: Columbia University
- Program: Master of Science in Applied Analytics
- Semester: Fall 2024
- Supervised learning (Classification)
- Unsupervised learning (Clustering, PCA)
- Ensemble methods (Random Forest)
- Anomaly detection (Isolation Forest)
- Class imbalance techniques (SMOTE)
- Feature engineering & preprocessing
- Dimensionality reduction (PCA)
- Model evaluation & selection
- Statistical hypothesis testing
- Data visualization (ggplot2)
- Financial fraud detection
- Risk assessment methodologies
- Market volatility analysis
- Time series analysis
- Business impact quantification
- Code documentation
- Executive presentations
- Technical reports
- GitHub project management
- Workplace Napping & Productivity - Statistical simulation (R)
- NYC Rent Analysis - Real estate analytics (Python)
- CTR Prediction - Marketing ML (R/XGBoost)
This project was completed as part of Columbia University's Applied Analytics Frameworks course. Code and methodology are available for educational and commercial use.
Interested in fraud detection or financial analytics? Open an issue or start a discussion!
Want to collaborate on improvements? Fork this repo and submit a pull request!
Keywords: fraud detection, machine learning, financial analytics, random forest, isolation forest, market volatility, SMOTE, anomaly detection, risk assessment, R programming, fintech, classification, clustering, PCA
Tech Stack: R Β· Random Forest Β· PCA Β· K-means Β· SMOTE Β· ggplot2 Β· caret Β· Isolation Forest Β· Statistical Analysis Β· Financial ML