This project tackles the problem of fraud detection using machine learning techniques to handle imbalanced data. The dataset used contains transaction details, where the target variable (Class) indicates whether a transaction is fraudulent (1) or non-fraudulent (0). The project applies various sampling techniques to balance the dataset and compares the performance of different models.
- Dataset: The project uses a dataset named
creditcard.csvthat contains transaction data. - Objective: Detect fraudulent transactions and handle the imbalanced nature of the dataset where fraud is a rare occurrence.
-
Oversampling:
- SMOTE (Synthetic Minority Over-sampling Technique): Synthetic data points are generated for the minority class (fraud cases).
- ADASYN (Adaptive Synthetic Sampling): Focuses on generating synthetic data in regions where the minority class is harder to classify.
- Bootstrap & Bagging: Balances the dataset by resampling and creating multiple sub-datasets to reduce the imbalance and improve model performance.
-
Undersampling:
- Cluster Centroid: Reduces the majority class by clustering and downsampling, effectively balancing the dataset for training.
- Random Forest Classifier: An ensemble method that constructs multiple decision trees and combines their results to improve accuracy and prevent overfitting.
- Bagging Classifier with Decision Trees: Uses bootstrap sampling to create multiple subsets of the data and trains decision trees on each subset.
- Logistic Regression: A linear model trained on a resampled dataset using the Cluster Centroid method.
-
Data Loading and Exploration:
- Load the
creditcard.csvdataset. - Explore the imbalance in the dataset:
- No Frauds: Approx. 99.83% of the dataset.
- Frauds: Approx. 0.17% of the dataset.
- Load the
-
Data Preprocessing:
- Split the dataset into features (X) and target (y).
- Perform a train-test split to prepare data for model training and evaluation.
-
Oversampling and Undersampling:
- Apply SMOTE and ADASYN to oversample the minority class.
- Use bootstrap bagging and undersampling methods (Cluster Centroid) to balance the data.
-
Model Training and Evaluation:
- Train the models using different sampling techniques.
- Plot the class distribution after resampling.
- Generate classification reports and confusion matrices to compare the performance of each model.
- Confusion Matrices: Plots of confusion matrices after applying different sampling methods.
- Classification Reports: Include precision, recall, F1-score, and support for each sampling method (SMOTE, ADASYN, Bagging, and Cluster Centroid).
pandas: For data manipulation and analysis.matplotlibandseaborn: For data visualization.scikit-learn: For machine learning algorithms, data splitting, and evaluation metrics.imblearn: For handling imbalanced datasets (oversampling and undersampling techniques).
-
Clone this repository:
git clone https://github.com/your-repo/fraud-detection.git
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the Python script:
python ds.py
This project demonstrates how to handle imbalanced data in fraud detection using various oversampling and undersampling techniques. The results show the effectiveness of each method in improving the classification of fraudulent transactions.