diff --git a/your-code/lab_imbalance.ipynb b/your-code/lab_imbalance.ipynb
index dbb15e1..9fc96cd 100644
--- a/your-code/lab_imbalance.ipynb
+++ b/your-code/lab_imbalance.ipynb
@@ -1,147 +1,498 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Inbalanced Classes\n",
- "## In this lab, we are going to explore a case of imbalanced classes. \n",
- "\n",
- "\n",
- "Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### First, download the data from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data?resource=download . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?\n",
- "### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Your code here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### What is the distribution of the outcome? "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Your response here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Clean the dataset. Pre-process it to make it suitable for ML training. Feel free to explore, drop, encode, transform, etc. Whatever you feel will improve the model score."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Your code here\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Run a logisitc regression classifier and evaluate its accuracy."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Your code here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Now pick a model of your choice and evaluate its accuracy."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Your code here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Which model worked better and how do you know?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Your response here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.8"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Inbalanced Classes\n",
+ "## In this lab, we are going to explore a case of imbalanced classes. \n",
+ "\n",
+ "\n",
+ "Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### First, download the data from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data?resource=download . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?\n",
+ "### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n",
+ "import pandas as pd\n",
+ "\n",
+ "data = pd.read_csv('Fraud.csv', nrows=100000)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " step | \n",
+ " type | \n",
+ " amount | \n",
+ " nameOrig | \n",
+ " oldbalanceOrg | \n",
+ " newbalanceOrig | \n",
+ " nameDest | \n",
+ " oldbalanceDest | \n",
+ " newbalanceDest | \n",
+ " isFraud | \n",
+ " isFlaggedFraud | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 1 | \n",
+ " PAYMENT | \n",
+ " 9839.64 | \n",
+ " C1231006815 | \n",
+ " 170136.0 | \n",
+ " 160296.36 | \n",
+ " M1979787155 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 1 | \n",
+ " PAYMENT | \n",
+ " 1864.28 | \n",
+ " C1666544295 | \n",
+ " 21249.0 | \n",
+ " 19384.72 | \n",
+ " M2044282225 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 1 | \n",
+ " TRANSFER | \n",
+ " 181.00 | \n",
+ " C1305486145 | \n",
+ " 181.0 | \n",
+ " 0.00 | \n",
+ " C553264065 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1 | \n",
+ " CASH_OUT | \n",
+ " 181.00 | \n",
+ " C840083671 | \n",
+ " 181.0 | \n",
+ " 0.00 | \n",
+ " C38997010 | \n",
+ " 21182.0 | \n",
+ " 0.0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 1 | \n",
+ " PAYMENT | \n",
+ " 11668.14 | \n",
+ " C2048537720 | \n",
+ " 41554.0 | \n",
+ " 29885.86 | \n",
+ " M1230701703 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " step type amount nameOrig oldbalanceOrg newbalanceOrig \\\n",
+ "0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 \n",
+ "1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 \n",
+ "2 1 TRANSFER 181.00 C1305486145 181.0 0.00 \n",
+ "3 1 CASH_OUT 181.00 C840083671 181.0 0.00 \n",
+ "4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 \n",
+ "\n",
+ " nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud \n",
+ "0 M1979787155 0.0 0.0 0 0 \n",
+ "1 M2044282225 0.0 0.0 0 0 \n",
+ "2 C553264065 0.0 0.0 1 0 \n",
+ "3 C38997010 21182.0 0.0 1 0 \n",
+ "4 M1230701703 0.0 0.0 0 0 "
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### What is the distribution of the outcome? "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 99884\n",
+ "1 116\n",
+ "Name: isFraud, dtype: int64"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your response here\n",
+ "data['isFraud'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Clean the dataset. Pre-process it to make it suitable for ML training. Feel free to explore, drop, encode, transform, etc. Whatever you feel will improve the model score."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n",
+ "data.drop(columns=['nameOrig','nameDest','isFlaggedFraud'], inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "9 37628\n",
+ "10 27274\n",
+ "8 21097\n",
+ "7 6837\n",
+ "1 2708\n",
+ "6 1660\n",
+ "2 1014\n",
+ "5 665\n",
+ "4 565\n",
+ "3 552\n",
+ "Name: step, dtype: int64"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data['step'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "PAYMENT 39512\n",
+ "CASH_OUT 30718\n",
+ "CASH_IN 20185\n",
+ "TRANSFER 8597\n",
+ "DEBIT 988\n",
+ "Name: type, dtype: int64"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data['type'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "RangeIndex: 100000 entries, 0 to 99999\n",
+ "Data columns (total 12 columns):\n",
+ " # Column Non-Null Count Dtype \n",
+ "--- ------ -------------- ----- \n",
+ " 0 step 100000 non-null int64 \n",
+ " 1 amount 100000 non-null float64\n",
+ " 2 oldbalanceOrg 100000 non-null float64\n",
+ " 3 newbalanceOrig 100000 non-null float64\n",
+ " 4 oldbalanceDest 100000 non-null float64\n",
+ " 5 newbalanceDest 100000 non-null float64\n",
+ " 6 isFraud 100000 non-null int64 \n",
+ " 7 type_CASH_IN 100000 non-null uint8 \n",
+ " 8 type_CASH_OUT 100000 non-null uint8 \n",
+ " 9 type_DEBIT 100000 non-null uint8 \n",
+ " 10 type_PAYMENT 100000 non-null uint8 \n",
+ " 11 type_TRANSFER 100000 non-null uint8 \n",
+ "dtypes: float64(5), int64(2), uint8(5)\n",
+ "memory usage: 5.8 MB\n"
+ ]
+ }
+ ],
+ "source": [
+ "data_dummies = pd.get_dummies(data)\n",
+ "data_dummies.info()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Run a logisitc regression classifier and evaluate its accuracy."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.metrics import confusion_matrix, accuracy_score"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X = data_dummies.drop(columns=['isFraud'])\n",
+ "Y = data_dummies['isFraud']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_train, X_test, y_train, y_test = train_test_split(X, Y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
+ ],
+ "text/plain": [
+ "LogisticRegression(max_iter=1000)"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = LogisticRegression(max_iter=1000)\n",
+ "model.fit(X_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Accuracy: 0.99892\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[24969, 1],\n",
+ " [ 26, 4]])"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "y_pred = model.predict(X_test)\n",
+ "print(\"Accuracy: \", accuracy_score(y_test, y_pred))\n",
+ "confusion_matrix(y_test, y_pred)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Now pick a model of your choice and evaluate its accuracy."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Accuracy: 0.99884\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[24970, 0],\n",
+ " [ 29, 1]])"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "model = KNeighborsClassifier(n_neighbors = 11)\n",
+ "model.fit(X_train, y_train)\n",
+ "y_pred = model.predict(X_test)\n",
+ "print(\"Accuracy: \", accuracy_score(y_test, y_pred))\n",
+ "confusion_matrix(y_test, y_pred)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Which model worked better and how do you know?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your response here\n",
+ "# The logestic regression worked better than KNN because the accuracy of logestic regression is higher\n",
+ "# which can be seen in the accuracy and confusion matrix"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}