diff --git a/examples/Frank's Tutorial on Reweighing.ipynb b/examples/Frank's Tutorial on Reweighing.ipynb new file mode 100644 index 00000000..e4ddf061 --- /dev/null +++ b/examples/Frank's Tutorial on Reweighing.ipynb @@ -0,0 +1,736 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Frank's Indepth Tutorial of the Reweighing Technique\n", + "##### Author: Guanzhong Chen\n", + "##### Date: 04/17/2020\n", + "*Reference: F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012.*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Introduction to Bias in Machine Learning" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This tutorial is meant to introduce one of the techniques in AI Fairness 360 (AIF360) package called \"Reweighing Technique.\" The AIF360 toolkit is an open-source library the helps machine learning researchers and the whole community detect and mitigate bias in machine learning models." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To give a brief background introduction of bias in machine learning, we look at a very simple dataset. This dataset classifies people described by a set of attributes as good or bad credit risks. File used is `german.data` consisting of 1000 instances and 20 features. In this case we focus on the supervised machine learning problem with a binary target of the credit risks being either \"good\" or \"bad.\" A machine learning model will learn and generalize the pattern from a training dataset and make predictions on a test dataset based on what it has learned. However, here is a problem. The training dataset may not be representative of the true population of people of all age groups. For example, in the training dataset, people with ages more than 25 are much more likely to receive a good credit risk due to the source of the dataset or some other reasons. However, the true distribution might be otherwise. This will generate bias and be unfavorable for people with ages less than 25. In this case, \"age\" will be our protected attribute, and it separates the instances into two groups: more than 25 and less than 25." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before further investigation, let's first import the german dataset, set the protected attribute, set the threshold of separation, set the training and testing dataset, and drop other sensitive attributes." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Matplotlib Error, comment out matplotlib.use('TkAgg')\n" + ] + } + ], + "source": [ + "# Load all necessary packages\n", + "import sys\n", + "sys.path.insert(1, \"../\") \n", + "\n", + "import numpy as np\n", + "np.random.seed(0)\n", + "\n", + "from aif360.datasets import GermanDataset\n", + "from aif360.metrics import BinaryLabelDatasetMetric\n", + "from aif360.algorithms.preprocessing import Reweighing\n", + "\n", + "from IPython.display import Markdown, display" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "dataset_orig = GermanDataset(\n", + " protected_attribute_names=['age'], \n", + " privileged_classes=[lambda x: x >= 25], \n", + " features_to_drop=['personal_status', 'sex'] \n", + ")\n", + "\n", + "dataset_orig_train, dataset_orig_test = dataset_orig.split([0.7], shuffle=True)\n", + "\n", + "privileged_groups = [{'age': 1}]\n", + "unprivileged_groups = [{'age': 0}]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Preprocessing Techniques to Mitigate Bias" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are many solutions to mitigate bias in machine learning. In this tutorial, we focus on those that are proceeded before training. They are known as the \"preprocessing\" techniques." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We list three of them here that are commonly used. We give a brief introduction to what they are and what their pros and cons are to help readers to choose the one that is most suitable for them." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1. **Suppresion**: First, we identify the attributes that are most correlated with the protected attribute $A$. Then, we just remove $A$ and these most correlated attribute. \n", + " * **Pros**: the algorithm itself is straightforward to understand and easy to implement. \n", + " * **Cons**: Sometimes we can't get rid of the protected attributes easily by just removing them. Some of them may be critical for companies' bussiness analysis, and some of them may be important for the classification process." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. **Dataset Massaging**: We change the labels of some objects in the dataset. Selections of the labels to change will base on a ranker that is related to Naive Bayes classifier.\n", + " * **Pros**: It help mitigate bias even if the protected attributes are not allowed to be removed. It is also relatively easy to understand.\n", + " * **Cons**: It is, in a sense, rather intrusive as it changes the labels of the instances." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. **Reweighing**: We give training instances weights according to the law of statistical independence.\n", + " * **Pros**: It is calculation-friendly. We can implement this algorithm using frequency count. Compared to dataset massaging, it also helps mitigate bias without changing the labels.\n", + " * **Cons**: More difficult to implement compared to the two techniques mentioned above." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Elaboration on Reweighing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### a. Notations and Weight Concept" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To have an idea of how this technique works, we first introduce some notations. We assume the protected attribute and the target variable are binary. Specifically, we denote the protected attribute as $A$ with two values $\\{b,w\\}$. We denote the target class as $T$ with two values $\\{+, -\\}$. The classifier we use is denoted as $C$, and the random unlabeled data subject is denoted as $X$. Also the training dataset is denoted as $D$." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the dataset $D$ is unbiased, $A$ and $T$ are statistically independent. This means that the expected probability to see an instance with its protected attribute value and class given independence can be calculated as the following: \n", + "\n", + "$$\n", + "P_{exp}(A=b \\wedge T=+) = \\frac{|\\{X \\in D | X(A) = b\\}|}{|D|} * \\frac{|\\{X \\in D | X(T) = +\\}|}{|D|}\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In reality, the observed probability to see an instance with its protected attribute value and class is calculated as the following: \n", + "\n", + "$$\n", + "P_{obs}(A=b \\wedge T=+) = \\frac{|\\{X \\in D | X(A) = b \\wedge X(T) = +\\}|}{|D|}\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We notice that $P_{exp}$ and $P_{obs}$ are usually different. We assign the weight for each instance to be $W$ that is calculated as the following:\n", + "\n", + "$$\n", + "W(X) = \\frac{P_{exp}(A=b \\wedge T=+)}{P_{obs}(A=b \\wedge T=+)}\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We notice that $W(X)$ is essentially the expected probability divided by the observed probability. In this way, we assign lower weights to objects that have been deprived or favored to compensate for the bias." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### b. A Concrete Example of Weight Calculation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notations may be abstract for some readers. We therefore look at a concrete example using the same dataset mentioned above." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We recall that our protected attribute here is the \"age.\" And this protected attribute is already binary (1.0 vs 0.0) by setting the threshold 25. If an person is older than or is 25 years old, we label him/her as 1.0 in the age attribute. We label him/her 0.0 otherwise. We also have binary value for the target class. In this case 1.0 means good credit risk, and 2.0 means bad credit risk. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we want to calculate the expected probability to see an instance with its protected attribute value and class given independence. Specifically, we look at $P_{exp}(A=1.0 \\wedge T=1.0)$ first." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As the equation indicates above, we need to calculate the frequency count of $A = 1.0$ in the training set first. That is $|\\{X \\in D | X(A) = 1.0\\}|$." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "587" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "freq_count_of_age_bigger_25 = sum(dataset_orig_train.convert_to_dataframe()[0][\"age\"] == 1.0)\n", + "freq_count_of_age_bigger_25" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In other words, in this dataset, **we have a total of 587 people who are older than or are at age 25**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And the frequency count of T = 1.0 next. That is $|\\{X \\in D | X(T) = 1.0\\}|$." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "490" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "freq_count_of_good_credit = sum(dataset_orig_train.convert_to_dataframe()[0][\"credit\"] == 1.0)\n", + "freq_count_of_good_credit" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In other words, in this dataset, **we have a total of 490 people who have a good credit risk**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we need to know the total number of training instance. That is $|D|$." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "700" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "total_instance = len(dataset_orig_train.convert_to_dataframe()[0][\"credit\"])\n", + "total_instance" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Combine the three values above, we have:\n", + "\n", + "$$\n", + "P_{exp}(A=1.0 \\wedge T=1.0) = \\frac{|\\{X \\in D | X(A) = 1.0\\}|}{|D|} * \\frac{|\\{X \\in D | X(T) = 1.0\\}|}{|D|}\n", + "$$\n", + "\n", + "We can calculate this value in Python." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.587" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "P_exp = freq_count_of_age_bigger_25/total_instance * freq_count_of_good_credit/total_instance\n", + "P_exp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we calculate the observed probability $P_{obs}(A=1.0 \\wedge T=1.0)$." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As the equation indicates above, we need to calculate the frequency count of $A = 1.0$ and $T = 1.0$ in the training set first. That is $|\\{X \\in D | X(A) = 1.0 \\wedge X(T) = 1.0\\}|$." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "427" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = dataset_orig_train.convert_to_dataframe()[0]\n", + "freq_count_both = df[(df['age']==1.0)&(df['credit']==1.0)].shape[0]\n", + "freq_count_both" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In other words, in this dataset, **we have a total of 427 people who have a good credit risk and older than or are at age 25**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since we already the total number of training instance is 700. We can calcuate the $P_{obs}$ easily using the following equation.\n", + "\n", + "$$\n", + "P_{obs}(A=1.0 \\wedge T=1.0) = \\frac{|\\{X \\in D | X(A) = 1.0 \\wedge X(T) = 1.0\\}|}{|D|}\n", + "$$\n", + "\n", + "We can calculate this specific value in Python." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.61" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "P_obs = freq_count_both / total_instance\n", + "P_obs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can calculate the weight using the two probabilities.\n", + "\n", + "$$\n", + "W(X) = \\frac{P_{exp}(A=1.0 \\wedge T=1.0)}{P_{obs}(A=1.0 \\wedge T=1.0)}\n", + "$$\n", + "\n", + "Again, the mathematical value can be calculated in Python." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.962295081967213" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "weight = P_exp / P_obs\n", + "weight" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That is the weight we will assign to for each instance with $A = 1.0$ and $T = 1.0$ in the training set." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Summary: this means that for this dataset, we give people who are older than or at age 25 AND have a good credit risk a weight of 0.587.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Thanks to the AIF360 package, we do not have to go through this weight-assigning preprocessing by hand ourselves. The package has its own function that calculates the weights for us automatically. " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "RW = Reweighing(unprivileged_groups=unprivileged_groups,\n", + " privileged_groups=privileged_groups)\n", + "dataset_transf_train = RW.fit_transform(dataset_orig_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can confirm if our calculation is correct using the package." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.9622950819672131\n" + ] + } + ], + "source": [ + "print(RW.w_p_fav) # S is 1.0 and T is 1.0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Congragulations! We have achieved the same result." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The other thing to notice is that since $A$ and $T$ are both binary, we have a total of 4 different weights to assign. If reader is interested, he/she can calculate the other 3 values of weight by hand and confirm them with AIF360." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.100625\n", + "1.2555555555555555\n", + "0.678\n" + ] + } + ], + "source": [ + "print(RW.w_p_unfav)\n", + "print(RW.w_up_fav)\n", + "print(RW.w_up_unfav)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Summary:** \n", + "**1. First value says that for this dataset, we give people who are younger than age 25 AND have a good credit risk a weight of 1.100625.** \n", + "**2. Second value says that for this dataset, we give people who are older than or at age 25 AND have a bad credit risk a weight of 1.2555555555555555.** \n", + "**3. Third value says that for this dataset, we give people who younger than age 25 AND have a bad credit risk a weight of 0.678.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can further visualize how many instances receive a specific weight value using a bar plot in Python." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "old_good = df[(df['age']==1.0)&(df['credit']==1.0)].shape[0]\n", + "young_good = df[(df['age']==0.0)&(df['credit']==1.0)].shape[0]\n", + "old_bad = df[(df['age']==1.0)&(df['credit']==2.0)].shape[0]\n", + "young_bad = df[(df['age']==0.0)&(df['credit']==2.0)].shape[0]\n", + "\n", + "import matplotlib.pyplot as plt\n", + "fig = plt.figure()\n", + "ax = fig.add_axes([0,0,1,1])\n", + "langs = ['old good(w=0.96)', 'young good(w=1.1)', 'old bad(w=1.25)', 'young bad(w=0.68)']\n", + "counts = [old_good,young_good,old_bad,young_bad]\n", + "ax.bar(langs,counts)\n", + "for i, v in enumerate(counts):\n", + " plt.text(langs[i],v, str(v))\n", + "plt.title(\"Count of Each Combination of Age and Credit Risk with Weight Value w\")\n", + "plt.xlabel('Combinations of Two Categories and Corresponding Weight Values')\n", + "plt.ylabel('Count')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Results with Reweighing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After understanding how Reweighing works, we want to see if this preprocessing technique really works. We will use a metric called \"BinaryLabelDatasetMetric.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We first apply this metric to the original training dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "#### Original training dataset" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Difference in mean outcomes between unprivileged and privileged groups = -0.169905\n" + ] + } + ], + "source": [ + "metric_orig_train = BinaryLabelDatasetMetric(dataset_orig_train, \n", + " unprivileged_groups=unprivileged_groups,\n", + " privileged_groups=privileged_groups)\n", + "display(Markdown(\"#### Original training dataset\"))\n", + "print(\"Difference in mean outcomes between unprivileged and privileged groups = %f\" % metric_orig_train.mean_difference())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we can see, the result showed that the privileged group was getting 17% more positive outcomes in the training dataset. Therefore the dataset is biased. We may want to mitigate this bias using the Reweighing preprocessing technique and see what happen to the metric after all." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "#### Transformed training dataset" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Difference in mean outcomes between unprivileged and privileged groups = 0.000000\n" + ] + } + ], + "source": [ + "metric_transf_train = BinaryLabelDatasetMetric(dataset_transf_train, \n", + " unprivileged_groups=unprivileged_groups,\n", + " privileged_groups=privileged_groups)\n", + "display(Markdown(\"#### Transformed training dataset\"))\n", + "print(\"Difference in mean outcomes between unprivileged and privileged groups = %f\" % metric_transf_train.mean_difference())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a result, after we apply the Reweighing, the bias is mitigated." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}