Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions draft.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,19 @@ Machine learning metrics are built around the idea of training a model and then

## Introduction

The idea of out of sample prediction is described in detail throughout the literature[1], the basic idea is to split the data into two groups, a training sample and a testing sample. Once the data is split then a statistical model is trained on the training sample. Then the trained model is used to predict the dependent variable from the independent variables[4] in the testing sample. Finally, a loss metric, like mean squared error[2] is used if it is a regression problem or cross entropy[3] is used if it's classification, to compare the predicted dependent variable against the ground truth dependent variable.
The idea of out of sample prediction is described in detail throughout the literature[1], the basic idea is to split the data into two groups, a training sample and a testing sample. Once the data is split then a statistical model is trained on the training sample. Then the trained model is used to predict the independent variable from the dependent variables[4] in the testing sample. Finally, a loss metric, like mean squared error[2] is used if it is a regression problem or cross entropy[3] is used if it's classification, to compare the predicted dependent variable against the ground truth dependent variable.

This method can be useful as a first pass to assess model quality, however it has many deficiencies[5]#To Do add more references here#. Since we only split the data once and we are dealing with a classification problem, we must hope for a few things:

1. We don't get a substantially different balance in the label classes in training and testing. And that this balance is not different from the total data set, as well as, the population data in question.

2. We don't get a concentration of indepedent variables that are caused by a specific exogenous effect[6] in the training data and a different exogenous effect in the testing data.
2. We don't get a concentration of independent variables that are caused by a specific exogenous effect[6] in the training data and a different exogenous effect in the testing data.

If either of these conditions fail then our loss metric may record either a far too optimistic or pessimistic view of how well the model does. This in turn may have consequences for a whole host of things - failure to select the correct model, for instance, we may select a logistic regression model[7] when a decision tree model[8] is more approriate. Or we may select the wrong hyperparameters for a given model class. A direct consequence of a bad model is poor inference which may have difficult or impossible to recognize consequences, in some cases`[9][10][11][12]`. Therefore it is of paramount importance that our models be 'honest' and the error well captured.

To deal with this failure to generalize from a single training and testing split, cross validation[13] was created to increase the number of training and testing splits and then average the error metric or metrics. The way this works is by creating a number of random partitions of the data, and then treating one of the partitions as out of sample, while the rest are treated as in sample. Then the model is trained on all in sample predictions and the out of sample is left for predicting against, just like before. The procedure is repeated for each partition, so that each partition is treated as both training and testing. Finally the recorded metrics across each partition are averaged and reported, as well as the individual loss metrics. The issue with this strategy is you need to tune the number of partitions - too many and individual partitions won't generalize well, too few and you will run into the same issues as with train test split once.

In theory, both of the methods described are enough, the issue comes down to what happens in practice. Therefore we have created [honest_ml](https://github.com/EricSchles/honest_ml) a library to do many individual splits of the data, typically on the order of 500 to several thousand. The idea is to iterate over the random seed used in a typical train-test split implementation. For this library, we use scikit-learn's implementation[14], consider the gold standard by many. By doing so we remove the need to consider how many partitions is the right number. Additionally, we far less likely to deal with a lucky or unlucky split, because we are splitting so many times.
In theory, both of the methods described are enough, the issue comes down to what happens in practice. Therefore we have created [honest_ml](https://github.com/EricSchles/honest_ml) a library to do many individual splits of the data, typically on the order of 500 to several thousand. The idea is to iterate over the random seed used in a typical train-test split implementation. For this library, we use scikit-learn's implementation[14], considered the gold standard by many. By doing so we remove the need to consider how many partitions is the right number. Additionally, we are far less likely to deal with a lucky or unlucky split, because we are splitting so many times.

## Honest ML

Expand Down Expand Up @@ -62,4 +62,4 @@ citation:

13 - [Cross Validation Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

14 - [Sci-kit learn's train test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
14 - [Sci-kit learn's train test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
Binary file added paper1/Figure1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper1/Figure2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper1/Figure3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions paper1/Honest ML paper
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Binary file added paper1/Honest ML.docx
Binary file not shown.
Loading