diff --git a/draft.md b/draft.md index 3409ef1..b18ed0c 100644 --- a/draft.md +++ b/draft.md @@ -8,19 +8,19 @@ Machine learning metrics are built around the idea of training a model and then ## Introduction -The idea of out of sample prediction is described in detail throughout the literature[1], the basic idea is to split the data into two groups, a training sample and a testing sample. Once the data is split then a statistical model is trained on the training sample. Then the trained model is used to predict the dependent variable from the independent variables[4] in the testing sample. Finally, a loss metric, like mean squared error[2] is used if it is a regression problem or cross entropy[3] is used if it's classification, to compare the predicted dependent variable against the ground truth dependent variable. +The idea of out of sample prediction is described in detail throughout the literature[1], the basic idea is to split the data into two groups, a training sample and a testing sample. Once the data is split then a statistical model is trained on the training sample. Then the trained model is used to predict the independent variable from the dependent variables[4] in the testing sample. Finally, a loss metric, like mean squared error[2] is used if it is a regression problem or cross entropy[3] is used if it's classification, to compare the predicted dependent variable against the ground truth dependent variable. This method can be useful as a first pass to assess model quality, however it has many deficiencies[5]#To Do add more references here#. Since we only split the data once and we are dealing with a classification problem, we must hope for a few things: 1. We don't get a substantially different balance in the label classes in training and testing. And that this balance is not different from the total data set, as well as, the population data in question. -2. We don't get a concentration of indepedent variables that are caused by a specific exogenous effect[6] in the training data and a different exogenous effect in the testing data. +2. We don't get a concentration of independent variables that are caused by a specific exogenous effect[6] in the training data and a different exogenous effect in the testing data. If either of these conditions fail then our loss metric may record either a far too optimistic or pessimistic view of how well the model does. This in turn may have consequences for a whole host of things - failure to select the correct model, for instance, we may select a logistic regression model[7] when a decision tree model[8] is more approriate. Or we may select the wrong hyperparameters for a given model class. A direct consequence of a bad model is poor inference which may have difficult or impossible to recognize consequences, in some cases`[9][10][11][12]`. Therefore it is of paramount importance that our models be 'honest' and the error well captured. To deal with this failure to generalize from a single training and testing split, cross validation[13] was created to increase the number of training and testing splits and then average the error metric or metrics. The way this works is by creating a number of random partitions of the data, and then treating one of the partitions as out of sample, while the rest are treated as in sample. Then the model is trained on all in sample predictions and the out of sample is left for predicting against, just like before. The procedure is repeated for each partition, so that each partition is treated as both training and testing. Finally the recorded metrics across each partition are averaged and reported, as well as the individual loss metrics. The issue with this strategy is you need to tune the number of partitions - too many and individual partitions won't generalize well, too few and you will run into the same issues as with train test split once. -In theory, both of the methods described are enough, the issue comes down to what happens in practice. Therefore we have created [honest_ml](https://github.com/EricSchles/honest_ml) a library to do many individual splits of the data, typically on the order of 500 to several thousand. The idea is to iterate over the random seed used in a typical train-test split implementation. For this library, we use scikit-learn's implementation[14], consider the gold standard by many. By doing so we remove the need to consider how many partitions is the right number. Additionally, we far less likely to deal with a lucky or unlucky split, because we are splitting so many times. +In theory, both of the methods described are enough, the issue comes down to what happens in practice. Therefore we have created [honest_ml](https://github.com/EricSchles/honest_ml) a library to do many individual splits of the data, typically on the order of 500 to several thousand. The idea is to iterate over the random seed used in a typical train-test split implementation. For this library, we use scikit-learn's implementation[14], considered the gold standard by many. By doing so we remove the need to consider how many partitions is the right number. Additionally, we are far less likely to deal with a lucky or unlucky split, because we are splitting so many times. ## Honest ML @@ -62,4 +62,4 @@ citation: 13 - [Cross Validation Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) -14 - [Sci-kit learn's train test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) \ No newline at end of file +14 - [Sci-kit learn's train test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) diff --git a/paper1/Figure1.jpg b/paper1/Figure1.jpg new file mode 100644 index 0000000..b223192 Binary files /dev/null and b/paper1/Figure1.jpg differ diff --git a/paper1/Figure2.jpg b/paper1/Figure2.jpg new file mode 100644 index 0000000..fe02cb6 Binary files /dev/null and b/paper1/Figure2.jpg differ diff --git a/paper1/Figure3.jpg b/paper1/Figure3.jpg new file mode 100644 index 0000000..312d26f Binary files /dev/null and b/paper1/Figure3.jpg differ diff --git a/paper1/Honest ML paper b/paper1/Honest ML paper new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/paper1/Honest ML paper @@ -0,0 +1 @@ + diff --git a/paper1/Honest ML.docx b/paper1/Honest ML.docx new file mode 100644 index 0000000..0046ae8 Binary files /dev/null and b/paper1/Honest ML.docx differ diff --git a/paper1/paper.bib b/paper1/paper.bib new file mode 100644 index 0000000..24b316c --- /dev/null +++ b/paper1/paper.bib @@ -0,0 +1,370 @@ +@misc{Andersen:2002, + author = {Andersen, Per Kragh}, + title = {3. Applied Logistic Regression. 2nd edn. David W. Hosmer and Stanley Lemeshow. Wiley, New York, 2000. No. of pages: xii+373. Price: £60.95. ISBN 0-471-35632-8}, + publisher = {John Wiley & Sons, Ltd}, + volume = {21}, + pages = {1963-1964}, + note = {ArticleID:SIM1236}, + ISBN = {0277-6715}, + DOI = {10.1002/sim.1236}, + year = {2002}, + type = {Generic} +} + +@article{Arlot:2010, + author = {Arlot, Sylvain and Celisse, Alain}, + title = {A survey of cross-validation procedures for model selection}, + journal = {Statistics surveys}, + volume = {4}, + number = {none}, + pages = {40-79}, + abstract = {Used to estimate the risk of an estimator or to perform model selection, cross-validation is a widespread strategy because of its simplicity and its apparent universality. Many results exist on the model selection performances of cross-validation procedures. This survey intends to relate these results to the most recent advances of model selection theory, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results. As a conclusion, guidelines are provided for choosing the best cross-validation procedure according to the particular features of the problem in hand.}, + keywords = {Applications +Bayesian analysis +Cross-validation +Leave-one-out +Mathematics +Methodology +Model selection +Other Statistics +Statistics +Statistics Theory}, + ISSN = {1935-7516}, + DOI = {10.1214/09-SS054}, + year = {2010}, + type = {Journal Article} +} + +@book{Bickel:2015, + author = {Bickel, Peter J. and Doksum, Kjell A.}, + title = {Mathematical statistics: Basic ideas and selected topics, second edition}, + volume = {1}, + ISBN = {9781498723817}, + DOI = {10.1201/b18312}, + year = {2015}, + type = {Book} +} + +@article{Buitinck:2013, + author = {Buitinck, Lars and Louppe, Gilles and Blondel, Mathieu and Pedregosa, Fabian and Mueller, Andreas and Grisel, Olivier and Niculae, Vlad and Prettenhofer, Peter and Gramfort, Alexandre and Grobler, Jaques}, + title = {API design for machine learning software: experiences from the scikit-learn project}, + journal = {arXiv preprint arXiv:1309.0238}, + year = {2013}, + type = {Journal Article} +} + +@article{Chernozhukov:2022, + author = {Chernozhukov, Victor and Newey, Whitney K. and Singh, Rahul}, + title = {Automatic Debiased Machine Learning of Causal and Structural Effects}, + journal = {Econometrica}, + volume = {90}, + number = {3}, + pages = {967-1027}, + abstract = {Many causal and structural effects depend on regressions. Examples include policy effects, average derivatives, regression decompositions, average treatment effects, causal mediation, and parameters of economic structural models. The regressions may be high‐dimensional, making machine learning useful. Plugging machine learners into identifying equations can lead to poor inference due to bias from regularization and/or model selection. This paper gives automatic debiasing for linear and nonlinear functions of regressions. The debiasing is automatic in using Lasso and the function of interest without the full form of the bias correction. The debiasing can be applied to any regression learner, including neural nets, random forests, Lasso, boosting, and other high‐dimensional methods. In addition to providing the bias correction, we give standard errors that are robust to misspecification, convergence rates for the bias correction, and primitive conditions for asymptotic inference for estimators of a variety of estimators of structural and causal effects. The automatic debiased machine learning is used to estimate the average treatment effect on the treated for the NSW job training data and to estimate demand elasticities from Nielsen scanner data while allowing preferences to be correlated with prices and income.}, + keywords = {Analysis +Automatic +Averages +Bias +causal parameters +Convergence +Debiased machine learning +Elasticity of demand +Forests +Lasso +Learning +Machine learning +Mediation +Nonlinear functions +Occupational training +Parameters +Prices +regression effects +Riesz representation +Structural models +structural parameters +Vocational education}, + ISSN = {0012-9682}, + DOI = {10.3982/ECTA18515}, + year = {2022}, + type = {Journal Article} +} + +@article{de Ville:2013, + author = {de Ville, Barry}, + title = {Decision trees}, + journal = {Wiley interdisciplinary reviews. Computational statistics}, + volume = {5}, + number = {6}, + pages = {448-455}, + note = {ArticleID:WICS1278}, + abstract = {Decision trees trace their origins to the era of the early development of written records. This history illustrates a major strength of trees: exceptionally interpretable results which have an intuitive tree‐like display which, in turn, enhances understanding and the dissemination of results. The computational origins of decision trees—sometimes called classification trees or regression trees—are models of biological and cognitive processes. This common heritage drives complementary developments of both statistical decision trees and trees designed for machine learning. The unfolding and progressive elucidation of the various features of trees throughout their early history in the late 20th century is discussed along with the important associated reference points and responsible authors. Statistical approaches, such as a hypothesis testing and various resampling approaches, have coevolved along with machine learning implementations. This had resulted in exceptionally adaptable decision tree tools, appropriate for various statistical and machine learning tasks, across various levels of measurement, with varying levels of data quality. Trees are robust in the presence of missing data and offer multiple ways of incorporating missing data in the resulting models. Although trees are powerful, they are also flexible and easy to use methods. This assures the production of high quality results that require few assumptions to deploy. The treatment ends with a discussion of the most current developments which continue to rely on the synergies and cross‐fertilization between statistical and machine learning communities. Current developments with the emergence of multiple trees and the various resampling approaches that are employed are discussed. WIREs Comput Stat 2013, 5:448–455. doi: 10.1002/wics.1278 This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition Statistical Learning and Exploratory Methods of the Data Sciences > Rule-Based Mining}, + keywords = {Boosting +Computation +Decision trees +Machine learning +Missing data +Origins +Predictive models +Random forests +Resampling +Rule induction +Trees +Wire}, + ISSN = {1939-5108}, + DOI = {10.1002/wics.1278}, + year = {2013}, + type = {Journal Article} +} + +@article{Doan: 2022, + author = {Doan, Quoc Hoan and Mai, Sy-Hung and Do, Quang Thang and Thai, Duc-Kien}, + title = {A cluster-based data splitting method for small sample and class imbalance problems in impact damage classification}, + journal = {Applied soft computing}, + volume = {120}, + pages = {108628}, + abstract = {From collected experimental data, a rapid and precise classification model for impact damage modes (IMDs) can be developed using machine learning (ML) techniques to evaluate impact resistant capabilities of reinforced concrete (RC) building walls. However, experimental data is often small and imbalanced, resulting in significant degradation and instability in classification performance. In this study, an imbalanced 4-classes dataset consisted of 240 missile impact tests is employed, with the most minor class containing only 10 samples. The paper aims to develop an automated classification model for IDMs, using a clustering-based within-class stratified splitting technique, named WICS, combining with a well-known oversampling technique, namely SMOTE-NC, that considers not only the between-class imbalance but also the within-class distribution to stabilize the classification performance. Four classifiers and five data splitting techniques are developed and implemented to address classification performance. We found that the support vector machine (SVM) classifier using WICS and SMOTE-NC achieves the best micro F1 score (0.821), Cohen’s kappa score (0.700), and AUC value (0.949) with highly stable performance. Friedman and Holm’s post-hoc statistical tests also confirm the outperformance of WICS+SMOTE-NC over other techniques. •A new splitting technique is proposed to address the small-imbalanced data problem.•A classification model for impact damages of resistant RC walls is developed.•The developed machine learning (ML) model can rapidly assess the impact damages.•The ML model applying the proposed technique achieves high and stable performance.}, + keywords = {Analysis +Green technology +Imbalanced dataset +Impact damage +Impact loading +Machine learning +Methods +RC walls +School construction +Small dataset}, + ISSN = {1568-4946}, + DOI = {10.1016/j.asoc.2022.108628}, + year = {2022}, + type = {Journal Article} +} + +@book{Edelkamp:2021, + author = {Edelkamp, Stefan and Möller, Ralf and Rueckert, Elmar}, + title = {KI 2021: advances in artificial intelligence : 44th German Conference on AI, virtual event, September 27 - October 1, 2021 : proceedings}, + publisher = {Springer}, + address = {Cham, Switzerland}, + series = {Lecture Notes in Computer Science Ser. ; v.12873}, + note = {Includes bibliographical references and index.}, + keywords = {Optical data processing +Electronic books}, + ISBN = {3-030-87626-8}, + year = {2021}, + type = {Book} +} + +@misc{Gortmaker:1994, + author = {Gortmaker, Steven L.}, + title = {Theory and methods -- Applied Logistic Regression by David W. Hosmer Jr and Stanley Lemeshow}, + publisher = {Sage Publications Ltd}, + volume = {23}, + pages = {159}, + keywords = {Epidemiology +Nonfiction +Statistical methods +Statistics +Variables}, + ISBN = {0094-3061}, + year = {1994}, + type = {Generic} +} + +@book{James:2021, + author = {James, Gareth}, + title = {An introduction to statistical learning : with applications in R}, + publisher = {Springer}, + address = {New York, New York}, + edition = {2nd}, + series = {Springer Texts in Statistics}, + note = {(Gareth Michael)}, + abstract = {Presents an essential statistical learning toolkit for practitioners in science, industry, and other fields. Demonstrates application of the statistical learning methods in R. Includes new chapters on deep learning, survival analysis, and multiple testing. Covers a range of topics, such as linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and deep learning. Features extensive color graphics for a dynamic learning experience.}, + keywords = {Mathematical statistics +Mathematical models +R (Computer program language) +Electronic books}, + ISBN = {1-0716-1418-5}, + year = {2021}, + type = {Book} +} + +@inproceedings{Kohavi:1995, + author = {Kohavi, Ron}, + title = {A study of cross-validation and bootstrap for accuracy estimation and model selection}, + booktitle = {Ijcai}, + publisher = {Montreal, Canada}, + volume = {14}, + pages = {1137-1145}, + year = {1995}, + type = {Conference Proceedings} + +} + +@book{Kok:2007, + author = {Kok, Joost N.}, + title = {Machine learning : ECML 2007 : 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007 : proceedings}, + publisher = {Springer}, + address = {Berlin, Germany ;}, + edition = {1st 2007.}, + series = {Lecture notes in computer science. Lecture notes in artificial intelligence ;4701}, + note = {Includes bibliographical references and index.}, + abstract = {The two premier annual European conferences in the areas of machine learning and data mining have been collocated ever since the ?rst joint conference in Freiburg, 2001. The European Conference on Machine Learning (ECML) traces its origins to 1986, when the ?rst European Working Session on Learning was held in Orsay, France. The European Conference on Principles and Practice of KnowledgeDiscoveryinDatabases(PKDD) was?rstheldin1997inTrondheim, Norway. Over the years, the ECML/PKDD series has evolved into one of the largest and most selective international conferences in machine learning and data mining. In 2007, the seventh collocated ECML/PKDD took place during September 17–21 on the centralcampus of WarsawUniversityand in the nearby Staszic Palace of the Polish Academy of Sciences. The conference for the third time used a hierarchical reviewing process. We nominated 30 Area Chairs, each of them responsible for one sub-?eld or several closely related research topics. Suitable areas were selected on the basis of the submission statistics for ECML/PKDD 2006 and for last year’s International Conference on Machine Learning (ICML 2006) to ensure a proper load balance amongtheAreaChairs.AjointProgramCommittee(PC)wasnominatedforthe two conferences, consisting of some 300 renowned researchers, mostly proposed by the Area Chairs. This joint PC, the largest of the series to date, allowed us to exploit synergies and deal competently with topic overlaps between ECML and PKDD. ECML/PKDD 2007 received 592 abstract submissions. As in previous years, toassistthereviewersandtheAreaChairsintheir?nalrecommendationauthors had the opportunity to communicate their feedback after the reviewing phase.}, + keywords = {Machine learning +Electronic books}, + ISBN = {3-540-74958-6}, + DOI = {10.1007/978-3-540-74958-5}, + year = {2007}, + type = {Book} +} + +@book{Kuhn:2013, + author = {Kuhn, Max and Johnson, Kjell}, + title = {Applied Predictive Modeling}, + publisher = {Springer New York}, + address = {New York, NY}, + keywords = {Linear Discriminant Analysis +Multivariate Adaptive Regression Spline +Predictive Model +Recursive Partitioning}, + pages = {595}, + ISBN = {9781461468486}, + DOI = {10.1007/978-1-4614-6849-3}, + year = {2013}, + type = {Book} +} + +@article{Marsili:2022, + author = {Marsili, Matteo and Roudi, Yasser}, + title = {Quantifying relevance in learning and inference}, + journal = {Physics reports}, + volume = {963}, + pages = {1-43}, + abstract = {Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on “true” models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of “relevance”. The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf’s law statistics. This identifies samples obeying Zipf’s law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties. This distinctive feature is consistent with the invariance of the classification under coarse graining of the output, which is a desirable property of learning machines. This theoretical framework is corroborated by empirical analysis showing (i) how the concept of relevance can be useful to identify relevant variables in high-dimensional inference and (ii) that widely used machine learning architectures approach reasonably well the ideal limit of optimal learning machines, within the limits of the data with which they are trained.}, + keywords = {Analysis +Artificial intelligence +Big Data +Complex systems +Electric power distribution +Energy levels +Frequency distribution +Information theory +Machine learning +Machinery +Magneto-electric machines +Relevance +Representations +Samples +Statistical analysis +Statistical inference +Statistical methods +Thermodynamic properties +Zipf's Law}, + ISSN = {0370-1573}, + DOI = {10.1016/j.physrep.2022.03.001}, + year = {2022}, + type = {Journal Article} +} + +@article{Montgomery:1991, + author = {Montgomery, Douglas C}, + title = {Response surface methods and designs}, + journal = {Design and analysis of experiments}, + year = {1991}, + type = {Journal Article} +} + +@article{Pawluszek-Filipiak:2020, + author = {Pawluszek-Filipiak, Kamila and Borkowski, Andrzej}, + title = {On the Importance of Train–Test Split Ratio of Datasets in Automatic Landslide Detection by Supervised Classification}, + journal = {Remote sensing (Basel, Switzerland)}, + volume = {12}, + number = {18}, + pages = {3054}, + abstract = {Many automatic landslide detection algorithms are based on supervised classification of various remote sensing (RS) data, particularly satellite images and digital elevation models (DEMs) delivered by Light Detection and Ranging (LiDAR). Machine learning methods require the collection of both training and testing data to produce and evaluate the classification results. The collection of good quality landslide ground truths to train classifiers and detect landslides in other regions is a challenge, with a significant impact on classification accuracy. Taking this into account, the following research question arises: What is the appropriate training–testing dataset split ratio in supervised classification to effectively detect landslides in a testing area based on DEMs? We investigated this issue for both the pixel-based approach (PBA) and object-based image analysis (OBIA). In both approaches, the random forest (RF) classification was implemented. The experiments were performed in the most landslide-affected area in Poland in the Outer Carpathians-Rożnów Lake vicinity. Based on the accuracy assessment, we found that the training area should be of a similar size to the testing area. We also found that the OBIA approach performs slightly better than PBA when the quantity of training samples is significantly lower than the testing samples. To increase detection performance, the intersection of the OBIA and PBA results together with median filtering and the removal of small elongated objects were performed. This allowed an overall accuracy (OA) = 80% and F1 Score = 0.50 to be achieved. The achieved results are compared and discussed with other landslide detection-related studies.}, + keywords = {automatic landslide detection +OBIA +PBA +random forests +supervised classification}, + ISSN = {2072-4292}, + DOI = {10.3390/rs12183054}, + year = {2020}, + type = {Journal Article} +} + +@article{Pedregosa:2011, + author = {Pedregosa, Fabian and Varoquaux, Gaël and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent}, + title = {Scikit-learn: Machine learning in Python}, + journal = {the Journal of machine Learning research}, + volume = {12}, + pages = {2825-2830}, + ISSN = {1532-4435}, + year = {2011}, + type = {Journal Article} +} + +@article{Salazar:2022, + author = {Salazar, Jose J. and Garland, Lean and Ochoa, Jesus and Pyrcz, Michael J.}, + title = {Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy}, + journal = {Journal of Petroleum Science and Engineering}, + volume = {209}, + pages = {109885}, + abstract = {Machine learning supports prediction and inference in multivariate and complex datasets where observations are spatially related to one another. Frequently, these datasets depict spatial autocorrelation that violates the assumption of identically and independently distributed data. Overlooking this correlation result in over-optimistic models that fail to account for the geographical configuration of data. Furthermore, although different data split methods account for spatial autocorrelation, these methods are inflexible, and the parameter training and hyperparameter tuning of the machine learning model is set with a different prediction difficulty than the planned real-world use of the model. In other words, it is an unfair training-testing process. We present a novel method that considers spatial autocorrelation and planned real-world use of the spatial prediction model to design a fair train-test split. Demonstrations include two examples of the planned real-world use of the model using a realistic multivariate synthetic dataset and the analysis of 148 wells from an undisclosed Equinor play. First, the workflow applies the semivariogram model of the target to compute the simple kriging variance as a proxy of spatial estimation difficulty based on the spatial data configuration. Second, the workflow employs a modified rejection sampling to generate a test set with similar prediction difficulty as the planned real-world use of the model. Third, we compare 100 test sets' realizations to the model's planned real-world use, using probability distributions and two divergence metrics: the Jensen-Shannon distance and the mean squared error. The analysis ranks the spatial fair train-test split method as the only one to replicate the difficulty (i.e., kriging variance) compared to the validation set approach and spatial cross-validation. Moreover, the proposed method outperforms the validation set approach, yielding a minor mean percentage error when predicting a target feature in an undisclosed Equinor play using a random forest model. The resulting outputs are training and test sets ready for model fit and assessment with any machine learning algorithm. Thus, the proposed workflow offers spatial aware sets ready for predictive machine learning problems with similar estimation difficulty as the planned real-world use of the model and compatible with any spatial data analysis task.}, + keywords = {Fairness +Spatial autocorrelation +Train-test split +Kriging +Cross-validation}, + ISSN = {0920-4105}, + DOI = {https://doi.org/10.1016/j.petrol.2021.109885}, + url = {https://www.sciencedirect.com/science/article/pii/S0920410521015023}, + year = {2022}, + type = {Journal Article} +} + +@book{Shalev-Shwartz:2013, + author = {Shalev-Shwartz, Shai and Ben-David, Shai}, + title = {Understanding machine learning: From theory to algorithms}, + volume = {9781107057135}, + ISBN = {9781107057135}, + DOI = {10.1017/CBO9781107298019}, + year = {2013}, + type = {Book} +} + +@article{Tan:2021, + author = {Tan, Jimin and Yang, Jianan and Wu, Sai and Chen, Gang and Zhao, Jake}, + title = {A critical look at the current train/test split in machine learning}, + abstract = {The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and financial considerations. In other words, it can be very difficult or even impossible to satisfy assumption (ii). In this article, we intend to access this problem and reiterate the paradigm of active learning, and investigate its potential on solving problems under unconventional train/test split protocols. We further propose a new adaptive active learning architecture (AAL) which involves an adaptation policy, in comparison with the traditional active learning that only unidirectionally adds data points to the training pool. We primarily justify our points by extensively investigating an interdisciplinary drug-protein binding problem. We additionally evaluate AAL on more conventional machine learning benchmarking datasets like CIFAR-10 to demonstrate the generalizability and efficacy of the new framework.}, + year = {2021}, + type = {Journal Article} +} + +@book{Vittinghoff:2012, + author = {Vittinghoff, Eric and Glidden, David V. and Shiboski, Stephen C. and McCulloch, Charles E.}, + title = {Regression Methods in Biostatistics Linear, Logistic, Survival, and Repeated Measures Models}, + publisher = {Springer New York}, + address = {New York, NY}, + edition = {2nd 2012.}, + series = {Statistics for Biology and Health}, + note = {Includes bibliographical references and index.}, + abstract = {This new edition provides a unified, in-depth, readable introduction to the multipredictor regression methods most widely used in biostatistics: linear models for continuous outcomes, logistic models for binary outcomes, the Cox model for right-censored survival times, repeated-measures models for longitudinal and hierarchical outcomes, and generalized linear models for counts and other outcomes. Treating these topics together takes advantage of all they have in common. The authors point out the many-shared elements in the methods they present for selecting, estimating, checking, and interpreting each of these models. They also show that these regression methods deal with confounding, mediation, and interaction of causal effects in essentially the same way. The examples, analyzed using Stata, are drawn from the biomedical context but generalize to other areas of application. While a first course in statistics is assumed, a chapter reviewing basic statistical methods is included. Some advanced topics are covered but the presentation remains intuitive. A brief introduction to regression analysis of complex surveys and notes for further reading are provided. For many students and researchers learning to use these methods, this one book may be all they need to conduct and interpret multipredictor regression analyses. In the second edition, the authors have substantially expanded the core chapters, including new coverage of exact, ordinal, and multinomial logistic models, discrete time and competing risks survival models, within and between effects in longitudinal models, zero-inflated Poisson and negative binomial models, cross-validation for prediction model selection, directed acyclic graphs, and sample size, power and minimum detectable effect calculations; Stata code is also updated. In addition, there are new chapters on methods for strengthening causal inference, including propensity scores, marginal structural models, and instrumental variables, and on methods for handling missing data, using maximum likelihood, multiple imputation, inverse weighting, and pattern mixture models. From the reviews of the first edition: "This book provides a unified introduction to the regression methods listed in the title...The methods are well illustrated by data drawn from medical studies...A real strength of this book is the careful discussion of issues common to all of the multipredictor methods covered." Journal of Biopharmaceutical Statistics, 2005 "This book is not just for biostatisticians. It is, in fact, a very good, and relatively nonmathematical, overview of multipredictor regression models. Although the examples are biologically oriented, they are generally easy to understand and follow...I heartily recommend the book" Technometrics, February 2006 "Overall, the text provides an overview of regression methods that is particularly strong in its breadth of coverage and emphasis on insight in place of mathematical detail. As intended, this well-unified approach should appeal to students who learn conceptually and verbally." Journal of the American Statistical Association, March 2006.}, + keywords = {UmU kursbok +Statistics  +Public health +Statistics for Life Sciences, Medicine, Health Sciences +Epidemiology}, + ISBN = {1-4614-1353-2}, + DOI = {10.1007/978-1-4614-1353-0}, + year = {2012}, + type = {Book} +} + +@misc{z_ai:2020, + author = {z_ai}, + title = {The Ultimate Guide to Debugging your Machine Learning models}, + pages = {https://towardsdatascience.com/the-ultimate-guide-to-debugging-your-machine-learning-models-103dc0f9e421}, + month = {8/20/2022}, + year = {2020}, + type = {Online Multimedia} +} + diff --git a/paper1/paper.md b/paper1/paper.md new file mode 100644 index 0000000..c6c2ea7 --- /dev/null +++ b/paper1/paper.md @@ -0,0 +1,61 @@ +--- +title: Honest ML: A library for building confidence in statistical models +tags: + - Python + - Machine Learning + - Intervals + +authors: + - name: Eric Schles + orcid: ### + equal-contrib: true + affiliation: "1, 2" + corresponding: true + - name: Abdul-Rashid Zakaria + orcid: 0000-0002-3694-7082 + equal-contrib: false + affiliation: 3 +affiliations: + - name: John Hopkins University Hospital, USA + index: 1 + - name: The City University of New York, USA + index: 2 + - name: Michigan Technological University, USA + index: 3 +date: 16 September 2022 +bibiliography: paper.bib +--- + +# Summary + +Machine learning metrics are built around the idea of training a model and then making out-of-sample predictions to test generalizability. There are a few standard methods; splitting the data into training and testing data and then predicting once on the testing or out of sample data. Using cross-validation to train on partitions of the data and then test by using one partition as the holdout and averaging the metric across all partitions. And finally, stratified partitioning splits the data subject to some condition, usually on the proportion of labels in the entire dataset. This paper will look at a library that implements a different method, training the model on many train-test splits and recording the out-of-sample error across these five hundred to more than a thousand splits. This creates higher confidence in the model and more closely simulates the likely scenarios you would find in the production setting, even with reasonably small datasets. Through this library, users can present statistical models based on confidence intervals to capture the uncertainty in inferences instead of point statistics for different machine learning models. + +# Introduction + +The idea of out-of-sample prediction is described in detail throughout the literature [@Montgomery:1991]. The basic idea is to split the data into two groups, a training sample and a testing sample. Once the data is split, a statistical model is trained on the training sample. Then the trained model is used to predict the independent variable from the dependent variables [@Kuhn:2013; @Pawluszek-Filipiak:2020] in the testing sample. Finally, a loss metric, like mean squared error, is used if it is a regression problem, or cross-entropy [@Bickel:2015; @James:2021] is used for classification to compare the predicted dependent variable against the ground truth dependent variable. This method can be helpful as a first pass to assess model quality; however, it has many deficiencies [@Doan:2022; @Salazar:2022; @Tan:2021] since the data was only split once considering a classification problem, there may be issues such as: + +1. Imbalance in the label classes in the training and testing data. This balance is not different from the entire data set, as well as the population data being modeled. +2. Concentration of independent variables caused by a specific exogenous effect [@Edelkamp:2021] in the training data and a different exogenous effect in the testing data. + +If either of these conditions persists, our loss metric may record a far too optimistic or pessimistic view of how well the model performs. This, in turn, may have consequences for a whole host of things - failure to select the correct model, for instance, we may choose a logistic regression model [@Gortmaker:1994; @Vittinghoff:2012] when a decision tree model [@de Ville:2013; @Shalev-Shwartz:2013] is more appropriate. Or we may select the wrong hyperparameters for a given model class. A direct consequence of a flawed model is a poor inference which may have complex or impossible to recognize consequences [@Chernozhukov:2022; @Kok:2007; @Marsili:2022; @z\_ai:2020]. Therefore, it is of paramount importance that our models be 'honest' and the error well captured. + +To deal with this failure to generalize from a single training and testing split, cross-validation [@Arlot:2010; @Kohavi:1995] was created to increase the number of training and testing splits and then average the error metric or metrics. This works by creating several random partitions of the data and then treating one of the partitions as an out of the sample, while the rest are treated as in the sample. A model is trained on all in-sample predictions, and the out-of-sample is left for testing the model. The procedure is repeated for each partition used as an out-of-sample. Issues with choosing the optimum number of partitions, including multiple and separate partitions, may not generalize well in some cases; few partitions will produce the same problems as with a train-test split. + +In theory, these methods described are inherently good approaches; the issues raised come down to how models are viewed and interpreted in practice. Therefore, [honest\_ml](https://github.com/EricSchles/honest_ml) is a library to do many individual data splits, typically on the order of 500 to several thousand data splits. The idea is to iterate over the random seed used in a typical train-test split implementation. For this library, we use scikit-learn's implementation [@Buitinck:2013; @Pedregosa:2011], considered the gold standard by machine learning engineers. Doing so removes the need to consider how many partitions are required for a particular dataset. We also further decrease the possibility of a "lucky or unlucky" split in a train-test split. In addition, this implementation helps to identify the sensitivity of trained models to the data used in training the models with specific hyperparameters. + +# Utilization + +[honest\_ml](https://github.com/EricSchles/honest_ml) has an EvaluateModel class that allows users to pass in their classifier of choice, a target data set, a feature data set and the number of trials where each data split during a trial uses a different random seed. The relevant performance metrics are calculated for each train-test split. For example, in \autoref{fig:Figure 1}, users can create an object of EvaluateModel. The performance metrics for each trial are saved after fitting the model. + +![Using the EvaluateModel class in honest_ml.\label{fig:Figure 1}](https://github.com/ZachJon1/honest_ml/blob/main/paper1/Figure1.jpg) + +The [honest\_ml](https://github.com/EricSchles/honest_ml) library also have a visualization tool that allows users to view results of each trial relative to other trials stored in a user defined variable using the EvaluateModel class. + +For example, using the model\_instances created above in the logistic regression model, users can compare metrics such as the precision, recall and f1-score for classification models. Figure 2 and Figure 3 shows the distribution of the precision and recall for 200 trials of the logistic regression model with two classes 0 and 1. Models that produce less normal distributions indicate a sensitivity of the model to the training data and provides users with a realistic expectation of the model in production than a point statistic would provide. + +![Comparison of the distribution of the precision and recall for different trials for the class 0\label{fig: Figure 2}](https://github.com/ZachJon1/honest_ml/blob/main/paper1/Figure2.jpg) + +![Sensitivity of class 1 to different trials using recall and precision distribution\label{fig: Figure 3}](https://github.com/ZachJon1/honest_ml/blob/main/paper1/Figure3.jpg) + +# References +