Convenience Library for CV grid-search optimization of scikit classifiers.
This is a convenience library for use with scikit supervised learning algorithms (knn,Logistic Regression, etc). It provides methods to automatically conduct a cross-validated grid-search for hyperparameter tuning - so you don't need to code loops every time.
It also includes a ModelSelector which will run comparisons of all models - returning runtimes and accuracy scores for different models.
Assume the user will split their dataset into training and test sets - X_tr/Y_tr and X_te/Y_te, respectively.
logreg = LogRegress()
Y_te_pred = logreg.fit(X_tr, Y_tr).predict(X_tr, Y_tr, X_te)
We could then compare the accuracy of Y_te_pred to the known Y_te. Alternatively, we cando it automatically
logreg = LogRegress()
accu = logreg.score(X_tr, Y_tr, X_te, Y_te) # compares Y_te (pedicted) to Y_te (actual) automatically
Using the full known dataset X/Y and unknown dataset X_un, just call
logreg = LogRegress()
Y_un_pred = logreg.fit(X, Y).predict(X, Y, X_un)
The ModelSelector provides an easy way to compare preliminary (though optimized) results on different models.
Calling
model_sel = ModelSelector()
model_sel.fit(X_tr, Y_tr, X_te, Y_te)
will return the several important elements:
(ranked) accuracy scores of the different models and the computation times required. Can be accessed via the pandas dataframe
model_sel.summary_df
The best model (instance) is also returned, with the best parameters it found in training.
best_mod = model_sel.models[model_sel.best_model]
Note that running
Y_un_pred = best_mod.fit(X,Y).predict(X,Y,X_un)
or
best_mod.best_params = None # this line would reset the params, and fit() would re-optimize for the full set.
Y_un_pred = best_mod.fit(X,Y).predict(X,Y,X_un)
will yield different results. In the latter case, the prediction will be done with parameters optimized over all of X,Y. In the former case, we would use parameters optimized over X_tr,Y_tr. The latter requires recomputation but uses more of the data for tuning.
By default ModelSelector will check every model. We can pass a list of model names (strings) to either check or ignore.
To only run and compare kNN and LogRegress models:
model_sel = ModelSelector(check=['kNN','LogRegress'])
To check every model except kNN and LogRegress:
model_sel = ModelSelector(ignore=['kNN','LogRegress'])
For now, we support the following model names. Equivalent scikit models can be found in the import section of the source code.
['GMM', 'LogRegress', 'DecTree', 'RandForest', 'SupportVC', 'kNN', 'BGMM', 'GaussNB', 'MultiNB']
Note that GMM and BGMM are kernel density estimation classifier that are not in scikit. They fit a GaussianMixture or BayesianGaussianMixture to each label, and use the probability of an unknown point being sampled from either distribution to classify it.