If this looks interesting, I recommend first looking at the docs directory. There you will find a set of slides describing this project in brief. There is also a report detailing the work and results.
Kudos to DrivenData.com for hosting the data and competition and ERS for the nice problem.
The docs directory also contains a tutorial notebook describing (with hands-on examples) different multi-target classification schemes and attempting to address some of the confusing terminology in this space. Enjoy.
M.
-
data/ :: training data and holdout set
-
dc_course :: original version of tutorial on Drivendata competition.
-
docs/ :: Project report, presentation, proposal, etc.
-
EDA_ddbpfe :: Exploratory Data Analysis of Box Plots for Education data.
-
ensemble :: Make an ensemble from 2 submissions, best log reg so far and default RF.
-
feature_importances :: A look at feature importance - what are the most predictive tokens?
-
first_models :: Like dc_course, but slimmed down for faster reading.
-
first_models_metrics :: Like the above but this version fits the models and saves the probability predictions and y values for train and test to disk.
-
fmm_out :: Saved probabilty predictions from first_models_metrics. Used by fm_standard_metrics
-
fm_standard_metrics :: Calculate F1, accuracy, log loss, and ROC_AUC for all targets separately for each DD model separately. Summarize.
-
flat_to_labels :: Shows the development of flat_to_labels, a utility to turn raw probability output from OneVsRest into properly normalized probabilities (for log loss, etc.) and label outputs (for accuracy, F1, confusion matrix, etc).
-
mod_200/ :: Mod4 with my feature interaction scheme on 200 best features and regularization.
-
mod_400/ :: Mod4 with my feature interaction scheme on 400 best features and regularization.
-
mod_1000/ :: Mod4 with my feature interaction scheme on 1000 best features and regularization - these take a day to run.
-
mod3_1/ :: Best model with various tweaks and regularization (and bug fixes/workarounds)
-
mod0_multiple_ways :: Explores 2 ways to use classifiers for multi-target, multi-class:
- One-hot encode targets; use sklean.multilabel to drive 104 binary classifiers across this input.
- Use logistic regression in a multiclass fashion directly on the input (unencoded). Uses 9 different classifiers (one for each target).
-
mod_04_99 :: Mod4 (all tutorial features), text data only, 1600 features. A final attempt to get something out out DrivenData feature interaction scheme (nothing there).
-
model_deltas :: spell out each change in the DrivenData models. Preliminary to one_at_a_time notebooks.
-
model_out/ :: place to save output
-
model_store/ :: place to save models
-
multiclass_classifiers_examples/multioutput_classifiers_examples notebooks :: These two files explore classifier probability outputs. They differ in the way the targets are represented.
In multiclass, target is nd.array, n_samples by 1, containing L different labels. Probability output is nd.array (n_samples, L). In multioutput, target is nd.array, n_samples by n_outputs (each with its own labels). Probability output is list of array (n_samples, n_labels), one element per output.
-
my_pipe :: New feature interaction scheme: get just the interactions of the best features then combine with all original features. Development.
-
one_at_a_time_part_1.ipynb :: Go through the first 2 models adding one feature engineering change at a time and see how the change impacts log loss and standard metrics.
-
one_at_a_time_part_2.ipynb :: Continue adding one change at a time.
-
one_at_a_time_p1_hv :: HashingVectorizer experiments with part 1 models.
-
one_at_a_time_p2_scale :: Scaling experiments with part 2 models.
-
one_at_a_time_mod4 :: Mod4, incremental changes.
-
rf/ :: Experiments with RandomForestClassifier
-
python/ :: Utility code
- flat_to_labels - take either probabilities or y values (as n_samples by 104) and transform to 9 columns of string labels (like original data)
- multilabel - split matrix ensuring all splits have at least n of every label
- plot_confusion_matrix - make nice picture of cm
- sparse_interactions - compute cartesian product of (sparse) feature matrix