Skip to content

Multi-class multi-output classification of school budget line items

Notifications You must be signed in to change notification settings

leonkato/budgets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repo contents

If this looks interesting, I recommend first looking at the docs directory. There you will find a set of slides describing this project in brief. There is also a report detailing the work and results.

Kudos to DrivenData.com for hosting the data and competition and ERS for the nice problem.

The docs directory also contains a tutorial notebook describing (with hands-on examples) different multi-target classification schemes and attempting to address some of the confusing terminology in this space. Enjoy.

M.

Notebooks, files and directories

  • data/ :: training data and holdout set

  • dc_course :: original version of tutorial on Drivendata competition.

  • docs/ :: Project report, presentation, proposal, etc.

  • EDA_ddbpfe :: Exploratory Data Analysis of Box Plots for Education data.

  • ensemble :: Make an ensemble from 2 submissions, best log reg so far and default RF.

  • feature_importances :: A look at feature importance - what are the most predictive tokens?

  • first_models :: Like dc_course, but slimmed down for faster reading.

  • first_models_metrics :: Like the above but this version fits the models and saves the probability predictions and y values for train and test to disk.

  • fmm_out :: Saved probabilty predictions from first_models_metrics. Used by fm_standard_metrics

  • fm_standard_metrics :: Calculate F1, accuracy, log loss, and ROC_AUC for all targets separately for each DD model separately. Summarize.

  • flat_to_labels :: Shows the development of flat_to_labels, a utility to turn raw probability output from OneVsRest into properly normalized probabilities (for log loss, etc.) and label outputs (for accuracy, F1, confusion matrix, etc).

  • mod_200/ :: Mod4 with my feature interaction scheme on 200 best features and regularization.

  • mod_400/ :: Mod4 with my feature interaction scheme on 400 best features and regularization.

  • mod_1000/ :: Mod4 with my feature interaction scheme on 1000 best features and regularization - these take a day to run.

  • mod3_1/ :: Best model with various tweaks and regularization (and bug fixes/workarounds)

  • mod0_multiple_ways :: Explores 2 ways to use classifiers for multi-target, multi-class:

    1. One-hot encode targets; use sklean.multilabel to drive 104 binary classifiers across this input.
    2. Use logistic regression in a multiclass fashion directly on the input (unencoded). Uses 9 different classifiers (one for each target).
  • mod_04_99 :: Mod4 (all tutorial features), text data only, 1600 features. A final attempt to get something out out DrivenData feature interaction scheme (nothing there).

  • model_deltas :: spell out each change in the DrivenData models. Preliminary to one_at_a_time notebooks.

  • model_out/ :: place to save output

  • model_store/ :: place to save models

  • multiclass_classifiers_examples/multioutput_classifiers_examples notebooks :: These two files explore classifier probability outputs. They differ in the way the targets are represented.

    In multiclass, target is nd.array, n_samples by 1, containing L different labels. Probability output is nd.array (n_samples, L). In multioutput, target is nd.array, n_samples by n_outputs (each with its own labels). Probability output is list of array (n_samples, n_labels), one element per output.

  • my_pipe :: New feature interaction scheme: get just the interactions of the best features then combine with all original features. Development.

  • one_at_a_time_part_1.ipynb :: Go through the first 2 models adding one feature engineering change at a time and see how the change impacts log loss and standard metrics.

  • one_at_a_time_part_2.ipynb :: Continue adding one change at a time.

  • one_at_a_time_p1_hv :: HashingVectorizer experiments with part 1 models.

  • one_at_a_time_p2_scale :: Scaling experiments with part 2 models.

  • one_at_a_time_mod4 :: Mod4, incremental changes.

  • rf/ :: Experiments with RandomForestClassifier

  • python/ :: Utility code

    • flat_to_labels - take either probabilities or y values (as n_samples by 104) and transform to 9 columns of string labels (like original data)
    • multilabel - split matrix ensuring all splits have at least n of every label
    • plot_confusion_matrix - make nice picture of cm
    • sparse_interactions - compute cartesian product of (sparse) feature matrix

About

Multi-class multi-output classification of school budget line items

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published