GitHub - leonkato/budgets: Multi-class multi-output classification of school budget line items

Repo contents

If this looks interesting, I recommend first looking at the docs directory. There you will find a set of slides describing this project in brief. There is also a report detailing the work and results.

Kudos to DrivenData.com for hosting the data and competition and ERS for the nice problem.

The docs directory also contains a tutorial notebook describing (with hands-on examples) different multi-target classification schemes and attempting to address some of the confusing terminology in this space. Enjoy.

M.

Notebooks, files and directories

data/ :: training data and holdout set
dc_course :: original version of tutorial on Drivendata competition.
docs/ :: Project report, presentation, proposal, etc.
EDA_ddbpfe :: Exploratory Data Analysis of Box Plots for Education data.
ensemble :: Make an ensemble from 2 submissions, best log reg so far and default RF.
feature_importances :: A look at feature importance - what are the most predictive tokens?
first_models :: Like dc_course, but slimmed down for faster reading.
first_models_metrics :: Like the above but this version fits the models and saves the probability predictions and y values for train and test to disk.
fmm_out :: Saved probabilty predictions from first_models_metrics. Used by fm_standard_metrics
fm_standard_metrics :: Calculate F1, accuracy, log loss, and ROC_AUC for all targets separately for each DD model separately. Summarize.
flat_to_labels :: Shows the development of flat_to_labels, a utility to turn raw probability output from OneVsRest into properly normalized probabilities (for log loss, etc.) and label outputs (for accuracy, F1, confusion matrix, etc).
mod_200/ :: Mod4 with my feature interaction scheme on 200 best features and regularization.
mod_400/ :: Mod4 with my feature interaction scheme on 400 best features and regularization.
mod_1000/ :: Mod4 with my feature interaction scheme on 1000 best features and regularization - these take a day to run.
mod3_1/ :: Best model with various tweaks and regularization (and bug fixes/workarounds)
mod0_multiple_ways :: Explores 2 ways to use classifiers for multi-target, multi-class:
1. One-hot encode targets; use sklean.multilabel to drive 104 binary classifiers across this input.
2. Use logistic regression in a multiclass fashion directly on the input (unencoded). Uses 9 different classifiers (one for each target).
mod_04_99 :: Mod4 (all tutorial features), text data only, 1600 features. A final attempt to get something out out DrivenData feature interaction scheme (nothing there).
model_deltas :: spell out each change in the DrivenData models. Preliminary to one_at_a_time notebooks.
model_out/ :: place to save output
model_store/ :: place to save models
multiclass_classifiers_examples/multioutput_classifiers_examples notebooks :: These two files explore classifier probability outputs. They differ in the way the targets are represented.

In multiclass, target is nd.array, n_samples by 1, containing L different labels. Probability output is nd.array (n_samples, L). In multioutput, target is nd.array, n_samples by n_outputs (each with its own labels). Probability output is list of array (n_samples, n_labels), one element per output.
my_pipe :: New feature interaction scheme: get just the interactions of the best features then combine with all original features. Development.
one_at_a_time_part_1.ipynb :: Go through the first 2 models adding one feature engineering change at a time and see how the change impacts log loss and standard metrics.
one_at_a_time_part_2.ipynb :: Continue adding one change at a time.
one_at_a_time_p1_hv :: HashingVectorizer experiments with part 1 models.
one_at_a_time_p2_scale :: Scaling experiments with part 2 models.
one_at_a_time_mod4 :: Mod4, incremental changes.
rf/ :: Experiments with RandomForestClassifier
python/ :: Utility code
- flat_to_labels - take either probabilities or y values (as n_samples by 104) and transform to 9 columns of string labels (like original data)
- multilabel - split matrix ensuring all splits have at least n of every label
- plot_confusion_matrix - make nice picture of cm
- sparse_interactions - compute cartesian product of (sparse) feature matrix

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
images		images
mod3_1		mod3_1
mod_1000		mod_1000
mod_200		mod_200
mod_400		mod_400
python		python
rf		rf
.gitignore		.gitignore
EDA_ddbpfe.ipynb		EDA_ddbpfe.ipynb
Project_Log.ipynb		Project_Log.ipynb
README.md		README.md
__init__.py		__init__.py
dc_course.ipynb		dc_course.ipynb
ensemble.ipynb		ensemble.ipynb
feature_importances.ipynb		feature_importances.ipynb
first_models.ipynb		first_models.ipynb
first_models_metrics.ipynb		first_models_metrics.ipynb
flat_to_labels.ipynb		flat_to_labels.ipynb
fm_standard_metrics.ipynb		fm_standard_metrics.ipynb
manifest.ipynb		manifest.ipynb
mod04_99.ipynb		mod04_99.ipynb
mod0_multiple_ways.ipynb		mod0_multiple_ways.ipynb
model_deltas.ipynb		model_deltas.ipynb
multiclass_classifiers_examples.ipynb		multiclass_classifiers_examples.ipynb
multioutput_classifiers_examples.ipynb		multioutput_classifiers_examples.ipynb
my_pipe.ipynb		my_pipe.ipynb
one_at_a_time_mod4.ipynb		one_at_a_time_mod4.ipynb
one_at_a_time_p1_hv.ipynb		one_at_a_time_p1_hv.ipynb
one_at_a_time_p2_scale.ipynb		one_at_a_time_p2_scale.ipynb
one_at_a_time_part_1.ipynb		one_at_a_time_part_1.ipynb
one_at_a_time_part_2.ipynb		one_at_a_time_part_2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repo contents

Notebooks, files and directories

About

Uh oh!

Releases

Packages

Languages

leonkato/budgets

Folders and files

Latest commit

History

Repository files navigation

Repo contents

Notebooks, files and directories

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages