Skip to content

Training & testing #12

@ssobel

Description

@ssobel

Hello Laurae, thanks for your response earlier about my question about emulating daForest.

This time I have a question sort of related to validation_data=NULL #6, in that I want to make sure I understand how to properly do training & testing to try to avoid overfitting. I tried running CascadeForest and got excellent results on training and held out validation data (where I knew the labels), but when I applied the model to test data (exclusive of my train & validation data, and where I did not know the labels but the contest website gave me my score), the model did not perform very well. So, I believe I am overfitting.

Basically, I trained CascadeForest using d_train & d_valid like this:

CascadeForest(training_data = d_train,
validation_data = d_valid,
training_labels = labels_train,
validation_labels = labels_valid, ...)

Where: d_train & labels_train = predictor columns & known labels (65% of my total training data)
d_valid & labels_valid = predictor columns & known labels, exclusive of d_train (the other 35% of my total training data).

My AUC was something like 0.96 when I then predicted on d_train and also when I predicted on d_valid. So, that made me happy, and I then applied the predict function to d_test, which is exclusive of d_train and d_valid, and where I don't know the true labels, but I submitted my predictions to the contest website and got a 0.75 AUC, not nearly as good as 0.96.

So that made me think I should use cross-validation in CascadeForest, like this:

CascadeForest(training_data = d_alltrain,
validation_data = NULL,
training_labels = labels_alltrain,
validation_labels = NULL, ...)

Where: d_alltrain is all my training data (= 65% + 35% = 100%), and labels_alltrain is all my known labels for all my training data.

But, I got the error as noted in validation_data=NULL #6. I have not yet tried the solution you suggest to fix the lines of code to work for cross-validation, but is this the proper way to do cross-val? Then if I get a good AUC indicated by the model (on cross-val d_alltrain), and then apply the model to d_test then that is the proper way to try to avoid overfitting and I should hope for a better score?

Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions