Hello Laurae, thanks for your response earlier about my question about emulating daForest.
This time I have a question sort of related to validation_data=NULL #6, in that I want to make sure I understand how to properly do training & testing to try to avoid overfitting. I tried running CascadeForest and got excellent results on training and held out validation data (where I knew the labels), but when I applied the model to test data (exclusive of my train & validation data, and where I did not know the labels but the contest website gave me my score), the model did not perform very well. So, I believe I am overfitting.
Basically, I trained CascadeForest using d_train & d_valid like this:
CascadeForest(training_data = d_train,
validation_data = d_valid,
training_labels = labels_train,
validation_labels = labels_valid, ...)
Where: d_train & labels_train = predictor columns & known labels (65% of my total training data)
d_valid & labels_valid = predictor columns & known labels, exclusive of d_train (the other 35% of my total training data).
My AUC was something like 0.96 when I then predicted on d_train and also when I predicted on d_valid. So, that made me happy, and I then applied the predict function to d_test, which is exclusive of d_train and d_valid, and where I don't know the true labels, but I submitted my predictions to the contest website and got a 0.75 AUC, not nearly as good as 0.96.
So that made me think I should use cross-validation in CascadeForest, like this:
CascadeForest(training_data = d_alltrain,
validation_data = NULL,
training_labels = labels_alltrain,
validation_labels = NULL, ...)
Where: d_alltrain is all my training data (= 65% + 35% = 100%), and labels_alltrain is all my known labels for all my training data.
But, I got the error as noted in validation_data=NULL #6. I have not yet tried the solution you suggest to fix the lines of code to work for cross-validation, but is this the proper way to do cross-val? Then if I get a good AUC indicated by the model (on cross-val d_alltrain), and then apply the model to d_test then that is the proper way to try to avoid overfitting and I should hope for a better score?
Thank you very much.
Hello Laurae, thanks for your response earlier about my question about emulating daForest.
This time I have a question sort of related to validation_data=NULL #6, in that I want to make sure I understand how to properly do training & testing to try to avoid overfitting. I tried running CascadeForest and got excellent results on training and held out validation data (where I knew the labels), but when I applied the model to test data (exclusive of my train & validation data, and where I did not know the labels but the contest website gave me my score), the model did not perform very well. So, I believe I am overfitting.
Basically, I trained CascadeForest using d_train & d_valid like this:
CascadeForest(training_data = d_train,
validation_data = d_valid,
training_labels = labels_train,
validation_labels = labels_valid, ...)
Where: d_train & labels_train = predictor columns & known labels (65% of my total training data)
d_valid & labels_valid = predictor columns & known labels, exclusive of d_train (the other 35% of my total training data).
My AUC was something like 0.96 when I then predicted on d_train and also when I predicted on d_valid. So, that made me happy, and I then applied the predict function to d_test, which is exclusive of d_train and d_valid, and where I don't know the true labels, but I submitted my predictions to the contest website and got a 0.75 AUC, not nearly as good as 0.96.
So that made me think I should use cross-validation in CascadeForest, like this:
CascadeForest(training_data = d_alltrain,
validation_data = NULL,
training_labels = labels_alltrain,
validation_labels = NULL, ...)
Where: d_alltrain is all my training data (= 65% + 35% = 100%), and labels_alltrain is all my known labels for all my training data.
But, I got the error as noted in validation_data=NULL #6. I have not yet tried the solution you suggest to fix the lines of code to work for cross-validation, but is this the proper way to do cross-val? Then if I get a good AUC indicated by the model (on cross-val d_alltrain), and then apply the model to d_test then that is the proper way to try to avoid overfitting and I should hope for a better score?
Thank you very much.