-
Notifications
You must be signed in to change notification settings - Fork 107
Description
Hi,
I am doing text classification with AutoWeka , and seem to have the same problem as described at #50
The classifier classified 100% of the instances correctly. That must be a mistake, as the whole dataset has only 156 texts divided into 5 categories.
Here is the Autoweka output:
Auto-WEKA result:
best classifier: weka.classifiers.meta.AdaBoostM1
arguments: [-P, 82, -I, 45, -Q, -S, 1, -W, weka.classifiers.trees.RandomForest, --, -I, 38, -K, 0, -depth, 14]
attribute search: null
attribute search arguments: []
attribute evaluation: null
attribute evaluation arguments: []
metric: errorRate
estimated errorRate: 0.0
training time on evaluation dataset: 0.83 seconds
You can use the chosen classifier in your own code as follows:
Classifier classifier = AbstractClassifier.forName("weka.classifiers.meta.AdaBoostM1", new String[]{"-P", "82", "-I", "45", "-Q", "-S", "1", "-W", "weka.classifiers.trees.RandomForest", "--", "-I", "38", "-K", "0", "-depth", "14"});
classifier.buildClassifier(instances);
Correctly Classified Instances 156 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 156
=== Confusion Matrix ===
a b c d e <-- classified as
72 0 0 0 0 | a = 1900-tallet_–Nyrealisme_og_modernisme
0 55 0 0 0 | b = 1855-1900–Realisme_og_naturalisme
0 0 9 0 0 | c = 1840-1860–Nasjonalromantikk
0 0 0 12 0 | d = 1890-årene–Nyromantikk
0 0 0 0 8 | e = 1700-tallet–_Opplysningstid_og_klassisime
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 1900-tallet_–_Nyrealisme_og_modernisme
1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 1855-1900_–_Realisme_og_naturalisme
1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 1840-1860_–_Nasjonalromantikk
1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 1890-årene_–_Nyromantikk
1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 1700-tallet_–_Opplysningstid_og_klassisime
Weighted Avg. 1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000
------- 2 BEST CONFIGURATIONS -------
These are the 2 best configurations, as ranked by SMAC
Please note that this list only contains configurations evaluated on at least 10 folds,
If you need more configurations, please consider running Auto-WEKA for a longer time.
Configuration #1:
SMAC Score: 0.23874999999999996
Argument String:
-_0__wekaclassifiersmetaadaboostm1_00_p_HIDDEN 1 -_0__wekaclassifiersmetaadaboostm1_02_2_INT_P 82 -_0__wekaclassifiersmetaadaboostm1_03_INT_I 45 -_0__wekaclassifiersmetaadaboostm1_04_Q REMOVED -_0__wekaclassifiersmetaadaboostm1_05_S 1 -_1_W weka.classifiers.trees.RandomForest -_1_W_0_DASHDASH REMOVED -_1_W_1__wekaclassifierstreesrandomforest_00_INT_I 38 -_1_W_1__wekaclassifierstreesrandomforest_01_features_HIDDEN 0 -_1_W_1__wekaclassifierstreesrandomforest_02_1_INT_K 0 -_1_W_1__wekaclassifierstreesrandomforest_04_depth_HIDDEN 1 -_1_W_1__wekaclassifierstreesrandomforest_06_2_INT_depth 14 -attributesearch NONE -attributetime 180.0 -targetclass weka.classifiers.meta.AdaBoostM1
Configuration #2:
SMAC Score: 0.24416666666666664
Argument String:
-_0__wekaclassifiersfunctionssimplelogistic_00_S REMOVED -_0__wekaclassifiersfunctionssimplelogistic_01_W_HIDDEN 0 -_0__wekaclassifiersfunctionssimplelogistic_02_1_W 0 -_0__wekaclassifiersfunctionssimplelogistic_04_A REMOVED -attributesearch NONE -attributetime 180.0 -targetclass weka.classifiers.functions.SimpleLogistic
----END OF CONFIGURATION RANKING----
Temporary run directories:
/var/folders/42/bq9r97tx5fg78j5d1k_td0pr0000gn/T/autoweka4462961014404761952/
/var/folders/42/bq9r97tx5fg78j5d1k_td0pr0000gn/T/autoweka6011276688207838292/
/var/folders/42/bq9r97tx5fg78j5d1k_td0pr0000gn/T/autoweka16926395897855063430/
/var/folders/42/bq9r97tx5fg78j5d1k_td0pr0000gn/T/autoweka16685207695737687558/
For better performance, try giving Auto-WEKA more time.
Tried 368 configurations; to get good results reliably you may need to allow for trying thousands of configurations.
Are those results reliable ??