-
Notifications
You must be signed in to change notification settings - Fork 85
Open
Description
跟着作者的代码一直执行到最后,在cv这一步输出的auc的确是挺高的,达到了0.79,然而当我自行分割训练集和验证集,并且用同样的模型参数训练模型时,效果却不如人意。
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, auc, roc_auc_score
voting = VotingClassifier(estimators = estimators, voting='soft')
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=0)
voting.fit(X_train_new, y_train_new)
y_train_predit = voting.predict(X_train_new)
y_val_predit = voting.predict(X_val)
print(classification_report(y_train_new, y_train_predit))
print(roc_auc_score(y_train_new, y_train_predit))
print(classification_report(y_val, y_val_predit))
print(roc_auc_score(y_val, y_val_predit))输出如下:
precision recall f1-score support
0 0.94 1.00 0.97 21212
1 1.00 0.00 0.00 1247
micro avg 0.94 0.94 0.94 22459
macro avg 0.97 0.50 0.49 22459
weighted avg 0.95 0.94 0.92 22459
0.5008019246190858
precision recall f1-score support
0 0.94 1.00 0.97 5306
1 0.00 0.00 0.00 309
micro avg 0.94 0.94 0.94 5615
macro avg 0.47 0.50 0.49 5615
weighted avg 0.89 0.94 0.92 5615
0.49962306822465136
模型的auc只有0.5不到,而且recall基本为0。因为这是一个不平衡预测集,违约人数较少,此时模型可能是把所有样本判断为0,虽然准确率很高,但是这样的模型却是没意义的。
那么到底是哪里出了问题呢,为什么前面的交叉验证显示的auc这么高呢。观察代码,我发现了一个bug。在这个地方:
cv = StratifiedKFold(n_splits=3, shuffle=True)
def estimate(estimator, name='estimator'):
auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()
print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))作者在cross_val_score的cv参数传入了一个 StratifiedKFold的实例。阅读代码发现,如果传入数字cross_val_score也会默认使用StratifiedKFold(cv)来对数据集进行分割,但是不会传入shuffle=True。另外计算三个指标分别进行三次交叉验证计算也不合常理。于是我尝试把代码改成如下:
def estimate(estimator, name='estimator'):
scoring = {'roc_auc': 'roc_auc',
'accuracy': 'accuracy',
'recall': 'recall'}
scoring_result_dict= cross_validate(estimator, X_train, y_train, scoring=scoring, cv=3, return_estimator=True)
auc = scoring_result_dict['test_roc_auc'].mean()
accuracy = scoring_result_dict['test_accuracy'].mean()
recall = scoring_result_dict['test_recall'].mean()
print(scoring_result_dict)
print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))此时算出来的auc只有0.5左右,符合上面的结果。同时我也尝试传入cv = StratifiedKFold(n_splits=3, shuffle=True),算出来的auc也只有0.5左右。我猜测,shuffle=True是造成auc偏高的原因,但具体原因我还没找到。
作者在数据清洗和特征工程做了大量的工作,还是能给人不少启发的,不过最后在模型调参这一部分就显得有点粗糙了。
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels