代码中有一个bug，导致算出cv的auc不正常

跟着作者的代码一直执行到最后，在cv这一步输出的auc的确是挺高的，达到了0.79，然而当我自行分割训练集和验证集，并且用同样的模型参数训练模型时，效果却不如人意。
``` python
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, auc, roc_auc_score

voting = VotingClassifier(estimators = estimators, voting='soft')
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=0)
voting.fit(X_train_new, y_train_new)
y_train_predit = voting.predict(X_train_new)
y_val_predit = voting.predict(X_val)

print(classification_report(y_train_new, y_train_predit))
print(roc_auc_score(y_train_new, y_train_predit))
print(classification_report(y_val, y_val_predit))
print(roc_auc_score(y_val, y_val_predit))
```
输出如下：
```
              precision    recall  f1-score   support

           0       0.94      1.00      0.97     21212
           1       1.00      0.00      0.00      1247

   micro avg       0.94      0.94      0.94     22459
   macro avg       0.97      0.50      0.49     22459
weighted avg       0.95      0.94      0.92     22459

0.5008019246190858
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      5306
           1       0.00      0.00      0.00       309

   micro avg       0.94      0.94      0.94      5615
   macro avg       0.47      0.50      0.49      5615
weighted avg       0.89      0.94      0.92      5615

0.49962306822465136
```
模型的auc只有0.5不到，而且recall基本为0。因为这是一个不平衡预测集，违约人数较少，此时模型可能是把所有样本判断为0，虽然准确率很高，但是这样的模型却是没意义的。

那么到底是哪里出了问题呢，为什么前面的交叉验证显示的auc这么高呢。观察代码，我发现了一个bug。在这个地方：
``` python
cv = StratifiedKFold(n_splits=3, shuffle=True)

def estimate(estimator, name='estimator'):
    auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
    accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
    recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()

    print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
```
作者在cross_val_score的cv参数传入了一个 StratifiedKFold的实例。阅读代码发现，如果传入数字cross_val_score也会默认使用StratifiedKFold(cv)来对数据集进行分割，但是不会传入shuffle=True。另外计算三个指标分别进行三次交叉验证计算也不合常理。于是我尝试把代码改成如下：
``` python
def estimate(estimator, name='estimator'):
    scoring = {'roc_auc': 'roc_auc',
               'accuracy': 'accuracy',
               'recall': 'recall'}
    
    scoring_result_dict= cross_validate(estimator, X_train, y_train, scoring=scoring, cv=3, return_estimator=True)
    auc = scoring_result_dict['test_roc_auc'].mean()
    accuracy = scoring_result_dict['test_accuracy'].mean()
    recall = scoring_result_dict['test_recall'].mean()
    print(scoring_result_dict)
    print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
```
此时算出来的auc只有0.5左右，符合上面的结果。同时我也尝试传入cv = StratifiedKFold(n_splits=3, shuffle=True)，算出来的auc也只有0.5左右。我猜测，shuffle=True是造成auc偏高的原因，但具体原因我还没找到。

作者在数据清洗和特征工程做了大量的工作，还是能给人不少启发的，不过最后在模型调参这一部分就显得有点粗糙了。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

代码中有一个bug，导致算出cv的auc不正常 #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

代码中有一个bug，导致算出cv的auc不正常 #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions