Skip to content
This repository was archived by the owner on Apr 15, 2026. It is now read-only.

Bow_research

kfritsch edited this page Jul 7, 2017 · 1 revision

Bow tutorial

  • Don't use RandomForest -> not good for sparse high dimensional data
  • Use Tfidf instead of just sparse word2vec
  • Use AUC instead of accuracy (increases the false positive rate by changing the threshold - plots true positives against that -> area under curve is indicator)
    • can deal with unbalanced data
    • handels all possible thresholds at once
    • doensn't care about the value range
    • metric in sklearn
  • Use linear classifier (like LogisticRegression) - its faster and better and can handle d>n
  • The more features you have the better -> consider n-grams

Clone this wiki locally