Skip to content

rcallah/Spark-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

Nationwide Investigation of Federal Prosecutors

This is a Natural Language Processing (NLP) & Supervised Machine Learning (ML) problem to determine if there is a prosecutor misconduct involved in a federal prosecutor case. We have a dataset of 624 labeled cases (467 are "no misconduct" & 157 "misconduct" cases) of which we use 80% to train the model and 20% to validate the model respectfully. We use StratifiedKFold cross-validator to ensure an equal distribution of both "misconduct" and "no misonduct" cases in the training & testing process. We have tried Logistic Regression, Random Forest, Support Vector Machine (SVM), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) to solve this problem. As the result of comparing, Logistic Regression model could achieve 80% accuracy which is the highest.

Implementations

Tools

Requirements

  • pandas >= 0.23.4
  • numpy >= 1.14.5
  • wordcloud >= 1.5.0
  • matplotlib >= 3.0.0
  • scikit_learn >= 0.20.1
  • download glove.6B.zip
  • install textract

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors