LING131FinalProject

The project could be divided into three parts.

Part one deals with choosing a better classifier and feature extractor, by testing and analyzing different classifiers with different features using nltk and scikit-learn.

Part two involves the implementation of almost every part of our course without much use of machine learning or natural language processing modules: from text tokenization and stemming, through feature selection, to model construction. We also analyze a few data sets and compare them to the results and source code in part one. From the comparison we find a few ways to improve.

In part three, we write a program that interacts with a user through command line.

Part One: Classification using sklearn models

Part Two: Simplified Modules used in Text Classification

2.1

2.2 2.3 Spam Classification Using Self-written Models

This part is in the file analysis_model.py. To run this code, one must make sure that numpy, pandas, and sklearn are installed. These tools are used to pre-process the text data and extract features before creating classifiers.

The code could be run in the Pycharm IDE as well as command line. Just type python3 analysis_model.py and press Enter, and you will get a result.

After running the code, you should see in the commandline or IDE a result like this:

Feature: Raw Count
Self-built NeuralNetwork Accuracy with Raw Count:  61.93895870736086
Self-built NaiveBayes Accuracy with Raw Count:  87.07360861759426
Feature: TF-IDF
Self-built NeuralNetwork Accuracy with Raw Count:  85.278276481149
Self-built NaiveBayes Accuracy with Raw Count:  87.07360861759426

There might be several warnings in between lines, but generally they do not affect the result. So please ignore them.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
models		models
LICENSE		LICENSE
README.md		README.md
Stemmer.py		Stemmer.py
analysis_models.py		analysis_models.py
analysis_sklearn.py		analysis_sklearn.py
division.txt		division.txt
feature_selection.py		feature_selection.py
spam.csv		spam.csv
spam_detection_client.py		spam_detection_client.py
text_processing.py		text_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LING131FinalProject

Part One: Classification using sklearn models

Part Two: Simplified Modules used in Text Classification

2.1

2.2

2.3 Spam Classification Using Self-written Models

Part Three: Client Code

About

Uh oh!

Releases

Packages

Languages

License

yehong86/LING131FinalProject

Folders and files

Latest commit

History

Repository files navigation

LING131FinalProject

Part One: Classification using sklearn models

Part Two: Simplified Modules used in Text Classification

2.1

2.2

2.3 Spam Classification Using Self-written Models

Part Three: Client Code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages