Project objective: To save you time and effort rewriting this script!
Text processing pipeline for NLP problems with ready-to-use functions and text classification models.
Description
-
Code file environment - Jupyter notebook
-
Programming language - Python (you can use any latest stable version)
Input dataset:
Data source - https://www.kaggle.com/team-ai/spam-text-message-classification
Data format - .csv
-
This is just sample data to show and test the code. You can replace it with ANY text dataset.
-
Note: This will not work on quantitative datasets or classification, regression, or clustering problems.
Getting Started
Dependencies
-> NLTK (Natural Language Toolkit) is the complete NLP package to build the full pipeline. Follow these steps for installations:
Run pip install nltk in your terminal or command prompt
If not installed, Python will throw a ModuleNotFoundError.
-> If you want to use an NLTK dataset, add the following to your script (replace with existing file import)
import nltk
nltk.download('dataset name') or 'all'
** I'd recommend using Gensim for large datasets, as NLTK tends to become slower as the input data size increases. **
The repository includes all other required files, and the code makes all necessary imports, including ScikitLearn.
Simply fork the repository and run it as any Python project.
----------------------- End of file ---------------------------------------------------------------------------------------------