Skip to content

This notebook contains entire text preprocessing pipeline for NLP problems. The ready-to-use functions require NLTK and SKlearn package installations. It also contains some prominent text classification models.

Notifications You must be signed in to change notification settings

Shubha23/Text-processing-NLP

Repository files navigation

Author - Shubha Mishra (@Shubha23)

The Complete Natural Language Processing (NLP) Pipeline For Text Processing


Project objective: To save you time and effort rewriting this script!

Text processing pipeline for NLP problems with ready-to-use functions and text classification models.


Description

  • Code file environment - Jupyter notebook

  • Programming language - Python (you can use any latest stable version)


Input dataset:

Data source - https://www.kaggle.com/team-ai/spam-text-message-classification

Data format - .csv

  • This is just sample data to show and test the code. You can replace it with ANY text dataset.

  • Note: This will not work on quantitative datasets or classification, regression, or clustering problems.


Getting Started

Dependencies

-> NLTK (Natural Language Toolkit) is the complete NLP package to build the full pipeline. Follow these steps for installations:

   Run pip install nltk in your terminal or command prompt

   If not installed, Python will throw a ModuleNotFoundError. 

-> If you want to use an NLTK dataset, add the following to your script (replace with existing file import)

   import nltk
   nltk.download('dataset name') or 'all' 

** I'd recommend using Gensim for large datasets, as NLTK tends to become slower as the input data size increases. **

The repository includes all other required files, and the code makes all necessary imports, including ScikitLearn.

Simply fork the repository and run it as any Python project.


----------------------- End of file ---------------------------------------------------------------------------------------------

About

This notebook contains entire text preprocessing pipeline for NLP problems. The ready-to-use functions require NLTK and SKlearn package installations. It also contains some prominent text classification models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published