NLP-SCAN is an unsupervised classification method using an innovative embedding technique that can be tailored to various classification tasks. It is based on the code for the paper DocSCAN: Unsupervised Text Classification via Learning from Neighbors, which translates the SCAN method to NLP. However, NLP-SCAN additionally incorporates the Fine-Tuning through Self-Labeling step from the original SCAN method, resulting in increased accuracy.
NLP-SCAN is a transformer-embedding-based unsupervised text classification algorithm with the following main advantages:
- Trains a text classification model on unlabeled text datasets
- The semantic focus of the classification model can be tailored to the task using the Indicative Sentence embedding method
- Fast, accurate, and requires minimal human input
Assuming Anaconda and Linux, the environment can be installed with the following commands:
conda create -n nlpscan python=3.8
conda activate nlpscan
pip install -r requirements.txtThis tutorial will guide you through the steps to apply NLP-SCAN to your unlabeled text dataset.
Ensure your text data is saved as a text file with one text per line.
Open the NLPSCAN.py script. Set the path to your dataset in the file_path_data variable and the path for the result in the file_path_result variable.
If needed, adjust other configurations within the NLPSCAN.py script to suit your specific requirements.
Follow the installation instructions to create the NLP-SCAN environment using Anaconda.
Activate the NLP-SCAN environment:
conda activate nlpscanExecute the NLP-SCAN script:
python NLPSCAN.pyNLP-SCAN will process your dataset and save the classified results as a CSV file with "sentence" and "class" columns.