The objective of this script is to apply NLP techniques in order to extract topics (tags) from a body of text.
It is used for the project wuwana2 to automatically classify a company based on the text from their website and social media.
- Python 3.7 or higher.
- A MySQL database
- Launch the following command to install all the libraries required:
pip install -r requirements.txt
- Download a Spacy pretrained model in the desired folder:
python -m spacy download en_core_web_lg
This model can be switched with some other english one from https://spacy.io/models/en
Use the config.py to define general aspects of the script:
- max_words = 5 - Max number of words/tags to be returned by default. Gensims is limited to 3 by default.
- lib = "keybert" # library to be used (wordcloud / gensim / keybert).
- desc_field = "description" - field from company table where description text is taken.
- weight_field = "TagInfo" - Field in company table where weights will be stored.
- remove_words = "./data/words_to_remove.txt" - path to file of words to be removed.
- replace_words = "./data/words_to_replace.txt" - replace_words: path to file of words to be replaced.
- empha_words = "./data/words_to_emphasize.txt" - *path to file of words to be emphasized.
- empha_multi = 5 - *number of times to repeat/emphasize words.
- finaltags_toremove = "./data/finaltags_toremove.txt" - *tags that never will appear.
- finaltags_alwaysmain = "./data/finaltags_alwaysmain.txt" - *these tags will always be main tags (in order of appeareance).
- languages = ["es","fr","zh-cn"] - list of languages to be translated, apart from english. Same order is followed in output.
- spacy_model = "./en_core_web_lg" - spacy pretrained model.
Use the database_config.py to connect to the MySQL database.
Once defined properly the config file, we can launch the script from a command line this way:
python main.py --onlyid ID
Where ID the id of the company in the database we want to update. This command will launch the script which will take the text from desc_field column (in the company table) and extract the main topics (including weights).
Topics will be stored in the following table/fields:
- Tag table: This table holds tags ID and descriptions (translated according to languages in config file). The table is updated when a new tag is discovered.
- Company table: For the selected ID, four columns are updated:
- FirstTagID (main tag)
- SecondTagID (second main tag)
- OtherTags (rest of them)
- TagInfo(or other field selected in config file, with the weights returned by NLP algos)