-
Notifications
You must be signed in to change notification settings - Fork 12
Nltk #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Nltk #108
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ionality to visualize_word_frequency_plot
… installed implicitly
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Removes the nltk dependency.
nltk was used for three things in pvOps:
word_tokenizefunctionpvops.text.visualize.visualize_word_frequency_plotThis PR addresses the points by:
stopwords.txtunderpvops.text, which is read by the functionpvops.text.nltk_utils.create_stopwordsusingimportlib.resources(comes default with python>=3.7)a. Removed the first argument to this function, which denoted the language to grab the stopwords for.
b. Removed the text module test that checked the stopwords list download from nltk
pvops.text.preprocess.regex_tokenizefunctiona. Replaced all references to
word_tokenizewith the new functionb. Can be used with the default regex pattern (the functionality of which is described in the associated docstring)
and also allows increased functionality for the user to create their own regex pattern for tokenizing.
pvops.text.visualize.visualize_word_frequency_plota. Care was taken to ensure arguments from the previous version to still work in the new version.
The following changes were made to
requirements.txtandsetup.py:Additionally:
test_visualize_word_frequency_plotfromtest_texthas been reenabled to confirm the new return type structureMotivation and Context
nltk has caused two problems recently for pvOps:
Since nltk ultimately is not part of the core pvOps functionality, it was determined that removing it from the
package was the best option.
How has this been tested?
The relevant tutorials were re-run to ensure they still work as they did before. The modified functions were
tested with pvOps example data to confirm they produced the same, or near the same, result as before.
Types of changes
Two breaking changes:
pvops.text.nltk_utils.create_stopwords.It was used for selecting which language to download the nltk stopwords for. Now that the stopwords
are static, other languages would need to be manually downloaded.
pvops.text.visualize.visualize_word_frequency_plothas changed. It is now atuple (figure, dict) where figure is the matplotlib Figure instance and dict is in the format {token: count}
Checklist:
The docs automatically updated to include the changes.