Skip to content

Conversation

@agmoore4
Copy link
Collaborator

@agmoore4 agmoore4 commented Mar 4, 2025

Description

Removes the nltk dependency.

nltk was used for three things in pvOps:

  1. Dynamically downloading the current stopwords list
  2. Tokenizing text documents with its word_tokenize function
  3. Plotting the token frequency from a tokenized document in pvops.text.visualize.visualize_word_frequency_plot

This PR addresses the points by:

  1. Including a static version of the nltk English stopwords in stopwords.txt under pvops.text, which is read by the function pvops.text.nltk_utils.create_stopwords using importlib.resources (comes default with python>=3.7)
    a. Removed the first argument to this function, which denoted the language to grab the stopwords for.
    b. Removed the text module test that checked the stopwords list download from nltk
  2. Writing a new pvops.text.preprocess.regex_tokenize function
    a. Replaced all references to word_tokenize with the new function
    b. Can be used with the default regex pattern (the functionality of which is described in the associated docstring)
    and also allows increased functionality for the user to create their own regex pattern for tokenizing.
  3. Manually providing the functionality previously in pvops.text.visualize.visualize_word_frequency_plot
    a. Care was taken to ensure arguments from the previous version to still work in the new version.

The following changes were made to requirements.txt and setup.py:

  • nltk has been removed.
  • tqdm was added, as it used in pvOps modules but was previously only installed implicitly as a dependency of nltk.

Additionally:

  • test_visualize_word_frequency_plot from test_text has been reenabled to confirm the new return type structure

Motivation and Context

nltk has caused two problems recently for pvOps:

Since nltk ultimately is not part of the core pvOps functionality, it was determined that removing it from the
package was the best option.

How has this been tested?

The relevant tutorials were re-run to ensure they still work as they did before. The modified functions were
tested with pvOps example data to confirm they produced the same, or near the same, result as before.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Two breaking changes:

  1. Removing the first argument of pvops.text.nltk_utils.create_stopwords.
    It was used for selecting which language to download the nltk stopwords for. Now that the stopwords
    are static, other languages would need to be manually downloaded.
  2. The return type of pvops.text.visualize.visualize_word_frequency_plot has changed. It is now a
    tuple (figure, dict) where figure is the matplotlib Figure instance and dict is in the format {token: count}

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

The docs automatically updated to include the changes.

@tgunda tgunda self-requested a review March 5, 2025 11:55
@agmoore4 agmoore4 merged commit 26c4274 into sandialabs:master Mar 17, 2025
16 checks passed
@agmoore4 agmoore4 deleted the nltk branch March 17, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant