Nltk #108

agmoore4 · 2025-03-04T21:40:39Z

Description

Removes the nltk dependency.

nltk was used for three things in pvOps:

Dynamically downloading the current stopwords list
Tokenizing text documents with its word_tokenize function
Plotting the token frequency from a tokenized document in pvops.text.visualize.visualize_word_frequency_plot

This PR addresses the points by:

Including a static version of the nltk English stopwords in stopwords.txt under pvops.text, which is read by the function pvops.text.nltk_utils.create_stopwords using importlib.resources (comes default with python>=3.7)
a. Removed the first argument to this function, which denoted the language to grab the stopwords for.
b. Removed the text module test that checked the stopwords list download from nltk
Writing a new pvops.text.preprocess.regex_tokenize function
a. Replaced all references to word_tokenize with the new function
b. Can be used with the default regex pattern (the functionality of which is described in the associated docstring)
and also allows increased functionality for the user to create their own regex pattern for tokenizing.
Manually providing the functionality previously in pvops.text.visualize.visualize_word_frequency_plot
a. Care was taken to ensure arguments from the previous version to still work in the new version.

The following changes were made to requirements.txt and setup.py:

nltk has been removed.
tqdm was added, as it used in pvOps modules but was previously only installed implicitly as a dependency of nltk.

Additionally:

test_visualize_word_frequency_plot from test_text has been reenabled to confirm the new return type structure

Motivation and Context

nltk has caused two problems recently for pvOps:

A security vulnerability, described and resolved at Requiring nltk>=3.9.1; switch punkt to punkt_tab #102
Breaking a test case due to a change to the English stopwords list that was not documented nltk English stopwords changed, breaking test cases #106

Since nltk ultimately is not part of the core pvOps functionality, it was determined that removing it from the
package was the best option.

How has this been tested?

The relevant tutorials were re-run to ensure they still work as they did before. The modified functions were
tested with pvOps example data to confirm they produced the same, or near the same, result as before.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Two breaking changes:

Removing the first argument of pvops.text.nltk_utils.create_stopwords.
It was used for selecting which language to download the nltk stopwords for. Now that the stopwords
are static, other languages would need to be manually downloaded.
The return type of pvops.text.visualize.visualize_word_frequency_plot has changed. It is now a
tuple (figure, dict) where figure is the matplotlib Figure instance and dict is in the format {token: count}

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

The docs automatically updated to include the changes.

…ionality to visualize_word_frequency_plot

… installed implicitly

…m name

…rtlib

agmoore4 added 12 commits February 27, 2025 12:03

new regex tokenizer with comparisons to nltk

05db260

removing nltk dependency

fdf28d3

small change to text tutorial to align with nltk change

7e5c286

removing nltk import from text.preprocess

a6a87d2

removing nltk import from text.visualize

06cad5c

removed nltk import from text.visualize and added nltk-inspired funct…

b6177bf

…ionality to visualize_word_frequency_plot

lint changes

d5877d2

adding tqdm as dependency; previously was a dependency of nltk and so…

86d29b3

… installed implicitly

removing nltk from imports in test_text

26a2316

removing all nltk references from test_text

d6b6a80

reenable test_visualize_word_frequency_plot by removing initial x fro…

cb76f80

…m name

small fix to new visualize function

6ed52cf

tgunda self-requested a review March 5, 2025 11:55

Merge branch 'sandialabs:master' into nltk

080fe70

agmoore4 mentioned this pull request Mar 10, 2025

nltk English stopwords changed, breaking test cases #106

Closed

agmoore4 and others added 7 commits March 12, 2025 10:33

moving stopwords to a static txt file; requiring python>=3.9 for impo…

eb85f13

…rtlib

Merge branch 'sandialabs:master' into nltk

6c4d5e3

Merge branch 'nltk' of https://github.com/agmoore4/pvOps into nltk

014b1f0

removing extraneous commented code

cf93d6c

requiring python>=3.7 for importlib.resources

aefdfaa

using python_requires argument in setup.py

687f2ff

reverting python version requirement for now; to be included in issue

c72fd60

agmoore4 mentioned this pull request Mar 12, 2025

Define minimum python version; update python test versions #110

Open

final changes to documentation and version number

3f43c88

agmoore4 merged commit 26c4274 into sandialabs:master Mar 17, 2025
16 checks passed

agmoore4 deleted the nltk branch March 17, 2025 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nltk #108

Nltk #108

Uh oh!

agmoore4 commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Nltk #108

Nltk #108

Uh oh!

Conversation

agmoore4 commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

agmoore4 commented Mar 4, 2025 •

edited

Loading