Skip to content

Adding remove_duplicates to llm_util#35

Closed
hnikolov-solirius wants to merge 29 commits intomainfrom
feat/NRR-119-StopWordListv3
Closed

Adding remove_duplicates to llm_util#35
hnikolov-solirius wants to merge 29 commits intomainfrom
feat/NRR-119-StopWordListv3

Conversation

@hnikolov-solirius
Copy link
Copy Markdown
Contributor

Placing stopword removal within analyse_text

@hnikolov-solirius
Copy link
Copy Markdown
Contributor Author

Pulling StopWord list in

@hnikolov-solirius hnikolov-solirius force-pushed the feat/NRR-119-StopWordListv3 branch from db40090 to 1e8d67f Compare February 11, 2026 16:20
"""
stopwords = safe_load(open(os.path.join("config", "stopwords.yaml"), "r"))
stopwords_list = stopwords["stopwords"]
text_to_redact = list(set(text_to_redact) - set(stopwords_list))
Copy link
Copy Markdown
Contributor

@shannon-wms shannon-wms Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've realised a better way to do this is text_to_redact = [x for x in text_to_redact if x.lower() not in stopwords_list] (since set will alphabetise the list and you also want to match cases). This change should make all tests pass

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, updated

Copy link
Copy Markdown
Contributor

@shannon-wms shannon-wms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some nitpicks

hnikolov and others added 15 commits February 12, 2026 10:26
Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.
Co-authored-by: hnikolov <hristo.nikolov@solirus.com>
* Add more comprehensive logging. Update and improve test coverage

* Improve error message when writing blobs. Improve logging. Fix bug in face redaction which resulted in the bounding box being incorrect when highlighting

* Improve styling

* Update test case

* Improve logs to be clearer Fix verbose imports

* Fix broken test
Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.
* Add more comprehensive logging. Update and improve test coverage

* Improve error message when writing blobs. Improve logging. Fix bug in face redaction which resulted in the bounding box being incorrect when highlighting

* Improve styling

* Update test case

* Improve logs to be clearer Fix verbose imports

* Fix broken test
shannon-wms and others added 5 commits February 13, 2026 10:38
Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.
* Add more comprehensive logging. Update and improve test coverage

* Improve error message when writing blobs. Improve logging. Fix bug in face redaction which resulted in the bounding box being incorrect when highlighting

* Improve styling

* Update test case

* Improve logs to be clearer Fix verbose imports

* Fix broken test
Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.
@HarrisonBoyleThomas
Copy link
Copy Markdown
Collaborator

Replaced by #98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants