Adding remove_duplicates to llm_util by hnikolov-solirius · Pull Request #35 · Planning-Inspectorate/redaction-system

hnikolov-solirius · 2026-02-03T12:16:27Z

Placing stopword removal within analyse_text

hnikolov-solirius · 2026-02-03T13:35:41Z

Pulling StopWord list in

shannon-wms · 2026-02-12T09:43:38Z

redactor/core/redaction/redactor.py

+        """
+        stopwords = safe_load(open(os.path.join("config", "stopwords.yaml"), "r"))
+        stopwords_list = stopwords["stopwords"]
+        text_to_redact = list(set(text_to_redact) - set(stopwords_list))


I've realised a better way to do this is text_to_redact = [x for x in text_to_redact if x.lower() not in stopwords_list] (since set will alphabetise the list and you also want to match cases). This change should make all tests pass

I like it, updated

shannon-wms

Looks good, just some nitpicks

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

Co-authored-by: hnikolov <hristo.nikolov@solirus.com>

* Add more comprehensive logging. Update and improve test coverage * Improve error message when writing blobs. Improve logging. Fix bug in face redaction which resulted in the bounding box being incorrect when highlighting * Improve styling * Update test case * Improve logs to be clearer Fix verbose imports * Fix broken test

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

* Add more comprehensive logging. Update and improve test coverage * Improve error message when writing blobs. Improve logging. Fix bug in face redaction which resulted in the bounding box being incorrect when highlighting * Improve styling * Update test case * Improve logs to be clearer Fix verbose imports * Fix broken test

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

* Add more comprehensive logging. Update and improve test coverage * Improve error message when writing blobs. Improve logging. Fix bug in face redaction which resulted in the bounding box being incorrect when highlighting * Improve styling * Update test case * Improve logs to be clearer Fix verbose imports * Fix broken test

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

HarrisonBoyleThomas · 2026-03-25T17:19:29Z

Replaced by #98

hnikolov-solirius requested a review from shannon-wms February 3, 2026 13:37

hnikolov-solirius force-pushed the feat/NRR-119-StopWordListv3 branch from e13e9d2 to d59167e Compare February 5, 2026 15:31

hnikolov added 8 commits February 11, 2026 16:20

Adding remove_duplicates to llm_util

bd72d1a

Back to redactor.py, no config_processor dependency

b914e51

Tests with Shannon

002fba8

_

ed2b1a3

Fixing error re positional argument

0c11acd

patching in LLMTextRedactor to remove_stopwords test

fc492df

Objectifying LLM_text_redactor

d470922

rebased

1e8d67f

hnikolov-solirius force-pushed the feat/NRR-119-StopWordListv3 branch from db40090 to 1e8d67f Compare February 11, 2026 16:20

shannon-wms reviewed Feb 12, 2026

View reviewed changes

Changed to capture case variance

b546cb9

shannon-wms reviewed Feb 12, 2026

View reviewed changes

hnikolov and others added 15 commits February 12, 2026 10:26

Ruff formatting

7a699d8

failing integration test - moving stopwords timing

66c3979

Addressing failed imports

ca97d4b

removed redundant reference in ImageLLMTextRedaction

699708d

Alternative PDF redaction highlighting method (#49)

116f364

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

Combined prompt (#53)

dd79f67

Co-authored-by: hnikolov <hristo.nikolov@solirus.com>

Adding remove_duplicates to llm_util

33ce118

Back to redactor.py, no config_processor dependency

c6a10b9

Addressing failed imports

2bb7109

Alternative PDF redaction highlighting method (#49)

2b35c70

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

missing imports

967d490

Adding remove_duplicates to llm_util

8ebc1b7

Addressing failed imports

2586365

shannon-wms and others added 5 commits February 13, 2026 10:38

Alternative PDF redaction highlighting method (#49)

cdf226b

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

Alternative PDF redaction highlighting method (#49)

7457815

Read and store PDF text line-by-line with bounding boxes, and search lines to validate redaction candidates found by initially searching the PDF page. Reduces time spent on redaction without spawning additional processes.

.

1cd5cca

.

ad94b5c

HarrisonBoyleThomas closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding remove_duplicates to llm_util#35

Adding remove_duplicates to llm_util#35
hnikolov-solirius wants to merge 29 commits intomainfrom
feat/NRR-119-StopWordListv3

hnikolov-solirius commented Feb 3, 2026

Uh oh!

hnikolov-solirius commented Feb 3, 2026

Uh oh!

shannon-wms Feb 12, 2026 •

edited

Loading

Uh oh!

hnikolov-solirius Feb 12, 2026

Uh oh!

shannon-wms left a comment

Uh oh!

HarrisonBoyleThomas commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hnikolov-solirius commented Feb 3, 2026

Uh oh!

hnikolov-solirius commented Feb 3, 2026

Uh oh!

shannon-wms Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hnikolov-solirius Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

shannon-wms left a comment

Choose a reason for hiding this comment

Uh oh!

HarrisonBoyleThomas commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shannon-wms Feb 12, 2026 •

edited

Loading