KWIC + n-gram analysis by aahafez · Pull Request #636 · uhh-lt/dats

aahafez · 2025-10-13T12:16:13Z

No description provided.

…_endpoint

Copilot

Pull request overview

Adds backend support for KWIC (keyword-in-context) search and project-level n‑gram frequency analysis backed by Elasticsearch, exposing new analysis endpoints for the frontend.

Changes:

Introduces /analysis/kwic endpoint and KWIC snippet extraction/sorting logic.
Introduces /analysis/ngrams endpoint and Elasticsearch aggregation to fetch frequent n‑grams.
Extends the SDoc Elasticsearch index mapping/settings with unigram/bigram/trigram subfields and analyzers.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
docker/compose.yml	Clarifies Elasticsearch JVM heap env var comment.
backend/src/modules/analysis/kwic_analysis.py	Implements KWIC snippet extraction and frequency-based sorting.
backend/src/modules/analysis/analysis_endpoint.py	Adds `/analysis/kwic` and `/analysis/ngrams` endpoints.
backend/src/core/doc/sdoc_kwic_dto.py	Adds DTOs/enums for KWIC + n‑gram responses.
backend/src/core/doc/sdoc_elastic_index.py	Updates ES index settings/mappings to support n‑grams + fast-vector-highlighter.
backend/src/core/doc/sdoc_elastic_crud.py	Adds ES CRUD methods for KWIC search and n‑gram aggregation.

Copilot · 2026-02-10T10:32:23Z

      - xpack.security.enabled=false # no auth
      - discovery.type=single-node # single node cluster
-      - ES_JAVA_OPTS=-Xms512m -Xmx512m
+      - ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, can be changed to -Xms2g


The comment suggests changing only -Xms to -Xms2g, but heap settings should typically keep -Xms and -Xmx aligned to avoid resizing overhead and to make the effective heap limit explicit. Consider adjusting the comment to suggest updating both values together.

Suggested change

- ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, can be changed to -Xms2g

- ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, e.g. can be changed to -Xms2g -Xmx2g

Copilot · 2026-02-10T10:32:23Z

+        if "<em>" in w:
+            words[i] = w[4:-5]


_get_kwic_from_highlight assumes the highlighted token is exactly word and strips with fixed slicing (w[4:-5]). ES highlights can wrap punctuation (e.g. word,) or span multiple tokens for phrase matches, which will produce corrupted keywords / missed indices. Consider stripping tags with a safer approach (e.g., remove / via regex/HTML parsing) and handling cases where the opening and closing tags occur in different tokens.

Suggested change

if "" in w:

words[i] = w[4:-5]

# Remember whether this token originally contained an opening tag.

has_open_tag = "" in w

# Safely strip both opening and closing emphasis tags without assuming

# a fixed position or length, so punctuation and multi-token phrases

# are handled correctly.

if "" in w or "" in w:

w = w.replace("", "").replace("", "")

words[i] = w

else:

words[i] = w

# Preserve existing behavior: only tokens that originally had an

# opening tag are treated as keyword centers.

if has_open_tag:

Copilot · 2026-02-10T10:32:23Z

+    window: int = 5,
+    direction: Direction = Direction.LEFT,
+    # fuzziness: int = 0,
+    page_number: int = 1,
+    page_size: int = 10,
+    authz_user: AuthzUser = Depends(),
+) -> PaginatedElasticSearchKwicSnippets:
+    authz_user.assert_in_project(project_id)
+    skip = ((page_number - 1) * page_size) if page_number and page_size else 0
+    limit = page_size


page_number, page_size, and window are not validated. Negative/zero values can produce a negative skip (which will return results from the end via pandas slicing) or an invalid highlight fragment_size. Add FastAPI validation (e.g., ge=1 for page_number/page_size, ge=1 for window) or clamp these values before computing skip.

Copilot · 2026-02-10T10:32:24Z

+def search_sdocs_unigrams(
+    *,
+    project_id: int,
+    search_query: str = "",
+    limit: int = 20,
+    exact: bool = False,
+    ngrams: Ngrams = Ngrams.BIGRAM,
+    ascending: bool = False,
+    authz_user: AuthzUser = Depends(),


Endpoint handler is named search_sdocs_unigrams, but it defaults to Ngrams.BIGRAM and supports multiple n-gram sizes. Rename the function to reflect what it does (e.g., search_sdocs_ngrams) to avoid confusion in logs/metrics and when importing.

Copilot · 2026-02-10T10:32:24Z

+                        "include": f"(.*[^A-Za-z0-9_])?{term}([^A-Za-z0-9_].*)?"
+                        if exact
+                        else f".*{term}.*",
+                        "order": {"_count": "asc" if ascending else "desc"},


fetch_ngrams interpolates user-provided term directly into the terms aggregation include regex. Regex metacharacters in term can change the query semantics or cause expensive regex evaluation (potential ReDoS). Escape term before building the regex (and consider rejecting overly long terms).

Copilot · 2026-02-10T10:32:26Z

+# TODO define ENUM here and use it in the endpoint ngram
+# TODO use enum for direction in kwic analysis


The TODO comments about defining/using enums are now stale (enums are already defined and used). Consider removing or updating them to reflect the remaining work, otherwise they add noise and can mislead future readers.

Suggested change

# TODO define ENUM here and use it in the endpoint ngram

# TODO use enum for direction in kwic analysis

Copilot · 2026-02-10T10:32:26Z

+                        "fragment_size": window * 20,
+                        "number_of_fragments": 9999,
+                        "no_match_size": 0,
+                        "pre_tags": ["<em>"],
+                        "post_tags": ["</em>"],
+                    }
+                }
+            },
+            size=9999,
+        )


search_sdocs_for_kwic hard-codes size=9999 and sets number_of_fragments=9999. This can create very large responses and load on Elasticsearch for common terms, and it bypasses any pagination semantics from the API layer. Consider adding a hard upper bound and/or using proper pagination/scrolling, and make the limits configurable.

Copilot · 2026-02-10T10:32:26Z

+        }
+        search_res = client.search(
+            index=index,
+            body=query,


This module generally calls client.search(...) using the query= keyword (see ElasticCrudBase.search). Here fetch_ngrams uses body= instead, which is inconsistent and may break depending on the Elasticsearch client version/config. Align the call signature with the rest of the codebase (or centralize the ES search invocation) to avoid version-specific behavior.

Suggested change

body=query,

query=query,

Copilot · 2026-02-10T10:32:26Z

+                    "unigrams": {
+                        "type": "text",
+                        "analyzer": "unigram_analyzer",
+                        "fielddata": True,
+                    },
+                    "bigrams": {
+                        "type": "text",
+                        "analyzer": "bigram_analyzer",
+                        "fielddata": True,
+                    },
+                    "trigrams": {
+                        "type": "text",
+                        "analyzer": "trigram_analyzer",
+                        "fielddata": True,
+                    },


The new n-gram subfields enable fielddata on text fields. This can significantly increase Elasticsearch heap usage (fielddata is loaded into memory for aggregations), especially for large corpora. If possible, consider alternative mappings/approaches that avoid fielddata, or at least document the expected heap impact and ensure operational limits/monitoring are in place.

Copilot · 2026-02-10T10:32:27Z

+    # STOPWORDS = {"the", "is", "in", "and", "to", "a"}
+    # def normalize_tokens(words: list[str]) -> list[str]:
+    #     return [
+    #         w.strip(string.punctuation).lower()
+    #         for w in words
+    #         if w.strip(string.punctuation).lower() not in STOPWORDS
+    #     ]
+


This comment appears to contain commented-out code.

Suggested change

# STOPWORDS = {"the", "is", "in", "and", "to", "a"}

# def normalize_tokens(words: list[str]) -> list[str]:

# return [

# w.strip(string.punctuation).lower()

# for w in words

# if w.strip(string.punctuation).lower() not in STOPWORDS

# ]

aahafez added 6 commits August 27, 2025 12:03

Implemented KWIC (no sorting)

070bf86

Add KWIC search functionality and remove unused search service

907a27c

Refactor kwic_search function call and remove unused import in search…

1cf9e92

…_endpoint

More efficient KWIC approach with sorting and pagination

2c0c526

Added n-gram analysis

e3d4b66

added memory heap comment

cbd3139

aahafez requested a review from bigabig October 13, 2025 12:16

aahafez added the backend This issue is related to the backend label Oct 13, 2025

bigabig requested a review from Copilot February 10, 2026 10:26

Copilot started reviewing on behalf of bigabig February 10, 2026 10:27 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KWIC + n-gram analysis#636

KWIC + n-gram analysis#636
aahafez wants to merge 6 commits intomainfrom
linguistic-analysis

aahafez commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, can be changed to -Xms2g
	- ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, e.g. can be changed to -Xms2g -Xmx2g

-        if "<em>" in w:
-            words[i] = w[4:-5]
+        # Remember whether this token originally contained an opening <em> tag.
+        has_open_tag = "<em>" in w
+        # Safely strip both opening and closing emphasis tags without assuming
+        # a fixed position or length, so punctuation and multi-token phrases
+        # are handled correctly.
+        if "<em>" in w or "</em>" in w:
+            w = w.replace("<em>", "").replace("</em>", "")
+            words[i] = w
+        else:
+            words[i] = w
+        # Preserve existing behavior: only tokens that originally had an
+        # opening <em> tag are treated as keyword centers.
+        if has_open_tag:

		# TODO define ENUM here and use it in the endpoint ngram
		# TODO use enum for direction in kwic analysis

Conversation

aahafez commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants