Conversation
There was a problem hiding this comment.
Pull request overview
Adds backend support for KWIC (keyword-in-context) search and project-level n‑gram frequency analysis backed by Elasticsearch, exposing new analysis endpoints for the frontend.
Changes:
- Introduces
/analysis/kwicendpoint and KWIC snippet extraction/sorting logic. - Introduces
/analysis/ngramsendpoint and Elasticsearch aggregation to fetch frequent n‑grams. - Extends the SDoc Elasticsearch index mapping/settings with unigram/bigram/trigram subfields and analyzers.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| docker/compose.yml | Clarifies Elasticsearch JVM heap env var comment. |
| backend/src/modules/analysis/kwic_analysis.py | Implements KWIC snippet extraction and frequency-based sorting. |
| backend/src/modules/analysis/analysis_endpoint.py | Adds /analysis/kwic and /analysis/ngrams endpoints. |
| backend/src/core/doc/sdoc_kwic_dto.py | Adds DTOs/enums for KWIC + n‑gram responses. |
| backend/src/core/doc/sdoc_elastic_index.py | Updates ES index settings/mappings to support n‑grams + fast-vector-highlighter. |
| backend/src/core/doc/sdoc_elastic_crud.py | Adds ES CRUD methods for KWIC search and n‑gram aggregation. |
| - xpack.security.enabled=false # no auth | ||
| - discovery.type=single-node # single node cluster | ||
| - ES_JAVA_OPTS=-Xms512m -Xmx512m | ||
| - ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, can be changed to -Xms2g |
There was a problem hiding this comment.
The comment suggests changing only -Xms to -Xms2g, but heap settings should typically keep -Xms and -Xmx aligned to avoid resizing overhead and to make the effective heap limit explicit. Consider adjusting the comment to suggest updating both values together.
| - ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, can be changed to -Xms2g | |
| - ES_JAVA_OPTS=-Xms512m -Xmx512m # heap size, e.g. can be changed to -Xms2g -Xmx2g |
| if "<em>" in w: | ||
| words[i] = w[4:-5] |
There was a problem hiding this comment.
_get_kwic_from_highlight assumes the highlighted token is exactly <em>word</em> and strips with fixed slicing (w[4:-5]). ES highlights can wrap punctuation (e.g. <em>word</em>,) or span multiple tokens for phrase matches, which will produce corrupted keywords / missed indices. Consider stripping tags with a safer approach (e.g., remove <em>/</em> via regex/HTML parsing) and handling cases where the opening and closing tags occur in different tokens.
| if "<em>" in w: | |
| words[i] = w[4:-5] | |
| # Remember whether this token originally contained an opening <em> tag. | |
| has_open_tag = "<em>" in w | |
| # Safely strip both opening and closing emphasis tags without assuming | |
| # a fixed position or length, so punctuation and multi-token phrases | |
| # are handled correctly. | |
| if "<em>" in w or "</em>" in w: | |
| w = w.replace("<em>", "").replace("</em>", "") | |
| words[i] = w | |
| else: | |
| words[i] = w | |
| # Preserve existing behavior: only tokens that originally had an | |
| # opening <em> tag are treated as keyword centers. | |
| if has_open_tag: |
| window: int = 5, | ||
| direction: Direction = Direction.LEFT, | ||
| # fuzziness: int = 0, | ||
| page_number: int = 1, | ||
| page_size: int = 10, | ||
| authz_user: AuthzUser = Depends(), | ||
| ) -> PaginatedElasticSearchKwicSnippets: | ||
| authz_user.assert_in_project(project_id) | ||
| skip = ((page_number - 1) * page_size) if page_number and page_size else 0 | ||
| limit = page_size |
There was a problem hiding this comment.
page_number, page_size, and window are not validated. Negative/zero values can produce a negative skip (which will return results from the end via pandas slicing) or an invalid highlight fragment_size. Add FastAPI validation (e.g., ge=1 for page_number/page_size, ge=1 for window) or clamp these values before computing skip.
| def search_sdocs_unigrams( | ||
| *, | ||
| project_id: int, | ||
| search_query: str = "", | ||
| limit: int = 20, | ||
| exact: bool = False, | ||
| ngrams: Ngrams = Ngrams.BIGRAM, | ||
| ascending: bool = False, | ||
| authz_user: AuthzUser = Depends(), |
There was a problem hiding this comment.
Endpoint handler is named search_sdocs_unigrams, but it defaults to Ngrams.BIGRAM and supports multiple n-gram sizes. Rename the function to reflect what it does (e.g., search_sdocs_ngrams) to avoid confusion in logs/metrics and when importing.
| "include": f"(.*[^A-Za-z0-9_])?{term}([^A-Za-z0-9_].*)?" | ||
| if exact | ||
| else f".*{term}.*", | ||
| "order": {"_count": "asc" if ascending else "desc"}, |
There was a problem hiding this comment.
fetch_ngrams interpolates user-provided term directly into the terms aggregation include regex. Regex metacharacters in term can change the query semantics or cause expensive regex evaluation (potential ReDoS). Escape term before building the regex (and consider rejecting overly long terms).
| # TODO define ENUM here and use it in the endpoint ngram | ||
| # TODO use enum for direction in kwic analysis |
There was a problem hiding this comment.
The TODO comments about defining/using enums are now stale (enums are already defined and used). Consider removing or updating them to reflect the remaining work, otherwise they add noise and can mislead future readers.
| # TODO define ENUM here and use it in the endpoint ngram | |
| # TODO use enum for direction in kwic analysis |
| "fragment_size": window * 20, | ||
| "number_of_fragments": 9999, | ||
| "no_match_size": 0, | ||
| "pre_tags": ["<em>"], | ||
| "post_tags": ["</em>"], | ||
| } | ||
| } | ||
| }, | ||
| size=9999, | ||
| ) |
There was a problem hiding this comment.
search_sdocs_for_kwic hard-codes size=9999 and sets number_of_fragments=9999. This can create very large responses and load on Elasticsearch for common terms, and it bypasses any pagination semantics from the API layer. Consider adding a hard upper bound and/or using proper pagination/scrolling, and make the limits configurable.
| } | ||
| search_res = client.search( | ||
| index=index, | ||
| body=query, |
There was a problem hiding this comment.
This module generally calls client.search(...) using the query= keyword (see ElasticCrudBase.search). Here fetch_ngrams uses body= instead, which is inconsistent and may break depending on the Elasticsearch client version/config. Align the call signature with the rest of the codebase (or centralize the ES search invocation) to avoid version-specific behavior.
| body=query, | |
| query=query, |
| "unigrams": { | ||
| "type": "text", | ||
| "analyzer": "unigram_analyzer", | ||
| "fielddata": True, | ||
| }, | ||
| "bigrams": { | ||
| "type": "text", | ||
| "analyzer": "bigram_analyzer", | ||
| "fielddata": True, | ||
| }, | ||
| "trigrams": { | ||
| "type": "text", | ||
| "analyzer": "trigram_analyzer", | ||
| "fielddata": True, | ||
| }, |
There was a problem hiding this comment.
The new n-gram subfields enable fielddata on text fields. This can significantly increase Elasticsearch heap usage (fielddata is loaded into memory for aggregations), especially for large corpora. If possible, consider alternative mappings/approaches that avoid fielddata, or at least document the expected heap impact and ensure operational limits/monitoring are in place.
| # STOPWORDS = {"the", "is", "in", "and", "to", "a"} | ||
| # def normalize_tokens(words: list[str]) -> list[str]: | ||
| # return [ | ||
| # w.strip(string.punctuation).lower() | ||
| # for w in words | ||
| # if w.strip(string.punctuation).lower() not in STOPWORDS | ||
| # ] | ||
|
|
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| # STOPWORDS = {"the", "is", "in", "and", "to", "a"} | |
| # def normalize_tokens(words: list[str]) -> list[str]: | |
| # return [ | |
| # w.strip(string.punctuation).lower() | |
| # for w in words | |
| # if w.strip(string.punctuation).lower() not in STOPWORDS | |
| # ] |
No description provided.