Skip to content

Fix/four bugs#12

Open
ToruOkadaOi wants to merge 6 commits intoGeeksfino:mainfrom
ToruOkadaOi:fix/four-bugs
Open

Fix/four bugs#12
ToruOkadaOi wants to merge 6 commits intoGeeksfino:mainfrom
ToruOkadaOi:fix/four-bugs

Conversation

@ToruOkadaOi
Copy link

@ToruOkadaOi ToruOkadaOi commented Feb 26, 2026

Fix four bugs affecting KB-loaded MCP server instances

Bug 1 — fix(config): Auto-detect embedding model from KB archive

Hardcoded all-MiniLM-L6-v2 caused silent dimension mismatch when KB
was built with a different model. Adds _detect_model_from_archive()
to read model path from config.json inside the tar at load time.

Bug 2 — fix(qa): Check answer field value not just key presence

results[0]["answer"] exists as a key but with value None in plain
text KBs. .get("answer") truthiness check prevents None propagating
as return value → fixes Pydantic Input should be a valid string error.

Bug 3 — fix(search): Replace undefined neighbors() with edges()

The txtai NetworkX wrapper does not expose neighbors(). Calling it
raises AttributeError on any semantic_search with graph=True.

Bug 4 — fix(search): SQL fallback when document cache empty on startup

document_cache is only populated via add_documents() at runtime.
Fresh startup from tar archive always has empty cache → all results
return "No text available" despite content existing in SQLite.
SQL fallback resolves text directly from the DB on cache miss.

Summary by CodeRabbit

  • New Features

    • Automatically detect and load the correct embedding model from archived knowledge bases at startup.
  • Bug Fixes

    • QA tool now reliably checks for actual answer content.
    • Document retrieval falls back from cache to database when text is missing.
    • Graph search shows more accurate connection counts.

results[0]['answer'] can exist as a key but with value None in plain
text KBs that don't have an extractive QA pipeline. Checking truthiness
via .get() instead of 'in' prevents None propagating as return value.

Fixes Pydantic validation error: Input should be a valid string [input_value=None]
The txtai NetworkX wrapper does not expose a neighbors() method.
Calling it raises AttributeError on any semantic_search with graph=True.
edges() returns equivalent adjacency data and IS defined on the wrapper.
document_cache is only populated via add_documents() at runtime.
On fresh server startup from a tar archive the cache is always empty,
causing all semantic_search graph=False results to return 'No text available'
despite content existing in the underlying SQLite DB.
Hardcoded all-MiniLM-L6-v2 (384-dim) causes silent dimension mismatch
when KB was built with a different model (e.g. allenai-specter, 768-dim).
Similarity scores become garbage -> answer_question returns None.

Adds _detect_model_from_archive() which reads model path from config.json
inside the KB tar at load time. Checks three candidate paths:
config.json, ./config.json, embeddings/config.json.
Safe fallback to current model_path if detection fails.

Fixes Pydantic validation error: Input should be a valid string [input_value=None]
@gemini-code-assist
Copy link

Summary of Changes

Hello @ToruOkadaOi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses four critical bugs affecting Knowledge Base (KB) loaded MCP server instances, significantly improving the robustness and reliability of the system. The changes ensure correct embedding model loading, prevent invalid answer propagation, fix graph traversal errors, and guarantee document text availability regardless of cache state, leading to a more stable and predictable application behavior.

Highlights

  • Embedding Model Auto-detection: Implemented automatic detection of the embedding model from the KB archive's config.json to prevent dimension mismatches when loading knowledge bases built with different models.
  • QA Answer Field Validation: Modified the QA tool to check the actual value of the 'answer' field, not just its presence, to avoid propagating None values which caused Pydantic validation errors.
  • Graph Search API Correction: Corrected the graph search functionality by replacing the undefined neighbors() method with the appropriate edges() method in the txtai NetworkX wrapper.
  • Document Cache SQL Fallback: Added a SQL fallback mechanism for document text retrieval when the in-memory cache is empty, ensuring content is available even on fresh server startups from archive.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/txtai_mcp_server/core/config.py
    • Added a private method _detect_model_from_archive to safely extract the embedding model path from a .tar.gz or .tgz knowledge base archive by inspecting its config.json.
    • Modified the create_application method to utilize _detect_model_from_archive when loading an application from an archive, ensuring the correct embedding model is used and overriding the default model_path if a different one is detected.
    • Updated the logic in answer_question to check the truthiness of the answer field's value using .get("answer") instead of just checking for key presence, preventing None values from being returned and causing Pydantic validation errors.
  • src/txtai_mcp_server/tools/qa.py
    • Updated the logic in answer_question to check the truthiness of the answer field's value using .get("answer") instead of just checking for key presence, preventing None values from being returned and causing Pydantic validation errors.
  • src/txtai_mcp_server/tools/search.py
    • Replaced the call to the non-existent results.neighbors(node_id) with results.edges(node_id) within the semantic_search function when graph=True, resolving an AttributeError.
    • Implemented a SQL fallback mechanism in semantic_search to retrieve document text directly from the txtai SQLite database if the document_cache is empty or misses an entry, addressing issues where documents were not found on fresh server startups.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b05b0fc and c633388.

📒 Files selected for processing (1)
  • src/txtai_mcp_server/tools/search.py

📝 Walkthrough

Walkthrough

Detects embedding model paths inside KB archive files during application creation, overrides configured model_path when different, adjusts QA answer-presence check to inspect the first result's "answer" value, and changes search to use graph edges and a cache-then-SQL fallback for document text retrieval.

Changes

Cohort / File(s) Summary
Embedding Model Detection from Archives
src/txtai_mcp_server/core/config.py
Added _detect_model_from_archive(kb_path) to read embedding model path from archive config.json. create_application now detects model from .tar.gz/.tgz archives and overrides self.model_path when detected model differs. Added tarfile, json imports and logging.
QA answer-presence check
src/txtai_mcp_server/tools/qa.py
Changed SQL-based fallback logic to check results[0].get("answer") for presence/truthiness instead of just key existence and non-empty list.
Search: graph edges & text fallback
src/txtai_mcp_server/tools/search.py
Graph-based connection counts now use results.edges(node_id) (fallback empty) instead of results.neighbors(node_id). Document text retrieval first attempts get_document_from_cache(doc_id), then queries txtai SQL (select text from txtai where id = 'doc_id') with exception handling; falls back to "No text available".

Sequence Diagram

sequenceDiagram
    participant User
    participant ConfigModule as Config<br/>(create_application)
    participant ArchiveReader as Archive<br/>Reader
    participant ModelDetector as Model<br/>Detector
    participant AppConstructor as Application<br/>Constructor

    User->>ConfigModule: call create_application(embeddings_path)
    ConfigModule->>ConfigModule: detect if embeddings_path is archive
    alt archive (.tar.gz/.tgz)
        ConfigModule->>ArchiveReader: open archive and read config.json
        ArchiveReader-->>ModelDetector: provide config content
        ModelDetector->>ModelDetector: parse config variants for model path
        ModelDetector-->>ConfigModule: return detected model_path
        ConfigModule->>ConfigModule: compare detected vs self.model_path
        alt differs
            ConfigModule->>ConfigModule: override self.model_path
            ConfigModule->>ConfigModule: log detection and override
        end
    end
    ConfigModule->>AppConstructor: construct Application with updated model_path
    AppConstructor-->>User: return Application instance
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I sniffed the tar and found a trace,
A hidden model in a boxed-up space,
Edges counted, text fetched with care,
Answers surfaced from cache and SQL there,
A tiny hop, a brighter trail to chase.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Fix/four bugs' is vague and generic. While it indicates that bugs are being fixed, it does not clearly specify which bugs or what the main changes are, making it difficult for reviewers to understand the primary focus at a glance. Consider a more descriptive title that highlights the primary bug being fixed, such as 'Fix embedding model detection and QA/search handling' or similar, to better communicate the changeset's purpose.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces fixes for four distinct bugs, enhancing the stability and correctness of the server. The changes include auto-detecting the embedding model from knowledge base archives, improving answer validation in the QA tool, fixing a graph search method call, and adding a database fallback for the document cache. My review focuses on improving code style, readability, and addressing a critical security vulnerability. I've identified a potential SQL injection and suggested a fix, along with other recommendations to align the code with Python best practices.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/txtai_mcp_server/core/config.py (2)

97-100: Mutating self.model_path may have unintended side effects.

Assigning to self.model_path permanently modifies the settings instance. If the settings object is reused or inspected after create_application(), it will reflect the detected model rather than the originally configured value. This is likely intentional for consistency, but worth noting.

Consider using a local variable for the application creation if preserving the original setting value is important:

♻️ Alternative approach (optional)
                 detected_model = self._detect_model_from_archive(self.embeddings_path)
                 if detected_model != self.model_path:
                     logger.info(f"Overriding model_path {self.model_path} with KB-detected model: {detected_model}")
-                    self.model_path = detected_model
+                    # Use detected model for this application without mutating settings
                 return Application(f"path: {self.embeddings_path}")

The mutation may be intentional to keep settings in sync—if so, this comment can be disregarded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/txtai_mcp_server/core/config.py` around lines 97 - 100, The code mutates
the settings instance by assigning detected_model to self.model_path inside
create_application flow; to avoid unintended side effects, use a local variable
(e.g., resolved_model) to hold the result of
self._detect_model_from_archive(self.embeddings_path) and use that local
variable where the application needs the model, leaving self.model_path
unchanged; if you intentionally want settings updated, add a clear comment or an
explicit opt-in (e.g., set_preserve_original=False) and update uses of
self.model_path/detected_model accordingly.

55-64: File handle not explicitly closed after extractfile().

The file object f returned by tar.extractfile() is not explicitly closed. While Python's GC will eventually clean it up, it's better practice to use a context manager or explicit close, especially within a loop.

♻️ Suggested fix
                 for candidate in ["config.json", "./config.json", "embeddings/config.json"]:
                     try:
                         f = tar.extractfile(candidate)
                         if f:
-                            cfg = json.load(f)
-                            model = cfg.get("path")
-                            if model:
-                                logger.info(f"Detected embedding model from KB archive: {model}")
-                                return model
+                            try:
+                                cfg = json.load(f)
+                                model = cfg.get("path")
+                                if model:
+                                    logger.info(f"Detected embedding model from KB archive: {model}")
+                                    return model
+                            finally:
+                                f.close()
                     except KeyError:
                         continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/txtai_mcp_server/core/config.py` around lines 55 - 64, The
tar.extractfile(candidate) file object `f` is not explicitly closed; update the
block that reads the KB archive (the code using tar.extractfile, `f`, `cfg`, and
`candidate`) to use a context manager (with statement) or ensure `f.close()` in
a finally to guarantee the file handle is closed after json.load(f), while
preserving the existing logic that reads `cfg.get("path")`, logs via
`logger.info`, and returns the detected model.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/txtai_mcp_server/tools/search.py`:
- Around line 115-122: The fallback code uses get_document_from_cache(doc_id)
then re-calls get_txtai_app() and builds a raw SQL string with doc_id; fix by
reusing the existing app instance (do not call get_txtai_app() again) and escape
doc_id via escape_sql_string(doc_id) before interpolating into the SQL passed to
app.search (i.e., call app.search(f"select text from txtai where id =
'{escaped}'")). Keep the same fallback semantics (use "No text available" when
sql_result is empty or on exception) and only change the app usage and SQL
construction.

---

Nitpick comments:
In `@src/txtai_mcp_server/core/config.py`:
- Around line 97-100: The code mutates the settings instance by assigning
detected_model to self.model_path inside create_application flow; to avoid
unintended side effects, use a local variable (e.g., resolved_model) to hold the
result of self._detect_model_from_archive(self.embeddings_path) and use that
local variable where the application needs the model, leaving self.model_path
unchanged; if you intentionally want settings updated, add a clear comment or an
explicit opt-in (e.g., set_preserve_original=False) and update uses of
self.model_path/detected_model accordingly.
- Around line 55-64: The tar.extractfile(candidate) file object `f` is not
explicitly closed; update the block that reads the KB archive (the code using
tar.extractfile, `f`, `cfg`, and `candidate`) to use a context manager (with
statement) or ensure `f.close()` in a finally to guarantee the file handle is
closed after json.load(f), while preserving the existing logic that reads
`cfg.get("path")`, logs via `logger.info`, and returns the detected model.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5daf0b and 0ad02a0.

📒 Files selected for processing (3)
  • src/txtai_mcp_server/core/config.py
  • src/txtai_mcp_server/tools/qa.py
  • src/txtai_mcp_server/tools/search.py

- Move tarfile/json imports to top of config.py (PEP 8)
- Fix SQL injection via escape_sql_string in search.py
- Simplify redundant len(results) check in qa.py
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/txtai_mcp_server/tools/search.py (1)

115-121: ⚠️ Potential issue | 🟠 Major

Cast doc_id to string before escaping to avoid silent fallback failures.

At Line 119, escape_sql_string(doc_id) can throw when doc_id is non-string (e.g., numeric IDs). That exception is swallowed and incorrectly returns "No text available" even if DB text exists.

💡 Proposed fix
-                                sql_result = app.search(f"select text from txtai where id = '{escape_sql_string(doc_id)}'")
+                                safe_id = escape_sql_string(str(doc_id))
+                                sql_result = app.search(f"select text from txtai where id = '{safe_id}'")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/txtai_mcp_server/tools/search.py` around lines 115 - 121, The code
swallows exceptions when escaping non-string doc_id which causes valid DB rows
to be hidden; before calling escape_sql_string(doc_id) in the
get_document_from_cache fallback block, cast doc_id to a string (e.g.,
str(doc_id)) so escape_sql_string always receives a string, then call
get_txtai_app() and app.search with the escaped string; additionally consider
narrowing the except or logging the error from app.search to avoid silent
failures. Ensure you update references in this block (doc_id, escape_sql_string,
get_txtai_app, app.search, get_document_from_cache) accordingly.
🧹 Nitpick comments (1)
src/txtai_mcp_server/core/config.py (1)

65-67: Narrow the exception type in archive model detection.

Line 65 catches all exceptions, which can mask unrelated coding errors. Prefer handling archive/JSON read errors explicitly and preserving unexpected exceptions.

♻️ Suggested refinement
-        except Exception as e:
+        except (tarfile.TarError, OSError, json.JSONDecodeError, KeyError, TypeError, ValueError) as e:
             logger.warning(f"Could not detect model from KB archive: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/txtai_mcp_server/core/config.py` around lines 65 - 67, The catch-all
except in the archive model detection should be narrowed: replace "except
Exception as e" with specific handlers for archive/JSON I/O errors (for example
zipfile.BadZipFile, FileNotFoundError, json.JSONDecodeError, KeyError) so only
expected archive/parse problems are logged via logger.warning(f"Could not detect
model from KB archive: {e}"), and re-raise or let propagate any other unexpected
exceptions; reference the same block that reads the KB archive which
updates/returns self.model_path and uses logger.warning to ensure you only
swallow known archive-related errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/txtai_mcp_server/tools/search.py`:
- Around line 115-121: The code swallows exceptions when escaping non-string
doc_id which causes valid DB rows to be hidden; before calling
escape_sql_string(doc_id) in the get_document_from_cache fallback block, cast
doc_id to a string (e.g., str(doc_id)) so escape_sql_string always receives a
string, then call get_txtai_app() and app.search with the escaped string;
additionally consider narrowing the except or logging the error from app.search
to avoid silent failures. Ensure you update references in this block (doc_id,
escape_sql_string, get_txtai_app, app.search, get_document_from_cache)
accordingly.

---

Nitpick comments:
In `@src/txtai_mcp_server/core/config.py`:
- Around line 65-67: The catch-all except in the archive model detection should
be narrowed: replace "except Exception as e" with specific handlers for
archive/JSON I/O errors (for example zipfile.BadZipFile, FileNotFoundError,
json.JSONDecodeError, KeyError) so only expected archive/parse problems are
logged via logger.warning(f"Could not detect model from KB archive: {e}"), and
re-raise or let propagate any other unexpected exceptions; reference the same
block that reads the KB archive which updates/returns self.model_path and uses
logger.warning to ensure you only swallow known archive-related errors.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ad02a0 and b05b0fc.

📒 Files selected for processing (3)
  • src/txtai_mcp_server/core/config.py
  • src/txtai_mcp_server/tools/qa.py
  • src/txtai_mcp_server/tools/search.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/txtai_mcp_server/tools/qa.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant