Conversation
results[0]['answer'] can exist as a key but with value None in plain text KBs that don't have an extractive QA pipeline. Checking truthiness via .get() instead of 'in' prevents None propagating as return value. Fixes Pydantic validation error: Input should be a valid string [input_value=None]
The txtai NetworkX wrapper does not expose a neighbors() method. Calling it raises AttributeError on any semantic_search with graph=True. edges() returns equivalent adjacency data and IS defined on the wrapper.
document_cache is only populated via add_documents() at runtime. On fresh server startup from a tar archive the cache is always empty, causing all semantic_search graph=False results to return 'No text available' despite content existing in the underlying SQLite DB.
Hardcoded all-MiniLM-L6-v2 (384-dim) causes silent dimension mismatch when KB was built with a different model (e.g. allenai-specter, 768-dim). Similarity scores become garbage -> answer_question returns None. Adds _detect_model_from_archive() which reads model path from config.json inside the KB tar at load time. Checks three candidate paths: config.json, ./config.json, embeddings/config.json. Safe fallback to current model_path if detection fails. Fixes Pydantic validation error: Input should be a valid string [input_value=None]
Summary of ChangesHello @ToruOkadaOi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses four critical bugs affecting Knowledge Base (KB) loaded MCP server instances, significantly improving the robustness and reliability of the system. The changes ensure correct embedding model loading, prevent invalid answer propagation, fix graph traversal errors, and guarantee document text availability regardless of cache state, leading to a more stable and predictable application behavior. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughDetects embedding model paths inside KB archive files during application creation, overrides configured model_path when different, adjusts QA answer-presence check to inspect the first result's "answer" value, and changes search to use graph edges and a cache-then-SQL fallback for document text retrieval. Changes
Sequence DiagramsequenceDiagram
participant User
participant ConfigModule as Config<br/>(create_application)
participant ArchiveReader as Archive<br/>Reader
participant ModelDetector as Model<br/>Detector
participant AppConstructor as Application<br/>Constructor
User->>ConfigModule: call create_application(embeddings_path)
ConfigModule->>ConfigModule: detect if embeddings_path is archive
alt archive (.tar.gz/.tgz)
ConfigModule->>ArchiveReader: open archive and read config.json
ArchiveReader-->>ModelDetector: provide config content
ModelDetector->>ModelDetector: parse config variants for model path
ModelDetector-->>ConfigModule: return detected model_path
ConfigModule->>ConfigModule: compare detected vs self.model_path
alt differs
ConfigModule->>ConfigModule: override self.model_path
ConfigModule->>ConfigModule: log detection and override
end
end
ConfigModule->>AppConstructor: construct Application with updated model_path
AppConstructor-->>User: return Application instance
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces fixes for four distinct bugs, enhancing the stability and correctness of the server. The changes include auto-detecting the embedding model from knowledge base archives, improving answer validation in the QA tool, fixing a graph search method call, and adding a database fallback for the document cache. My review focuses on improving code style, readability, and addressing a critical security vulnerability. I've identified a potential SQL injection and suggested a fix, along with other recommendations to align the code with Python best practices.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/txtai_mcp_server/core/config.py (2)
97-100: Mutatingself.model_pathmay have unintended side effects.Assigning to
self.model_pathpermanently modifies the settings instance. If the settings object is reused or inspected aftercreate_application(), it will reflect the detected model rather than the originally configured value. This is likely intentional for consistency, but worth noting.Consider using a local variable for the application creation if preserving the original setting value is important:
♻️ Alternative approach (optional)
detected_model = self._detect_model_from_archive(self.embeddings_path) if detected_model != self.model_path: logger.info(f"Overriding model_path {self.model_path} with KB-detected model: {detected_model}") - self.model_path = detected_model + # Use detected model for this application without mutating settings return Application(f"path: {self.embeddings_path}")The mutation may be intentional to keep settings in sync—if so, this comment can be disregarded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/txtai_mcp_server/core/config.py` around lines 97 - 100, The code mutates the settings instance by assigning detected_model to self.model_path inside create_application flow; to avoid unintended side effects, use a local variable (e.g., resolved_model) to hold the result of self._detect_model_from_archive(self.embeddings_path) and use that local variable where the application needs the model, leaving self.model_path unchanged; if you intentionally want settings updated, add a clear comment or an explicit opt-in (e.g., set_preserve_original=False) and update uses of self.model_path/detected_model accordingly.
55-64: File handle not explicitly closed afterextractfile().The file object
freturned bytar.extractfile()is not explicitly closed. While Python's GC will eventually clean it up, it's better practice to use a context manager or explicit close, especially within a loop.♻️ Suggested fix
for candidate in ["config.json", "./config.json", "embeddings/config.json"]: try: f = tar.extractfile(candidate) if f: - cfg = json.load(f) - model = cfg.get("path") - if model: - logger.info(f"Detected embedding model from KB archive: {model}") - return model + try: + cfg = json.load(f) + model = cfg.get("path") + if model: + logger.info(f"Detected embedding model from KB archive: {model}") + return model + finally: + f.close() except KeyError: continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/txtai_mcp_server/core/config.py` around lines 55 - 64, The tar.extractfile(candidate) file object `f` is not explicitly closed; update the block that reads the KB archive (the code using tar.extractfile, `f`, `cfg`, and `candidate`) to use a context manager (with statement) or ensure `f.close()` in a finally to guarantee the file handle is closed after json.load(f), while preserving the existing logic that reads `cfg.get("path")`, logs via `logger.info`, and returns the detected model.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/txtai_mcp_server/tools/search.py`:
- Around line 115-122: The fallback code uses get_document_from_cache(doc_id)
then re-calls get_txtai_app() and builds a raw SQL string with doc_id; fix by
reusing the existing app instance (do not call get_txtai_app() again) and escape
doc_id via escape_sql_string(doc_id) before interpolating into the SQL passed to
app.search (i.e., call app.search(f"select text from txtai where id =
'{escaped}'")). Keep the same fallback semantics (use "No text available" when
sql_result is empty or on exception) and only change the app usage and SQL
construction.
---
Nitpick comments:
In `@src/txtai_mcp_server/core/config.py`:
- Around line 97-100: The code mutates the settings instance by assigning
detected_model to self.model_path inside create_application flow; to avoid
unintended side effects, use a local variable (e.g., resolved_model) to hold the
result of self._detect_model_from_archive(self.embeddings_path) and use that
local variable where the application needs the model, leaving self.model_path
unchanged; if you intentionally want settings updated, add a clear comment or an
explicit opt-in (e.g., set_preserve_original=False) and update uses of
self.model_path/detected_model accordingly.
- Around line 55-64: The tar.extractfile(candidate) file object `f` is not
explicitly closed; update the block that reads the KB archive (the code using
tar.extractfile, `f`, `cfg`, and `candidate`) to use a context manager (with
statement) or ensure `f.close()` in a finally to guarantee the file handle is
closed after json.load(f), while preserving the existing logic that reads
`cfg.get("path")`, logs via `logger.info`, and returns the detected model.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/txtai_mcp_server/core/config.pysrc/txtai_mcp_server/tools/qa.pysrc/txtai_mcp_server/tools/search.py
- Move tarfile/json imports to top of config.py (PEP 8) - Fix SQL injection via escape_sql_string in search.py - Simplify redundant len(results) check in qa.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/txtai_mcp_server/tools/search.py (1)
115-121:⚠️ Potential issue | 🟠 MajorCast
doc_idto string before escaping to avoid silent fallback failures.At Line 119,
escape_sql_string(doc_id)can throw whendoc_idis non-string (e.g., numeric IDs). That exception is swallowed and incorrectly returns"No text available"even if DB text exists.💡 Proposed fix
- sql_result = app.search(f"select text from txtai where id = '{escape_sql_string(doc_id)}'") + safe_id = escape_sql_string(str(doc_id)) + sql_result = app.search(f"select text from txtai where id = '{safe_id}'")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/txtai_mcp_server/tools/search.py` around lines 115 - 121, The code swallows exceptions when escaping non-string doc_id which causes valid DB rows to be hidden; before calling escape_sql_string(doc_id) in the get_document_from_cache fallback block, cast doc_id to a string (e.g., str(doc_id)) so escape_sql_string always receives a string, then call get_txtai_app() and app.search with the escaped string; additionally consider narrowing the except or logging the error from app.search to avoid silent failures. Ensure you update references in this block (doc_id, escape_sql_string, get_txtai_app, app.search, get_document_from_cache) accordingly.
🧹 Nitpick comments (1)
src/txtai_mcp_server/core/config.py (1)
65-67: Narrow the exception type in archive model detection.Line 65 catches all exceptions, which can mask unrelated coding errors. Prefer handling archive/JSON read errors explicitly and preserving unexpected exceptions.
♻️ Suggested refinement
- except Exception as e: + except (tarfile.TarError, OSError, json.JSONDecodeError, KeyError, TypeError, ValueError) as e: logger.warning(f"Could not detect model from KB archive: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/txtai_mcp_server/core/config.py` around lines 65 - 67, The catch-all except in the archive model detection should be narrowed: replace "except Exception as e" with specific handlers for archive/JSON I/O errors (for example zipfile.BadZipFile, FileNotFoundError, json.JSONDecodeError, KeyError) so only expected archive/parse problems are logged via logger.warning(f"Could not detect model from KB archive: {e}"), and re-raise or let propagate any other unexpected exceptions; reference the same block that reads the KB archive which updates/returns self.model_path and uses logger.warning to ensure you only swallow known archive-related errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/txtai_mcp_server/tools/search.py`:
- Around line 115-121: The code swallows exceptions when escaping non-string
doc_id which causes valid DB rows to be hidden; before calling
escape_sql_string(doc_id) in the get_document_from_cache fallback block, cast
doc_id to a string (e.g., str(doc_id)) so escape_sql_string always receives a
string, then call get_txtai_app() and app.search with the escaped string;
additionally consider narrowing the except or logging the error from app.search
to avoid silent failures. Ensure you update references in this block (doc_id,
escape_sql_string, get_txtai_app, app.search, get_document_from_cache)
accordingly.
---
Nitpick comments:
In `@src/txtai_mcp_server/core/config.py`:
- Around line 65-67: The catch-all except in the archive model detection should
be narrowed: replace "except Exception as e" with specific handlers for
archive/JSON I/O errors (for example zipfile.BadZipFile, FileNotFoundError,
json.JSONDecodeError, KeyError) so only expected archive/parse problems are
logged via logger.warning(f"Could not detect model from KB archive: {e}"), and
re-raise or let propagate any other unexpected exceptions; reference the same
block that reads the KB archive which updates/returns self.model_path and uses
logger.warning to ensure you only swallow known archive-related errors.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/txtai_mcp_server/core/config.pysrc/txtai_mcp_server/tools/qa.pysrc/txtai_mcp_server/tools/search.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/txtai_mcp_server/tools/qa.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Fix four bugs affecting KB-loaded MCP server instances
Bug 1 —
fix(config): Auto-detect embedding model from KB archiveHardcoded
all-MiniLM-L6-v2caused silent dimension mismatch when KBwas built with a different model. Adds
_detect_model_from_archive()to read model path from
config.jsoninside the tar at load time.Bug 2 —
fix(qa): Check answer field value not just key presenceresults[0]["answer"]exists as a key but with valueNonein plaintext KBs.
.get("answer")truthiness check preventsNonepropagatingas return value → fixes Pydantic
Input should be a valid stringerror.Bug 3 —
fix(search): Replace undefinedneighbors()withedges()The txtai NetworkX wrapper does not expose
neighbors(). Calling itraises
AttributeErroron anysemantic_searchwithgraph=True.Bug 4 —
fix(search): SQL fallback when document cache empty on startupdocument_cacheis only populated viaadd_documents()at runtime.Fresh startup from tar archive always has empty cache → all results
return
"No text available"despite content existing in SQLite.SQL fallback resolves text directly from the DB on cache miss.
Summary by CodeRabbit
New Features
Bug Fixes