-
Notifications
You must be signed in to change notification settings - Fork 0
1.3.0 rc #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…cation in WebSecurityConfig
…ted before alias creation
…rategy for direct collections
…esolution tracking
…oth incremental and full modes
- Updated JobService to use REQUIRES_NEW transaction propagation for deleting ignored jobs, ensuring fresh entity retrieval and preventing issues with the calling transaction. - Removed token limitation from AI connection model and related DTOs, transitioning to project-level configuration for token limits. - Adjusted AIConnectionDTO tests to reflect the removal of token limitation. - Enhanced Bitbucket, GitHub, and GitLab AI client services to check token limits before analysis, throwing DiffTooLargeException when limits are exceeded. - Updated command processors to utilize project-level token limits instead of AI connection-specific limits. - Modified webhook processing to handle diff size issues gracefully, posting informative messages to VCS when analysis is skipped due to large diffs. - Cleaned up integration tests to remove references to token limitation in AI connection creation and updates.
…sis processing. Project PR analysis max analysis token limit implementation
…Exception in webhook processors
… entities from async contexts
…ies without re-fetching in async contexts
… lazy loading of associations
…cy across transaction contexts
…oading of associations
…ansaction management in async context
…mproved job management
…ervice for direct deletion
…nt in webhook processing
…n for RAG context
- Added AST-based code splitter using Tree-sitter for accurate code parsing. - Introduced TreeSitterParser for dynamic language loading and caching. - Created scoring configuration for RAG query result reranking with configurable boost factors and priority patterns. - Refactored RAGQueryService to utilize the new scoring configuration for enhanced result ranking. - Improved metadata extraction and handling for better context in scoring.
…rove code referencing in prompts
… target branch results
…emental updates in RAG operations
…ssue reconciliation process
…model configurations - Removed deprecated `_get_collection_name` method from RAGIndexManager. - Updated imports in `models/__init__.py` to exclude DocumentMetadata. - Changed default model in RAGConfig from "openai/text-embedding-3-small" to "qwen/qwen3-embedding-8b". - Deleted DocumentMetadata class from RAGConfig. - Removed WebhookIntegration service and its related methods. - Cleaned up unused utility function `is_code_file` from utils. - Updated tests to reflect the removal of DocumentMetadata and WebhookIntegration.
- Increased the queue capacity of the webhook executor to handle bursts of incoming requests. - Updated the rejected execution handler to throw exceptions instead of blocking the caller, allowing for better error handling in webhook endpoints. - Improved error logging in the WebhookAsyncProcessor for better traceability of job failures. - Introduced an EntityManager to detach jobs from the persistence context, preventing overwriting of job statuses during async processing. - Implemented a deduplication mechanism in the WebhookDeduplicationService to avoid processing duplicate commits within a defined time window. - Enhanced the response parsing logic to ensure proper logging and handling of resolved issues. - Streamlined the handling of PR and MR analysis in GitHub and GitLab webhook handlers, ensuring proper lock management and error handling. - Added memory-efficient streaming methods for preserving and copying points in the BranchManager. - Improved regex patterns in metadata extraction for various programming languages to account for leading whitespace. - Updated scoring configuration to use word-boundary matching for file path patterns, enhancing accuracy in priority assignment.
- Replaced @PreAuthorize annotations with @IsWorkspaceMember and @HasOwnerOrAdminRights in JobController, AllowedCommandUserController, ProjectController, QualityGateController, and VCS controllers (BitbucketCloud, GitHub, GitLab). - Improved readability and maintainability of security checks by utilizing custom annotations. - Removed redundant security checks in various endpoints to streamline access control. - Deleted unused _extract_archive function in web_server.py to clean up codebase.
- Updated import paths for IssueDTO in prompt_builder.py. - Modified analysis_summary prompt in prompt_constants.py for clarity. - Enhanced Dockerfile with CPU threading optimizations. - Improved environment validation in main.py to support Ollama and OpenRouter embedding providers. - Added new endpoints for parsing files and batch parsing in api.py, including AST metadata extraction. - Introduced embedding_factory.py to manage embedding model creation for Ollama and OpenRouter. - Implemented Ollama embedding wrapper in ollama_embedding.py for local model support. - Updated index_manager to check vector dimensions before copying branches. - Refactored RAGConfig to validate embedding provider configurations and auto-detect embedding dimensions. - Adjusted query_service to utilize the new embedding factory for model instantiation.
…action management in JobService and WebhookAsyncProcessor
…s for CRUD operations. Cloud version preparations
…ersion for existing records; update database migration scripts for version handling and remove token limitation from ai_connection
Epic/ca 7 pr review process flow
|
/codecrow analyze |
|
Important Review skippedToo many files! This PR contains 197 files, which is 47 over the limit of 150. You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ling - Introduced MAX_CONCURRENT_REVIEWS to limit simultaneous review requests. - Implemented asyncio.Semaphore to manage concurrent reviews. - Added a timeout of 600 seconds for review processing to prevent long-running requests. - Refactored LLM reranker initialization to be per-request, improving resource management. - Ensured MCP sessions are closed after review processing to release resources. - Enhanced error handling for timeouts and exceptions during review processing. refactor: Simplify context builder and remove unused components - Removed legacy context budget management and model context limits. - Streamlined context builder utilities for RAG metrics and caching. - Updated context fetching logic to align with new architecture. fix: Update prompt templates for clarity and accuracy - Revised Stage 2 cross-file prompt to focus on relevant aspects. - Changed references from "Database Migrations" to "Migration Files" for consistency. feat: Implement service-to-service authentication middleware - Added ServiceSecretMiddleware to validate internal requests with a shared secret. - Configured middleware to skip authentication for public endpoints. enhance: Improve collection management with payload indexing - Added functionality to create payload indexes for efficient filtering on common fields in Qdrant collections. fix: Adjust query service to handle path prefix mismatches - Updated fallback logic in RAGQueryService to improve handling of filename matches during queries.
feat: Enhance ReviewService with concurrency control and timeout hand…
|
/codecrow analyze |
- Added DiffFingerprintUtil to compute a stable fingerprint for code changes in pull requests. - Enhanced PullRequestAnalysisProcessor to utilize commit hash and diff fingerprint caches for reusing analysis results. - Updated CodeAnalysis model to include a diff_fingerprint field for storage. - Modified CodeAnalysisService to support retrieval and cloning of analyses based on diff fingerprints and commit hashes. - Added database migrations to introduce the diff_fingerprint column and create necessary indexes. - Improved error handling and logging in various components, including file existence checks in Bitbucket Cloud. - Refactored tests to accommodate new functionality and ensure coverage for caching mechanisms.
…or header validation
…ysis-binding feat: Implement diff fingerprint caching for pull request analysis
|
| Status | FAIL |
| Risk Level | HIGH |
| Review Coverage | 23 files analyzed in depth |
| Confidence | HIGH |
Executive Summary
This PR (1.3.0 rc) introduces significant architectural changes to the analysis engine, including a new diff fingerprinting mechanism for issue deduplication and a major refactoring of the webhook processing pipeline. While the scope aims to improve system scalability and data integrity, the current implementation introduces several high-risk regressions across security, database migration stability, and core RAG extraction logic.
Recommendation
Decision: FAIL
The PR is not suitable for merge in its current state due to critical blockers, including a "fail-open" authentication vulnerability in the Python middleware and conflicting database migrations that will cause deployment failures. Additionally, the RAG pipeline contains runtime incompatibilities with the Tree-sitter API that must be resolved to maintain core functionality.
Issues Overview
| Severity | Count | |
|---|---|---|
| 🔴 High | 7 | Critical issues requiring immediate attention |
| 🟡 Medium | 27 | Issues that should be addressed |
| 🔵 Low | 14 | Minor issues and improvements |
| ℹ️ Info | 3 | Informational notes and suggestions |
Analysis completed on 2026-02-10 10:20:23 | View Full Report | Pull Request
📋 Detailed Issues (51)
🔴 High Severity Issues
Id on Platform: 1943
Category: 🔒 Security
File: .../api/api.py:30
Issue: The 'startswith' check on resolved paths is vulnerable to prefix attacks. For example, if '_ALLOWED_REPO_ROOT' is '/tmp', a path like '/tmp-secret/repo' would pass the validation because '/tmp-secret/repo'.startswith('/tmp') is true, even though it is outside the intended root.
💡 Suggested Fix
Use 'os.path.commonpath' to verify that the resolved path is actually within the allowed root directory.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/api/api.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/api/api.py
@@ -25,7 +25,8 @@
def _validate_repo_path(path: str) -> str:
"""Validate that a repo path is within the allowed root and contains no traversal."""
+ root = os.path.realpath(_ALLOWED_REPO_ROOT)
resolved = os.path.realpath(path)
- if not resolved.startswith(os.path.realpath(_ALLOWED_REPO_ROOT)):
+ if os.path.commonpath([root, resolved]) != root:
raise ValueError(
f"Path must be under {_ALLOWED_REPO_ROOT}, got: {path}"
)Id on Platform: 1950
Category: ⚡ Performance
File: .../core/ollama_embedding.py:240
Issue: The implementation uses a synchronous httpx.Client for all requests, and the async methods (_aget_text_embeddings, etc.) simply wrap these sync calls. In an asynchronous RAG pipeline, this blocks the event loop for the duration of the embedding request, which can take several seconds, effectively negating the benefits of async and causing the entire service to hang.
💡 Suggested Fix
Initialize and use an httpx.AsyncClient for the aget* methods to ensure non-blocking I/O.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
@@ -68,6 +68,10 @@
base_url=base_url.rstrip('/'),
timeout=timeout
))
+ object.__setattr__(self, '_aclient', httpx.AsyncClient(
+ base_url=base_url.rstrip('/'),
+ timeout=timeout
+ ))
# Test connection
self._test_connection()Id on Platform: 1959
Category: 🐛 Bug Risk
File: .../ai/AiAnalysisRequestImpl.java:1
Issue: The deduplication logic in 'withAllPrAnalysesData' fails to merge resolved status when the input list 'allPrAnalyses' is sorted DESC (newest first), as specified in the Javadoc. In DESC order, the first issue seen for a fingerprint is the newest. Subsequent (older) issues will have a lower 'currentVersion' than the 'existingVersion' in the map, causing the 'if (currentVersion > existingVersion)' block to be skipped. Consequently, if an older version was resolved but the newest is open, the resolved status is never propagated.
💡 Suggested Fix
Sort the input list in ascending order (oldest first) before processing, or adjust the logic to handle DESC order by checking if the older (current) issue is resolved when the existing (newer) one is not.
--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
@@ -310,7 +310,8 @@
}
// Convert all issues to DTOs
- List<AiRequestPreviousIssueDTO> allIssues = allPrAnalyses.stream()
+ List<AiRequestPreviousIssueDTO> allIssues = allPrAnalyses.stream().sorted(java.util.Comparator.comparingInt(CodeAnalysis::getPrVersion))
.flatMap(analysis -> analysis.getIssues().stream())
.map(AiRequestPreviousIssueDTO::fromEntity)
.toList();Id on Platform: 1967
Category: 🐛 Bug Risk
File: .../index_manager/indexer.py:4
Issue: In 'index_repository', the loop processes batches and tracks 'failed_chunks'. However, if some chunks fail to index, the process continues and eventually performs an atomic swap, promoting an incomplete index to the production alias. This violates the integrity of the 'atomic swap' strategy which should ensure a complete and valid index is ready before swapping.
💡 Suggested Fix
Check the 'failed_chunks' count after the indexing loop. If failures occurred, raise an exception to trigger the cleanup logic and prevent the alias swap.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
@@ -215,6 +215,9 @@
f"Streaming indexing complete: {document_count} files, "
f"{successful_chunks}/{chunk_count} chunks indexed ({failed_chunks} failed)"
)
+
+ if failed_chunks > 0:
+ raise Exception(f"Indexing failed for {failed_chunks} chunks. Aborting swap.")
# Verify and perform atomic swap
temp_info = self.point_ops.client.get_collection(temp_collection_name)Id on Platform: 1974
Category: 🐛 Bug Risk
File: .../splitter/query_runner.py:2
Issue: The code attempts to import and use 'QueryCursor', which was removed in 'tree-sitter' version 0.22.0. Additionally, it attempts to unpack matches as tuples, but in the new API, 'query.matches()' returns 'QueryMatch' objects. This will cause an 'ImportError' or 'TypeError' at runtime.
💡 Suggested Fix
Update the query execution logic to use the 0.22+ API: call 'query.matches(node)' directly and access the 'captures' attribute on the resulting match objects.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
@@ -193,11 +193,7 @@
- try:
- # Use QueryCursor.matches() for pattern-grouped results
- # Each match is (pattern_id, {capture_name: [nodes]})
- from tree_sitter import QueryCursor
- cursor = QueryCursor(query)
- raw_matches = list(cursor.matches(tree.root_node))
- except Exception as e:
- logger.warning(f"Query execution failed for {lang_name}: {e}")
- return []
+ try:
+ raw_matches = query.matches(tree.root_node)
+ except Exception as e:
+ logger.warning(f"Query execution failed for {lang_name}: {e}")
+ return []
results: List[QueryMatch] = []
- for pattern_id, captures_dict in raw_matches:
+ for match_obj in raw_matches:
+ captures_dict = match_obj.capturesId on Platform: 1975
Category: 🔒 Security
File: .../service/WorkspaceService.java:1
Issue: The current logic prevents an ADMIN from promoting someone to ADMIN or OWNER, but it does not prevent an ADMIN from demoting an existing OWNER or another ADMIN. Since the check only looks at 'newRole', an ADMIN could change an OWNER's role to MEMBER.
💡 Suggested Fix
Update the security check to ensure that users with the ADMIN role cannot modify the roles of users who are currently ADMINs or OWNERs.
--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/workspace/service/WorkspaceService.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/workspace/service/WorkspaceService.java
@@ -123,10 +123,13 @@
throw new SecurityException("The current user does not have editing privileges.");
}
WorkspaceMember workspaceMemberActor = optionalWorkspaceMemberActor.get();
- if (workspaceMemberActor.getRole() == EWorkspaceRole.ADMIN
- && (newRole == EWorkspaceRole.ADMIN || newRole == EWorkspaceRole.OWNER)) {
- throw new SecurityException("The current user does not have editing privileges.");
- }
WorkspaceMember workspaceMember = workspaceMemberRepository
.findByWorkspaceIdAndUserId(workspaceId, target.getId())
.orElseThrow(() -> new NoSuchElementException("The user is not present in the workspace."));
+
+ if (workspaceMemberActor.getRole() == EWorkspaceRole.ADMIN) {
+ if (newRole == EWorkspaceRole.ADMIN || newRole == EWorkspaceRole.OWNER ||
+ workspaceMember.getRole() == EWorkspaceRole.ADMIN || workspaceMember.getRole() == EWorkspaceRole.OWNER) {
+ throw new SecurityException("The current user does not have editing privileges.");
+ }
+ }Id on Platform: 1977
Category: 🐛 Bug Risk
File: .../1.3.0/V1.3.0__remove_token_limitation_from_ai_connection.sql:1
Issue: This migration script is identical to V1.4.0__remove_token_limitation_from_ai_connection.sql which is also included in this PR. Having two migrations performing the same schema change with different version numbers will cause conflicts in the schema history table and potentially break the deployment.
💡 Suggested Fix
Remove the duplicate migration file. Ensure the change is only present in the correct versioned directory (likely 1.4.0 based on the project's current progression).
No suggested fix provided🟡 Medium Severity Issues
Id on Platform: 1934
Category: 🔒 Security
File: .../api/middleware.py:37
Issue: The secret comparison uses a standard equality operator ('=='), which is vulnerable to timing attacks. Additionally, the middleware 'fails open' if the SERVICE_SECRET environment variable is not set, allowing all requests to bypass authentication.
💡 Suggested Fix
Use 'secrets.compare_digest' for constant-time comparison to prevent timing attacks. Also, consider requiring the secret to be present in production environments to avoid accidental 'fail-open' scenarios.
--- a/python-ecosystem/mcp-client/api/middleware.py
+++ b/python-ecosystem/mcp-client/api/middleware.py
@@ -6,6 +6,7 @@
"""
import os
import logging
+import secrets
from starlette.middleware.base import BaseHTTPMiddleware
@@ -34,7 +35,7 @@
return await call_next(request)
provided = request.headers.get("x-service-secret", "")
- if provided != self.secret:
+ if not secrets.compare_digest(provided, self.secret):
logger.warning(
f"Unauthorized request to {request.url.path} from {request.client.host if request.client else 'unknown'}"
)Id on Platform: 1935
Category: 🔒 Security
File: .../api/middleware.py:37
Issue: Vulnerable to timing attacks due to standard string comparison and 'fails open' if SERVICE_SECRET is missing.
💡 Suggested Fix
Use 'secrets.compare_digest' for secret validation.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/api/middleware.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/api/middleware.py
@@ -6,6 +6,7 @@
"""
import os
import logging
+import secrets
from starlette.middleware.base import BaseHTTPMiddleware
@@ -34,7 +35,7 @@
return await call_next(request)
provided = request.headers.get("x-service-secret", "")
- if provided != self.secret:
+ if not secrets.compare_digest(provided, self.secret):
logger.warning(
f"Unauthorized request to {request.url.path} from {request.client.host if request.client else 'unknown'}"
)Id on Platform: 1936
Category: 🛡️ Error Handling
File: .../command/command_service.py
Issue: In 'process_summarize', the MCP client is instantiated, but the 'finally' block that ensures sessions are closed only covers the '_execute_summarize' call. If a timeout or exception occurs during '_fetch_rag_context_for_summarize' or prompt building, 'client.close_all_sessions()' will never be called, potentially leaking JVM subprocesses.
💡 Suggested Fix
Wrap the entire lifecycle of the MCP client in a try...finally block.
--- a/python-ecosystem/mcp-client/service/command/command_service.py
+++ b/python-ecosystem/mcp-client/service/command/command_service.py
@@ -84,13 +84,13 @@
})
client = self._create_mcp_client(config)
- llm = self._create_llm(request)
-
- # Fetch RAG context
- rag_context = await self._fetch_rag_context_for_summarize(request, event_callback)
-
- # Build prompt
- prompt = self._build_summarize_prompt(request, rag_context)
-
- self._emit_event(event_callback, {
- "type": "status",
- "state": "generating",
- "message": "Generating PR summary with AI"
- })
-
- # Execute with MCP agent
try:
+ llm = self._create_llm(request)
+ rag_context = await self._fetch_rag_context_for_summarize(request, event_callback)
+ prompt = self._build_summarize_prompt(request, rag_context)
+ self._emit_event(event_callback, {
+ "type": "status",
+ "state": "generating",
+ "message": "Generating PR summary with AI"
+ })
result = await self._execute_summarize(Id on Platform: 1937
Category: 🛡️ Error Handling
File: .../command/command_service.py
Issue: In 'process_ask', the MCP client is instantiated, but the 'finally' block that ensures sessions are closed only covers the '_execute_ask' call. If a timeout or exception occurs during '_fetch_rag_context_for_ask' or prompt building, 'client.close_all_sessions()' will never be called.
💡 Suggested Fix
Wrap the entire lifecycle of the MCP client in a try...finally block.
--- a/python-ecosystem/mcp-client/service/command/command_service.py
+++ b/python-ecosystem/mcp-client/service/command/command_service.py
@@ -184,12 +184,12 @@
client = self._create_mcp_client(config)
- llm = self._create_llm(request)
-
- # Fetch RAG context for the question
- rag_context = await self._fetch_rag_context_for_ask(request, event_callback)
-
- # Build prompt with Platform MCP tools if available
- prompt = self._build_ask_prompt(request, rag_context, has_platform_mcp=include_platform)
-
- self._emit_event(event_callback, {
- "type": "status",
- "state": "generating",
- "message": "Generating answer with AI"
- })
-
- # Execute with MCP agent
try:
+ llm = self._create_llm(request)
+ rag_context = await self._fetch_rag_context_for_ask(request, event_callback)
+ prompt = self._build_ask_prompt(request, rag_context, has_platform_mcp=include_platform)
+ self._emit_event(event_callback, {
+ "type": "status",
+ "state": "generating",
+ "message": "Generating answer with AI"
+ })
result = await self._execute_ask(Id on Platform: 1938
Category: 🛡️ Error Handling
File: .../review/review_service.py
Issue: The MCP client is created, but the cleanup logic ('close_all_sessions') is only triggered if the code reaches the inner 'try...finally' block. If a timeout occurs during 'self._fetch_rag_context' or orchestrator initialization, the sessions will remain open, leading to leaked JVM processes.
💡 Suggested Fix
Move the 'try...finally' block to immediately follow the MCP client creation to ensure cleanup regardless of where a timeout or error occurs within the request lifecycle.
--- a/python-ecosystem/mcp-client/service/review/review_service.py
+++ b/python-ecosystem/mcp-client/service/review/review_service.py
@@ -158,16 +158,16 @@
client = self._create_mcp_client(config)
-
- # Create LLM instance
- llm = self._create_llm(request)
-
- # Create a per-request reranker (not shared across concurrent requests)
- llm_reranker = LLMReranker(llm_client=llm)
-
- # Fetch RAG context if enabled
- rag_context = await self._fetch_rag_context(request, event_callback, llm_reranker=llm_reranker)
-
- # Build processed_diff if rawDiff is available to optimize Stage 1
- processed_diff = None
- if has_raw_diff:
- processed_diff = self._process_raw_diff(request.rawDiff)
-
- # Initialize Multi-Stage Orchestrator
- orchestrator = MultiStageReviewOrchestrator(
- llm=llm,
- mcp_client=client,
- project_id=request.projectId,
- event_callback=event_callback
- )
-
try:
+ llm = self._create_llm(request)
+ llm_reranker = LLMReranker(llm_client=llm)
+ rag_context = await self._fetch_rag_context(request, event_callback, llm_reranker=llm_reranker)
+ processed_diff = None
+ if has_raw_diff:
+ processed_diff = self._process_raw_diff(request.rawDiff)
+ orchestrator = MultiStageReviewOrchestrator(
+ llm=llm,
+ mcp_client=client,
+ project_id=request.projectId,
+ event_callback=event_callback
+ )
# Check for Branch Analysis / Reconciliation modeId on Platform: 1940
Category: 🐛 Bug Risk
File: .../client/RagPipelineClient.java:214
Issue: The 'branch' parameter in 'deleteIndex' is appended to the URL without encoding. Branch names frequently contain special characters like slashes (e.g., 'feature/fix'), which will break the URL structure. In contrast, 'deleteBranch' correctly encodes the branch name.
💡 Suggested Fix
Use URLEncoder to encode the branch name before formatting the URL, similar to the implementation in 'deleteBranch'.
--- a/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/client/RagPipelineClient.java
+++ b/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/client/RagPipelineClient.java
@@ -211,7 +211,8 @@
return;
}
- String url = String.format("%s/index/%s/%s/%s", ragApiUrl, workspace, project, branch);
+ String encodedBranch = java.net.URLEncoder.encode(branch, java.nio.charset.StandardCharsets.UTF_8);
+ String url = String.format("%s/index/%s/%s/%s", ragApiUrl, workspace, project, encodedBranch);
Request.Builder builder = new Request.Builder()
.url(url)
.delete();Id on Platform: 1941
Category: 🛡️ Error Handling
File: .../actions/CheckFileExistsInBranchAction.java
Issue: The retry logic within the 'catch (IOException e)' block lacks a backoff delay (Thread.sleep). If an IOException is thrown that contains '429' in its message (e.g., by a custom interceptor or a specific client implementation), the code will immediately retry the request without waiting, which is counter-productive when dealing with rate limits.
💡 Suggested Fix
Add a Thread.sleep(backoffMs) call before 'continue' in the catch block to ensure the rate limit is respected during retries.
--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/bitbucket/cloud/actions/CheckFileExistsInBranchAction.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/bitbucket/cloud/actions/CheckFileExistsInBranchAction.java
@@ -94,6 +94,12 @@
} catch (IOException e) {
if (attempt < MAX_RETRIES && e.getMessage() != null && e.getMessage().contains("429")) {
+ try {
+ Thread.sleep(backoffMs);
+ } catch (InterruptedException ie) {
+ Thread.currentThread().interrupt();
+ throw new IOException("Interrupted during retry backoff", ie);
+ }
attempt++;
backoffMs *= 2;
continue;Id on Platform: 1942
Category: 🐛 Bug Risk
File: .../github/GitHubClient.java:966
Issue: The GraphQL 'expression' string is constructed by only escaping double quotes. If a file path contains a backslash (), it will act as an escape character in the GraphQL query string, potentially leading to malformed queries or incorrect path resolution (e.g., '\n' being interpreted as a newline).
💡 Suggested Fix
Escape backslashes in the expression string before inserting it into the GraphQL query.
--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/github/GitHubClient.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/github/GitHubClient.java
@@ -964,3 +964,3 @@
String expression = branchOrCommit + ":" + path;
queryBuilder.append(alias).append(": object(expression: \"")
- .append(expression.replace("\"", "\\\""))
+ .append(expression.replace("\\", "\\\\\\").replace("\"", "\\\""))
.append("\") { ... on Blob { text byteSize } } ");Id on Platform: 1945
Category: 🐛 Bug Risk
File: .../queries/rust.scm:12
Issue: The queries for impl_item use (type_identifier) for the type and trait fields. In Rust, these fields can contain complex types such as generic types (Vec<T>), scoped types (std::collections::HashMap), or references (&str). Using (type_identifier) restricts matches to simple, non-generic, local types only, causing the RAG pipeline to miss a significant portion of implementation logic.
💡 Suggested Fix
Use the wildcard (_) or remove the node type restriction to allow matching any valid type node within the impl block.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/rust.scm
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/rust.scm
@@ -21,8 +21,8 @@
; Implementation blocks (simple type impl)
(impl_item
- type: (type_identifier) @name) @definition.impl
+ type: (_) @name) @definition.impl
; Implementation blocks for trait (impl Trait for Type)
(impl_item
- trait: (type_identifier) @trait_name
- type: (type_identifier) @name) @definition.impl
+ trait: (_) @trait_name
+ type: (_) @name) @definition.implId on Platform: 1949
Category: ⚡ Performance
File: .../core/ollama_embedding.py:73
Issue: Calling _test_connection() in the constructor performs a blocking network request. This can cause significant delays or timeouts during application startup or object instantiation if the Ollama server is slow or unreachable.
💡 Suggested Fix
Remove the connection test from the constructor. Connection health should be checked lazily or via a dedicated health check method.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
@@ -70,8 +70,6 @@
timeout=timeout
))
- # Test connection
- self._test_connection()
logger.info(f"Ollama embeddings initialized successfully")
def _test_connection(self):Id on Platform: 1951
Category: 🐛 Bug Risk
File: .../model/dtos.py:21
Issue: The 'line' field is defined as Optional[int], but the corresponding 'CodeReviewIssue' in 'output_schemas.py' defines it as a string to support ranges (e.g., '42-45'). If a previous issue containing a range is passed back to the client, Pydantic validation will fail for IssueDTO.
💡 Suggested Fix
Change the type of 'line' to Optional[str] to maintain consistency with the output schema and support line ranges.
--- a/python-ecosystem/mcp-client/model/dtos.py
+++ b/python-ecosystem/mcp-client/model/dtos.py
@@ -19,3 +19,3 @@
suggestedFixDiff: Optional[str] = None # Diff for suggested fix (from Java)
file: Optional[str] = None
- line: Optional[int] = None
+ line: Optional[str] = None
branch: Optional[str] = NoneId on Platform: 1954
Category: 🐛 Bug Risk
File: .../splitter/languages.py:1
Issue: Inconsistency in language support: 'Language.COBOL' is included in 'AST_SUPPORTED_LANGUAGES' but missing from 'LANGUAGE_TO_TREESITTER'. Additionally, several languages (Kotlin, Scala, Lua, Perl, Swift, Haskell, Cobol) are missing from 'TREESITTER_MODULES', which will cause 'TreeSitterParser.get_language' to return None even when 'is_ast_supported' returns True.
💡 Suggested Fix
Ensure all languages in 'AST_SUPPORTED_LANGUAGES' have corresponding entries in 'LANGUAGE_TO_TREESITTER' and 'TREESITTER_MODULES'.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
@@ -101,4 +101,5 @@
Language.SWIFT: 'swift',
Language.HASKELL: 'haskell',
+ Language.COBOL: 'cobol',
}Id on Platform: 1957
Category: ⚡ Performance
File: .../vcsclient/VcsClientProvider.java:213
Issue: The 'refreshToken' method is marked with '@transactional' and performs network I/O (OAuth token refresh). Holding a database transaction open during network calls is a performance anti-pattern that can lead to connection pool exhaustion and deadlocks, especially with the 30-second timeout. If the network is slow, the DB connection is held unnecessarily.
💡 Suggested Fix
Refactor the method to perform the network call outside of the transaction. Only the final step of updating the database with the new tokens should be wrapped in a transaction.
No suggested fix providedId on Platform: 1960
Category: 🧹 Code Quality
File: .../ai/AiAnalysisRequestImpl.java:364
Issue: The line grouping logic 'issue.line() / 3' does not implement the '±3 tolerance' mentioned in the Javadoc. This creates fixed buckets (e.g., lines 1-2 in bucket 0, lines 3-5 in bucket 1). Adjacent lines like 2 and 3 will fall into different buckets and fail to match, while lines 3 and 5 will match despite being further apart. This makes issue tracking across commits jittery.
💡 Suggested Fix
Consider using a larger bucket or a more flexible matching algorithm if true tolerance is required. At minimum, update the Javadoc to reflect that it uses fixed buckets of size 3.
--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
@@ -361,7 +361,7 @@
private String computeIssueFingerprint(AiRequestPreviousIssueDTO issue) {
String file = issue.file() != null ? issue.file() : "";
- // Normalize line to nearest multiple of 3 for tolerance
- int lineGroup = issue.line() != null ? (issue.line() / 3) : 0;
+ // Normalize line to buckets of 5 for better tolerance across small shifts
+ int lineGroup = issue.line() != null ? (issue.line() / 5) : 0;
String severity = issue.severity() != null ? issue.severity() : "";Id on Platform: 1963
Category: 🐛 Bug Risk
File: .../webhookhandler/GitHubPullRequestWebhookHandler.java:198
Issue: Calling postAnalysisResults with a null codeAnalysis parameter is likely to cause a NullPointerException in the reporting service. As seen in GitHubReportingService.java, the implementation attempts to generate an analysis summary from the provided analysis object without checking for null.
💡 Suggested Fix
Avoid passing null to postAnalysisResults. Instead, consider adding a dedicated error reporting method to the VcsReportingService or passing a minimal CodeAnalysis object that represents the error state.
No suggested fix providedId on Platform: 1964
Category: 🧹 Code Quality
File: .../webhookhandler/GitLabMergeRequestWebhookHandler.java:189
Issue: The GitLab handler is inconsistent with the GitHub handler as it lacks specific catch blocks for DiffTooLargeException and AnalysisLockedException. This results in these recoverable exceptions being logged as errors and prevents them from being re-thrown for proper handling by the WebhookAsyncProcessor.
💡 Suggested Fix
Add a specific catch block for DiffTooLargeException and AnalysisLockedException that releases the lock and re-throws the exception, matching the behavior in GitHubPullRequestWebhookHandler. Ensure the exceptions are imported.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/webhookhandler/GitLabMergeRequestWebhookHandler.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/webhookhandler/GitLabMergeRequestWebhookHandler.java
@@ -187,6 +187,13 @@
return WebhookResult.success("MR analysis completed", result);
+ } catch (DiffTooLargeException | AnalysisLockedException e) {
+ if (acquiredLockKey != null) {
+ analysisLockService.releaseLock(acquiredLockKey);
+ }
+ log.warn("MR analysis failed with recoverable exception for project {}: {}", project.getId(), e.getMessage());
+ throw e;
} catch (Exception e) {
log.error("MR analysis failed for project {}", project.getId(), e);
// Release the lock since processor won't take ownershipId on Platform: 1965
Category: 🐛 Bug Risk
File: .../webhookhandler/GitLabMergeRequestWebhookHandler.java:197
Issue: Calling postAnalysisResults with a null codeAnalysis parameter is likely to cause a NullPointerException in the reporting service, similar to the issue in the GitHub handler.
💡 Suggested Fix
Avoid passing null to postAnalysisResults. Use a dedicated error reporting mechanism or a fallback analysis object.
No suggested fix providedId on Platform: 1966
Category: ⚡ Performance
File: .../index_manager/branch_manager.py:1
Issue: The 'get_indexed_branches' method performs a full scroll of the collection to identify unique branch names. In collections with millions of points, this will result in thousands of network requests and significant latency, as it iterates through every point's payload.
💡 Suggested Fix
Increase the scroll limit to reduce round trips and consider caching the branch list or storing it in a dedicated metadata collection/point to avoid full scans.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
@@ -75,7 +75,7 @@
try:
branches: Set[str] = set()
offset = None
- limit = 100
+ limit = 1000
while True:
results = self.client.scroll(Id on Platform: 1969
Category: 🛡️ Error Handling
File: .../index_manager/point_operations.py:1
Issue: The 'upsert_points' method catches all exceptions during batch upserts and simply logs them, continuing with the next batch. While this provides resilience, it leads to silent data loss in the index if the caller does not explicitly check the failure count.
💡 Suggested Fix
Consider re-raising the exception or ensuring the caller (like RepositoryIndexer) strictly validates the success of all batches before proceeding with alias swaps.
No suggested fix providedId on Platform: 1971
Category: 🧹 Code Quality
File: .../orchestrator/context_helpers.py:1
Issue: The regex patterns for symbol extraction are too restrictive. The CamelCase pattern [A-Z][a-z]+[A-Z][a-zA-Z]* requires at least two uppercase letters with lowercase in between (e.g., 'MyClass' matches, but 'User' or 'Data' do not). Similarly, the snake_case pattern requires at least one underscore, missing single-word variables like 'result' or 'items'.
💡 Suggested Fix
Update the regex patterns to capture single-word capitalized identifiers and single-word lowercase identifiers (with length filters).
--- a/python-ecosystem/mcp-client/service/review/orchestrator/context_helpers.py
+++ b/python-ecosystem/mcp-client/service/review/orchestrator/context_helpers.py
@@ -40,8 +40,8 @@
- camel_case = re.findall(r'\b([A-Z][a-z]+[A-Z][a-zA-Z]*)\b', diff_content)
- symbols.update(camel_case)
-
- # Match snake_case identifiers (variables, functions)
- snake_case = re.findall(r'\b([a-z][a-z0-9]*(?:_[a-z0-9]+)+)\b', diff_content)
+ # Match Capitalized identifiers (Classes, Components)
+ capitalized = re.findall(r'\b([A-Z][a-zA-Z0-9]+)\b', diff_content)
+ symbols.update(capitalized)
+
+ # Match lowercase identifiers (variables, functions)
+ identifiers = re.findall(r'\b([a-z][a-z0-9_]+)\b', diff_content)
+ symbols.update(s for s in identifiers if len(s) > 4)Id on Platform: 1972
Category: 🐛 Bug Risk
File: .../orchestrator/json_utils.py:10
Issue: The use of text.find('{') and text.rfind('}') to extract JSON is fragile. If the LLM output contains multiple JSON objects (e.g., 'Config: { ... } Result: { ... }'), this logic will extract everything from the first '{' to the last '}', resulting in invalid JSON.
💡 Suggested Fix
Implement a balanced brace extraction algorithm to find the first complete JSON object, similar to the _extract_json_object method seen in python-ecosystem/mcp-client/service/command_service.py.
--- a/python-ecosystem/mcp-client/service/review/orchestrator/json_utils.py
+++ b/python-ecosystem/mcp-client/service/review/orchestrator/json_utils.py
@@ -130,2 +130,19 @@
- obj_start = text.find("{")
- obj_end = text.rfind("}")
+ def extract_balanced(t, start_char='{', end_char='}'):
+ start = t.find(start_char)
+ if start == -1: return -1, -1
+ depth, in_str, escape = 0, False, False
+ for i in range(start, len(t)):
+ char = t[i]
+ if escape: escape = False; continue
+ if char == '\\': escape = True; continue
+ if char == '"': in_str = not in_str; continue
+ if in_str: continue
+ if char == start_char: depth += 1
+ elif char == end_char:
+ depth -= 1
+ if depth == 0: return start, i
+ return -1, -1
+
+ obj_start, obj_end = extract_balanced(text, '{', '}')Id on Platform: 1973
Category: 🐛 Bug Risk
File: .../splitter/metadata.py:120
Issue: In 'extract_signature', the code uses 'next()' with a generator to find the index of the current line in the 'lines' list. If the same line content (e.g., a common decorator or method name like 'def init') appears multiple times in the first 15 lines, 'idx' will always point to the first occurrence, potentially causing the signature to be built from the wrong context.
💡 Suggested Fix
Use 'enumerate' in the outer loop to get the current line index directly.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
@@ -112,2 +112,2 @@
- for line in lines[:15]:
- line = line.strip()
+ for idx, line in enumerate(lines[:15]):
+ line = line.strip()
@@ -125,3 +125,2 @@
- if ')' not in sig and ':' not in sig:
- idx = next((i for i, l in enumerate(lines) if l.strip() == line), -1)
- if idx >= 0:
- for next_line in lines[idx+1:idx+5]:
+ if ')' not in sig and ':' not in sig:
+ for next_line in lines[idx+1:idx+5]:Id on Platform: 1976
Category: ⚡ Performance
File: .../service/JobService.java:383
Issue: Applying REQUIRES_NEW to all logging methods (info, warn, error, debug) will cause a new database transaction to be opened and committed for every single log entry. Under high load or within loops, this can lead to connection pool exhaustion and significant performance degradation.
💡 Suggested Fix
Consider using a non-blocking asynchronous logging mechanism or only applying REQUIRES_NEW to critical status changes and error logs, while keeping standard progress logs within the parent transaction or using a buffered approach.
No suggested fix providedId on Platform: 1978
Category: 🐛 Bug Risk
File: .../service/PrFileEnrichmentService.java:1
Issue: The OkHttpClient is initialized in the constructor before @value fields are injected. Consequently, the 'requestTimeoutSeconds' property is ignored, and the client always uses the hardcoded 60s timeout. Additionally, creating a new ObjectMapper instance manually bypasses Spring's auto-configuration (e.g., custom modules, naming strategies).
💡 Suggested Fix
Use constructor injection for the ObjectMapper and the timeout property to ensure they are correctly initialized and managed by Spring.
--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/PrFileEnrichmentService.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/PrFileEnrichmentService.java
@@ -42,16 +42,14 @@
@Value("${pr.enrichment.rag-pipeline-url:http://localhost:8006}")
private String ragPipelineUrl;
- @Value("${pr.enrichment.request-timeout-seconds:60}")
- private int requestTimeoutSeconds;
-
private final ObjectMapper objectMapper;
private final OkHttpClient httpClient;
- public PrFileEnrichmentService() {
- this.objectMapper = new ObjectMapper();
+ public PrFileEnrichmentService(
+ ObjectMapper objectMapper,
+ @Value("${pr.enrichment.request-timeout-seconds:60}") int requestTimeoutSeconds
+ ) {
+ this.objectMapper = objectMapper;
this.httpClient = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
- .readTimeout(60, TimeUnit.SECONDS)
+ .readTimeout(requestTimeoutSeconds, TimeUnit.SECONDS)
.writeTimeout(30, TimeUnit.SECONDS)
.build();
}Id on Platform: 1979
Category: 🐛 Bug Risk
File: .../enrichment/FileContentDto.java:21
Issue: String.getBytes() uses the platform's default charset. This can cause inconsistent file size reporting if the analysis engine runs on systems with different default encodings (e.g., UTF-16 vs UTF-8). Since VCS content is typically UTF-8, it should be explicitly specified.
💡 Suggested Fix
Specify StandardCharsets.UTF_8 when calling getBytes().
--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/enrichment/FileContentDto.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/enrichment/FileContentDto.java
@@ -19,3 +19,3 @@
path,
content,
- content != null ? content.getBytes().length : 0,
+ content != null ? content.getBytes(java.nio.charset.StandardCharsets.UTF_8).length : 0,
false,Id on Platform: 1983
Category: 🧹 Code Quality
File: .../routers/commands.py:1
Issue: The utility functions '_wants_streaming', '_json_event', and especially the complex '_drain_queue_until_final' logic are duplicated exactly in 'python-ecosystem/mcp-client/api/routers/review.py'.
💡 Suggested Fix
Extract shared streaming utilities and NDJSON helpers into a common utility module (e.g., 'utils/streaming.py') to improve maintainability and reduce code duplication.
No suggested fix providedId on Platform: 1984
Category: 🧹 Code Quality
File: .../routers/review.py:1
Issue: The utility functions '_wants_streaming', '_json_event', and '_drain_queue_until_final' are duplicated from 'python-ecosystem/mcp-client/api/routers/commands.py'.
💡 Suggested Fix
Extract shared streaming utilities into a common utility module.
No suggested fix provided🔵 Low Severity Issues
Id on Platform: 1939
Category: ⚡ Performance
File: .../models/config.py:107
Issue: The 'chunk_size' has been increased from 800 to 8000 characters. While this keeps semantic units intact, it significantly increases the token count per retrieved context. If 'top_k' is high (e.g., 8 as seen in 'command_service.py'), this could lead to very large prompts that might exceed context limits or increase latency.
💡 Suggested Fix
Monitor token usage and consider if a slightly smaller chunk size (e.g., 4000) provides a better balance between semantic completeness and prompt efficiency.
No suggested fix providedId on Platform: 1944
Category: 🧹 Code Quality
File: .../model/multi_stage.py:84
Issue: The alias 'check_pass' is identical to the field name, making it redundant.
💡 Suggested Fix
Remove the redundant alias from the Field definition.
--- a/python-ecosystem/mcp-client/model/multi_stage.py
+++ b/python-ecosystem/mcp-client/model/multi_stage.py
@@ -83,3 +83,3 @@
class ImmutabilityCheck(BaseModel):
rule: str
- check_pass: bool = Field(alias="check_pass")
+ check_pass: bool
evidence: strId on Platform: 1946
Category: ✨ Best Practices
File: .../controller/ProjectController.java:592
Issue: The 'maxAnalysisTokenLimit' field in 'UpdateAnalysisSettingsRequest' lacks validation. A user could potentially submit a negative value or an excessively high value if not constrained by the service layer.
💡 Suggested Fix
Add validation annotations such as '@min(0)' to the record field.
--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
@@ -591,7 +591,7 @@
Boolean prAnalysisEnabled,
Boolean branchAnalysisEnabled,
String installationMethod,
- Integer maxAnalysisTokenLimit
+ @jakarta.validation.constraints.Min(0) Integer maxAnalysisTokenLimit
) {}Id on Platform: 1947
Category: 🐛 Bug Risk
File: .../queries/typescript.scm:58
Issue: The predicate (#not-match? @value "^arrow_function") matches against the literal source text of the node. Since an arrow function's source (e.g., '(x) => x') never starts with the string 'arrow_function', this filter does nothing. Consequently, arrow functions assigned to variables will be tagged as both @definition.function and @definition.variable.
💡 Suggested Fix
Remove the ineffective regex filter. To distinguish between functions and variables in lexical declarations, consider using more specific patterns or handling the overlap in the post-processing logic.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scm
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scm
@@ -55,7 +55,6 @@
(lexical_declaration
(variable_declarator
name: (identifier) @name
- value: (_) @value) @declarator
- (#not-match? @value "^arrow_function")) @definition.variable
+ value: (_) @value) @declarator) @definition.variable
; Ambient declarationsId on Platform: 1948
Category: ✨ Best Practices
File: .../core/embedding_factory.py:35
Issue: Hardcoding 'timeout=120.0' (and 60.0 for OpenRouter) overrides the default logic in OllamaEmbedding/OpenRouterEmbedding which checks for environment variables like OLLAMA_TIMEOUT. This makes the system less configurable.
💡 Suggested Fix
Pass None for the timeout parameter to allow the embedding classes to use their internal default logic (which includes environment variable lookups).
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/embedding_factory.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/embedding_factory.py
@@ -32,7 +32,7 @@
return OllamaEmbedding(
model=config.ollama_model,
base_url=config.ollama_base_url,
- timeout=120.0,
+ timeout=None,
expected_dim=config.embedding_dim
)
@@ -42,7 +42,7 @@
api_key=config.openrouter_api_key,
model=config.openrouter_model,
api_base=config.openrouter_base_url,
- timeout=60.0,
+ timeout=None,
max_retries=3,
expected_dim=config.embedding_dim
)Id on Platform: 1952
Category: 🧹 Code Quality
File: .../model/output_schemas.py:1
Issue: The fields 'resolutionExplanation' and 'resolvedInCommit' are inconsistent with 'resolvedDescription' and 'resolvedByCommit' used in 'IssueDTO' (dtos.py). Aligning these names improves maintainability and reduces mapping errors during issue reconciliation.
💡 Suggested Fix
Rename the resolution tracking fields to match the naming convention used in IssueDTO.
--- a/python-ecosystem/mcp-client/model/output_schemas.py
+++ b/python-ecosystem/mcp-client/model/output_schemas.py
@@ -23,4 +23,4 @@
isResolved: bool = Field(default=False, description="Whether this issue from previous analysis is resolved")
# Resolution tracking fields
- resolutionExplanation: Optional[str] = Field(default=None, description="Explanation of how the issue was resolved (separate from original reason)")
- resolvedInCommit: Optional[str] = Field(default=None, description="Commit hash where the issue was resolved")
+ resolvedDescription: Optional[str] = Field(default=None, description="Explanation of how the issue was resolved (separate from original reason)")
+ resolvedByCommit: Optional[str] = Field(default=None, description="Commit hash where the issue was resolved")Id on Platform: 1955
Category: ⚡ Performance
File: .../service/RagOperationsServiceImpl.java:1
Issue: When a branch has no differences compared to the base branch (e.g., a newly created branch), the method returns true without persisting a 'RagBranchIndex' record. Because 'ensureBranchIndexUpToDate' only skips indexing if this record exists, it will re-trigger the VCS diff fetch on every subsequent analysis request for this branch until a change is actually committed.
💡 Suggested Fix
Persist a RagBranchIndex record even when the diff is empty. This marks the branch as 'up-to-date' at the current commit hash, preventing redundant VCS calls in future checks.
--- a/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
+++ b/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
@@ -445,13 +445,22 @@
String rawDiff = vcsClient.getBranchDiff(workspaceSlug, repoSlug, baseBranch, targetBranch);
+ // Get latest commit hash on target branch
+ String targetCommit = vcsClient.getLatestCommitHash(workspaceSlug, repoSlug, targetBranch);
+
if (rawDiff == null || rawDiff.isEmpty()) {
log.info("No diff between '{}' and '{}' - branch has same content as base, using main index",
baseBranch, targetBranch);
+ // Persist record to avoid redundant diff checks
+ RagBranchIndex branchIndex = new RagBranchIndex();
+ branchIndex.setProjectId(project.getId());
+ branchIndex.setBranchName(targetBranch);
+ branchIndex.setCommitHash(targetCommit);
+ branchIndex.setUpdatedAt(OffsetDateTime.now());
+ ragBranchIndexRepository.save(branchIndex);
eventConsumer.accept(Map.of(
"type", "info",
"message", String.format("No changes between %s and %s - using main branch index", baseBranch, targetBranch)
));
return true;
}
-
- // Get latest commit hash on target branch
- String targetCommit = vcsClient.getLatestCommitHash(workspaceSlug, repoSlug, targetBranch);Id on Platform: 1956
Category: 🐛 Bug Risk
File: .../vcsclient/VcsClient.java:247
Issue: The check 'content.length() <= maxFileSizeBytes' compares the number of UTF-16 characters to a byte limit. For non-ASCII characters, the byte size will exceed the character count, potentially allowing files larger than the intended limit. Furthermore, the file is fully fetched into memory before the check, which doesn't prevent memory pressure for large files.
💡 Suggested Fix
Use a more accurate byte-size check if the limit is strictly in bytes, and consider checking file size via metadata before fetching the full content if the provider supports it.
--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/VcsClient.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/VcsClient.java
@@ -238,5 +238,5 @@
String content = getFileContent(workspaceId, repoIdOrSlug, path, branchOrCommit);
- if (content != null && content.length() <= maxFileSizeBytes) {
+ if (content != null && content.getBytes(java.nio.charset.StandardCharsets.UTF_8).length <= maxFileSizeBytes) {
results.put(path, content);
}Id on Platform: 1958
Category: ⚡ Performance
File: .../vcsclient/VcsClientProvider.java:419
Issue: A new 'OkHttpClient' instance is created for every GitLab token refresh call. This is inefficient as it prevents connection pooling and requires setting up new thread pools and connection structures for every request.
💡 Suggested Fix
Reuse a single 'OkHttpClient' instance (e.g., as a class field or a bean) instead of creating a new one in every method call.
No suggested fix providedId on Platform: 1961
Category: 🧹 Code Quality
File: .../webhookhandler/BitbucketCloudPullRequestWebhookHandler.java:1
Issue: The pull request ID is parsed from a string to a long multiple times within the same method. This is inefficient and redundant.
💡 Suggested Fix
Parse the pull request ID once at the beginning of the method and reuse the variable.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java\n+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java\n@@ -102,8 +102,9 @@\n private WebhookResult handlePullRequestEvent(WebhookPayload payload, Project project, Consumer<Map<String, Object>> eventConsumer) {\n String placeholderCommentId = null;\n String acquiredLockKey = null;\n+ long prId = Long.parseLong(payload.pullRequestId());\n \n try {\n // Try to acquire lock atomically BEFORE posting placeholder\n // This prevents race condition where multiple webhooks could post duplicate placeholders\n String sourceBranch = payload.sourceBranch();\n Optional<String> earlyLock = analysisLockService.acquireLock(\n project, sourceBranch, AnalysisLockType.PR_ANALYSIS,\n- payload.commitHash(), Long.parseLong(payload.pullRequestId()));\n+ payload.commitHash(), prId);\n \n if (earlyLock.isEmpty()) {\n log.info(\"PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook\", \n- project.getId(), sourceBranch, payload.pullRequestId());\n+ project.getId(), sourceBranch, prId);\n return WebhookResult.ignored(\"PR analysis already in progress for this branch\");\n }\n@@ -120,7 +121,7 @@\n // Lock acquired - placeholder posting is now protected from race conditions\n \n // Post placeholder comment immediately to show analysis has started\n- placeholderCommentId = postPlaceholderComment(project, Long.parseLong(payload.pullRequestId()));\n+ placeholderCommentId = postPlaceholderComment(project, prId);\n \n // Create PR analysis request\n PrProcessRequest request = new PrProcessRequest();\n@@ -128,7 +129,7 @@\n request.sourceBranchName = sourceBranch;\n request.targetBranchName = payload.targetBranch();\n request.commitHash = payload.commitHash();\n- request.pullRequestId = Long.parseLong(payload.pullRequestId());\n+ request.pullRequestId = prId;\n request.placeholderCommentId = placeholderCommentId;\n request.prAuthorId = payload.prAuthorId();\n request.prAuthorUsername = payload.prAuthorUsername();Id on Platform: 1962
Category: 🧹 Code Quality
File: .../webhookhandler/GitHubPullRequestWebhookHandler.java:123
Issue: The pull request ID is parsed from a String to a Long multiple times (lines 132, 147, 155, and 196). This is redundant and slightly inefficient.
💡 Suggested Fix
Parse the pull request ID once at the beginning of the method and store it in a variable.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
@@ -122,12 +122,13 @@
) {
String placeholderCommentId = null;
String acquiredLockKey = null;
+ Long prId = Long.parseLong(payload.pullRequestId());
try {
// Try to acquire lock atomically BEFORE posting placeholder
// This prevents race condition where multiple webhooks could post duplicate placeholders
String sourceBranch = payload.sourceBranch();
Optional<String> earlyLock = analysisLockService.acquireLock(
project, sourceBranch, AnalysisLockType.PR_ANALYSIS,
- payload.commitHash(), Long.parseLong(payload.pullRequestId()));
+ payload.commitHash(), prId);
if (earlyLock.isEmpty()) {
log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook",
- project.getId(), sourceBranch, payload.pullRequestId());
+ project.getId(), sourceBranch, prId);
return WebhookResult.ignored("PR analysis already in progress for this branch");
}
acquiredLockKey = earlyLock.get();
// Lock acquired - placeholder posting is now protected from race conditions
// Post placeholder comment immediately to show analysis has started
- placeholderCommentId = postPlaceholderComment(project, Long.parseLong(payload.pullRequestId()));
+ placeholderCommentId = postPlaceholderComment(project, prId);
PrProcessRequest request = new PrProcessRequest();
request.project = project;
- request.pullRequestId = Long.parseLong(payload.pullRequestId());
+ request.pullRequestId = prId;
request.sourceBranchName = sourceBranch;
request.targetBranchName = payload.targetBranch();Id on Platform: 1980
Category: 🎨 Style
File: .../index_manager/collection_manager.py:15
Issue: The imports 'TextIndexParams' and 'TokenizerType' are added but not used anywhere in the file.
💡 Suggested Fix
Remove the unused imports to keep the code clean.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py
@@ -12,3 +12,3 @@
from qdrant_client.models import (
- Distance, VectorParams,
- CreateAlias, DeleteAlias, CreateAliasOperation, DeleteAliasOperation,
- PayloadSchemaType, TextIndexParams, TokenizerType
+ Distance, VectorParams, CreateAlias, DeleteAlias,
+ CreateAliasOperation, DeleteAliasOperation, PayloadSchemaType
)Id on Platform: 1981
Category: ⚡ Performance
File: .../orchestrator/orchestrator.py:74
Issue: In _index_pr_files, the code falls back to f.content if f.full_content is missing. In many diff processing contexts, f.content represents the raw diff string (with + and - markers). Indexing diff hunks into a RAG system instead of clean code significantly degrades embedding quality and semantic search accuracy.
💡 Suggested Fix
Ensure that only full file content is indexed, or strip diff markers before indexing if only the diff is available.
No suggested fix providedId on Platform: 1982
Category: 🧹 Code Quality
File: .../orchestrator/reconciliation.py:43
Issue: The line grouping logic int(line) // 3 is unstable for issue tracking. An issue moving from line 3 to line 4 will change its fingerprint (group 1 to group 2), while an issue moving from line 4 to line 5 will not. This makes issue persistence tracking inconsistent across small code shifts.
💡 Suggested Fix
Consider using a more robust fuzzy matching for lines or rely more heavily on the issue reason/category for deduplication if the line number is within a certain threshold.
No suggested fix providedℹ️ Informational Notes
Id on Platform: 1953
Category: 🎨 Style
File: .../splitter/languages.py:72
Issue: The extension '.xml' is mapped to 'Language.HTML'. While similar, LangChain's 'Language' enum contains a specific 'XML' member which would be more appropriate.
💡 Suggested Fix
Change the mapping for '.xml' to 'Language.XML'.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
@@ -73,1 +73,1 @@
- '.xml': Language.HTML,
+ '.xml': Language.XML,Id on Platform: 1968
Category: 🧹 Code Quality
File: .../index_manager/manager.py:254
Issue: The call to 'list_all_indices' passes 'self._collection_manager.alias_exists' as an argument, but the receiving method in 'StatsManager' does not use this parameter.
💡 Suggested Fix
Remove the unused argument from the call to align with the suggested cleanup in 'StatsManager'.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/manager.py
@@ -239,3 +239,1 @@
def list_indices(self) -> List[IndexStats]:
"""List all project indices with branch breakdown."""
- return self._stats_manager.list_all_indices(
- self._collection_manager.alias_exists
- )
+ return self._stats_manager.list_all_indices()Id on Platform: 1970
Category: 🧹 Code Quality
File: .../index_manager/stats_manager.py:102
Issue: The parameter 'alias_checker' is defined in 'list_all_indices' but never used within the function body.
💡 Suggested Fix
Remove the unused parameter from the function signature.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
@@ -100,1 +100,1 @@
- def list_all_indices(self, alias_checker) -> List[IndexStats]:
+ def list_all_indices(self) -> List[IndexStats]:Files Affected
- .../splitter/languages.py: 2 issues
- .../command/command_service.py: 2 issues
- .../vcsclient/VcsClientProvider.java: 2 issues
- .../core/ollama_embedding.py: 2 issues
- .../webhookhandler/GitLabMergeRequestWebhookHandler.java: 2 issues
- .../webhookhandler/GitHubPullRequestWebhookHandler.java: 2 issues
- .../ai/AiAnalysisRequestImpl.java: 2 issues
- .../service/WorkspaceService.java: 1 issue
- .../queries/rust.scm: 1 issue
- .../routers/commands.py: 1 issue
No description provided.