Skip to content

Conversation

@rostilos
Copy link
Owner

@rostilos rostilos commented Feb 8, 2026

No description provided.

- Updated JobService to use REQUIRES_NEW transaction propagation for deleting ignored jobs, ensuring fresh entity retrieval and preventing issues with the calling transaction.
- Removed token limitation from AI connection model and related DTOs, transitioning to project-level configuration for token limits.
- Adjusted AIConnectionDTO tests to reflect the removal of token limitation.
- Enhanced Bitbucket, GitHub, and GitLab AI client services to check token limits before analysis, throwing DiffTooLargeException when limits are exceeded.
- Updated command processors to utilize project-level token limits instead of AI connection-specific limits.
- Modified webhook processing to handle diff size issues gracefully, posting informative messages to VCS when analysis is skipped due to large diffs.
- Cleaned up integration tests to remove references to token limitation in AI connection creation and updates.
…sis processing. Project PR analysis max analysis token limit implementation
- Added AST-based code splitter using Tree-sitter for accurate code parsing.
- Introduced TreeSitterParser for dynamic language loading and caching.
- Created scoring configuration for RAG query result reranking with configurable boost factors and priority patterns.
- Refactored RAGQueryService to utilize the new scoring configuration for enhanced result ranking.
- Improved metadata extraction and handling for better context in scoring.
rostilos and others added 16 commits January 28, 2026 22:05
…model configurations

- Removed deprecated `_get_collection_name` method from RAGIndexManager.
- Updated imports in `models/__init__.py` to exclude DocumentMetadata.
- Changed default model in RAGConfig from "openai/text-embedding-3-small" to "qwen/qwen3-embedding-8b".
- Deleted DocumentMetadata class from RAGConfig.
- Removed WebhookIntegration service and its related methods.
- Cleaned up unused utility function `is_code_file` from utils.
- Updated tests to reflect the removal of DocumentMetadata and WebhookIntegration.
- Increased the queue capacity of the webhook executor to handle bursts of incoming requests.
- Updated the rejected execution handler to throw exceptions instead of blocking the caller, allowing for better error handling in webhook endpoints.
- Improved error logging in the WebhookAsyncProcessor for better traceability of job failures.
- Introduced an EntityManager to detach jobs from the persistence context, preventing overwriting of job statuses during async processing.
- Implemented a deduplication mechanism in the WebhookDeduplicationService to avoid processing duplicate commits within a defined time window.
- Enhanced the response parsing logic to ensure proper logging and handling of resolved issues.
- Streamlined the handling of PR and MR analysis in GitHub and GitLab webhook handlers, ensuring proper lock management and error handling.
- Added memory-efficient streaming methods for preserving and copying points in the BranchManager.
- Improved regex patterns in metadata extraction for various programming languages to account for leading whitespace.
- Updated scoring configuration to use word-boundary matching for file path patterns, enhancing accuracy in priority assignment.
- Replaced @PreAuthorize annotations with @IsWorkspaceMember and @HasOwnerOrAdminRights in JobController, AllowedCommandUserController, ProjectController, QualityGateController, and VCS controllers (BitbucketCloud, GitHub, GitLab).
- Improved readability and maintainability of security checks by utilizing custom annotations.
- Removed redundant security checks in various endpoints to streamline access control.
- Deleted unused _extract_archive function in web_server.py to clean up codebase.
- Updated import paths for IssueDTO in prompt_builder.py.
- Modified analysis_summary prompt in prompt_constants.py for clarity.
- Enhanced Dockerfile with CPU threading optimizations.
- Improved environment validation in main.py to support Ollama and OpenRouter embedding providers.
- Added new endpoints for parsing files and batch parsing in api.py, including AST metadata extraction.
- Introduced embedding_factory.py to manage embedding model creation for Ollama and OpenRouter.
- Implemented Ollama embedding wrapper in ollama_embedding.py for local model support.
- Updated index_manager to check vector dimensions before copying branches.
- Refactored RAGConfig to validate embedding provider configurations and auto-detect embedding dimensions.
- Adjusted query_service to utilize the new embedding factory for model instantiation.
…action management in JobService and WebhookAsyncProcessor
…s for CRUD operations. Cloud version preparations
…ersion for existing records; update database migration scripts for version handling and remove token limitation from ai_connection
@rostilos
Copy link
Owner Author

rostilos commented Feb 8, 2026

/codecrow analyze

@coderabbitai
Copy link

coderabbitai bot commented Feb 8, 2026

Important

Review skipped

Too many files!

This PR contains 197 files, which is 47 over the limit of 150.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 1.3.0-rc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

rostilos and others added 2 commits February 8, 2026 21:55
…ling

- Introduced MAX_CONCURRENT_REVIEWS to limit simultaneous review requests.
- Implemented asyncio.Semaphore to manage concurrent reviews.
- Added a timeout of 600 seconds for review processing to prevent long-running requests.
- Refactored LLM reranker initialization to be per-request, improving resource management.
- Ensured MCP sessions are closed after review processing to release resources.
- Enhanced error handling for timeouts and exceptions during review processing.

refactor: Simplify context builder and remove unused components

- Removed legacy context budget management and model context limits.
- Streamlined context builder utilities for RAG metrics and caching.
- Updated context fetching logic to align with new architecture.

fix: Update prompt templates for clarity and accuracy

- Revised Stage 2 cross-file prompt to focus on relevant aspects.
- Changed references from "Database Migrations" to "Migration Files" for consistency.

feat: Implement service-to-service authentication middleware

- Added ServiceSecretMiddleware to validate internal requests with a shared secret.
- Configured middleware to skip authentication for public endpoints.

enhance: Improve collection management with payload indexing

- Added functionality to create payload indexes for efficient filtering on common fields in Qdrant collections.

fix: Adjust query service to handle path prefix mismatches

- Updated fallback logic in RAGQueryService to improve handling of filename matches during queries.
feat: Enhance ReviewService with concurrency control and timeout hand…
@rostilos
Copy link
Owner Author

rostilos commented Feb 8, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 8, 2026
rostilos and others added 3 commits February 9, 2026 11:04
- Added DiffFingerprintUtil to compute a stable fingerprint for code changes in pull requests.
- Enhanced PullRequestAnalysisProcessor to utilize commit hash and diff fingerprint caches for reusing analysis results.
- Updated CodeAnalysis model to include a diff_fingerprint field for storage.
- Modified CodeAnalysisService to support retrieval and cloning of analyses based on diff fingerprints and commit hashes.
- Added database migrations to introduce the diff_fingerprint column and create necessary indexes.
- Improved error handling and logging in various components, including file existence checks in Bitbucket Cloud.
- Refactored tests to accommodate new functionality and ensure coverage for caching mechanisms.
…ysis-binding

feat: Implement diff fingerprint caching for pull request analysis
@codecrow-local codecrow-local bot deleted a comment from codecrow-ai bot Feb 10, 2026
@codecrow-local
Copy link

codecrow-local bot commented Feb 10, 2026

⚠️ Code Analysis Results

Summary

Pull Request Review: 1.3.0 rc

Status FAIL
Risk Level HIGH
Review Coverage 23 files analyzed in depth
Confidence HIGH

Executive Summary

This PR (1.3.0 rc) introduces significant architectural changes to the analysis engine, including a new diff fingerprinting mechanism for issue deduplication and a major refactoring of the webhook processing pipeline. While the scope aims to improve system scalability and data integrity, the current implementation introduces several high-risk regressions across security, database migration stability, and core RAG extraction logic.

Recommendation

Decision: FAIL

The PR is not suitable for merge in its current state due to critical blockers, including a "fail-open" authentication vulnerability in the Python middleware and conflicting database migrations that will cause deployment failures. Additionally, the RAG pipeline contains runtime incompatibilities with the Tree-sitter API that must be resolved to maintain core functionality.


Issues Overview

Severity Count
🔴 High 7 Critical issues requiring immediate attention
🟡 Medium 27 Issues that should be addressed
🔵 Low 14 Minor issues and improvements
ℹ️ Info 3 Informational notes and suggestions

Analysis completed on 2026-02-10 10:20:23 | View Full Report | Pull Request


📋 Detailed Issues (51)

🔴 High Severity Issues

Id on Platform: 1943

Category: 🔒 Security

File: .../api/api.py:30

Issue: The 'startswith' check on resolved paths is vulnerable to prefix attacks. For example, if '_ALLOWED_REPO_ROOT' is '/tmp', a path like '/tmp-secret/repo' would pass the validation because '/tmp-secret/repo'.startswith('/tmp') is true, even though it is outside the intended root.

💡 Suggested Fix

Use 'os.path.commonpath' to verify that the resolved path is actually within the allowed root directory.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/api/api.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/api/api.py
@@ -25,7 +25,8 @@
 def _validate_repo_path(path: str) -> str:
     """Validate that a repo path is within the allowed root and contains no traversal."""
+    root = os.path.realpath(_ALLOWED_REPO_ROOT)
     resolved = os.path.realpath(path)
-    if not resolved.startswith(os.path.realpath(_ALLOWED_REPO_ROOT)):
+    if os.path.commonpath([root, resolved]) != root:
         raise ValueError(
             f"Path must be under {_ALLOWED_REPO_ROOT}, got: {path}"
         )

View Issue Details


Id on Platform: 1950

Category: ⚡ Performance

File: .../core/ollama_embedding.py:240

Issue: The implementation uses a synchronous httpx.Client for all requests, and the async methods (_aget_text_embeddings, etc.) simply wrap these sync calls. In an asynchronous RAG pipeline, this blocks the event loop for the duration of the embedding request, which can take several seconds, effectively negating the benefits of async and causing the entire service to hang.

💡 Suggested Fix

Initialize and use an httpx.AsyncClient for the aget* methods to ensure non-blocking I/O.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
@@ -68,6 +68,10 @@
             base_url=base_url.rstrip('/'),
             timeout=timeout
         ))
+        object.__setattr__(self, '_aclient', httpx.AsyncClient(
+            base_url=base_url.rstrip('/'),
+            timeout=timeout
+        ))
 
         # Test connection
         self._test_connection()

View Issue Details


Id on Platform: 1959

Category: 🐛 Bug Risk

File: .../ai/AiAnalysisRequestImpl.java:1

Issue: The deduplication logic in 'withAllPrAnalysesData' fails to merge resolved status when the input list 'allPrAnalyses' is sorted DESC (newest first), as specified in the Javadoc. In DESC order, the first issue seen for a fingerprint is the newest. Subsequent (older) issues will have a lower 'currentVersion' than the 'existingVersion' in the map, causing the 'if (currentVersion > existingVersion)' block to be skipped. Consequently, if an older version was resolved but the newest is open, the resolved status is never propagated.

💡 Suggested Fix

Sort the input list in ascending order (oldest first) before processing, or adjust the logic to handle DESC order by checking if the older (current) issue is resolved when the existing (newer) one is not.

--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
@@ -310,7 +310,8 @@
             }
 
             // Convert all issues to DTOs
-            List<AiRequestPreviousIssueDTO> allIssues = allPrAnalyses.stream()
+            List<AiRequestPreviousIssueDTO> allIssues = allPrAnalyses.stream().sorted(java.util.Comparator.comparingInt(CodeAnalysis::getPrVersion))
                     .flatMap(analysis -> analysis.getIssues().stream())
                     .map(AiRequestPreviousIssueDTO::fromEntity)
                     .toList();

View Issue Details


Id on Platform: 1967

Category: 🐛 Bug Risk

File: .../index_manager/indexer.py:4

Issue: In 'index_repository', the loop processes batches and tracks 'failed_chunks'. However, if some chunks fail to index, the process continues and eventually performs an atomic swap, promoting an incomplete index to the production alias. This violates the integrity of the 'atomic swap' strategy which should ensure a complete and valid index is ready before swapping.

💡 Suggested Fix

Check the 'failed_chunks' count after the indexing loop. If failures occurred, raise an exception to trigger the cleanup logic and prevent the alias swap.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
@@ -215,6 +215,9 @@
                 f"Streaming indexing complete: {document_count} files, "
                 f"{successful_chunks}/{chunk_count} chunks indexed ({failed_chunks} failed)"
             )
+
+            if failed_chunks > 0:
+                raise Exception(f"Indexing failed for {failed_chunks} chunks. Aborting swap.")
 
             # Verify and perform atomic swap
             temp_info = self.point_ops.client.get_collection(temp_collection_name)

View Issue Details


Id on Platform: 1974

Category: 🐛 Bug Risk

File: .../splitter/query_runner.py:2

Issue: The code attempts to import and use 'QueryCursor', which was removed in 'tree-sitter' version 0.22.0. Additionally, it attempts to unpack matches as tuples, but in the new API, 'query.matches()' returns 'QueryMatch' objects. This will cause an 'ImportError' or 'TypeError' at runtime.

💡 Suggested Fix

Update the query execution logic to use the 0.22+ API: call 'query.matches(node)' directly and access the 'captures' attribute on the resulting match objects.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
@@ -193,11 +193,7 @@
-        try:
-            # Use QueryCursor.matches() for pattern-grouped results
-            # Each match is (pattern_id, {capture_name: [nodes]})
-            from tree_sitter import QueryCursor
-            cursor = QueryCursor(query)
-            raw_matches = list(cursor.matches(tree.root_node))
-        except Exception as e:
-            logger.warning(f"Query execution failed for {lang_name}: {e}")
-            return []
+        try:
+            raw_matches = query.matches(tree.root_node)
+        except Exception as e:
+            logger.warning(f"Query execution failed for {lang_name}: {e}")
+            return []
 
         results: List[QueryMatch] = []
 
-        for pattern_id, captures_dict in raw_matches:
+        for match_obj in raw_matches:
+            captures_dict = match_obj.captures

View Issue Details


Id on Platform: 1975

Category: 🔒 Security

File: .../service/WorkspaceService.java:1

Issue: The current logic prevents an ADMIN from promoting someone to ADMIN or OWNER, but it does not prevent an ADMIN from demoting an existing OWNER or another ADMIN. Since the check only looks at 'newRole', an ADMIN could change an OWNER's role to MEMBER.

💡 Suggested Fix

Update the security check to ensure that users with the ADMIN role cannot modify the roles of users who are currently ADMINs or OWNERs.

--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/workspace/service/WorkspaceService.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/workspace/service/WorkspaceService.java
@@ -123,10 +123,13 @@
             throw new SecurityException("The current user does not have editing privileges.");
         }
         WorkspaceMember workspaceMemberActor = optionalWorkspaceMemberActor.get();
-        if (workspaceMemberActor.getRole() == EWorkspaceRole.ADMIN
-                && (newRole == EWorkspaceRole.ADMIN || newRole == EWorkspaceRole.OWNER)) {
-            throw new SecurityException("The current user does not have editing privileges.");
-        }
 
         WorkspaceMember workspaceMember = workspaceMemberRepository
                 .findByWorkspaceIdAndUserId(workspaceId, target.getId())
                 .orElseThrow(() -> new NoSuchElementException("The user is not present in the workspace."));
+
+        if (workspaceMemberActor.getRole() == EWorkspaceRole.ADMIN) {
+            if (newRole == EWorkspaceRole.ADMIN || newRole == EWorkspaceRole.OWNER ||
+                workspaceMember.getRole() == EWorkspaceRole.ADMIN || workspaceMember.getRole() == EWorkspaceRole.OWNER) {
+                throw new SecurityException("The current user does not have editing privileges.");
+            }
+        }

View Issue Details


Id on Platform: 1977

Category: 🐛 Bug Risk

File: .../1.3.0/V1.3.0__remove_token_limitation_from_ai_connection.sql:1

Issue: This migration script is identical to V1.4.0__remove_token_limitation_from_ai_connection.sql which is also included in this PR. Having two migrations performing the same schema change with different version numbers will cause conflicts in the schema history table and potentially break the deployment.

💡 Suggested Fix

Remove the duplicate migration file. Ensure the change is only present in the correct versioned directory (likely 1.4.0 based on the project's current progression).

No suggested fix provided

View Issue Details


🟡 Medium Severity Issues

Id on Platform: 1934

Category: 🔒 Security

File: .../api/middleware.py:37

Issue: The secret comparison uses a standard equality operator ('=='), which is vulnerable to timing attacks. Additionally, the middleware 'fails open' if the SERVICE_SECRET environment variable is not set, allowing all requests to bypass authentication.

💡 Suggested Fix

Use 'secrets.compare_digest' for constant-time comparison to prevent timing attacks. Also, consider requiring the secret to be present in production environments to avoid accidental 'fail-open' scenarios.

--- a/python-ecosystem/mcp-client/api/middleware.py
+++ b/python-ecosystem/mcp-client/api/middleware.py
@@ -6,6 +6,7 @@
 """
 import os
 import logging
+import secrets
 
 from starlette.middleware.base import BaseHTTPMiddleware
@@ -34,7 +35,7 @@
             return await call_next(request)
 
         provided = request.headers.get("x-service-secret", "")
-        if provided != self.secret:
+        if not secrets.compare_digest(provided, self.secret):
             logger.warning(
                 f"Unauthorized request to {request.url.path} from {request.client.host if request.client else 'unknown'}"
             )

View Issue Details


Id on Platform: 1935

Category: 🔒 Security

File: .../api/middleware.py:37

Issue: Vulnerable to timing attacks due to standard string comparison and 'fails open' if SERVICE_SECRET is missing.

💡 Suggested Fix

Use 'secrets.compare_digest' for secret validation.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/api/middleware.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/api/middleware.py
@@ -6,6 +6,7 @@
 """
 import os
 import logging
+import secrets
 
 from starlette.middleware.base import BaseHTTPMiddleware
@@ -34,7 +35,7 @@
             return await call_next(request)
 
         provided = request.headers.get("x-service-secret", "")
-        if provided != self.secret:
+        if not secrets.compare_digest(provided, self.secret):
             logger.warning(
                 f"Unauthorized request to {request.url.path} from {request.client.host if request.client else 'unknown'}"
             )

View Issue Details


Id on Platform: 1936

Category: 🛡️ Error Handling

File: .../command/command_service.py

Issue: In 'process_summarize', the MCP client is instantiated, but the 'finally' block that ensures sessions are closed only covers the '_execute_summarize' call. If a timeout or exception occurs during '_fetch_rag_context_for_summarize' or prompt building, 'client.close_all_sessions()' will never be called, potentially leaking JVM subprocesses.

💡 Suggested Fix

Wrap the entire lifecycle of the MCP client in a try...finally block.

--- a/python-ecosystem/mcp-client/service/command/command_service.py
+++ b/python-ecosystem/mcp-client/service/command/command_service.py
@@ -84,13 +84,13 @@
             })
             client = self._create_mcp_client(config)
-            llm = self._create_llm(request)
-
-            # Fetch RAG context
-            rag_context = await self._fetch_rag_context_for_summarize(request, event_callback)
-
-            # Build prompt
-            prompt = self._build_summarize_prompt(request, rag_context)
-
-            self._emit_event(event_callback, {
-                "type": "status",
-                "state": "generating",
-                "message": "Generating PR summary with AI"
-            })
-
-            # Execute with MCP agent
             try:
+                llm = self._create_llm(request)
+                rag_context = await self._fetch_rag_context_for_summarize(request, event_callback)
+                prompt = self._build_summarize_prompt(request, rag_context)
+                self._emit_event(event_callback, {
+                    "type": "status",
+                    "state": "generating",
+                    "message": "Generating PR summary with AI"
+                })
                 result = await self._execute_summarize(

View Issue Details


Id on Platform: 1937

Category: 🛡️ Error Handling

File: .../command/command_service.py

Issue: In 'process_ask', the MCP client is instantiated, but the 'finally' block that ensures sessions are closed only covers the '_execute_ask' call. If a timeout or exception occurs during '_fetch_rag_context_for_ask' or prompt building, 'client.close_all_sessions()' will never be called.

💡 Suggested Fix

Wrap the entire lifecycle of the MCP client in a try...finally block.

--- a/python-ecosystem/mcp-client/service/command/command_service.py
+++ b/python-ecosystem/mcp-client/service/command/command_service.py
@@ -184,12 +184,12 @@
             client = self._create_mcp_client(config)
-            llm = self._create_llm(request)
-
-            # Fetch RAG context for the question
-            rag_context = await self._fetch_rag_context_for_ask(request, event_callback)
-
-            # Build prompt with Platform MCP tools if available
-            prompt = self._build_ask_prompt(request, rag_context, has_platform_mcp=include_platform)
-
-            self._emit_event(event_callback, {
-                "type": "status",
-                "state": "generating",
-                "message": "Generating answer with AI"
-            })
-
-            # Execute with MCP agent
             try:
+                llm = self._create_llm(request)
+                rag_context = await self._fetch_rag_context_for_ask(request, event_callback)
+                prompt = self._build_ask_prompt(request, rag_context, has_platform_mcp=include_platform)
+                self._emit_event(event_callback, {
+                    "type": "status",
+                    "state": "generating",
+                    "message": "Generating answer with AI"
+                })
                 result = await self._execute_ask(

View Issue Details


Id on Platform: 1938

Category: 🛡️ Error Handling

File: .../review/review_service.py

Issue: The MCP client is created, but the cleanup logic ('close_all_sessions') is only triggered if the code reaches the inner 'try...finally' block. If a timeout occurs during 'self._fetch_rag_context' or orchestrator initialization, the sessions will remain open, leading to leaked JVM processes.

💡 Suggested Fix

Move the 'try...finally' block to immediately follow the MCP client creation to ensure cleanup regardless of where a timeout or error occurs within the request lifecycle.

--- a/python-ecosystem/mcp-client/service/review/review_service.py
+++ b/python-ecosystem/mcp-client/service/review/review_service.py
@@ -158,16 +158,16 @@
             client = self._create_mcp_client(config)
-            
-            # Create LLM instance
-            llm = self._create_llm(request)
-            
-            # Create a per-request reranker (not shared across concurrent requests)
-            llm_reranker = LLMReranker(llm_client=llm)
-
-            # Fetch RAG context if enabled
-            rag_context = await self._fetch_rag_context(request, event_callback, llm_reranker=llm_reranker)
-
-            # Build processed_diff if rawDiff is available to optimize Stage 1
-            processed_diff = None
-            if has_raw_diff:
-                processed_diff = self._process_raw_diff(request.rawDiff)
-
-            # Initialize Multi-Stage Orchestrator
-            orchestrator = MultiStageReviewOrchestrator(
-                llm=llm,
-                mcp_client=client,
-                project_id=request.projectId,
-                event_callback=event_callback
-            )
-
             try:
+                llm = self._create_llm(request)
+                llm_reranker = LLMReranker(llm_client=llm)
+                rag_context = await self._fetch_rag_context(request, event_callback, llm_reranker=llm_reranker)
+                processed_diff = None
+                if has_raw_diff:
+                    processed_diff = self._process_raw_diff(request.rawDiff)
+                orchestrator = MultiStageReviewOrchestrator(
+                    llm=llm,
+                    mcp_client=client,
+                    project_id=request.projectId,
+                    event_callback=event_callback
+                )
                 # Check for Branch Analysis / Reconciliation mode

View Issue Details


Id on Platform: 1940

Category: 🐛 Bug Risk

File: .../client/RagPipelineClient.java:214

Issue: The 'branch' parameter in 'deleteIndex' is appended to the URL without encoding. Branch names frequently contain special characters like slashes (e.g., 'feature/fix'), which will break the URL structure. In contrast, 'deleteBranch' correctly encodes the branch name.

💡 Suggested Fix

Use URLEncoder to encode the branch name before formatting the URL, similar to the implementation in 'deleteBranch'.

--- a/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/client/RagPipelineClient.java
+++ b/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/client/RagPipelineClient.java
@@ -211,7 +211,8 @@
             return;
         }
 
-        String url = String.format("%s/index/%s/%s/%s", ragApiUrl, workspace, project, branch);
+        String encodedBranch = java.net.URLEncoder.encode(branch, java.nio.charset.StandardCharsets.UTF_8);
+        String url = String.format("%s/index/%s/%s/%s", ragApiUrl, workspace, project, encodedBranch);
         Request.Builder builder = new Request.Builder()
                 .url(url)
                 .delete();

View Issue Details


Id on Platform: 1941

Category: 🛡️ Error Handling

File: .../actions/CheckFileExistsInBranchAction.java

Issue: The retry logic within the 'catch (IOException e)' block lacks a backoff delay (Thread.sleep). If an IOException is thrown that contains '429' in its message (e.g., by a custom interceptor or a specific client implementation), the code will immediately retry the request without waiting, which is counter-productive when dealing with rate limits.

💡 Suggested Fix

Add a Thread.sleep(backoffMs) call before 'continue' in the catch block to ensure the rate limit is respected during retries.

--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/bitbucket/cloud/actions/CheckFileExistsInBranchAction.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/bitbucket/cloud/actions/CheckFileExistsInBranchAction.java
@@ -94,6 +94,12 @@
             } catch (IOException e) {
                 if (attempt < MAX_RETRIES && e.getMessage() != null && e.getMessage().contains("429")) {
+                    try {
+                        Thread.sleep(backoffMs);
+                    } catch (InterruptedException ie) {
+                        Thread.currentThread().interrupt();
+                        throw new IOException("Interrupted during retry backoff", ie);
+                    }
                     attempt++;
                     backoffMs *= 2;
                     continue;

View Issue Details


Id on Platform: 1942

Category: 🐛 Bug Risk

File: .../github/GitHubClient.java:966

Issue: The GraphQL 'expression' string is constructed by only escaping double quotes. If a file path contains a backslash (), it will act as an escape character in the GraphQL query string, potentially leading to malformed queries or incorrect path resolution (e.g., '\n' being interpreted as a newline).

💡 Suggested Fix

Escape backslashes in the expression string before inserting it into the GraphQL query.

--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/github/GitHubClient.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/github/GitHubClient.java
@@ -964,3 +964,3 @@
                 String expression = branchOrCommit + ":" + path;
                 queryBuilder.append(alias).append(": object(expression: \"")
-                        .append(expression.replace("\"", "\\\""))
+                        .append(expression.replace("\\", "\\\\\\").replace("\"", "\\\""))
                         .append("\") { ... on Blob { text byteSize } } ");

View Issue Details


Id on Platform: 1945

Category: 🐛 Bug Risk

File: .../queries/rust.scm:12

Issue: The queries for impl_item use (type_identifier) for the type and trait fields. In Rust, these fields can contain complex types such as generic types (Vec<T>), scoped types (std::collections::HashMap), or references (&str). Using (type_identifier) restricts matches to simple, non-generic, local types only, causing the RAG pipeline to miss a significant portion of implementation logic.

💡 Suggested Fix

Use the wildcard (_) or remove the node type restriction to allow matching any valid type node within the impl block.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/rust.scm
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/rust.scm
@@ -21,8 +21,8 @@
 
 ; Implementation blocks (simple type impl)
 (impl_item
-  type: (type_identifier) @name) @definition.impl
+  type: (_) @name) @definition.impl
 
 ; Implementation blocks for trait (impl Trait for Type)
 (impl_item
-  trait: (type_identifier) @trait_name
-  type: (type_identifier) @name) @definition.impl
+  trait: (_) @trait_name
+  type: (_) @name) @definition.impl

View Issue Details


Id on Platform: 1949

Category: ⚡ Performance

File: .../core/ollama_embedding.py:73

Issue: Calling _test_connection() in the constructor performs a blocking network request. This can cause significant delays or timeouts during application startup or object instantiation if the Ollama server is slow or unreachable.

💡 Suggested Fix

Remove the connection test from the constructor. Connection health should be checked lazily or via a dedicated health check method.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/ollama_embedding.py
@@ -70,8 +70,6 @@
             timeout=timeout
         ))
 
-        # Test connection
-        self._test_connection()
         logger.info(f"Ollama embeddings initialized successfully")
 
     def _test_connection(self):

View Issue Details


Id on Platform: 1951

Category: 🐛 Bug Risk

File: .../model/dtos.py:21

Issue: The 'line' field is defined as Optional[int], but the corresponding 'CodeReviewIssue' in 'output_schemas.py' defines it as a string to support ranges (e.g., '42-45'). If a previous issue containing a range is passed back to the client, Pydantic validation will fail for IssueDTO.

💡 Suggested Fix

Change the type of 'line' to Optional[str] to maintain consistency with the output schema and support line ranges.

--- a/python-ecosystem/mcp-client/model/dtos.py
+++ b/python-ecosystem/mcp-client/model/dtos.py
@@ -19,3 +19,3 @@
     suggestedFixDiff: Optional[str] = None  # Diff for suggested fix (from Java)
     file: Optional[str] = None
-    line: Optional[int] = None
+    line: Optional[str] = None
     branch: Optional[str] = None

View Issue Details


Id on Platform: 1954

Category: 🐛 Bug Risk

File: .../splitter/languages.py:1

Issue: Inconsistency in language support: 'Language.COBOL' is included in 'AST_SUPPORTED_LANGUAGES' but missing from 'LANGUAGE_TO_TREESITTER'. Additionally, several languages (Kotlin, Scala, Lua, Perl, Swift, Haskell, Cobol) are missing from 'TREESITTER_MODULES', which will cause 'TreeSitterParser.get_language' to return None even when 'is_ast_supported' returns True.

💡 Suggested Fix

Ensure all languages in 'AST_SUPPORTED_LANGUAGES' have corresponding entries in 'LANGUAGE_TO_TREESITTER' and 'TREESITTER_MODULES'.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
@@ -101,4 +101,5 @@
     Language.SWIFT: 'swift',
     Language.HASKELL: 'haskell',
+    Language.COBOL: 'cobol',
 }

View Issue Details


Id on Platform: 1957

Category: ⚡ Performance

File: .../vcsclient/VcsClientProvider.java:213

Issue: The 'refreshToken' method is marked with '@transactional' and performs network I/O (OAuth token refresh). Holding a database transaction open during network calls is a performance anti-pattern that can lead to connection pool exhaustion and deadlocks, especially with the 30-second timeout. If the network is slow, the DB connection is held unnecessarily.

💡 Suggested Fix

Refactor the method to perform the network call outside of the transaction. Only the final step of updating the database with the new tokens should be wrapped in a transaction.

No suggested fix provided

View Issue Details


Id on Platform: 1960

Category: 🧹 Code Quality

File: .../ai/AiAnalysisRequestImpl.java:364

Issue: The line grouping logic 'issue.line() / 3' does not implement the '±3 tolerance' mentioned in the Javadoc. This creates fixed buckets (e.g., lines 1-2 in bucket 0, lines 3-5 in bucket 1). Adjacent lines like 2 and 3 will fall into different buckets and fail to match, while lines 3 and 5 will match despite being further apart. This makes issue tracking across commits jittery.

💡 Suggested Fix

Consider using a larger bucket or a more flexible matching algorithm if true tolerance is required. At minimum, update the Javadoc to reflect that it uses fixed buckets of size 3.

--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
@@ -361,7 +361,7 @@
         private String computeIssueFingerprint(AiRequestPreviousIssueDTO issue) {
             String file = issue.file() != null ? issue.file() : "";
-            // Normalize line to nearest multiple of 3 for tolerance
-            int lineGroup = issue.line() != null ? (issue.line() / 3) : 0;
+            // Normalize line to buckets of 5 for better tolerance across small shifts
+            int lineGroup = issue.line() != null ? (issue.line() / 5) : 0;
             String severity = issue.severity() != null ? issue.severity() : "";

View Issue Details


Id on Platform: 1963

Category: 🐛 Bug Risk

File: .../webhookhandler/GitHubPullRequestWebhookHandler.java:198

Issue: Calling postAnalysisResults with a null codeAnalysis parameter is likely to cause a NullPointerException in the reporting service. As seen in GitHubReportingService.java, the implementation attempts to generate an analysis summary from the provided analysis object without checking for null.

💡 Suggested Fix

Avoid passing null to postAnalysisResults. Instead, consider adding a dedicated error reporting method to the VcsReportingService or passing a minimal CodeAnalysis object that represents the error state.

No suggested fix provided

View Issue Details


Id on Platform: 1964

Category: 🧹 Code Quality

File: .../webhookhandler/GitLabMergeRequestWebhookHandler.java:189

Issue: The GitLab handler is inconsistent with the GitHub handler as it lacks specific catch blocks for DiffTooLargeException and AnalysisLockedException. This results in these recoverable exceptions being logged as errors and prevents them from being re-thrown for proper handling by the WebhookAsyncProcessor.

💡 Suggested Fix

Add a specific catch block for DiffTooLargeException and AnalysisLockedException that releases the lock and re-throws the exception, matching the behavior in GitHubPullRequestWebhookHandler. Ensure the exceptions are imported.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/webhookhandler/GitLabMergeRequestWebhookHandler.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/webhookhandler/GitLabMergeRequestWebhookHandler.java
@@ -187,6 +187,13 @@
             
             return WebhookResult.success("MR analysis completed", result);
             
+        } catch (DiffTooLargeException | AnalysisLockedException e) {
+            if (acquiredLockKey != null) {
+                analysisLockService.releaseLock(acquiredLockKey);
+            }
+            log.warn("MR analysis failed with recoverable exception for project {}: {}", project.getId(), e.getMessage());
+            throw e;
         } catch (Exception e) {
             log.error("MR analysis failed for project {}", project.getId(), e);
             // Release the lock since processor won't take ownership

View Issue Details


Id on Platform: 1965

Category: 🐛 Bug Risk

File: .../webhookhandler/GitLabMergeRequestWebhookHandler.java:197

Issue: Calling postAnalysisResults with a null codeAnalysis parameter is likely to cause a NullPointerException in the reporting service, similar to the issue in the GitHub handler.

💡 Suggested Fix

Avoid passing null to postAnalysisResults. Use a dedicated error reporting mechanism or a fallback analysis object.

No suggested fix provided

View Issue Details


Id on Platform: 1966

Category: ⚡ Performance

File: .../index_manager/branch_manager.py:1

Issue: The 'get_indexed_branches' method performs a full scroll of the collection to identify unique branch names. In collections with millions of points, this will result in thousands of network requests and significant latency, as it iterates through every point's payload.

💡 Suggested Fix

Increase the scroll limit to reduce round trips and consider caching the branch list or storing it in a dedicated metadata collection/point to avoid full scans.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
@@ -75,7 +75,7 @@
         try:
             branches: Set[str] = set()
             offset = None
-            limit = 100
+            limit = 1000
             
             while True:
                 results = self.client.scroll(

View Issue Details


Id on Platform: 1969

Category: 🛡️ Error Handling

File: .../index_manager/point_operations.py:1

Issue: The 'upsert_points' method catches all exceptions during batch upserts and simply logs them, continuing with the next batch. While this provides resilience, it leads to silent data loss in the index if the caller does not explicitly check the failure count.

💡 Suggested Fix

Consider re-raising the exception or ensuring the caller (like RepositoryIndexer) strictly validates the success of all batches before proceeding with alias swaps.

No suggested fix provided

View Issue Details


Id on Platform: 1971

Category: 🧹 Code Quality

File: .../orchestrator/context_helpers.py:1

Issue: The regex patterns for symbol extraction are too restrictive. The CamelCase pattern [A-Z][a-z]+[A-Z][a-zA-Z]* requires at least two uppercase letters with lowercase in between (e.g., 'MyClass' matches, but 'User' or 'Data' do not). Similarly, the snake_case pattern requires at least one underscore, missing single-word variables like 'result' or 'items'.

💡 Suggested Fix

Update the regex patterns to capture single-word capitalized identifiers and single-word lowercase identifiers (with length filters).

--- a/python-ecosystem/mcp-client/service/review/orchestrator/context_helpers.py
+++ b/python-ecosystem/mcp-client/service/review/orchestrator/context_helpers.py
@@ -40,8 +40,8 @@
-    camel_case = re.findall(r'\b([A-Z][a-z]+[A-Z][a-zA-Z]*)\b', diff_content)
-    symbols.update(camel_case)
-    
-    # Match snake_case identifiers (variables, functions)
-    snake_case = re.findall(r'\b([a-z][a-z0-9]*(?:_[a-z0-9]+)+)\b', diff_content)
+    # Match Capitalized identifiers (Classes, Components)
+    capitalized = re.findall(r'\b([A-Z][a-zA-Z0-9]+)\b', diff_content)
+    symbols.update(capitalized)
+    
+    # Match lowercase identifiers (variables, functions)
+    identifiers = re.findall(r'\b([a-z][a-z0-9_]+)\b', diff_content)
+    symbols.update(s for s in identifiers if len(s) > 4)

View Issue Details


Id on Platform: 1972

Category: 🐛 Bug Risk

File: .../orchestrator/json_utils.py:10

Issue: The use of text.find('{') and text.rfind('}') to extract JSON is fragile. If the LLM output contains multiple JSON objects (e.g., 'Config: { ... } Result: { ... }'), this logic will extract everything from the first '{' to the last '}', resulting in invalid JSON.

💡 Suggested Fix

Implement a balanced brace extraction algorithm to find the first complete JSON object, similar to the _extract_json_object method seen in python-ecosystem/mcp-client/service/command_service.py.

--- a/python-ecosystem/mcp-client/service/review/orchestrator/json_utils.py
+++ b/python-ecosystem/mcp-client/service/review/orchestrator/json_utils.py
@@ -130,2 +130,19 @@
-    obj_start = text.find("{")
-    obj_end = text.rfind("}")
+    def extract_balanced(t, start_char='{', end_char='}'):
+        start = t.find(start_char)
+        if start == -1: return -1, -1
+        depth, in_str, escape = 0, False, False
+        for i in range(start, len(t)):
+            char = t[i]
+            if escape: escape = False; continue
+            if char == '\\': escape = True; continue
+            if char == '"': in_str = not in_str; continue
+            if in_str: continue
+            if char == start_char: depth += 1
+            elif char == end_char:
+                depth -= 1
+                if depth == 0: return start, i
+        return -1, -1
+
+    obj_start, obj_end = extract_balanced(text, '{', '}')

View Issue Details


Id on Platform: 1973

Category: 🐛 Bug Risk

File: .../splitter/metadata.py:120

Issue: In 'extract_signature', the code uses 'next()' with a generator to find the index of the current line in the 'lines' list. If the same line content (e.g., a common decorator or method name like 'def init') appears multiple times in the first 15 lines, 'idx' will always point to the first occurrence, potentially causing the signature to be built from the wrong context.

💡 Suggested Fix

Use 'enumerate' in the outer loop to get the current line index directly.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
@@ -112,2 +112,2 @@
-        for line in lines[:15]:
-            line = line.strip()
+        for idx, line in enumerate(lines[:15]):
+            line = line.strip()
@@ -125,3 +125,2 @@
-                    if ')' not in sig and ':' not in sig:
-                        idx = next((i for i, l in enumerate(lines) if l.strip() == line), -1)
-                        if idx >= 0:
-                            for next_line in lines[idx+1:idx+5]:
+                    if ')' not in sig and ':' not in sig:
+                        for next_line in lines[idx+1:idx+5]:

View Issue Details


Id on Platform: 1976

Category: ⚡ Performance

File: .../service/JobService.java:383

Issue: Applying REQUIRES_NEW to all logging methods (info, warn, error, debug) will cause a new database transaction to be opened and committed for every single log entry. Under high load or within loops, this can lead to connection pool exhaustion and significant performance degradation.

💡 Suggested Fix

Consider using a non-blocking asynchronous logging mechanism or only applying REQUIRES_NEW to critical status changes and error logs, while keeping standard progress logs within the parent transaction or using a buffered approach.

No suggested fix provided

View Issue Details


Id on Platform: 1978

Category: 🐛 Bug Risk

File: .../service/PrFileEnrichmentService.java:1

Issue: The OkHttpClient is initialized in the constructor before @value fields are injected. Consequently, the 'requestTimeoutSeconds' property is ignored, and the client always uses the hardcoded 60s timeout. Additionally, creating a new ObjectMapper instance manually bypasses Spring's auto-configuration (e.g., custom modules, naming strategies).

💡 Suggested Fix

Use constructor injection for the ObjectMapper and the timeout property to ensure they are correctly initialized and managed by Spring.

--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/PrFileEnrichmentService.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/PrFileEnrichmentService.java
@@ -42,16 +42,14 @@
     @Value("${pr.enrichment.rag-pipeline-url:http://localhost:8006}")
     private String ragPipelineUrl;
     
-    @Value("${pr.enrichment.request-timeout-seconds:60}")
-    private int requestTimeoutSeconds;
-    
     private final ObjectMapper objectMapper;
     private final OkHttpClient httpClient;
     
-    public PrFileEnrichmentService() {
-        this.objectMapper = new ObjectMapper();
+    public PrFileEnrichmentService(
+            ObjectMapper objectMapper,
+            @Value("${pr.enrichment.request-timeout-seconds:60}") int requestTimeoutSeconds
+    ) {
+        this.objectMapper = objectMapper;
         this.httpClient = new OkHttpClient.Builder()
                 .connectTimeout(30, TimeUnit.SECONDS)
-                .readTimeout(60, TimeUnit.SECONDS)
+                .readTimeout(requestTimeoutSeconds, TimeUnit.SECONDS)
                 .writeTimeout(30, TimeUnit.SECONDS)
                 .build();
     }

View Issue Details


Id on Platform: 1979

Category: 🐛 Bug Risk

File: .../enrichment/FileContentDto.java:21

Issue: String.getBytes() uses the platform's default charset. This can cause inconsistent file size reporting if the analysis engine runs on systems with different default encodings (e.g., UTF-16 vs UTF-8). Since VCS content is typically UTF-8, it should be explicitly specified.

💡 Suggested Fix

Specify StandardCharsets.UTF_8 when calling getBytes().

--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/enrichment/FileContentDto.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/enrichment/FileContentDto.java
@@ -19,3 +19,3 @@
                 path,
                 content,
-                content != null ? content.getBytes().length : 0,
+                content != null ? content.getBytes(java.nio.charset.StandardCharsets.UTF_8).length : 0,
                 false,

View Issue Details


Id on Platform: 1983

Category: 🧹 Code Quality

File: .../routers/commands.py:1

Issue: The utility functions '_wants_streaming', '_json_event', and especially the complex '_drain_queue_until_final' logic are duplicated exactly in 'python-ecosystem/mcp-client/api/routers/review.py'.

💡 Suggested Fix

Extract shared streaming utilities and NDJSON helpers into a common utility module (e.g., 'utils/streaming.py') to improve maintainability and reduce code duplication.

No suggested fix provided

View Issue Details


Id on Platform: 1984

Category: 🧹 Code Quality

File: .../routers/review.py:1

Issue: The utility functions '_wants_streaming', '_json_event', and '_drain_queue_until_final' are duplicated from 'python-ecosystem/mcp-client/api/routers/commands.py'.

💡 Suggested Fix

Extract shared streaming utilities into a common utility module.

No suggested fix provided

View Issue Details


🔵 Low Severity Issues

Id on Platform: 1939

Category: ⚡ Performance

File: .../models/config.py:107

Issue: The 'chunk_size' has been increased from 800 to 8000 characters. While this keeps semantic units intact, it significantly increases the token count per retrieved context. If 'top_k' is high (e.g., 8 as seen in 'command_service.py'), this could lead to very large prompts that might exceed context limits or increase latency.

💡 Suggested Fix

Monitor token usage and consider if a slightly smaller chunk size (e.g., 4000) provides a better balance between semantic completeness and prompt efficiency.

No suggested fix provided

View Issue Details


Id on Platform: 1944

Category: 🧹 Code Quality

File: .../model/multi_stage.py:84

Issue: The alias 'check_pass' is identical to the field name, making it redundant.

💡 Suggested Fix

Remove the redundant alias from the Field definition.

--- a/python-ecosystem/mcp-client/model/multi_stage.py
+++ b/python-ecosystem/mcp-client/model/multi_stage.py
@@ -83,3 +83,3 @@
 class ImmutabilityCheck(BaseModel):
     rule: str
-    check_pass: bool = Field(alias="check_pass")
+    check_pass: bool
     evidence: str

View Issue Details


Id on Platform: 1946

Category: ✨ Best Practices

File: .../controller/ProjectController.java:592

Issue: The 'maxAnalysisTokenLimit' field in 'UpdateAnalysisSettingsRequest' lacks validation. A user could potentially submit a negative value or an excessively high value if not constrained by the service layer.

💡 Suggested Fix

Add validation annotations such as '@min(0)' to the record field.

--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
@@ -591,7 +591,7 @@
             Boolean prAnalysisEnabled,
             Boolean branchAnalysisEnabled,
             String installationMethod,
-            Integer maxAnalysisTokenLimit
+            @jakarta.validation.constraints.Min(0) Integer maxAnalysisTokenLimit
     ) {}

View Issue Details


Id on Platform: 1947

Category: 🐛 Bug Risk

File: .../queries/typescript.scm:58

Issue: The predicate (#not-match? @value "^arrow_function") matches against the literal source text of the node. Since an arrow function's source (e.g., '(x) => x') never starts with the string 'arrow_function', this filter does nothing. Consequently, arrow functions assigned to variables will be tagged as both @definition.function and @definition.variable.

💡 Suggested Fix

Remove the ineffective regex filter. To distinguish between functions and variables in lexical declarations, consider using more specific patterns or handling the overlap in the post-processing logic.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scm
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scm
@@ -55,7 +55,6 @@
 (lexical_declaration
   (variable_declarator
     name: (identifier) @name
-    value: (_) @value) @declarator
-  (#not-match? @value "^arrow_function")) @definition.variable
+    value: (_) @value) @declarator) @definition.variable
 
 ; Ambient declarations

View Issue Details


Id on Platform: 1948

Category: ✨ Best Practices

File: .../core/embedding_factory.py:35

Issue: Hardcoding 'timeout=120.0' (and 60.0 for OpenRouter) overrides the default logic in OllamaEmbedding/OpenRouterEmbedding which checks for environment variables like OLLAMA_TIMEOUT. This makes the system less configurable.

💡 Suggested Fix

Pass None for the timeout parameter to allow the embedding classes to use their internal default logic (which includes environment variable lookups).

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/embedding_factory.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/embedding_factory.py
@@ -32,7 +32,7 @@
         return OllamaEmbedding(
             model=config.ollama_model,
             base_url=config.ollama_base_url,
-            timeout=120.0,
+            timeout=None,
             expected_dim=config.embedding_dim
         )
 
@@ -42,7 +42,7 @@
             api_key=config.openrouter_api_key,
             model=config.openrouter_model,
             api_base=config.openrouter_base_url,
-            timeout=60.0,
+            timeout=None,
             max_retries=3,
             expected_dim=config.embedding_dim
         )

View Issue Details


Id on Platform: 1952

Category: 🧹 Code Quality

File: .../model/output_schemas.py:1

Issue: The fields 'resolutionExplanation' and 'resolvedInCommit' are inconsistent with 'resolvedDescription' and 'resolvedByCommit' used in 'IssueDTO' (dtos.py). Aligning these names improves maintainability and reduces mapping errors during issue reconciliation.

💡 Suggested Fix

Rename the resolution tracking fields to match the naming convention used in IssueDTO.

--- a/python-ecosystem/mcp-client/model/output_schemas.py
+++ b/python-ecosystem/mcp-client/model/output_schemas.py
@@ -23,4 +23,4 @@
     isResolved: bool = Field(default=False, description="Whether this issue from previous analysis is resolved")
     # Resolution tracking fields
-    resolutionExplanation: Optional[str] = Field(default=None, description="Explanation of how the issue was resolved (separate from original reason)")
-    resolvedInCommit: Optional[str] = Field(default=None, description="Commit hash where the issue was resolved")
+    resolvedDescription: Optional[str] = Field(default=None, description="Explanation of how the issue was resolved (separate from original reason)")
+    resolvedByCommit: Optional[str] = Field(default=None, description="Commit hash where the issue was resolved")

View Issue Details


Id on Platform: 1955

Category: ⚡ Performance

File: .../service/RagOperationsServiceImpl.java:1

Issue: When a branch has no differences compared to the base branch (e.g., a newly created branch), the method returns true without persisting a 'RagBranchIndex' record. Because 'ensureBranchIndexUpToDate' only skips indexing if this record exists, it will re-trigger the VCS diff fetch on every subsequent analysis request for this branch until a change is actually committed.

💡 Suggested Fix

Persist a RagBranchIndex record even when the diff is empty. This marks the branch as 'up-to-date' at the current commit hash, preventing redundant VCS calls in future checks.

--- a/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
+++ b/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
@@ -445,13 +445,22 @@
             String rawDiff = vcsClient.getBranchDiff(workspaceSlug, repoSlug, baseBranch, targetBranch);
             
+            // Get latest commit hash on target branch
+            String targetCommit = vcsClient.getLatestCommitHash(workspaceSlug, repoSlug, targetBranch);
+
             if (rawDiff == null || rawDiff.isEmpty()) {
                 log.info("No diff between '{}' and '{}' - branch has same content as base, using main index", 
                         baseBranch, targetBranch);
+                // Persist record to avoid redundant diff checks
+                RagBranchIndex branchIndex = new RagBranchIndex();
+                branchIndex.setProjectId(project.getId());
+                branchIndex.setBranchName(targetBranch);
+                branchIndex.setCommitHash(targetCommit);
+                branchIndex.setUpdatedAt(OffsetDateTime.now());
+                ragBranchIndexRepository.save(branchIndex);
                 eventConsumer.accept(Map.of(
                     "type", "info",
                     "message", String.format("No changes between %s and %s - using main branch index", baseBranch, targetBranch)
                 ));
                 return true;
             }
-            
-            // Get latest commit hash on target branch
-            String targetCommit = vcsClient.getLatestCommitHash(workspaceSlug, repoSlug, targetBranch);

View Issue Details


Id on Platform: 1956

Category: 🐛 Bug Risk

File: .../vcsclient/VcsClient.java:247

Issue: The check 'content.length() <= maxFileSizeBytes' compares the number of UTF-16 characters to a byte limit. For non-ASCII characters, the byte size will exceed the character count, potentially allowing files larger than the intended limit. Furthermore, the file is fully fetched into memory before the check, which doesn't prevent memory pressure for large files.

💡 Suggested Fix

Use a more accurate byte-size check if the limit is strictly in bytes, and consider checking file size via metadata before fetching the full content if the provider supports it.

--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/VcsClient.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/VcsClient.java
@@ -238,5 +238,5 @@
                 String content = getFileContent(workspaceId, repoIdOrSlug, path, branchOrCommit);
-                if (content != null && content.length() <= maxFileSizeBytes) {
+                if (content != null && content.getBytes(java.nio.charset.StandardCharsets.UTF_8).length <= maxFileSizeBytes) {
                     results.put(path, content);
                 }

View Issue Details


Id on Platform: 1958

Category: ⚡ Performance

File: .../vcsclient/VcsClientProvider.java:419

Issue: A new 'OkHttpClient' instance is created for every GitLab token refresh call. This is inefficient as it prevents connection pooling and requires setting up new thread pools and connection structures for every request.

💡 Suggested Fix

Reuse a single 'OkHttpClient' instance (e.g., as a class field or a bean) instead of creating a new one in every method call.

No suggested fix provided

View Issue Details


Id on Platform: 1961

Category: 🧹 Code Quality

File: .../webhookhandler/BitbucketCloudPullRequestWebhookHandler.java:1

Issue: The pull request ID is parsed from a string to a long multiple times within the same method. This is inefficient and redundant.

💡 Suggested Fix

Parse the pull request ID once at the beginning of the method and reuse the variable.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java\n+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java\n@@ -102,8 +102,9 @@\n     private WebhookResult handlePullRequestEvent(WebhookPayload payload, Project project, Consumer<Map<String, Object>> eventConsumer) {\n         String placeholderCommentId = null;\n         String acquiredLockKey = null;\n+        long prId = Long.parseLong(payload.pullRequestId());\n         \n         try {\n             // Try to acquire lock atomically BEFORE posting placeholder\n             // This prevents race condition where multiple webhooks could post duplicate placeholders\n             String sourceBranch = payload.sourceBranch();\n             Optional<String> earlyLock = analysisLockService.acquireLock(\n                     project, sourceBranch, AnalysisLockType.PR_ANALYSIS,\n-                    payload.commitHash(), Long.parseLong(payload.pullRequestId()));\n+                    payload.commitHash(), prId);\n             \n             if (earlyLock.isEmpty()) {\n                 log.info(\"PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook\", \n-                        project.getId(), sourceBranch, payload.pullRequestId());\n+                        project.getId(), sourceBranch, prId);\n                 return WebhookResult.ignored(\"PR analysis already in progress for this branch\");\n             }\n@@ -120,7 +121,7 @@\n             // Lock acquired - placeholder posting is now protected from race conditions\n             \n             // Post placeholder comment immediately to show analysis has started\n-            placeholderCommentId = postPlaceholderComment(project, Long.parseLong(payload.pullRequestId()));\n+            placeholderCommentId = postPlaceholderComment(project, prId);\n             \n             // Create PR analysis request\n             PrProcessRequest request = new PrProcessRequest();\n@@ -128,7 +129,7 @@\n             request.sourceBranchName = sourceBranch;\n             request.targetBranchName = payload.targetBranch();\n             request.commitHash = payload.commitHash();\n-            request.pullRequestId = Long.parseLong(payload.pullRequestId());\n+            request.pullRequestId = prId;\n             request.placeholderCommentId = placeholderCommentId;\n             request.prAuthorId = payload.prAuthorId();\n             request.prAuthorUsername = payload.prAuthorUsername();

View Issue Details


Id on Platform: 1962

Category: 🧹 Code Quality

File: .../webhookhandler/GitHubPullRequestWebhookHandler.java:123

Issue: The pull request ID is parsed from a String to a Long multiple times (lines 132, 147, 155, and 196). This is redundant and slightly inefficient.

💡 Suggested Fix

Parse the pull request ID once at the beginning of the method and store it in a variable.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
@@ -122,12 +122,13 @@
     ) {
         String placeholderCommentId = null;
         String acquiredLockKey = null;
+        Long prId = Long.parseLong(payload.pullRequestId());
         
         try {
             // Try to acquire lock atomically BEFORE posting placeholder
             // This prevents race condition where multiple webhooks could post duplicate placeholders
             String sourceBranch = payload.sourceBranch();
             Optional<String> earlyLock = analysisLockService.acquireLock(
                     project, sourceBranch, AnalysisLockType.PR_ANALYSIS,
-                    payload.commitHash(), Long.parseLong(payload.pullRequestId()));
+                    payload.commitHash(), prId);
             
             if (earlyLock.isEmpty()) {
                 log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook", 
-                        project.getId(), sourceBranch, payload.pullRequestId());
+                        project.getId(), sourceBranch, prId);
                 return WebhookResult.ignored("PR analysis already in progress for this branch");
             }
             
             acquiredLockKey = earlyLock.get();
             
             // Lock acquired - placeholder posting is now protected from race conditions
             
             // Post placeholder comment immediately to show analysis has started
-            placeholderCommentId = postPlaceholderComment(project, Long.parseLong(payload.pullRequestId()));
+            placeholderCommentId = postPlaceholderComment(project, prId);
             
             PrProcessRequest request = new PrProcessRequest();
             request.project = project;
-            request.pullRequestId = Long.parseLong(payload.pullRequestId());
+            request.pullRequestId = prId;
             request.sourceBranchName = sourceBranch;
             request.targetBranchName = payload.targetBranch();

View Issue Details


Id on Platform: 1980

Category: 🎨 Style

File: .../index_manager/collection_manager.py:15

Issue: The imports 'TextIndexParams' and 'TokenizerType' are added but not used anywhere in the file.

💡 Suggested Fix

Remove the unused imports to keep the code clean.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py
@@ -12,3 +12,3 @@
 from qdrant_client.models import (
-    Distance, VectorParams,
-    CreateAlias, DeleteAlias, CreateAliasOperation, DeleteAliasOperation,
-    PayloadSchemaType, TextIndexParams, TokenizerType
+    Distance, VectorParams, CreateAlias, DeleteAlias, 
+    CreateAliasOperation, DeleteAliasOperation, PayloadSchemaType
 )

View Issue Details


Id on Platform: 1981

Category: ⚡ Performance

File: .../orchestrator/orchestrator.py:74

Issue: In _index_pr_files, the code falls back to f.content if f.full_content is missing. In many diff processing contexts, f.content represents the raw diff string (with + and - markers). Indexing diff hunks into a RAG system instead of clean code significantly degrades embedding quality and semantic search accuracy.

💡 Suggested Fix

Ensure that only full file content is indexed, or strip diff markers before indexing if only the diff is available.

No suggested fix provided

View Issue Details


Id on Platform: 1982

Category: 🧹 Code Quality

File: .../orchestrator/reconciliation.py:43

Issue: The line grouping logic int(line) // 3 is unstable for issue tracking. An issue moving from line 3 to line 4 will change its fingerprint (group 1 to group 2), while an issue moving from line 4 to line 5 will not. This makes issue persistence tracking inconsistent across small code shifts.

💡 Suggested Fix

Consider using a more robust fuzzy matching for lines or rely more heavily on the issue reason/category for deduplication if the line number is within a certain threshold.

No suggested fix provided

View Issue Details


ℹ️ Informational Notes

Id on Platform: 1953

Category: 🎨 Style

File: .../splitter/languages.py:72

Issue: The extension '.xml' is mapped to 'Language.HTML'. While similar, LangChain's 'Language' enum contains a specific 'XML' member which would be more appropriate.

💡 Suggested Fix

Change the mapping for '.xml' to 'Language.XML'.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
@@ -73,1 +73,1 @@
-    '.xml': Language.HTML,
+    '.xml': Language.XML,

View Issue Details


Id on Platform: 1968

Category: 🧹 Code Quality

File: .../index_manager/manager.py:254

Issue: The call to 'list_all_indices' passes 'self._collection_manager.alias_exists' as an argument, but the receiving method in 'StatsManager' does not use this parameter.

💡 Suggested Fix

Remove the unused argument from the call to align with the suggested cleanup in 'StatsManager'.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/manager.py
@@ -239,3 +239,1 @@
     def list_indices(self) -> List[IndexStats]:
         """List all project indices with branch breakdown."""
-        return self._stats_manager.list_all_indices(
-            self._collection_manager.alias_exists
-        )
+        return self._stats_manager.list_all_indices()

View Issue Details


Id on Platform: 1970

Category: 🧹 Code Quality

File: .../index_manager/stats_manager.py:102

Issue: The parameter 'alias_checker' is defined in 'list_all_indices' but never used within the function body.

💡 Suggested Fix

Remove the unused parameter from the function signature.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
@@ -100,1 +100,1 @@
-    def list_all_indices(self, alias_checker) -> List[IndexStats]:
+    def list_all_indices(self) -> List[IndexStats]:

View Issue Details


Files Affected

  • .../splitter/languages.py: 2 issues
  • .../command/command_service.py: 2 issues
  • .../vcsclient/VcsClientProvider.java: 2 issues
  • .../core/ollama_embedding.py: 2 issues
  • .../webhookhandler/GitLabMergeRequestWebhookHandler.java: 2 issues
  • .../webhookhandler/GitHubPullRequestWebhookHandler.java: 2 issues
  • .../ai/AiAnalysisRequestImpl.java: 2 issues
  • .../service/WorkspaceService.java: 1 issue
  • .../queries/rust.scm: 1 issue
  • .../routers/commands.py: 1 issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant