Scoring UI by metlos · Pull Request #247 · codeready-toolchain/tarsy-bot

metlos · 2026-01-28T16:43:11Z

This PR stacks on top of #246 and adds UI for the scoring functionality.

The session history list contains a new (clickable) column for the score badge.
The session detailed view contains a new button to initiate the scoring.
The session score badge is placed next to the session status in the header.
There's a new "tab" that shows the details of the score - i.e. the reasoning for giving the score as well as the list of the missing tools identified in the particular session.

Part of the PR stack:

Summary by CodeRabbit

New Features

Session Scoring: Users can now score alert sessions to evaluate investigation quality. Scores include detailed analysis of the investigation and missing tools.
Score Visualization: Session scores are displayed as badges in alert lists with status indicators (pending, in progress, completed, or failed) and numerical scores.
Score Details View: New dedicated view shows score analysis, scoring status, timestamp, and missing tools analysis for completed assessments.

Enhancements

Dashboard Integration: Score information now appears alongside session summaries for quick quality assessment visibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

… model

coderabbitai · 2026-01-28T16:43:33Z

Walkthrough

This PR implements a comprehensive session scoring system for alert investigations. It adds database schema for storing scores, a scoring service with LLM-based evaluation, REST API endpoints, and frontend UI components to display and trigger scoring operations with support for multi-turn LLM interactions and deterministic prompt versioning.

Changes

Cohort / File(s)	Summary
Database Schema & Models `backend/alembic/versions/20260123_1721_776dca9d9a2f_add_session_scores_table.py`, `backend/tarsy/models/db_models.py`, `backend/tarsy/models/constants.py`	Creates session_scores table with 13 columns, 6 indexes including partial unique index for active scores; adds ScoringStatus enum (PENDING, IN_PROGRESS, COMPLETED, FAILED); extends AlertSession and Chat with chain-related fields
Scoring Service & Orchestration `backend/tarsy/services/scoring_service.py`, `backend/tarsy/agents/prompts/judges.py`	Implements async scoring workflow: prompt generation, multi-turn LLM evaluation, score extraction, and result persistence; defines judge prompts and SHA256 hashing for deterministic prompt versioning (CURRENT_PROMPT_HASH)
API Controller & Routing `backend/tarsy/controllers/scoring_controller.py`, `backend/tarsy/main.py`	Exposes POST /api/v1/scoring/sessions/{session_id}/score (returns 202 for async) and GET /api/v1/scoring/sessions/{session_id}/score; handles SessionNotFoundError (404), SessionNotCompletedError (400), and conflicts (409)
Data Access & Repositories `backend/tarsy/repositories/session_score_repository.py`, `backend/tarsy/repositories/history_repository.py`	SessionScoreRepository manages CRUD and status transitions with retry logic; history_repository augments session overviews with latest score metrics (score_total, score_status)
API Models & Integration `backend/tarsy/models/api_models.py`, `backend/tarsy/models/history_models.py`, `backend/tarsy/integrations/llm/manager.py`	Adds SessionScoreResponse (with computed_field current_prompt_used), SessionScoreRequest, and score fields to SessionOverview/DetailedSession; updates LLMManager type hints (str \| None)
Backend Tests `backend/tests/integration/test_scoring_integration.py`, `backend/tests/unit/agents/test_judges.py`, `backend/tests/unit/controllers/test_scoring_controller.py`, `backend/tests/unit/models/test_session_score_models.py`, `backend/tests/unit/repositories/test_session_score_repository.py`, `backend/tests/unit/services/test_scoring_service.py`	Comprehensive integration and unit test coverage: prompt validation, score extraction, async workflow, API endpoints (202/200/404/400/409/500), repository operations, and service initialization
Frontend Components - Scoring UI `dashboard/src/components/ScoreBadge.tsx`, `dashboard/src/components/ScoreDetailView.tsx`, `dashboard/src/components/MarkdownRenderer.tsx`	ScoreBadge displays status/score with color coding (red <50, yellow 50-74, green 75-100); ScoreDetailView fetches and renders detailed analysis with metadata; MarkdownRenderer provides centralized markdown rendering with copy buttons
Frontend Components - Session Integration `dashboard/src/components/SessionDetailPageBase.tsx`, `dashboard/src/components/SessionDetailWrapper.tsx`, `dashboard/src/components/SessionHeader.tsx`, `dashboard/src/components/FinalAnalysisCard.tsx`	Adds "score" view option to session detail pages; SessionHeader adds ScoreSession button, rescore confirmation dialog, and scoring state management; FinalAnalysisCard replaces inline ReactMarkdown with MarkdownRenderer
Frontend Components - Dashboard Integration `dashboard/src/components/DashboardView.tsx`, `dashboard/src/components/DashboardLayout.tsx`, `dashboard/src/components/HistoricalAlertsList.tsx`, `dashboard/src/components/AlertListItem.tsx`, `dashboard/src/App.tsx`	Threads onScoreClick handler through dashboard hierarchy; HistoricalAlertsList adds Score column; AlertListItem renders ScoreBadge; App adds /sessions/:id/score route
Frontend Services & Types `dashboard/src/services/api.ts`, `dashboard/src/types/index.ts`	APIClient methods scoreSession() and getSessionScore(); new types: ScoringStatus, SessionScore, ScoreSessionRequest/Response, and WebSocket scoring events; extends Session/props with score fields and click handlers
Frontend Test `dashboard/src/components/test_history_controller.py`	Updates expected session response to include optional score_total and score_status fields
Documentation `docs/enhancements/pending/EP-0028-phase-1-scoring-api.md`	Marks all tasks in Phases 1–4 as completed; Phases 5–6 retained for future implementation

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/Dashboard
    participant API as Scoring API
    participant Service as ScoringService
    participant DB as Database
    participant LLM as LLM Manager
    participant Background as Background Task

    Client->>API: POST /sessions/{id}/score
    API->>API: Extract user identity
    API->>Service: initiate_scoring(session_id, triggered_by, force_rescore)
    Service->>DB: Get session + check completion
    Service->>DB: Check active scores
    Service->>DB: Create score (status=PENDING)
    Service->>API: Return SessionScoreResponse (202 Accepted)
    API-->>Client: 202 + pending score

    Service->>Background: Launch async _execute_scoring task
    Background->>DB: Update status to IN_PROGRESS
    Background->>DB: Fetch session history & final analysis
    Background->>LLM: Send Turn 1: Score prompt
    LLM-->>Background: Response with score + analysis
    Background->>Background: Extract score from response
    Background->>LLM: Send Turn 2: Missing tools prompt
    LLM-->>Background: Response with missing tools analysis
    Background->>DB: Update score with results (status=COMPLETED)
    
    Client->>API: GET /sessions/{id}/score
    API->>Service: _get_latest_score(session_id)
    Service->>DB: Fetch latest score
    Service-->>API: SessionScore
    API-->>Client: 200 + SessionScoreResponse

sequenceDiagram
    participant UI as UI/SessionHeader
    participant API as APIClient
    participant Controller as Controller
    participant Service as ScoringService
    participant Dialog as Rescore Dialog

    UI->>UI: User clicks "Score Session"
    UI->>API: scoreSession(sessionId)
    API->>Controller: POST /sessions/{id}/score
    Controller->>Service: initiate_scoring()
    
    alt Score already exists and not force
        Service-->>Controller: Return existing score
        Controller-->>API: 200 OK (existing)
        API-->>UI: Existing score response
        UI->>Dialog: Show rescore confirmation?
    else Force rescore
        Service-->>Controller: Return new pending score
        Controller-->>API: 202 Accepted (new)
        API-->>UI: New pending score
        UI->>UI: Show loading state
    end
    
    UI->>UI: Poll GET /sessions/{id}/score (background)
    UI->>API: getSessionScore(sessionId)
    API-->>UI: Updated SessionScore
    UI->>UI: Render ScoreDetailView when completed

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #180: Main scoring implementation (this PR) realizes the EP-0028 design and adds all core scoring components (judges, service, API, frontend).
PR #28: Both modify history models and history_repository with new session score fields, requiring careful coordination on schema and data retrieval logic.

Suggested reviewers

alexeykazakov

Poem

🐰 A scoring dream comes into play,
With judge prompts leading the way,
Badge colors tell the story true,
From red to green, investigations review,
Async tasks hop through the night,
Making evaluations bright! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Scoring UI' is vague and does not clearly describe the main changes in this comprehensive pull request that spans backend and frontend scoring functionality.	Consider a more descriptive title such as 'Add session scoring with LLM-based evaluation and UI components' or 'Implement scoring service, API endpoints, and UI integration' to better reflect the full scope of changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 89.22% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

🤖 Fix all issues with AI agents

In `@backend/tarsy/agents/prompts/judges.py`:
- Around line 320-335: get_current_prompt_hash currently only hashes
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS, so changes to other
prompt parts (system/reminder) won't update the version; update the function to
concatenate all prompt strings used in judge requests (include the system prompt
and reminder prompt constants used in this module — e.g., JUDGE_PROMPT_SYSTEM
and JUDGE_PROMPT_REMINDER or whatever names are defined here — along with
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) before computing the
SHA256, or if you intentionally want a subset, update the function docstring to
explicitly list which prompts are included; ensure you reference
get_current_prompt_hash and the prompt constant names when making the change.

In `@backend/tarsy/controllers/scoring_controller.py`:
- Around line 152-165: The except handlers in scoring_controller.py currently
include str(e) in HTTPException details (see the RuntimeError except block and
the generic except Exception block around initiating score for session_id),
which can leak internal info; change both HTTPException detail strings to
generic messages like "Internal server error" or "Scoring service error" without
including str(e), while retaining the existing logger.error/logger.exception
calls to record the full error internally (use session_id in log context as
already done).

In `@backend/tarsy/repositories/history_repository.py`:
- Around line 658-690: The import and usage of the non-existent model
SessionScoreDB causes the except ImportError to swallow the failure and leave
score_data empty; update the import and all references to use the correct model
name SessionScore (replace SessionScoreDB -> SessionScore) in the score-fetching
block (the select/subquery and score_query that reference SessionScoreDB) and in
the other occurrence noted in the file so the query uses the actual model class
SessionScore and no longer silently fails.

In `@backend/tarsy/services/scoring_service.py`:
- Around line 648-651: Remove the debugging marker from the logger.debug call
that prints the conversation: locate the logger.debug(f"!!!!! Full conversation
after turn 1:\n{conversation}") line (the logger.debug invocation referencing
conversation) and either delete it or replace it with a cleaned-up message
without the "!!!!!" artifact (e.g., "Full conversation after turn 1:" +
conversation) so the log entry is production-ready.
- Around line 208-239: In _extract_score_from_response, handle empty/whitespace
responses and restrict parsed scores to 0–100: first strip the response and if
empty return (None, response) (avoid indexing lines[-1]); otherwise find the
last non-empty line (not just lines[-1]) and attempt to parse it using a
stricter pattern that only accepts integers 0–100 (e.g., match 100 or 0–99) or
parse then validate range; if the match fails or the numeric value is outside
0–100, log an error including the problematic last line and return (None,
response); otherwise convert to int, build score_analysis from all lines before
that last non-empty line, and return (total_score, score_analysis).
- Around line 567-571: The background task launched with
asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) is
not stored and may be garbage-collected or have unobserved exceptions; modify
the ScoringService (or the class containing _execute_scoring) to keep a
reference to the created Task (e.g., append the returned task to a tasks
set/list or a tracking dict keyed by score_id/session_id) and ensure you remove
completed tasks (or await/monitor them) to handle exceptions and lifecycle;
update the code that calls asyncio.create_task to assign its return value to
that tracker so tasks aren't silently cancelled.

In `@backend/tests/integration/test_scoring_integration.py`:
- Around line 94-151: In the async test helpers mock_generate_score and
mock_generate (and the other similar test helper around lines referenced),
rename the unused variadic parameter *args to _args (or remove it entirely) so
Ruff ARG001 is not raised; update function signatures where *args is declared
(e.g., async def mock_generate_score(*args, **kwargs) and async def
mock_generate(*args, **kwargs)) to use async def mock_generate_score(_args,
**kwargs) or async def mock_generate(_args, **kwargs) (or drop the variadic
parameter) while leaving usage of kwargs and the rest of the function body
unchanged.
- Around line 34-366: The tests and pytest fixtures lack return type hints; add
appropriate annotations for each fixture and test: annotate fixtures like
in_memory_engine -> Engine, db_manager -> DatabaseManager, history_service ->
HistoryService, scoring_service -> ScoringService, mock_llm_client -> Mock,
completed_session -> AlertSession (import types from sqlmodel/sqlalchemy/typing
or project types as needed), and annotate async test functions
(test_end_to_end_score_completed_session,
test_multiple_scoring_attempts_same_session, test_concurrent_scoring_prevention,
test_scoring_failure_updates_status, test_score_extraction_failure) with ->
None; ensure any generator fixtures use typing.Generator or Iterator when
applicable and add necessary imports for the type names referenced.
- Around line 48-60: Remove the duplicate import of Session inside the
db_manager function to satisfy Ruff F811: delete the line "from sqlmodel import
Session" in db_manager and rely on the existing import at the top of the module;
ensure db_manager still sets manager.session_factory using
sessionmaker(bind=in_memory_engine, class_=Session, expire_on_commit=False) and
references the DatabaseManager, in_memory_engine, and sessionmaker symbols
unchanged.
- Around line 285-291: The nested context managers around the test can be
collapsed into a single combined `with` statement to improve readability and
satisfy SIM117; replace the two nested `with` blocks that use
`patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client)`
and `pytest.raises(ValueError, match=...)` with a single `with` that joins both
context managers (e.g., `with patch.object(...), pytest.raises(...):`), keeping
the same `mock_llm_client`, the same `scoring_service._get_llm_client` patch
target, and the same exception match string.

In `@backend/tests/unit/agents/test_judges.py`:
- Around line 1-47: Add pytest markers and return type hints: import pytest at
top and decorate each test class or each test function with `@pytest.mark.unit`,
then add explicit return type hints (-> None) to every test method (e.g.,
test_judge_prompt_score_contains_placeholders(self) -> None). Also run Ruff
SIM300 and fix any yoda-condition warnings by ensuring comparisons place the
variable on the left side (e.g., keep len(CURRENT_PROMPT_HASH) == 64 rather than
64 == len(...)) and adjust any expressions flagged by SIM300 accordingly;
reference the test class and function names (TestJudgePrompts, TestPromptHashing
and all test_* methods) to locate where to apply these changes.

In `@backend/tests/unit/models/test_session_score_models.py`:
- Around line 11-28: Add the pytest unit marker and return type hints to the
test classes and functions in this file: decorate TestScoringStatusEnum (and the
other test classes in the file) with `@pytest.mark.unit`, and add explicit return
type hints (-> None) to the test methods such as test_active_values and
test_values (and all other test_* functions in the file) so each test function
signature follows the project convention and linters/mypy expectations.

In `@dashboard/src/components/ScoreBadge.tsx`:
- Around line 2-58: The import of Error from '@mui/icons-material' shadows the
global Error; rename the imported icon (e.g., Error as ErrorIcon) and update all
usages in the ScoreBadge component where the icon prop uses <Error /> (the Chip
for the failed state and any other Chip/Icon usage) to use the new alias
(<ErrorIcon />) and any corresponding JSX references so the global Error
identifier is no longer shadowed.

🧹 Nitpick comments (15)

backend/tarsy/models/history_models.py (1)
229-231: Consider constraining score_status to known values for validation.

Using a Literal (or an enum if you already have one) avoids invalid statuses slipping into API responses.
♻️ Proposed refactor
-    score_status: Optional[str] = None  # pending|in_progress|completed|failed
+    score_status: Optional[Literal["pending", "in_progress", "completed", "failed"]] = None
-    score_status: Optional[str] = None  # pending|in_progress|completed|failed
+    score_status: Optional[Literal["pending", "in_progress", "completed", "failed"]] = None
Also applies to: 437-438
backend/tests/unit/repositories/test_session_score_repository.py (2)
1-11: Add @pytest.mark.unit marker to the test class.

Per coding guidelines, unit tests should be marked with @pytest.mark.unit. Also, the IntegrityError import on line 4 appears unused in this test file since the repository handles it internally and raises ValueError instead.
Suggested fix
 import pytest
-from sqlalchemy.exc import IntegrityError
 from sqlmodel import Session, SQLModel, create_engine
 
 from tarsy.models.constants import ScoringStatus
 from tarsy.models.db_models import SessionScore, AlertSession
 from tarsy.repositories.session_score_repository import SessionScoreRepository
 from tarsy.utils.timestamp import now_us
 
 
+@pytest.mark.unit
 class TestSessionScoreRepository:
As per coding guidelines: "Use test markers appropriately: @pytest.mark.unit for unit tests".

134-159: Consider adding explicit time ordering guarantee.

The test relies on now_us() - 1000000 to create an "earlier" timestamp, but if the test runs extremely fast, there's a theoretical (though unlikely) risk of timing issues. The test logic is correct, but you could make the ordering more explicit.
Optional: More explicit timestamp ordering
     def test_get_latest_score_for_session(self, repository, sample_alert_session):
         """Test getting most recent score."""
+        base_time = now_us()
         # Create two scores
         score1 = SessionScore(
             session_id=sample_alert_session.session_id,
             prompt_hash="abc123",
             score_triggered_by="user:test",
             status=ScoringStatus.COMPLETED.value,
-            scored_at_us=now_us() - 1000000,  # Earlier
+            scored_at_us=base_time - 1000000,  # Earlier
         )
         score2 = SessionScore(
             session_id=sample_alert_session.session_id,
             prompt_hash="def456",
             score_triggered_by="user:test",
             status=ScoringStatus.PENDING.value,
-            scored_at_us=now_us(),  # Later
+            scored_at_us=base_time,  # Later
         )
dashboard/src/components/SessionDetailWrapper.tsx (1)
22-27: Ternary chain is becoming complex; consider refactoring for readability.

The nested ternary for initialView detection is functional but harder to read at a glance. Consider extracting to a helper function for clarity.
Optional: Extract to helper function
+  // Determine initial view from URL path
+  const getInitialView = (): 'conversation' | 'technical' | 'score' => {
+    if (location.pathname.includes('/technical')) return 'technical';
+    if (location.pathname.includes('/score')) return 'score';
+    return 'conversation';
+  };
+
   // Determine initial view from URL
-  const initialView = location.pathname.includes('/technical')
-    ? 'technical'
-    : location.pathname.includes('/score')
-    ? 'score'
-    : 'conversation';
-  const [currentView, setCurrentView] = useState<'conversation' | 'technical' | 'score'>(initialView);
+  const [currentView, setCurrentView] = useState<'conversation' | 'technical' | 'score'>(getInitialView);
backend/tarsy/models/db_models.py (1)

390-394: Consider adding an index on scored_at_us for time-based queries.

The scored_at_us field uses BIGINT but doesn't have an index. If you anticipate queries filtering or ordering by scoring initiation time (e.g., for cleanup or analytics), consider adding an index.

backend/tests/unit/controllers/test_scoring_controller.py (1)

38-44: Tests mock a private method _get_latest_score which couples tests to implementation details.

The mock targets _get_latest_score (underscore-prefixed private method). This creates tight coupling between tests and implementation. If the service refactors this internal method, tests will break even if the public contract remains unchanged.

Consider whether the ScoringService should expose a public method for retrieving scores, or mock at a lower level (repository).
backend/tarsy/controllers/scoring_controller.py (2)
11-11: Unused import: Body from fastapi.

The Body import is not used in this file.
🧹 Remove unused import
-from fastapi import APIRouter, Depends, HTTPException, Path, Request, Response, Body
+from fastapi import APIRouter, Depends, HTTPException, Path, Request, Response
213-220: Controller calls private method _get_latest_score directly.

The GET endpoint calls scoring_service._get_latest_score(), which is a private method (underscore prefix). This breaks encapsulation and makes the controller dependent on implementation details.

Consider exposing a public method like get_latest_score() or get_score_for_session() in the service layer.
♻️ Suggested refactor in scoring_service.py

Add a public method in ScoringService:
async def get_latest_score(self, session_id: str) -> Optional[SessionScore]:
    """
    Get the latest score for a session (public API).
    
    Args:
        session_id: Session identifier
        
    Returns:
        Latest SessionScore or None if not found
    """
    return await self._get_latest_score(session_id)
Then update the controller to use the public method.
dashboard/src/components/SessionHeader.tsx (1)
573-583: Magic string 'completed' should use a constant.

The status check uses a string literal 'completed' instead of referencing a shared constant. This could lead to bugs if the status value changes.
♻️ Use ScoringStatus constant

Import and use the ScoringStatus type or create a constant:
+import { SCORING_STATUS } from '../utils/statusConstants';

   // EP-0028: Handle score session button click
   const handleScoreSession = () => {
     // Check if session already has a completed score
-    if (session.score_status === 'completed') {
+    if (session.score_status === SCORING_STATUS.COMPLETED) {
       // Show confirmation dialog for rescoring
       setShowRescoreDialog(true);
     } else {
backend/tarsy/models/api_models.py (1)

14-14: Import of CURRENT_PROMPT_HASH creates coupling between API and agent modules.

Importing from tarsy.agents.prompts.judges in the API models creates a dependency from the API layer to the agent/prompt layer. This could be moved to a shared constants module if this coupling becomes problematic.
backend/tests/unit/services/test_scoring_service.py (2)
307-319: Unused fixture parameter mock_chat_conversation_history.

The mock_chat_conversation_history parameter in test_build_score_prompt_with_chat_conversation_present is declared but the test actually uses mock_final_analysis_response which already contains the chat conversation.

This is a minor issue in test code and the fixture dependency is technically correct (ensures the data is available), but the explicit parameter suggests it might have been intended to be used directly.

571-598: Consider combining nested with statements for readability.

The nested with statements can be combined into a single statement as suggested by static analysis. However, the current structure is clear and the nesting shows the mock dependency hierarchy.
♻️ Combined with statement (optional)
-        with patch.object(
-            scoring_service.history_service,
-            "get_session",
-            return_value=mock_session_completed,
-        ):
-            with patch.object(scoring_service, "_get_latest_score", return_value=None):
-                with patch.object(
-                    scoring_service,
-                    "_create_score_record",
-                    return_value=mock_score_record_pending,
-                ):
-                    with patch("asyncio.create_task") as mock_create_task:
+        with (
+            patch.object(
+                scoring_service.history_service,
+                "get_session",
+                return_value=mock_session_completed,
+            ),
+            patch.object(scoring_service, "_get_latest_score", return_value=None),
+            patch.object(
+                scoring_service,
+                "_create_score_record",
+                return_value=mock_score_record_pending,
+            ),
+            patch("asyncio.create_task") as mock_create_task,
+        ):
backend/tarsy/repositories/session_score_repository.py (1)
99-136: Truthy check for completed_at_us could skip valid zero value.

Line 121 uses if completed_at_us: which would skip if the value is 0. While 0 microseconds since epoch (1970-01-01 00:00:00.000000) is practically never valid, using is not None would be more defensive and consistent with the total_score check on line 127.
Suggested fix for consistency
-        if completed_at_us:
+        if completed_at_us is not None:
             score.completed_at_us = completed_at_us
backend/tarsy/services/scoring_service.py (2)
97-115: Repository context manager silently yields None on error.

The _get_repository method yields None on exceptions instead of propagating them. While callers do check for None, this pattern can mask underlying database errors and make debugging harder.

Consider logging more details or re-raising after logging to help diagnose issues:
         except Exception as e:
             logger.error(f"Failed to get scoring repository: {str(e)}")
-            yield None
+            raise  # Let caller handle the exception
Alternatively, if the silent failure is intentional for resilience, document this behavior in the docstring.

142-200: Inline import should be at module top level.

Line 161 has import json inside the method. Per PEP 8 and Ruff conventions, imports should be at the top of the file.
Move import to top of file
 import asyncio
+import json
 import random
And remove line 161.

coderabbitai · 2026-01-28T16:57:45Z

backend/tarsy/agents/prompts/judges.py

+def get_current_prompt_hash() -> str:
+    """
+    Compute SHA256 hash of both judge prompts concatenated.
+
+    This hash provides deterministic criteria versioning - when prompts change,
+    the hash changes, allowing detection of scores using obsolete criteria.
+
+    Returns:
+        Hex string of SHA256 hash (64 characters)
+    """
+    # Concatenate both prompts in the order they're used
+    combined_prompts = JUDGE_PROMPT_SCORE + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS
+
+    # Compute SHA256 hash
+    hash_obj = hashlib.sha256(combined_prompts.encode("utf-8"))
+


⚠️ Potential issue | 🟠 Major

Include all prompts in the version hash (system + reminder).

CURRENT_PROMPT_HASH won’t change if the system prompt or the reminder prompt changes, which breaks prompt-versioning guarantees. If those prompts are part of the judge request, they should be included in the hash (or the docstring/usage updated to make this explicit).

🐛 Proposed fix

- # Concatenate both prompts in the order they're used - combined_prompts = JUDGE_PROMPT_SCORE + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS + # Concatenate all prompts in the order they're used + combined_prompts = ( + JUDGE_SYSTEM_PROMPT + + JUDGE_PROMPT_SCORE + + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS + + JUDGE_PROMPT_SCORE_REMINDER + )

🤖 Prompt for AI Agents

In `@backend/tarsy/agents/prompts/judges.py` around lines 320 - 335, get_current_prompt_hash currently only hashes JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS, so changes to other prompt parts (system/reminder) won't update the version; update the function to concatenate all prompt strings used in judge requests (include the system prompt and reminder prompt constants used in this module — e.g., JUDGE_PROMPT_SYSTEM and JUDGE_PROMPT_REMINDER or whatever names are defined here — along with JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) before computing the SHA256, or if you intentionally want a subset, update the function docstring to explicitly list which prompts are included; ensure you reference get_current_prompt_hash and the prompt constant names when making the change.

coderabbitai · 2026-01-28T16:57:45Z

backend/tarsy/controllers/scoring_controller.py

+    except RuntimeError as e:
+        # Database errors
+        logger.error(f"Scoring service error for session {session_id}: {str(e)}")
+        raise HTTPException(
+            status_code=500, detail=f"Scoring service error: {str(e)}"
+        ) from e
+    except Exception as e:
+        # Unexpected errors
+        logger.exception(
+            f"Unexpected error initiating score for session {session_id}: {str(e)}"
+        )
+        raise HTTPException(
+            status_code=500, detail=f"Internal server error: {str(e)}"
+        ) from e


⚠️ Potential issue | 🟡 Minor

Error messages may expose internal details in 500 responses.

The 500 error responses include str(e) which could leak implementation details or stack traces in production. Consider using a generic message for unexpected errors.

🛡️ Suggested fix

except Exception as e: # Unexpected errors logger.exception( f"Unexpected error initiating score for session {session_id}: {str(e)}" ) raise HTTPException( - status_code=500, detail=f"Internal server error: {str(e)}" + status_code=500, detail="Internal server error" ) from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except RuntimeError as e:

# Database errors

logger.error(f"Scoring service error for session {session_id}: {str(e)}")

raise HTTPException(

status_code=500, detail=f"Scoring service error: {str(e)}"

) from e

except Exception as e:

# Unexpected errors

logger.exception(

f"Unexpected error initiating score for session {session_id}: {str(e)}"

)

raise HTTPException(

status_code=500, detail=f"Internal server error: {str(e)}"

) from e

except RuntimeError as e:

# Database errors

logger.error(f"Scoring service error for session {session_id}: {str(e)}")

raise HTTPException(

status_code=500, detail=f"Scoring service error: {str(e)}"

) from e

except Exception as e:

# Unexpected errors

logger.exception(

f"Unexpected error initiating score for session {session_id}: {str(e)}"

)

raise HTTPException(

status_code=500, detail="Internal server error"

) from e

🤖 Prompt for AI Agents

In `@backend/tarsy/controllers/scoring_controller.py` around lines 152 - 165, The except handlers in scoring_controller.py currently include str(e) in HTTPException details (see the RuntimeError except block and the generic except Exception block around initiating score for session_id), which can leak internal info; change both HTTPException detail strings to generic messages like "Internal server error" or "Scoring service error" without including str(e), while retaining the existing logger.error/logger.exception calls to record the full error internally (use session_id in log context as already done).

coderabbitai · 2026-01-28T16:57:45Z

backend/tarsy/repositories/history_repository.py

+                # Fetch latest scores for sessions (most recent scoring attempt per session)
+                score_data = {}
+                try:
+                    from tarsy.models.db_models import SessionScoreDB
+
+                    # Get latest score for each session (most recent started_at_us)
+                    # Use subquery to get max started_at_us per session, then join to get full record
+                    subquery = select(
+                        SessionScoreDB.session_id,
+                        func.max(SessionScoreDB.started_at_us).label('max_started_at')
+                    ).where(
+                        SessionScoreDB.session_id.in_(session_ids)
+                    ).group_by(SessionScoreDB.session_id).subquery()
+
+                    score_query = select(SessionScoreDB).join(
+                        subquery,
+                        and_(
+                            SessionScoreDB.session_id == subquery.c.session_id,
+                            SessionScoreDB.started_at_us == subquery.c.max_started_at
+                        )
+                    )
+
+                    score_results = self.session.exec(score_query).all()
+                    score_data = {
+                        score.session_id: {
+                            'total_score': score.total_score,
+                            'status': score.status
+                        }
+                        for score in score_results
+                    }
+                except ImportError:
+                    # SessionScoreDB not available (shouldn't happen but defensive)
+                    pass


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check for SessionScoreDB and SessionScore class definitions rg -n "class SessionScoreDB|class SessionScore\b" backend/tarsy/models/ rg -n "SessionScoreDB|SessionScore" backend/tarsy/repositories/history_repository.py

Repository: codeready-toolchain/tarsy-bot

Length of output: 1171

Fix SessionScoreDB → SessionScore import; model class name is incorrect and silently fails.

The model at backend/tarsy/models/db_models.py:330 is defined as SessionScore, not SessionScoreDB. Lines 661 and 985 attempt to import SessionScoreDB, which does not exist. The ImportError is silently caught, leaving score_data empty and causing score_total/score_status to be None in history views.

Replace all occurrences of SessionScoreDB with SessionScore in both locations (lines 661–690 and 985–999).

🤖 Prompt for AI Agents

In `@backend/tarsy/repositories/history_repository.py` around lines 658 - 690, The import and usage of the non-existent model SessionScoreDB causes the except ImportError to swallow the failure and leave score_data empty; update the import and all references to use the correct model name SessionScore (replace SessionScoreDB -> SessionScore) in the score-fetching block (the select/subquery and score_query that reference SessionScoreDB) and in the other occurrence noted in the file so the query uses the actual model class SessionScore and no longer silently fails.

coderabbitai · 2026-01-28T16:57:45Z

backend/tarsy/services/scoring_service.py

+    def _extract_score_from_response(self, response: str) -> Tuple[Optional[int], str]:
+        """
+        Extract total_score and analysis from LLM response.
+
+        Score is expected on the last line via regex: r'(\\d+)\\s*$'
+        Analysis is everything before the score line.
+
+        If the total score could not be extracted, None is returned instead, along with the full response
+        as the score analysis.
+
+        Args:
+            response: LLM response text
+
+        Returns:
+            Tuple of (total_score, score_analysis)
+        """
+        # Find score on last line
+        lines = response.splitlines()
+        last_line = lines[-1]
+        score_match = re.search(r"^((\+|-)?(\d+))\s*$", last_line)
+        if not score_match:
+            logger.error(
+                f"No score found on the last line in the response: {last_line[:500]}..."
+            )
+            return None, response
+
+        total_score = int(score_match.group(1))
+
+        # Extract analysis (everything before score)
+        score_analysis = "\n".join(response.splitlines()[0:-1])
+
+        return total_score, score_analysis


⚠️ Potential issue | 🟠 Major

Score extraction lacks range validation and empty response handling.

Two issues:

The regex r"^((\+|-)?(\d+))\s*$" allows negative numbers, but scores should be 0-100. A value like -5 would be parsed as -5.

Line 226 lines[-1] will raise IndexError if the response is empty or contains only whitespace.

Proposed fix with validation

def _extract_score_from_response(self, response: str) -> Tuple[Optional[int], str]: # Find score on last line lines = response.splitlines() + if not lines: + logger.error("Empty response received for score extraction") + return None, response + last_line = lines[-1] - score_match = re.search(r"^((\+|-)?(\d+))\s*$", last_line) + score_match = re.search(r"^(\d+)\s*$", last_line) if not score_match: logger.error( f"No score found on the last line in the response: {last_line[:500]}..." ) return None, response total_score = int(score_match.group(1)) + + # Validate score is in expected range + if not 0 <= total_score <= 100: + logger.error(f"Score {total_score} is outside valid range 0-100") + return None, response # Extract analysis (everything before score) score_analysis = "\n".join(response.splitlines()[0:-1]) return total_score, score_analysis

🤖 Prompt for AI Agents

In `@backend/tarsy/services/scoring_service.py` around lines 208 - 239, In _extract_score_from_response, handle empty/whitespace responses and restrict parsed scores to 0–100: first strip the response and if empty return (None, response) (avoid indexing lines[-1]); otherwise find the last non-empty line (not just lines[-1]) and attempt to parse it using a stricter pattern that only accepts integers 0–100 (e.g., match 100 or 0–99) or parse then validate range; if the match fails or the numeric value is outside 0–100, log an error including the problematic last line and return (None, response); otherwise convert to int, build score_analysis from all lines before that last non-empty line, and return (total_score, score_analysis).

coderabbitai · 2026-01-28T16:57:45Z

backend/tarsy/services/scoring_service.py

+        # Launch background scoring task
+        asyncio.create_task(self._execute_scoring(score_record.score_id, session_id))
+        logger.debug(
+            f"Launched background scoring task for score {score_record.score_id}"
+        )


⚠️ Potential issue | 🟠 Major

Background task created without reference may be garbage collected.

asyncio.create_task() returns a task object that should be stored to prevent potential garbage collection and to properly handle exceptions. If the task is GC'd before completion, it may be silently cancelled.

Store task reference for proper lifecycle management

+ # Store reference to prevent GC and enable exception handling + task = asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) + task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None) - asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) logger.debug( f"Launched background scoring task for score {score_record.score_id}" )

Alternatively, consider using a task tracking mechanism if you need to monitor or cancel these tasks.

🤖 Prompt for AI Agents

In `@backend/tarsy/services/scoring_service.py` around lines 567 - 571, The background task launched with asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) is not stored and may be garbage-collected or have unobserved exceptions; modify the ScoringService (or the class containing _execute_scoring) to keep a reference to the created Task (e.g., append the returned task to a tasks set/list or a tracking dict keyed by score_id/session_id) and ensure you remove completed tasks (or await/monitor them) to handle exceptions and lifecycle; update the code that calls asyncio.create_task to assign its return value to that tracker so tasks aren't silently cancelled.

coderabbitai · 2026-01-28T16:57:46Z

backend/tests/integration/test_scoring_integration.py

+        async def mock_generate_score(*args, **kwargs):
+            conversation = kwargs.get("conversation")
+            # Add assistant message with score
+            conversation.append_assistant_message(
+                """## Evaluation
+
+**Logical Flow: 18/25**
+The investigation followed a reasonable pattern with good use of tools.
+
+**Consistency: 20/25**
+Conclusions are well-supported by evidence gathered.
+
+**Tool Relevance: 17/25**
+Good tool selection, though some opportunities were missed.
+
+**Synthesis Quality: 18/25**
+Final analysis is comprehensive and acknowledges limitations.
+
+73"""
+            )
+            return conversation
+
+        # Mock missing tools response (Turn 2)
+        turn_count = {"count": 0}
+
+        async def mock_generate(*args, **kwargs):
+            conversation = kwargs.get("conversation")
+            turn_count["count"] += 1
+
+            if turn_count["count"] == 1:
+                # Turn 1: Score response
+                conversation.append_assistant_message(
+                    """## Evaluation
+
+**Logical Flow: 18/25**
+The investigation followed a reasonable pattern.
+
+**Consistency: 20/25**
+Conclusions are well-supported.
+
+**Tool Relevance: 17/25**
+Good tool selection overall.
+
+**Synthesis Quality: 18/25**
+Comprehensive final analysis.
+
+73"""
+                )
+            else:
+                # Turn 2: Missing tools analysis
+                conversation.append_assistant_message(
+                    """1. **kubectl-describe**: Would have provided more detailed pod information.
+
+2. **read-file**: Could have inspected configuration files for root cause analysis."""
+                )
+
+            return conversation
+


⚠️ Potential issue | 🟠 Major

Rename unused args parameters to avoid Ruff ARG001.

Lines 94, 119, and 338 define *args but never use them. Ruff flags this; rename to _args (or drop them) to keep lint clean.

Fix

- async def mock_generate_score(*args, **kwargs): + async def mock_generate_score(*_args, **kwargs): ... - async def mock_generate(*args, **kwargs): + async def mock_generate(*_args, **kwargs): ... - async def mock_generate_bad_response(*args, **kwargs): + async def mock_generate_bad_response(*_args, **kwargs):

Also applies to: 338-343

🧰 Tools

🪛 Ruff (0.14.14)

94-94: Unused function argument: args

(ARG001)

119-119: Unused function argument: args

(ARG001)

🤖 Prompt for AI Agents

In `@backend/tests/integration/test_scoring_integration.py` around lines 94 - 151, In the async test helpers mock_generate_score and mock_generate (and the other similar test helper around lines referenced), rename the unused variadic parameter *args to _args (or remove it entirely) so Ruff ARG001 is not raised; update function signatures where *args is declared (e.g., async def mock_generate_score(*args, **kwargs) and async def mock_generate(*args, **kwargs)) to use async def mock_generate_score(_args, **kwargs) or async def mock_generate(_args, **kwargs) (or drop the variadic parameter) while leaving usage of kwargs and the rest of the function body unchanged.

coderabbitai · 2026-01-28T16:57:46Z

backend/tests/integration/test_scoring_integration.py

+        with patch.object(
+            scoring_service, "_get_llm_client", return_value=mock_llm_client
+        ):
+            with pytest.raises(
+                ValueError,
+                match="Cannot force rescore while an already existing scoring is pending",
+            ):


⚠️ Potential issue | 🟡 Minor

Collapse nested with blocks for readability (SIM117).

This nested with can be a single context manager tuple, which also satisfies Ruff’s SIM117.

Refactor

- with patch.object( - scoring_service, "_get_llm_client", return_value=mock_llm_client - ): - with pytest.raises( - ValueError, - match="Cannot force rescore while an already existing scoring is pending", - ): - await scoring_service.initiate_scoring( - session_id=session_id, triggered_by="test-user", force_rescore=True - ) + with ( + patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client), + pytest.raises( + ValueError, + match="Cannot force rescore while an already existing scoring is pending", + ), + ): + await scoring_service.initiate_scoring( + session_id=session_id, triggered_by="test-user", force_rescore=True + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with patch.object(

scoring_service, "_get_llm_client", return_value=mock_llm_client

):

with pytest.raises(

ValueError,

match="Cannot force rescore while an already existing scoring is pending",

):

with (

patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client),

pytest.raises(

ValueError,

match="Cannot force rescore while an already existing scoring is pending",

),

):

await scoring_service.initiate_scoring(

session_id=session_id, triggered_by="test-user", force_rescore=True

)

🧰 Tools

🪛 Ruff (0.14.14)

285-291: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

🤖 Prompt for AI Agents

In `@backend/tests/integration/test_scoring_integration.py` around lines 285 - 291, The nested context managers around the test can be collapsed into a single combined `with` statement to improve readability and satisfy SIM117; replace the two nested `with` blocks that use `patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client)` and `pytest.raises(ValueError, match=...)` with a single `with` that joins both context managers (e.g., `with patch.object(...), pytest.raises(...):`), keeping the same `mock_llm_client`, the same `scoring_service._get_llm_client` patch target, and the same exception match string.

coderabbitai · 2026-01-28T16:57:46Z

backend/tests/unit/agents/test_judges.py

+"""Tests for judge prompts and hashing."""
+
+from tarsy.agents.prompts.judges import (
+    JUDGE_PROMPT_SCORE,
+    JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS,
+    get_current_prompt_hash,
+    CURRENT_PROMPT_HASH,
+)
+
+
+class TestJudgePrompts:
+    """Test judge prompt constants."""
+
+    def test_judge_prompt_score_contains_placeholders(self):
+        """Test that JUDGE_PROMPT_SCORE contains required placeholders."""
+        assert "{{ALERT_DATA}}" in JUDGE_PROMPT_SCORE
+        assert "{{FINAL_ANALYSIS}}" in JUDGE_PROMPT_SCORE
+        assert "{{LLM_CONVERSATION}}" in JUDGE_PROMPT_SCORE
+        assert "{{CHAT_CONVERSATION}}" in JUDGE_PROMPT_SCORE
+        assert "{{OUTPUT_SCHEMA}}" in JUDGE_PROMPT_SCORE
+
+    def test_judge_prompt_followup_contains_content(self):
+        """Test that JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS is non-empty."""
+        assert len(JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) > 0
+        assert "missing tool" in JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS.lower()
+
+
+class TestPromptHashing:
+    """Test SHA256 hashing logic for prompt versioning."""
+
+    def test_hash_determinism(self):
+        """Test that hashing produces consistent results."""
+        hash1 = get_current_prompt_hash()
+        hash2 = get_current_prompt_hash()
+
+        assert hash1 == hash2
+        assert len(hash1) == 64  # SHA256 hex digest length
+
+    def test_hash_matches_module_constant(self):
+        """Test that module-level CURRENT_PROMPT_HASH matches function result."""
+        computed_hash = get_current_prompt_hash()
+        assert CURRENT_PROMPT_HASH == computed_hash
+
+    def test_hash_is_hex_string(self):
+        """Test that hash is a valid hexadecimal string."""
+        assert all(c in "0123456789abcdef" for c in CURRENT_PROMPT_HASH)
+        assert len(CURRENT_PROMPT_HASH) == 64


⚠️ Potential issue | 🟠 Major

Add pytest unit markers + return type hints (and apply Ruff SIM300).

This keeps tests aligned with project conventions and addresses the reported yoda-condition warning.

✅ Suggested updates

+import pytest + +@pytest.mark.unit class TestJudgePrompts: """Test judge prompt constants.""" - def test_judge_prompt_score_contains_placeholders(self): + def test_judge_prompt_score_contains_placeholders(self) -> None: """Test that JUDGE_PROMPT_SCORE contains required placeholders.""" assert "{{ALERT_DATA}}" in JUDGE_PROMPT_SCORE assert "{{FINAL_ANALYSIS}}" in JUDGE_PROMPT_SCORE assert "{{LLM_CONVERSATION}}" in JUDGE_PROMPT_SCORE assert "{{CHAT_CONVERSATION}}" in JUDGE_PROMPT_SCORE assert "{{OUTPUT_SCHEMA}}" in JUDGE_PROMPT_SCORE - def test_judge_prompt_followup_contains_content(self): + def test_judge_prompt_followup_contains_content(self) -> None: """Test that JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS is non-empty.""" assert len(JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) > 0 assert "missing tool" in JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS.lower() +@pytest.mark.unit class TestPromptHashing: """Test SHA256 hashing logic for prompt versioning.""" - def test_hash_determinism(self): + def test_hash_determinism(self) -> None: """Test that hashing produces consistent results.""" hash1 = get_current_prompt_hash() hash2 = get_current_prompt_hash() assert hash1 == hash2 assert len(hash1) == 64 # SHA256 hex digest length - def test_hash_matches_module_constant(self): + def test_hash_matches_module_constant(self) -> None: """Test that module-level CURRENT_PROMPT_HASH matches function result.""" computed_hash = get_current_prompt_hash() - assert CURRENT_PROMPT_HASH == computed_hash + assert computed_hash == CURRENT_PROMPT_HASH - def test_hash_is_hex_string(self): + def test_hash_is_hex_string(self) -> None: """Test that hash is a valid hexadecimal string.""" assert all(c in "0123456789abcdef" for c in CURRENT_PROMPT_HASH) assert len(CURRENT_PROMPT_HASH) == 64

As per coding guidelines: "Use test markers appropriately: `@pytest.mark.unit` for unit tests..." and "Include type hints in test functions following project standards."

🧰 Tools

🪛 Ruff (0.14.14)

42-42: Yoda condition detected

Rewrite as computed_hash == CURRENT_PROMPT_HASH

(SIM300)

🤖 Prompt for AI Agents

In `@backend/tests/unit/agents/test_judges.py` around lines 1 - 47, Add pytest markers and return type hints: import pytest at top and decorate each test class or each test function with `@pytest.mark.unit`, then add explicit return type hints (-> None) to every test method (e.g., test_judge_prompt_score_contains_placeholders(self) -> None). Also run Ruff SIM300 and fix any yoda-condition warnings by ensuring comparisons place the variable on the left side (e.g., keep len(CURRENT_PROMPT_HASH) == 64 rather than 64 == len(...)) and adjust any expressions flagged by SIM300 accordingly; reference the test class and function names (TestJudgePrompts, TestPromptHashing and all test_* methods) to locate where to apply these changes.

coderabbitai · 2026-01-28T16:57:46Z

backend/tests/unit/models/test_session_score_models.py

+class TestScoringStatusEnum:
+    """Test ScoringStatus enum."""
+
+    def test_active_values(self):
+        """Test active_values() returns pending and in_progress."""
+        active = ScoringStatus.active_values()
+        assert "pending" in active
+        assert "in_progress" in active
+        assert len(active) == 2
+
+    def test_values(self):
+        """Test values() returns all status strings."""
+        all_values = ScoringStatus.values()
+        assert len(all_values) == 4
+        assert "pending" in all_values
+        assert "in_progress" in all_values
+        assert "completed" in all_values
+        assert "failed" in all_values


⚠️ Potential issue | 🟡 Minor

Add unit markers and return type hints for tests.

Line 11 and the other test classes should be marked as unit tests, and test functions should include -> None return type hints to meet project test conventions. This keeps markers consistent and helps mypy/linters.

Suggested pattern

+@pytest.mark.unit class TestScoringStatusEnum: """Test ScoringStatus enum.""" - def test_active_values(self): + def test_active_values(self) -> None: """Test active_values() returns pending and in_progress."""

Apply the same pattern to the other classes and test functions in this file.
As per coding guidelines: “Use test markers appropriately: @pytest.mark.unit …” and “Include type hints in test functions following project standards.”

Also applies to: 31-177

🤖 Prompt for AI Agents

In `@backend/tests/unit/models/test_session_score_models.py` around lines 11 - 28, Add the pytest unit marker and return type hints to the test classes and functions in this file: decorate TestScoringStatusEnum (and the other test classes in the file) with `@pytest.mark.unit`, and add explicit return type hints (-> None) to the test methods such as test_active_values and test_values (and all other test_* functions in the file) so each test function signature follows the project convention and linters/mypy expectations.

coderabbitai · 2026-01-28T16:57:46Z

dashboard/src/components/ScoreBadge.tsx

+import { Chip, Tooltip, CircularProgress } from '@mui/material';
+import { CheckCircle, Error, Warning, HourglassEmpty } from '@mui/icons-material';
+import type { ScoringStatus } from '../types';
+
+interface ScoreBadgeProps {
+  score: number | null;
+  status?: ScoringStatus;
+  onClick?: (event: React.MouseEvent) => void;
+  size?: 'small' | 'medium';
+  label?: string; // Optional label to show after the score (e.g., 'score', 'points')
+}
+
+/**
+ * ScoreBadge component displays session quality score with 3-tier color coding
+ *
+ * Color scheme (aligned with judge prompt scoring philosophy):
+ * - 0-49: Red (failed investigation)
+ * - 50-74: Yellow (weak investigation)
+ * - 75-100: Green (good investigation)
+ *
+ * Also handles pending/in_progress/failed states
+ */
+const ScoreBadge: React.FC<ScoreBadgeProps> = ({
+  score,
+  status,
+  onClick,
+  size = 'small',
+  label: customLabel,
+}) => {
+  // Handle different scoring states
+  if (status === 'pending' || status === 'in_progress') {
+    return (
+      <Tooltip title={status === 'pending' ? 'Score pending' : 'Scoring in progress'} arrow>
+        <Chip
+          icon={status === 'in_progress' ? <CircularProgress size={14} /> : <HourglassEmpty />}
+          label={status === 'in_progress' ? 'Scoring...' : 'Pending'}
+          size={size}
+          variant="outlined"
+          color="default"
+          onClick={onClick}
+          sx={{ cursor: onClick ? 'pointer' : 'default' }}
+        />
+      </Tooltip>
+    );
+  }
+
+  if (status === 'failed') {
+    return (
+      <Tooltip title="Scoring failed" arrow>
+        <Chip
+          icon={<Error />}
+          label="Error"
+          size={size}
+          color="error"
+          variant="outlined"
+          onClick={onClick}
+          sx={{ cursor: onClick ? 'pointer' : 'default' }}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n dashboard/src/components/ScoreBadge.tsx

Repository: codeready-toolchain/tarsy-bot

Length of output: 4258

Avoid shadowing the global Error identifier.

Line 3 imports Error from @mui/icons-material, which shadows the global Error and triggers the linter. Alias it and update both usages.

Fix

-import { CheckCircle, Error, Warning, HourglassEmpty } from '@mui/icons-material'; +import { CheckCircle, Error as ErrorIcon, Warning, HourglassEmpty } from '@mui/icons-material';

Update usages on lines 52 and 88:

- icon={<Error />} + icon={<ErrorIcon />}

and

- icon = <Error />; + icon = <ErrorIcon />;

🧰 Tools

🪛 Biome (2.1.2)

[error] 3-3: Do not shadow the global "Error" property.

Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.

(lint/suspicious/noShadowRestrictedNames)

🤖 Prompt for AI Agents

In `@dashboard/src/components/ScoreBadge.tsx` around lines 2 - 58, The import of Error from '@mui/icons-material' shadows the global Error; rename the imported icon (e.g., Error as ErrorIcon) and update all usages in the ScoreBadge component where the icon prop uses <Error /> (the Chip for the failed state and any other Chip/Icon usage) to use the new alias (<ErrorIcon />) and any corresponding JSX references so the global Error identifier is no longer shadowed.

metlos · 2026-01-28T20:57:57Z

I need to close this because it doesn't stack on top of the previous PRs 🤦🏼

metlos added 5 commits January 28, 2026 17:28

implement the database layer for session scoring

b624ebe

introduce judge prompts and simple computed field to the SessionScore…

949b9ee

… model

implement the scoring service

b36d6fa

implement the scoring controller

1bb3267

ui integration for session scoring

573b101

This was referenced Jan 28, 2026

implement the database layer for session scoring #243

Closed

introduce judge prompts and simple computed field to the SessionScore model #244

Closed

Scoring service #245

Closed

REST API controller for the scoring service #246

Closed

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

metlos closed this Jan 28, 2026

metlos mentioned this pull request Jan 28, 2026

Scoring UI #248

Closed

Conversation

metlos commented Jan 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Enhancements

Uh oh!

coderabbitai bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

metlos commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

metlos commented Jan 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 28, 2026 •

edited

Loading