Skip to content
This repository was archived by the owner on Feb 26, 2026. It is now read-only.

Scoring UI#247

Closed
metlos wants to merge 5 commits intomasterfrom
metlos/scoring-ui
Closed

Scoring UI#247
metlos wants to merge 5 commits intomasterfrom
metlos/scoring-ui

Conversation

@metlos
Copy link
Collaborator

@metlos metlos commented Jan 28, 2026

This PR stacks on top of #246 and adds UI for the scoring functionality.

  • The session history list contains a new (clickable) column for the score badge.
  • The session detailed view contains a new button to initiate the scoring.
  • The session score badge is placed next to the session status in the header.
  • There's a new "tab" that shows the details of the score - i.e. the reasoning for giving the score as well as the list of the missing tools identified in the particular session.

Part of the PR stack:

Summary by CodeRabbit

New Features

  • Session Scoring: Users can now score alert sessions to evaluate investigation quality. Scores include detailed analysis of the investigation and missing tools.
  • Score Visualization: Session scores are displayed as badges in alert lists with status indicators (pending, in progress, completed, or failed) and numerical scores.
  • Score Details View: New dedicated view shows score analysis, scoring status, timestamp, and missing tools analysis for completed assessments.

Enhancements

  • Dashboard Integration: Score information now appears alongside session summaries for quick quality assessment visibility.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

Walkthrough

This PR implements a comprehensive session scoring system for alert investigations. It adds database schema for storing scores, a scoring service with LLM-based evaluation, REST API endpoints, and frontend UI components to display and trigger scoring operations with support for multi-turn LLM interactions and deterministic prompt versioning.

Changes

Cohort / File(s) Summary
Database Schema & Models
backend/alembic/versions/20260123_1721_776dca9d9a2f_add_session_scores_table.py, backend/tarsy/models/db_models.py, backend/tarsy/models/constants.py
Creates session_scores table with 13 columns, 6 indexes including partial unique index for active scores; adds ScoringStatus enum (PENDING, IN_PROGRESS, COMPLETED, FAILED); extends AlertSession and Chat with chain-related fields
Scoring Service & Orchestration
backend/tarsy/services/scoring_service.py, backend/tarsy/agents/prompts/judges.py
Implements async scoring workflow: prompt generation, multi-turn LLM evaluation, score extraction, and result persistence; defines judge prompts and SHA256 hashing for deterministic prompt versioning (CURRENT_PROMPT_HASH)
API Controller & Routing
backend/tarsy/controllers/scoring_controller.py, backend/tarsy/main.py
Exposes POST /api/v1/scoring/sessions/{session_id}/score (returns 202 for async) and GET /api/v1/scoring/sessions/{session_id}/score; handles SessionNotFoundError (404), SessionNotCompletedError (400), and conflicts (409)
Data Access & Repositories
backend/tarsy/repositories/session_score_repository.py, backend/tarsy/repositories/history_repository.py
SessionScoreRepository manages CRUD and status transitions with retry logic; history_repository augments session overviews with latest score metrics (score_total, score_status)
API Models & Integration
backend/tarsy/models/api_models.py, backend/tarsy/models/history_models.py, backend/tarsy/integrations/llm/manager.py
Adds SessionScoreResponse (with computed_field current_prompt_used), SessionScoreRequest, and score fields to SessionOverview/DetailedSession; updates LLMManager type hints (str | None)
Backend Tests
backend/tests/integration/test_scoring_integration.py, backend/tests/unit/agents/test_judges.py, backend/tests/unit/controllers/test_scoring_controller.py, backend/tests/unit/models/test_session_score_models.py, backend/tests/unit/repositories/test_session_score_repository.py, backend/tests/unit/services/test_scoring_service.py
Comprehensive integration and unit test coverage: prompt validation, score extraction, async workflow, API endpoints (202/200/404/400/409/500), repository operations, and service initialization
Frontend Components - Scoring UI
dashboard/src/components/ScoreBadge.tsx, dashboard/src/components/ScoreDetailView.tsx, dashboard/src/components/MarkdownRenderer.tsx
ScoreBadge displays status/score with color coding (red <50, yellow 50-74, green 75-100); ScoreDetailView fetches and renders detailed analysis with metadata; MarkdownRenderer provides centralized markdown rendering with copy buttons
Frontend Components - Session Integration
dashboard/src/components/SessionDetailPageBase.tsx, dashboard/src/components/SessionDetailWrapper.tsx, dashboard/src/components/SessionHeader.tsx, dashboard/src/components/FinalAnalysisCard.tsx
Adds "score" view option to session detail pages; SessionHeader adds ScoreSession button, rescore confirmation dialog, and scoring state management; FinalAnalysisCard replaces inline ReactMarkdown with MarkdownRenderer
Frontend Components - Dashboard Integration
dashboard/src/components/DashboardView.tsx, dashboard/src/components/DashboardLayout.tsx, dashboard/src/components/HistoricalAlertsList.tsx, dashboard/src/components/AlertListItem.tsx, dashboard/src/App.tsx
Threads onScoreClick handler through dashboard hierarchy; HistoricalAlertsList adds Score column; AlertListItem renders ScoreBadge; App adds /sessions/:id/score route
Frontend Services & Types
dashboard/src/services/api.ts, dashboard/src/types/index.ts
APIClient methods scoreSession() and getSessionScore(); new types: ScoringStatus, SessionScore, ScoreSessionRequest/Response, and WebSocket scoring events; extends Session/props with score fields and click handlers
Frontend Test
dashboard/src/components/test_history_controller.py
Updates expected session response to include optional score_total and score_status fields
Documentation
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
Marks all tasks in Phases 1–4 as completed; Phases 5–6 retained for future implementation

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/Dashboard
    participant API as Scoring API
    participant Service as ScoringService
    participant DB as Database
    participant LLM as LLM Manager
    participant Background as Background Task

    Client->>API: POST /sessions/{id}/score
    API->>API: Extract user identity
    API->>Service: initiate_scoring(session_id, triggered_by, force_rescore)
    Service->>DB: Get session + check completion
    Service->>DB: Check active scores
    Service->>DB: Create score (status=PENDING)
    Service->>API: Return SessionScoreResponse (202 Accepted)
    API-->>Client: 202 + pending score

    Service->>Background: Launch async _execute_scoring task
    Background->>DB: Update status to IN_PROGRESS
    Background->>DB: Fetch session history & final analysis
    Background->>LLM: Send Turn 1: Score prompt
    LLM-->>Background: Response with score + analysis
    Background->>Background: Extract score from response
    Background->>LLM: Send Turn 2: Missing tools prompt
    LLM-->>Background: Response with missing tools analysis
    Background->>DB: Update score with results (status=COMPLETED)
    
    Client->>API: GET /sessions/{id}/score
    API->>Service: _get_latest_score(session_id)
    Service->>DB: Fetch latest score
    Service-->>API: SessionScore
    API-->>Client: 200 + SessionScoreResponse
Loading
sequenceDiagram
    participant UI as UI/SessionHeader
    participant API as APIClient
    participant Controller as Controller
    participant Service as ScoringService
    participant Dialog as Rescore Dialog

    UI->>UI: User clicks "Score Session"
    UI->>API: scoreSession(sessionId)
    API->>Controller: POST /sessions/{id}/score
    Controller->>Service: initiate_scoring()
    
    alt Score already exists and not force
        Service-->>Controller: Return existing score
        Controller-->>API: 200 OK (existing)
        API-->>UI: Existing score response
        UI->>Dialog: Show rescore confirmation?
    else Force rescore
        Service-->>Controller: Return new pending score
        Controller-->>API: 202 Accepted (new)
        API-->>UI: New pending score
        UI->>UI: Show loading state
    end
    
    UI->>UI: Poll GET /sessions/{id}/score (background)
    UI->>API: getSessionScore(sessionId)
    API-->>UI: Updated SessionScore
    UI->>UI: Render ScoreDetailView when completed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #180: Main scoring implementation (this PR) realizes the EP-0028 design and adds all core scoring components (judges, service, API, frontend).
  • PR #28: Both modify history models and history_repository with new session score fields, requiring careful coordination on schema and data retrieval logic.

Suggested reviewers

  • alexeykazakov

Poem

🐰 A scoring dream comes into play,
With judge prompts leading the way,
Badge colors tell the story true,
From red to green, investigations review,
Async tasks hop through the night,
Making evaluations bright! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Scoring UI' is vague and does not clearly describe the main changes in this comprehensive pull request that spans backend and frontend scoring functionality. Consider a more descriptive title such as 'Add session scoring with LLM-based evaluation and UI components' or 'Implement scoring service, API endpoints, and UI integration' to better reflect the full scope of changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 89.22% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🤖 Fix all issues with AI agents
In `@backend/tarsy/agents/prompts/judges.py`:
- Around line 320-335: get_current_prompt_hash currently only hashes
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS, so changes to other
prompt parts (system/reminder) won't update the version; update the function to
concatenate all prompt strings used in judge requests (include the system prompt
and reminder prompt constants used in this module — e.g., JUDGE_PROMPT_SYSTEM
and JUDGE_PROMPT_REMINDER or whatever names are defined here — along with
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) before computing the
SHA256, or if you intentionally want a subset, update the function docstring to
explicitly list which prompts are included; ensure you reference
get_current_prompt_hash and the prompt constant names when making the change.

In `@backend/tarsy/controllers/scoring_controller.py`:
- Around line 152-165: The except handlers in scoring_controller.py currently
include str(e) in HTTPException details (see the RuntimeError except block and
the generic except Exception block around initiating score for session_id),
which can leak internal info; change both HTTPException detail strings to
generic messages like "Internal server error" or "Scoring service error" without
including str(e), while retaining the existing logger.error/logger.exception
calls to record the full error internally (use session_id in log context as
already done).

In `@backend/tarsy/repositories/history_repository.py`:
- Around line 658-690: The import and usage of the non-existent model
SessionScoreDB causes the except ImportError to swallow the failure and leave
score_data empty; update the import and all references to use the correct model
name SessionScore (replace SessionScoreDB -> SessionScore) in the score-fetching
block (the select/subquery and score_query that reference SessionScoreDB) and in
the other occurrence noted in the file so the query uses the actual model class
SessionScore and no longer silently fails.

In `@backend/tarsy/services/scoring_service.py`:
- Around line 648-651: Remove the debugging marker from the logger.debug call
that prints the conversation: locate the logger.debug(f"!!!!! Full conversation
after turn 1:\n{conversation}") line (the logger.debug invocation referencing
conversation) and either delete it or replace it with a cleaned-up message
without the "!!!!!" artifact (e.g., "Full conversation after turn 1:" +
conversation) so the log entry is production-ready.
- Around line 208-239: In _extract_score_from_response, handle empty/whitespace
responses and restrict parsed scores to 0–100: first strip the response and if
empty return (None, response) (avoid indexing lines[-1]); otherwise find the
last non-empty line (not just lines[-1]) and attempt to parse it using a
stricter pattern that only accepts integers 0–100 (e.g., match 100 or 0–99) or
parse then validate range; if the match fails or the numeric value is outside
0–100, log an error including the problematic last line and return (None,
response); otherwise convert to int, build score_analysis from all lines before
that last non-empty line, and return (total_score, score_analysis).
- Around line 567-571: The background task launched with
asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) is
not stored and may be garbage-collected or have unobserved exceptions; modify
the ScoringService (or the class containing _execute_scoring) to keep a
reference to the created Task (e.g., append the returned task to a tasks
set/list or a tracking dict keyed by score_id/session_id) and ensure you remove
completed tasks (or await/monitor them) to handle exceptions and lifecycle;
update the code that calls asyncio.create_task to assign its return value to
that tracker so tasks aren't silently cancelled.

In `@backend/tests/integration/test_scoring_integration.py`:
- Around line 94-151: In the async test helpers mock_generate_score and
mock_generate (and the other similar test helper around lines referenced),
rename the unused variadic parameter *args to _args (or remove it entirely) so
Ruff ARG001 is not raised; update function signatures where *args is declared
(e.g., async def mock_generate_score(*args, **kwargs) and async def
mock_generate(*args, **kwargs)) to use async def mock_generate_score(_args,
**kwargs) or async def mock_generate(_args, **kwargs) (or drop the variadic
parameter) while leaving usage of kwargs and the rest of the function body
unchanged.
- Around line 34-366: The tests and pytest fixtures lack return type hints; add
appropriate annotations for each fixture and test: annotate fixtures like
in_memory_engine -> Engine, db_manager -> DatabaseManager, history_service ->
HistoryService, scoring_service -> ScoringService, mock_llm_client -> Mock,
completed_session -> AlertSession (import types from sqlmodel/sqlalchemy/typing
or project types as needed), and annotate async test functions
(test_end_to_end_score_completed_session,
test_multiple_scoring_attempts_same_session, test_concurrent_scoring_prevention,
test_scoring_failure_updates_status, test_score_extraction_failure) with ->
None; ensure any generator fixtures use typing.Generator or Iterator when
applicable and add necessary imports for the type names referenced.
- Around line 48-60: Remove the duplicate import of Session inside the
db_manager function to satisfy Ruff F811: delete the line "from sqlmodel import
Session" in db_manager and rely on the existing import at the top of the module;
ensure db_manager still sets manager.session_factory using
sessionmaker(bind=in_memory_engine, class_=Session, expire_on_commit=False) and
references the DatabaseManager, in_memory_engine, and sessionmaker symbols
unchanged.
- Around line 285-291: The nested context managers around the test can be
collapsed into a single combined `with` statement to improve readability and
satisfy SIM117; replace the two nested `with` blocks that use
`patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client)`
and `pytest.raises(ValueError, match=...)` with a single `with` that joins both
context managers (e.g., `with patch.object(...), pytest.raises(...):`), keeping
the same `mock_llm_client`, the same `scoring_service._get_llm_client` patch
target, and the same exception match string.

In `@backend/tests/unit/agents/test_judges.py`:
- Around line 1-47: Add pytest markers and return type hints: import pytest at
top and decorate each test class or each test function with `@pytest.mark.unit`,
then add explicit return type hints (-> None) to every test method (e.g.,
test_judge_prompt_score_contains_placeholders(self) -> None). Also run Ruff
SIM300 and fix any yoda-condition warnings by ensuring comparisons place the
variable on the left side (e.g., keep len(CURRENT_PROMPT_HASH) == 64 rather than
64 == len(...)) and adjust any expressions flagged by SIM300 accordingly;
reference the test class and function names (TestJudgePrompts, TestPromptHashing
and all test_* methods) to locate where to apply these changes.

In `@backend/tests/unit/models/test_session_score_models.py`:
- Around line 11-28: Add the pytest unit marker and return type hints to the
test classes and functions in this file: decorate TestScoringStatusEnum (and the
other test classes in the file) with `@pytest.mark.unit`, and add explicit return
type hints (-> None) to the test methods such as test_active_values and
test_values (and all other test_* functions in the file) so each test function
signature follows the project convention and linters/mypy expectations.

In `@dashboard/src/components/ScoreBadge.tsx`:
- Around line 2-58: The import of Error from '@mui/icons-material' shadows the
global Error; rename the imported icon (e.g., Error as ErrorIcon) and update all
usages in the ScoreBadge component where the icon prop uses <Error /> (the Chip
for the failed state and any other Chip/Icon usage) to use the new alias
(<ErrorIcon />) and any corresponding JSX references so the global Error
identifier is no longer shadowed.
🧹 Nitpick comments (15)
backend/tarsy/models/history_models.py (1)

229-231: Consider constraining score_status to known values for validation.

Using a Literal (or an enum if you already have one) avoids invalid statuses slipping into API responses.

♻️ Proposed refactor
-    score_status: Optional[str] = None  # pending|in_progress|completed|failed
+    score_status: Optional[Literal["pending", "in_progress", "completed", "failed"]] = None
-    score_status: Optional[str] = None  # pending|in_progress|completed|failed
+    score_status: Optional[Literal["pending", "in_progress", "completed", "failed"]] = None

Also applies to: 437-438

backend/tests/unit/repositories/test_session_score_repository.py (2)

1-11: Add @pytest.mark.unit marker to the test class.

Per coding guidelines, unit tests should be marked with @pytest.mark.unit. Also, the IntegrityError import on line 4 appears unused in this test file since the repository handles it internally and raises ValueError instead.

Suggested fix
 import pytest
-from sqlalchemy.exc import IntegrityError
 from sqlmodel import Session, SQLModel, create_engine
 
 from tarsy.models.constants import ScoringStatus
 from tarsy.models.db_models import SessionScore, AlertSession
 from tarsy.repositories.session_score_repository import SessionScoreRepository
 from tarsy.utils.timestamp import now_us
 
 
+@pytest.mark.unit
 class TestSessionScoreRepository:

As per coding guidelines: "Use test markers appropriately: @pytest.mark.unit for unit tests".


134-159: Consider adding explicit time ordering guarantee.

The test relies on now_us() - 1000000 to create an "earlier" timestamp, but if the test runs extremely fast, there's a theoretical (though unlikely) risk of timing issues. The test logic is correct, but you could make the ordering more explicit.

Optional: More explicit timestamp ordering
     def test_get_latest_score_for_session(self, repository, sample_alert_session):
         """Test getting most recent score."""
+        base_time = now_us()
         # Create two scores
         score1 = SessionScore(
             session_id=sample_alert_session.session_id,
             prompt_hash="abc123",
             score_triggered_by="user:test",
             status=ScoringStatus.COMPLETED.value,
-            scored_at_us=now_us() - 1000000,  # Earlier
+            scored_at_us=base_time - 1000000,  # Earlier
         )
         score2 = SessionScore(
             session_id=sample_alert_session.session_id,
             prompt_hash="def456",
             score_triggered_by="user:test",
             status=ScoringStatus.PENDING.value,
-            scored_at_us=now_us(),  # Later
+            scored_at_us=base_time,  # Later
         )
dashboard/src/components/SessionDetailWrapper.tsx (1)

22-27: Ternary chain is becoming complex; consider refactoring for readability.

The nested ternary for initialView detection is functional but harder to read at a glance. Consider extracting to a helper function for clarity.

Optional: Extract to helper function
+  // Determine initial view from URL path
+  const getInitialView = (): 'conversation' | 'technical' | 'score' => {
+    if (location.pathname.includes('/technical')) return 'technical';
+    if (location.pathname.includes('/score')) return 'score';
+    return 'conversation';
+  };
+
   // Determine initial view from URL
-  const initialView = location.pathname.includes('/technical')
-    ? 'technical'
-    : location.pathname.includes('/score')
-    ? 'score'
-    : 'conversation';
-  const [currentView, setCurrentView] = useState<'conversation' | 'technical' | 'score'>(initialView);
+  const [currentView, setCurrentView] = useState<'conversation' | 'technical' | 'score'>(getInitialView);
backend/tarsy/models/db_models.py (1)

390-394: Consider adding an index on scored_at_us for time-based queries.

The scored_at_us field uses BIGINT but doesn't have an index. If you anticipate queries filtering or ordering by scoring initiation time (e.g., for cleanup or analytics), consider adding an index.

backend/tests/unit/controllers/test_scoring_controller.py (1)

38-44: Tests mock a private method _get_latest_score which couples tests to implementation details.

The mock targets _get_latest_score (underscore-prefixed private method). This creates tight coupling between tests and implementation. If the service refactors this internal method, tests will break even if the public contract remains unchanged.

Consider whether the ScoringService should expose a public method for retrieving scores, or mock at a lower level (repository).

backend/tarsy/controllers/scoring_controller.py (2)

11-11: Unused import: Body from fastapi.

The Body import is not used in this file.

🧹 Remove unused import
-from fastapi import APIRouter, Depends, HTTPException, Path, Request, Response, Body
+from fastapi import APIRouter, Depends, HTTPException, Path, Request, Response

213-220: Controller calls private method _get_latest_score directly.

The GET endpoint calls scoring_service._get_latest_score(), which is a private method (underscore prefix). This breaks encapsulation and makes the controller dependent on implementation details.

Consider exposing a public method like get_latest_score() or get_score_for_session() in the service layer.

♻️ Suggested refactor in scoring_service.py

Add a public method in ScoringService:

async def get_latest_score(self, session_id: str) -> Optional[SessionScore]:
    """
    Get the latest score for a session (public API).
    
    Args:
        session_id: Session identifier
        
    Returns:
        Latest SessionScore or None if not found
    """
    return await self._get_latest_score(session_id)

Then update the controller to use the public method.

dashboard/src/components/SessionHeader.tsx (1)

573-583: Magic string 'completed' should use a constant.

The status check uses a string literal 'completed' instead of referencing a shared constant. This could lead to bugs if the status value changes.

♻️ Use ScoringStatus constant

Import and use the ScoringStatus type or create a constant:

+import { SCORING_STATUS } from '../utils/statusConstants';

   // EP-0028: Handle score session button click
   const handleScoreSession = () => {
     // Check if session already has a completed score
-    if (session.score_status === 'completed') {
+    if (session.score_status === SCORING_STATUS.COMPLETED) {
       // Show confirmation dialog for rescoring
       setShowRescoreDialog(true);
     } else {
backend/tarsy/models/api_models.py (1)

14-14: Import of CURRENT_PROMPT_HASH creates coupling between API and agent modules.

Importing from tarsy.agents.prompts.judges in the API models creates a dependency from the API layer to the agent/prompt layer. This could be moved to a shared constants module if this coupling becomes problematic.

backend/tests/unit/services/test_scoring_service.py (2)

307-319: Unused fixture parameter mock_chat_conversation_history.

The mock_chat_conversation_history parameter in test_build_score_prompt_with_chat_conversation_present is declared but the test actually uses mock_final_analysis_response which already contains the chat conversation.

This is a minor issue in test code and the fixture dependency is technically correct (ensures the data is available), but the explicit parameter suggests it might have been intended to be used directly.


571-598: Consider combining nested with statements for readability.

The nested with statements can be combined into a single statement as suggested by static analysis. However, the current structure is clear and the nesting shows the mock dependency hierarchy.

♻️ Combined with statement (optional)
-        with patch.object(
-            scoring_service.history_service,
-            "get_session",
-            return_value=mock_session_completed,
-        ):
-            with patch.object(scoring_service, "_get_latest_score", return_value=None):
-                with patch.object(
-                    scoring_service,
-                    "_create_score_record",
-                    return_value=mock_score_record_pending,
-                ):
-                    with patch("asyncio.create_task") as mock_create_task:
+        with (
+            patch.object(
+                scoring_service.history_service,
+                "get_session",
+                return_value=mock_session_completed,
+            ),
+            patch.object(scoring_service, "_get_latest_score", return_value=None),
+            patch.object(
+                scoring_service,
+                "_create_score_record",
+                return_value=mock_score_record_pending,
+            ),
+            patch("asyncio.create_task") as mock_create_task,
+        ):
backend/tarsy/repositories/session_score_repository.py (1)

99-136: Truthy check for completed_at_us could skip valid zero value.

Line 121 uses if completed_at_us: which would skip if the value is 0. While 0 microseconds since epoch (1970-01-01 00:00:00.000000) is practically never valid, using is not None would be more defensive and consistent with the total_score check on line 127.

Suggested fix for consistency
-        if completed_at_us:
+        if completed_at_us is not None:
             score.completed_at_us = completed_at_us
backend/tarsy/services/scoring_service.py (2)

97-115: Repository context manager silently yields None on error.

The _get_repository method yields None on exceptions instead of propagating them. While callers do check for None, this pattern can mask underlying database errors and make debugging harder.

Consider logging more details or re-raising after logging to help diagnose issues:

         except Exception as e:
             logger.error(f"Failed to get scoring repository: {str(e)}")
-            yield None
+            raise  # Let caller handle the exception

Alternatively, if the silent failure is intentional for resilience, document this behavior in the docstring.


142-200: Inline import should be at module top level.

Line 161 has import json inside the method. Per PEP 8 and Ruff conventions, imports should be at the top of the file.

Move import to top of file
 import asyncio
+import json
 import random

And remove line 161.

Comment on lines +320 to +335
def get_current_prompt_hash() -> str:
"""
Compute SHA256 hash of both judge prompts concatenated.

This hash provides deterministic criteria versioning - when prompts change,
the hash changes, allowing detection of scores using obsolete criteria.

Returns:
Hex string of SHA256 hash (64 characters)
"""
# Concatenate both prompts in the order they're used
combined_prompts = JUDGE_PROMPT_SCORE + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS

# Compute SHA256 hash
hash_obj = hashlib.sha256(combined_prompts.encode("utf-8"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include all prompts in the version hash (system + reminder).

CURRENT_PROMPT_HASH won’t change if the system prompt or the reminder prompt changes, which breaks prompt-versioning guarantees. If those prompts are part of the judge request, they should be included in the hash (or the docstring/usage updated to make this explicit).

🐛 Proposed fix
-    # Concatenate both prompts in the order they're used
-    combined_prompts = JUDGE_PROMPT_SCORE + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS
+    # Concatenate all prompts in the order they're used
+    combined_prompts = (
+        JUDGE_SYSTEM_PROMPT
+        + JUDGE_PROMPT_SCORE
+        + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS
+        + JUDGE_PROMPT_SCORE_REMINDER
+    )
🤖 Prompt for AI Agents
In `@backend/tarsy/agents/prompts/judges.py` around lines 320 - 335,
get_current_prompt_hash currently only hashes JUDGE_PROMPT_SCORE and
JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS, so changes to other prompt parts
(system/reminder) won't update the version; update the function to concatenate
all prompt strings used in judge requests (include the system prompt and
reminder prompt constants used in this module — e.g., JUDGE_PROMPT_SYSTEM and
JUDGE_PROMPT_REMINDER or whatever names are defined here — along with
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) before computing the
SHA256, or if you intentionally want a subset, update the function docstring to
explicitly list which prompts are included; ensure you reference
get_current_prompt_hash and the prompt constant names when making the change.

Comment on lines +152 to +165
except RuntimeError as e:
# Database errors
logger.error(f"Scoring service error for session {session_id}: {str(e)}")
raise HTTPException(
status_code=500, detail=f"Scoring service error: {str(e)}"
) from e
except Exception as e:
# Unexpected errors
logger.exception(
f"Unexpected error initiating score for session {session_id}: {str(e)}"
)
raise HTTPException(
status_code=500, detail=f"Internal server error: {str(e)}"
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Error messages may expose internal details in 500 responses.

The 500 error responses include str(e) which could leak implementation details or stack traces in production. Consider using a generic message for unexpected errors.

🛡️ Suggested fix
     except Exception as e:
         # Unexpected errors
         logger.exception(
             f"Unexpected error initiating score for session {session_id}: {str(e)}"
         )
         raise HTTPException(
-            status_code=500, detail=f"Internal server error: {str(e)}"
+            status_code=500, detail="Internal server error"
         ) from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except RuntimeError as e:
# Database errors
logger.error(f"Scoring service error for session {session_id}: {str(e)}")
raise HTTPException(
status_code=500, detail=f"Scoring service error: {str(e)}"
) from e
except Exception as e:
# Unexpected errors
logger.exception(
f"Unexpected error initiating score for session {session_id}: {str(e)}"
)
raise HTTPException(
status_code=500, detail=f"Internal server error: {str(e)}"
) from e
except RuntimeError as e:
# Database errors
logger.error(f"Scoring service error for session {session_id}: {str(e)}")
raise HTTPException(
status_code=500, detail=f"Scoring service error: {str(e)}"
) from e
except Exception as e:
# Unexpected errors
logger.exception(
f"Unexpected error initiating score for session {session_id}: {str(e)}"
)
raise HTTPException(
status_code=500, detail="Internal server error"
) from e
🤖 Prompt for AI Agents
In `@backend/tarsy/controllers/scoring_controller.py` around lines 152 - 165, The
except handlers in scoring_controller.py currently include str(e) in
HTTPException details (see the RuntimeError except block and the generic except
Exception block around initiating score for session_id), which can leak internal
info; change both HTTPException detail strings to generic messages like
"Internal server error" or "Scoring service error" without including str(e),
while retaining the existing logger.error/logger.exception calls to record the
full error internally (use session_id in log context as already done).

Comment on lines +658 to +690
# Fetch latest scores for sessions (most recent scoring attempt per session)
score_data = {}
try:
from tarsy.models.db_models import SessionScoreDB

# Get latest score for each session (most recent started_at_us)
# Use subquery to get max started_at_us per session, then join to get full record
subquery = select(
SessionScoreDB.session_id,
func.max(SessionScoreDB.started_at_us).label('max_started_at')
).where(
SessionScoreDB.session_id.in_(session_ids)
).group_by(SessionScoreDB.session_id).subquery()

score_query = select(SessionScoreDB).join(
subquery,
and_(
SessionScoreDB.session_id == subquery.c.session_id,
SessionScoreDB.started_at_us == subquery.c.max_started_at
)
)

score_results = self.session.exec(score_query).all()
score_data = {
score.session_id: {
'total_score': score.total_score,
'status': score.status
}
for score in score_results
}
except ImportError:
# SessionScoreDB not available (shouldn't happen but defensive)
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check for SessionScoreDB and SessionScore class definitions
rg -n "class SessionScoreDB|class SessionScore\b" backend/tarsy/models/
rg -n "SessionScoreDB|SessionScore" backend/tarsy/repositories/history_repository.py

Repository: codeready-toolchain/tarsy-bot

Length of output: 1171


Fix SessionScoreDBSessionScore import; model class name is incorrect and silently fails.

The model at backend/tarsy/models/db_models.py:330 is defined as SessionScore, not SessionScoreDB. Lines 661 and 985 attempt to import SessionScoreDB, which does not exist. The ImportError is silently caught, leaving score_data empty and causing score_total/score_status to be None in history views.

Replace all occurrences of SessionScoreDB with SessionScore in both locations (lines 661–690 and 985–999).

🤖 Prompt for AI Agents
In `@backend/tarsy/repositories/history_repository.py` around lines 658 - 690, The
import and usage of the non-existent model SessionScoreDB causes the except
ImportError to swallow the failure and leave score_data empty; update the import
and all references to use the correct model name SessionScore (replace
SessionScoreDB -> SessionScore) in the score-fetching block (the select/subquery
and score_query that reference SessionScoreDB) and in the other occurrence noted
in the file so the query uses the actual model class SessionScore and no longer
silently fails.

Comment on lines +208 to +239
def _extract_score_from_response(self, response: str) -> Tuple[Optional[int], str]:
"""
Extract total_score and analysis from LLM response.

Score is expected on the last line via regex: r'(\\d+)\\s*$'
Analysis is everything before the score line.

If the total score could not be extracted, None is returned instead, along with the full response
as the score analysis.

Args:
response: LLM response text

Returns:
Tuple of (total_score, score_analysis)
"""
# Find score on last line
lines = response.splitlines()
last_line = lines[-1]
score_match = re.search(r"^((\+|-)?(\d+))\s*$", last_line)
if not score_match:
logger.error(
f"No score found on the last line in the response: {last_line[:500]}..."
)
return None, response

total_score = int(score_match.group(1))

# Extract analysis (everything before score)
score_analysis = "\n".join(response.splitlines()[0:-1])

return total_score, score_analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Score extraction lacks range validation and empty response handling.

Two issues:

  1. The regex r"^((\+|-)?(\d+))\s*$" allows negative numbers, but scores should be 0-100. A value like -5 would be parsed as -5.

  2. Line 226 lines[-1] will raise IndexError if the response is empty or contains only whitespace.

Proposed fix with validation
     def _extract_score_from_response(self, response: str) -> Tuple[Optional[int], str]:
         # Find score on last line
         lines = response.splitlines()
+        if not lines:
+            logger.error("Empty response received for score extraction")
+            return None, response
+        
         last_line = lines[-1]
-        score_match = re.search(r"^((\+|-)?(\d+))\s*$", last_line)
+        score_match = re.search(r"^(\d+)\s*$", last_line)
         if not score_match:
             logger.error(
                 f"No score found on the last line in the response: {last_line[:500]}..."
             )
             return None, response

         total_score = int(score_match.group(1))
+        
+        # Validate score is in expected range
+        if not 0 <= total_score <= 100:
+            logger.error(f"Score {total_score} is outside valid range 0-100")
+            return None, response

         # Extract analysis (everything before score)
         score_analysis = "\n".join(response.splitlines()[0:-1])

         return total_score, score_analysis
🤖 Prompt for AI Agents
In `@backend/tarsy/services/scoring_service.py` around lines 208 - 239, In
_extract_score_from_response, handle empty/whitespace responses and restrict
parsed scores to 0–100: first strip the response and if empty return (None,
response) (avoid indexing lines[-1]); otherwise find the last non-empty line
(not just lines[-1]) and attempt to parse it using a stricter pattern that only
accepts integers 0–100 (e.g., match 100 or 0–99) or parse then validate range;
if the match fails or the numeric value is outside 0–100, log an error including
the problematic last line and return (None, response); otherwise convert to int,
build score_analysis from all lines before that last non-empty line, and return
(total_score, score_analysis).

Comment on lines +567 to +571
# Launch background scoring task
asyncio.create_task(self._execute_scoring(score_record.score_id, session_id))
logger.debug(
f"Launched background scoring task for score {score_record.score_id}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Background task created without reference may be garbage collected.

asyncio.create_task() returns a task object that should be stored to prevent potential garbage collection and to properly handle exceptions. If the task is GC'd before completion, it may be silently cancelled.

Store task reference for proper lifecycle management
+        # Store reference to prevent GC and enable exception handling
+        task = asyncio.create_task(self._execute_scoring(score_record.score_id, session_id))
+        task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None)
-        asyncio.create_task(self._execute_scoring(score_record.score_id, session_id))
         logger.debug(
             f"Launched background scoring task for score {score_record.score_id}"
         )

Alternatively, consider using a task tracking mechanism if you need to monitor or cancel these tasks.

🤖 Prompt for AI Agents
In `@backend/tarsy/services/scoring_service.py` around lines 567 - 571, The
background task launched with
asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) is
not stored and may be garbage-collected or have unobserved exceptions; modify
the ScoringService (or the class containing _execute_scoring) to keep a
reference to the created Task (e.g., append the returned task to a tasks
set/list or a tracking dict keyed by score_id/session_id) and ensure you remove
completed tasks (or await/monitor them) to handle exceptions and lifecycle;
update the code that calls asyncio.create_task to assign its return value to
that tracker so tasks aren't silently cancelled.

Comment on lines +94 to +151
async def mock_generate_score(*args, **kwargs):
conversation = kwargs.get("conversation")
# Add assistant message with score
conversation.append_assistant_message(
"""## Evaluation

**Logical Flow: 18/25**
The investigation followed a reasonable pattern with good use of tools.

**Consistency: 20/25**
Conclusions are well-supported by evidence gathered.

**Tool Relevance: 17/25**
Good tool selection, though some opportunities were missed.

**Synthesis Quality: 18/25**
Final analysis is comprehensive and acknowledges limitations.

73"""
)
return conversation

# Mock missing tools response (Turn 2)
turn_count = {"count": 0}

async def mock_generate(*args, **kwargs):
conversation = kwargs.get("conversation")
turn_count["count"] += 1

if turn_count["count"] == 1:
# Turn 1: Score response
conversation.append_assistant_message(
"""## Evaluation

**Logical Flow: 18/25**
The investigation followed a reasonable pattern.

**Consistency: 20/25**
Conclusions are well-supported.

**Tool Relevance: 17/25**
Good tool selection overall.

**Synthesis Quality: 18/25**
Comprehensive final analysis.

73"""
)
else:
# Turn 2: Missing tools analysis
conversation.append_assistant_message(
"""1. **kubectl-describe**: Would have provided more detailed pod information.

2. **read-file**: Could have inspected configuration files for root cause analysis."""
)

return conversation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Rename unused args parameters to avoid Ruff ARG001.

Lines 94, 119, and 338 define *args but never use them. Ruff flags this; rename to _args (or drop them) to keep lint clean.

Fix
-        async def mock_generate_score(*args, **kwargs):
+        async def mock_generate_score(*_args, **kwargs):
...
-        async def mock_generate(*args, **kwargs):
+        async def mock_generate(*_args, **kwargs):
...
-        async def mock_generate_bad_response(*args, **kwargs):
+        async def mock_generate_bad_response(*_args, **kwargs):

Also applies to: 338-343

🧰 Tools
🪛 Ruff (0.14.14)

94-94: Unused function argument: args

(ARG001)


119-119: Unused function argument: args

(ARG001)

🤖 Prompt for AI Agents
In `@backend/tests/integration/test_scoring_integration.py` around lines 94 - 151,
In the async test helpers mock_generate_score and mock_generate (and the other
similar test helper around lines referenced), rename the unused variadic
parameter *args to _args (or remove it entirely) so Ruff ARG001 is not raised;
update function signatures where *args is declared (e.g., async def
mock_generate_score(*args, **kwargs) and async def mock_generate(*args,
**kwargs)) to use async def mock_generate_score(_args, **kwargs) or async def
mock_generate(_args, **kwargs) (or drop the variadic parameter) while leaving
usage of kwargs and the rest of the function body unchanged.

Comment on lines +285 to +291
with patch.object(
scoring_service, "_get_llm_client", return_value=mock_llm_client
):
with pytest.raises(
ValueError,
match="Cannot force rescore while an already existing scoring is pending",
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Collapse nested with blocks for readability (SIM117).

This nested with can be a single context manager tuple, which also satisfies Ruff’s SIM117.

Refactor
-        with patch.object(
-            scoring_service, "_get_llm_client", return_value=mock_llm_client
-        ):
-            with pytest.raises(
-                ValueError,
-                match="Cannot force rescore while an already existing scoring is pending",
-            ):
-                await scoring_service.initiate_scoring(
-                    session_id=session_id, triggered_by="test-user", force_rescore=True
-                )
+        with (
+            patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client),
+            pytest.raises(
+                ValueError,
+                match="Cannot force rescore while an already existing scoring is pending",
+            ),
+        ):
+            await scoring_service.initiate_scoring(
+                session_id=session_id, triggered_by="test-user", force_rescore=True
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with patch.object(
scoring_service, "_get_llm_client", return_value=mock_llm_client
):
with pytest.raises(
ValueError,
match="Cannot force rescore while an already existing scoring is pending",
):
with (
patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client),
pytest.raises(
ValueError,
match="Cannot force rescore while an already existing scoring is pending",
),
):
await scoring_service.initiate_scoring(
session_id=session_id, triggered_by="test-user", force_rescore=True
)
🧰 Tools
🪛 Ruff (0.14.14)

285-291: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

🤖 Prompt for AI Agents
In `@backend/tests/integration/test_scoring_integration.py` around lines 285 -
291, The nested context managers around the test can be collapsed into a single
combined `with` statement to improve readability and satisfy SIM117; replace the
two nested `with` blocks that use `patch.object(scoring_service,
"_get_llm_client", return_value=mock_llm_client)` and `pytest.raises(ValueError,
match=...)` with a single `with` that joins both context managers (e.g., `with
patch.object(...), pytest.raises(...):`), keeping the same `mock_llm_client`,
the same `scoring_service._get_llm_client` patch target, and the same exception
match string.

Comment on lines +1 to +47
"""Tests for judge prompts and hashing."""

from tarsy.agents.prompts.judges import (
JUDGE_PROMPT_SCORE,
JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS,
get_current_prompt_hash,
CURRENT_PROMPT_HASH,
)


class TestJudgePrompts:
"""Test judge prompt constants."""

def test_judge_prompt_score_contains_placeholders(self):
"""Test that JUDGE_PROMPT_SCORE contains required placeholders."""
assert "{{ALERT_DATA}}" in JUDGE_PROMPT_SCORE
assert "{{FINAL_ANALYSIS}}" in JUDGE_PROMPT_SCORE
assert "{{LLM_CONVERSATION}}" in JUDGE_PROMPT_SCORE
assert "{{CHAT_CONVERSATION}}" in JUDGE_PROMPT_SCORE
assert "{{OUTPUT_SCHEMA}}" in JUDGE_PROMPT_SCORE

def test_judge_prompt_followup_contains_content(self):
"""Test that JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS is non-empty."""
assert len(JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) > 0
assert "missing tool" in JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS.lower()


class TestPromptHashing:
"""Test SHA256 hashing logic for prompt versioning."""

def test_hash_determinism(self):
"""Test that hashing produces consistent results."""
hash1 = get_current_prompt_hash()
hash2 = get_current_prompt_hash()

assert hash1 == hash2
assert len(hash1) == 64 # SHA256 hex digest length

def test_hash_matches_module_constant(self):
"""Test that module-level CURRENT_PROMPT_HASH matches function result."""
computed_hash = get_current_prompt_hash()
assert CURRENT_PROMPT_HASH == computed_hash

def test_hash_is_hex_string(self):
"""Test that hash is a valid hexadecimal string."""
assert all(c in "0123456789abcdef" for c in CURRENT_PROMPT_HASH)
assert len(CURRENT_PROMPT_HASH) == 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add pytest unit markers + return type hints (and apply Ruff SIM300).

This keeps tests aligned with project conventions and addresses the reported yoda-condition warning.

✅ Suggested updates
+import pytest
+
 
+@pytest.mark.unit
 class TestJudgePrompts:
     """Test judge prompt constants."""
 
-    def test_judge_prompt_score_contains_placeholders(self):
+    def test_judge_prompt_score_contains_placeholders(self) -> None:
         """Test that JUDGE_PROMPT_SCORE contains required placeholders."""
         assert "{{ALERT_DATA}}" in JUDGE_PROMPT_SCORE
         assert "{{FINAL_ANALYSIS}}" in JUDGE_PROMPT_SCORE
         assert "{{LLM_CONVERSATION}}" in JUDGE_PROMPT_SCORE
         assert "{{CHAT_CONVERSATION}}" in JUDGE_PROMPT_SCORE
         assert "{{OUTPUT_SCHEMA}}" in JUDGE_PROMPT_SCORE
 
-    def test_judge_prompt_followup_contains_content(self):
+    def test_judge_prompt_followup_contains_content(self) -> None:
         """Test that JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS is non-empty."""
         assert len(JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) > 0
         assert "missing tool" in JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS.lower()
 
 
+@pytest.mark.unit
 class TestPromptHashing:
     """Test SHA256 hashing logic for prompt versioning."""
 
-    def test_hash_determinism(self):
+    def test_hash_determinism(self) -> None:
         """Test that hashing produces consistent results."""
         hash1 = get_current_prompt_hash()
         hash2 = get_current_prompt_hash()
 
         assert hash1 == hash2
         assert len(hash1) == 64  # SHA256 hex digest length
 
-    def test_hash_matches_module_constant(self):
+    def test_hash_matches_module_constant(self) -> None:
         """Test that module-level CURRENT_PROMPT_HASH matches function result."""
         computed_hash = get_current_prompt_hash()
-        assert CURRENT_PROMPT_HASH == computed_hash
+        assert computed_hash == CURRENT_PROMPT_HASH
 
-    def test_hash_is_hex_string(self):
+    def test_hash_is_hex_string(self) -> None:
         """Test that hash is a valid hexadecimal string."""
         assert all(c in "0123456789abcdef" for c in CURRENT_PROMPT_HASH)
         assert len(CURRENT_PROMPT_HASH) == 64
As per coding guidelines: "Use test markers appropriately: `@pytest.mark.unit` for unit tests..." and "Include type hints in test functions following project standards."
🧰 Tools
🪛 Ruff (0.14.14)

42-42: Yoda condition detected

Rewrite as computed_hash == CURRENT_PROMPT_HASH

(SIM300)

🤖 Prompt for AI Agents
In `@backend/tests/unit/agents/test_judges.py` around lines 1 - 47, Add pytest
markers and return type hints: import pytest at top and decorate each test class
or each test function with `@pytest.mark.unit`, then add explicit return type
hints (-> None) to every test method (e.g.,
test_judge_prompt_score_contains_placeholders(self) -> None). Also run Ruff
SIM300 and fix any yoda-condition warnings by ensuring comparisons place the
variable on the left side (e.g., keep len(CURRENT_PROMPT_HASH) == 64 rather than
64 == len(...)) and adjust any expressions flagged by SIM300 accordingly;
reference the test class and function names (TestJudgePrompts, TestPromptHashing
and all test_* methods) to locate where to apply these changes.

Comment on lines +11 to +28
class TestScoringStatusEnum:
"""Test ScoringStatus enum."""

def test_active_values(self):
"""Test active_values() returns pending and in_progress."""
active = ScoringStatus.active_values()
assert "pending" in active
assert "in_progress" in active
assert len(active) == 2

def test_values(self):
"""Test values() returns all status strings."""
all_values = ScoringStatus.values()
assert len(all_values) == 4
assert "pending" in all_values
assert "in_progress" in all_values
assert "completed" in all_values
assert "failed" in all_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add unit markers and return type hints for tests.

Line 11 and the other test classes should be marked as unit tests, and test functions should include -> None return type hints to meet project test conventions. This keeps markers consistent and helps mypy/linters.

Suggested pattern
+@pytest.mark.unit
 class TestScoringStatusEnum:
     """Test ScoringStatus enum."""

-    def test_active_values(self):
+    def test_active_values(self) -> None:
         """Test active_values() returns pending and in_progress."""

Apply the same pattern to the other classes and test functions in this file.
As per coding guidelines: “Use test markers appropriately: @pytest.mark.unit …” and “Include type hints in test functions following project standards.”

Also applies to: 31-177

🤖 Prompt for AI Agents
In `@backend/tests/unit/models/test_session_score_models.py` around lines 11 - 28,
Add the pytest unit marker and return type hints to the test classes and
functions in this file: decorate TestScoringStatusEnum (and the other test
classes in the file) with `@pytest.mark.unit`, and add explicit return type hints
(-> None) to the test methods such as test_active_values and test_values (and
all other test_* functions in the file) so each test function signature follows
the project convention and linters/mypy expectations.

Comment on lines +2 to +58
import { Chip, Tooltip, CircularProgress } from '@mui/material';
import { CheckCircle, Error, Warning, HourglassEmpty } from '@mui/icons-material';
import type { ScoringStatus } from '../types';

interface ScoreBadgeProps {
score: number | null;
status?: ScoringStatus;
onClick?: (event: React.MouseEvent) => void;
size?: 'small' | 'medium';
label?: string; // Optional label to show after the score (e.g., 'score', 'points')
}

/**
* ScoreBadge component displays session quality score with 3-tier color coding
*
* Color scheme (aligned with judge prompt scoring philosophy):
* - 0-49: Red (failed investigation)
* - 50-74: Yellow (weak investigation)
* - 75-100: Green (good investigation)
*
* Also handles pending/in_progress/failed states
*/
const ScoreBadge: React.FC<ScoreBadgeProps> = ({
score,
status,
onClick,
size = 'small',
label: customLabel,
}) => {
// Handle different scoring states
if (status === 'pending' || status === 'in_progress') {
return (
<Tooltip title={status === 'pending' ? 'Score pending' : 'Scoring in progress'} arrow>
<Chip
icon={status === 'in_progress' ? <CircularProgress size={14} /> : <HourglassEmpty />}
label={status === 'in_progress' ? 'Scoring...' : 'Pending'}
size={size}
variant="outlined"
color="default"
onClick={onClick}
sx={{ cursor: onClick ? 'pointer' : 'default' }}
/>
</Tooltip>
);
}

if (status === 'failed') {
return (
<Tooltip title="Scoring failed" arrow>
<Chip
icon={<Error />}
label="Error"
size={size}
color="error"
variant="outlined"
onClick={onClick}
sx={{ cursor: onClick ? 'pointer' : 'default' }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n dashboard/src/components/ScoreBadge.tsx

Repository: codeready-toolchain/tarsy-bot

Length of output: 4258


Avoid shadowing the global Error identifier.

Line 3 imports Error from @mui/icons-material, which shadows the global Error and triggers the linter. Alias it and update both usages.

Fix
-import { CheckCircle, Error, Warning, HourglassEmpty } from '@mui/icons-material';
+import { CheckCircle, Error as ErrorIcon, Warning, HourglassEmpty } from '@mui/icons-material';

Update usages on lines 52 and 88:

-          icon={<Error />}
+          icon={<ErrorIcon />}

and

-    icon = <Error />;
+    icon = <ErrorIcon />;
🧰 Tools
🪛 Biome (2.1.2)

[error] 3-3: Do not shadow the global "Error" property.

Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.

(lint/suspicious/noShadowRestrictedNames)

🤖 Prompt for AI Agents
In `@dashboard/src/components/ScoreBadge.tsx` around lines 2 - 58, The import of
Error from '@mui/icons-material' shadows the global Error; rename the imported
icon (e.g., Error as ErrorIcon) and update all usages in the ScoreBadge
component where the icon prop uses <Error /> (the Chip for the failed state and
any other Chip/Icon usage) to use the new alias (<ErrorIcon />) and any
corresponding JSX references so the global Error identifier is no longer
shadowed.

@metlos
Copy link
Collaborator Author

metlos commented Jan 28, 2026

I need to close this because it doesn't stack on top of the previous PRs 🤦🏼

@metlos metlos closed this Jan 28, 2026
@metlos metlos mentioned this pull request Jan 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant