Conversation
WalkthroughThis PR implements a comprehensive session scoring system for alert investigations. It adds database schema for storing scores, a scoring service with LLM-based evaluation, REST API endpoints, and frontend UI components to display and trigger scoring operations with support for multi-turn LLM interactions and deterministic prompt versioning. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client/Dashboard
participant API as Scoring API
participant Service as ScoringService
participant DB as Database
participant LLM as LLM Manager
participant Background as Background Task
Client->>API: POST /sessions/{id}/score
API->>API: Extract user identity
API->>Service: initiate_scoring(session_id, triggered_by, force_rescore)
Service->>DB: Get session + check completion
Service->>DB: Check active scores
Service->>DB: Create score (status=PENDING)
Service->>API: Return SessionScoreResponse (202 Accepted)
API-->>Client: 202 + pending score
Service->>Background: Launch async _execute_scoring task
Background->>DB: Update status to IN_PROGRESS
Background->>DB: Fetch session history & final analysis
Background->>LLM: Send Turn 1: Score prompt
LLM-->>Background: Response with score + analysis
Background->>Background: Extract score from response
Background->>LLM: Send Turn 2: Missing tools prompt
LLM-->>Background: Response with missing tools analysis
Background->>DB: Update score with results (status=COMPLETED)
Client->>API: GET /sessions/{id}/score
API->>Service: _get_latest_score(session_id)
Service->>DB: Fetch latest score
Service-->>API: SessionScore
API-->>Client: 200 + SessionScoreResponse
sequenceDiagram
participant UI as UI/SessionHeader
participant API as APIClient
participant Controller as Controller
participant Service as ScoringService
participant Dialog as Rescore Dialog
UI->>UI: User clicks "Score Session"
UI->>API: scoreSession(sessionId)
API->>Controller: POST /sessions/{id}/score
Controller->>Service: initiate_scoring()
alt Score already exists and not force
Service-->>Controller: Return existing score
Controller-->>API: 200 OK (existing)
API-->>UI: Existing score response
UI->>Dialog: Show rescore confirmation?
else Force rescore
Service-->>Controller: Return new pending score
Controller-->>API: 202 Accepted (new)
API-->>UI: New pending score
UI->>UI: Show loading state
end
UI->>UI: Poll GET /sessions/{id}/score (background)
UI->>API: getSessionScore(sessionId)
API-->>UI: Updated SessionScore
UI->>UI: Render ScoreDetailView when completed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🤖 Fix all issues with AI agents
In `@backend/tarsy/agents/prompts/judges.py`:
- Around line 320-335: get_current_prompt_hash currently only hashes
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS, so changes to other
prompt parts (system/reminder) won't update the version; update the function to
concatenate all prompt strings used in judge requests (include the system prompt
and reminder prompt constants used in this module — e.g., JUDGE_PROMPT_SYSTEM
and JUDGE_PROMPT_REMINDER or whatever names are defined here — along with
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) before computing the
SHA256, or if you intentionally want a subset, update the function docstring to
explicitly list which prompts are included; ensure you reference
get_current_prompt_hash and the prompt constant names when making the change.
In `@backend/tarsy/controllers/scoring_controller.py`:
- Around line 152-165: The except handlers in scoring_controller.py currently
include str(e) in HTTPException details (see the RuntimeError except block and
the generic except Exception block around initiating score for session_id),
which can leak internal info; change both HTTPException detail strings to
generic messages like "Internal server error" or "Scoring service error" without
including str(e), while retaining the existing logger.error/logger.exception
calls to record the full error internally (use session_id in log context as
already done).
In `@backend/tarsy/repositories/history_repository.py`:
- Around line 658-690: The import and usage of the non-existent model
SessionScoreDB causes the except ImportError to swallow the failure and leave
score_data empty; update the import and all references to use the correct model
name SessionScore (replace SessionScoreDB -> SessionScore) in the score-fetching
block (the select/subquery and score_query that reference SessionScoreDB) and in
the other occurrence noted in the file so the query uses the actual model class
SessionScore and no longer silently fails.
In `@backend/tarsy/services/scoring_service.py`:
- Around line 648-651: Remove the debugging marker from the logger.debug call
that prints the conversation: locate the logger.debug(f"!!!!! Full conversation
after turn 1:\n{conversation}") line (the logger.debug invocation referencing
conversation) and either delete it or replace it with a cleaned-up message
without the "!!!!!" artifact (e.g., "Full conversation after turn 1:" +
conversation) so the log entry is production-ready.
- Around line 208-239: In _extract_score_from_response, handle empty/whitespace
responses and restrict parsed scores to 0–100: first strip the response and if
empty return (None, response) (avoid indexing lines[-1]); otherwise find the
last non-empty line (not just lines[-1]) and attempt to parse it using a
stricter pattern that only accepts integers 0–100 (e.g., match 100 or 0–99) or
parse then validate range; if the match fails or the numeric value is outside
0–100, log an error including the problematic last line and return (None,
response); otherwise convert to int, build score_analysis from all lines before
that last non-empty line, and return (total_score, score_analysis).
- Around line 567-571: The background task launched with
asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) is
not stored and may be garbage-collected or have unobserved exceptions; modify
the ScoringService (or the class containing _execute_scoring) to keep a
reference to the created Task (e.g., append the returned task to a tasks
set/list or a tracking dict keyed by score_id/session_id) and ensure you remove
completed tasks (or await/monitor them) to handle exceptions and lifecycle;
update the code that calls asyncio.create_task to assign its return value to
that tracker so tasks aren't silently cancelled.
In `@backend/tests/integration/test_scoring_integration.py`:
- Around line 94-151: In the async test helpers mock_generate_score and
mock_generate (and the other similar test helper around lines referenced),
rename the unused variadic parameter *args to _args (or remove it entirely) so
Ruff ARG001 is not raised; update function signatures where *args is declared
(e.g., async def mock_generate_score(*args, **kwargs) and async def
mock_generate(*args, **kwargs)) to use async def mock_generate_score(_args,
**kwargs) or async def mock_generate(_args, **kwargs) (or drop the variadic
parameter) while leaving usage of kwargs and the rest of the function body
unchanged.
- Around line 34-366: The tests and pytest fixtures lack return type hints; add
appropriate annotations for each fixture and test: annotate fixtures like
in_memory_engine -> Engine, db_manager -> DatabaseManager, history_service ->
HistoryService, scoring_service -> ScoringService, mock_llm_client -> Mock,
completed_session -> AlertSession (import types from sqlmodel/sqlalchemy/typing
or project types as needed), and annotate async test functions
(test_end_to_end_score_completed_session,
test_multiple_scoring_attempts_same_session, test_concurrent_scoring_prevention,
test_scoring_failure_updates_status, test_score_extraction_failure) with ->
None; ensure any generator fixtures use typing.Generator or Iterator when
applicable and add necessary imports for the type names referenced.
- Around line 48-60: Remove the duplicate import of Session inside the
db_manager function to satisfy Ruff F811: delete the line "from sqlmodel import
Session" in db_manager and rely on the existing import at the top of the module;
ensure db_manager still sets manager.session_factory using
sessionmaker(bind=in_memory_engine, class_=Session, expire_on_commit=False) and
references the DatabaseManager, in_memory_engine, and sessionmaker symbols
unchanged.
- Around line 285-291: The nested context managers around the test can be
collapsed into a single combined `with` statement to improve readability and
satisfy SIM117; replace the two nested `with` blocks that use
`patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client)`
and `pytest.raises(ValueError, match=...)` with a single `with` that joins both
context managers (e.g., `with patch.object(...), pytest.raises(...):`), keeping
the same `mock_llm_client`, the same `scoring_service._get_llm_client` patch
target, and the same exception match string.
In `@backend/tests/unit/agents/test_judges.py`:
- Around line 1-47: Add pytest markers and return type hints: import pytest at
top and decorate each test class or each test function with `@pytest.mark.unit`,
then add explicit return type hints (-> None) to every test method (e.g.,
test_judge_prompt_score_contains_placeholders(self) -> None). Also run Ruff
SIM300 and fix any yoda-condition warnings by ensuring comparisons place the
variable on the left side (e.g., keep len(CURRENT_PROMPT_HASH) == 64 rather than
64 == len(...)) and adjust any expressions flagged by SIM300 accordingly;
reference the test class and function names (TestJudgePrompts, TestPromptHashing
and all test_* methods) to locate where to apply these changes.
In `@backend/tests/unit/models/test_session_score_models.py`:
- Around line 11-28: Add the pytest unit marker and return type hints to the
test classes and functions in this file: decorate TestScoringStatusEnum (and the
other test classes in the file) with `@pytest.mark.unit`, and add explicit return
type hints (-> None) to the test methods such as test_active_values and
test_values (and all other test_* functions in the file) so each test function
signature follows the project convention and linters/mypy expectations.
In `@dashboard/src/components/ScoreBadge.tsx`:
- Around line 2-58: The import of Error from '@mui/icons-material' shadows the
global Error; rename the imported icon (e.g., Error as ErrorIcon) and update all
usages in the ScoreBadge component where the icon prop uses <Error /> (the Chip
for the failed state and any other Chip/Icon usage) to use the new alias
(<ErrorIcon />) and any corresponding JSX references so the global Error
identifier is no longer shadowed.
🧹 Nitpick comments (15)
backend/tarsy/models/history_models.py (1)
229-231: Consider constrainingscore_statusto known values for validation.Using a
Literal(or an enum if you already have one) avoids invalid statuses slipping into API responses.♻️ Proposed refactor
- score_status: Optional[str] = None # pending|in_progress|completed|failed + score_status: Optional[Literal["pending", "in_progress", "completed", "failed"]] = None- score_status: Optional[str] = None # pending|in_progress|completed|failed + score_status: Optional[Literal["pending", "in_progress", "completed", "failed"]] = NoneAlso applies to: 437-438
backend/tests/unit/repositories/test_session_score_repository.py (2)
1-11: Add@pytest.mark.unitmarker to the test class.Per coding guidelines, unit tests should be marked with
@pytest.mark.unit. Also, theIntegrityErrorimport on line 4 appears unused in this test file since the repository handles it internally and raisesValueErrorinstead.Suggested fix
import pytest -from sqlalchemy.exc import IntegrityError from sqlmodel import Session, SQLModel, create_engine from tarsy.models.constants import ScoringStatus from tarsy.models.db_models import SessionScore, AlertSession from tarsy.repositories.session_score_repository import SessionScoreRepository from tarsy.utils.timestamp import now_us +@pytest.mark.unit class TestSessionScoreRepository:As per coding guidelines: "Use test markers appropriately:
@pytest.mark.unitfor unit tests".
134-159: Consider adding explicit time ordering guarantee.The test relies on
now_us() - 1000000to create an "earlier" timestamp, but if the test runs extremely fast, there's a theoretical (though unlikely) risk of timing issues. The test logic is correct, but you could make the ordering more explicit.Optional: More explicit timestamp ordering
def test_get_latest_score_for_session(self, repository, sample_alert_session): """Test getting most recent score.""" + base_time = now_us() # Create two scores score1 = SessionScore( session_id=sample_alert_session.session_id, prompt_hash="abc123", score_triggered_by="user:test", status=ScoringStatus.COMPLETED.value, - scored_at_us=now_us() - 1000000, # Earlier + scored_at_us=base_time - 1000000, # Earlier ) score2 = SessionScore( session_id=sample_alert_session.session_id, prompt_hash="def456", score_triggered_by="user:test", status=ScoringStatus.PENDING.value, - scored_at_us=now_us(), # Later + scored_at_us=base_time, # Later )dashboard/src/components/SessionDetailWrapper.tsx (1)
22-27: Ternary chain is becoming complex; consider refactoring for readability.The nested ternary for
initialViewdetection is functional but harder to read at a glance. Consider extracting to a helper function for clarity.Optional: Extract to helper function
+ // Determine initial view from URL path + const getInitialView = (): 'conversation' | 'technical' | 'score' => { + if (location.pathname.includes('/technical')) return 'technical'; + if (location.pathname.includes('/score')) return 'score'; + return 'conversation'; + }; + // Determine initial view from URL - const initialView = location.pathname.includes('/technical') - ? 'technical' - : location.pathname.includes('/score') - ? 'score' - : 'conversation'; - const [currentView, setCurrentView] = useState<'conversation' | 'technical' | 'score'>(initialView); + const [currentView, setCurrentView] = useState<'conversation' | 'technical' | 'score'>(getInitialView);backend/tarsy/models/db_models.py (1)
390-394: Consider adding an index onscored_at_usfor time-based queries.The
scored_at_usfield usesBIGINTbut doesn't have an index. If you anticipate queries filtering or ordering by scoring initiation time (e.g., for cleanup or analytics), consider adding an index.backend/tests/unit/controllers/test_scoring_controller.py (1)
38-44: Tests mock a private method_get_latest_scorewhich couples tests to implementation details.The mock targets
_get_latest_score(underscore-prefixed private method). This creates tight coupling between tests and implementation. If the service refactors this internal method, tests will break even if the public contract remains unchanged.Consider whether the
ScoringServiceshould expose a public method for retrieving scores, or mock at a lower level (repository).backend/tarsy/controllers/scoring_controller.py (2)
11-11: Unused import:Bodyfrom fastapi.The
Bodyimport is not used in this file.🧹 Remove unused import
-from fastapi import APIRouter, Depends, HTTPException, Path, Request, Response, Body +from fastapi import APIRouter, Depends, HTTPException, Path, Request, Response
213-220: Controller calls private method_get_latest_scoredirectly.The GET endpoint calls
scoring_service._get_latest_score(), which is a private method (underscore prefix). This breaks encapsulation and makes the controller dependent on implementation details.Consider exposing a public method like
get_latest_score()orget_score_for_session()in the service layer.♻️ Suggested refactor in scoring_service.py
Add a public method in
ScoringService:async def get_latest_score(self, session_id: str) -> Optional[SessionScore]: """ Get the latest score for a session (public API). Args: session_id: Session identifier Returns: Latest SessionScore or None if not found """ return await self._get_latest_score(session_id)Then update the controller to use the public method.
dashboard/src/components/SessionHeader.tsx (1)
573-583: Magic string'completed'should use a constant.The status check uses a string literal
'completed'instead of referencing a shared constant. This could lead to bugs if the status value changes.♻️ Use ScoringStatus constant
Import and use the
ScoringStatustype or create a constant:+import { SCORING_STATUS } from '../utils/statusConstants'; // EP-0028: Handle score session button click const handleScoreSession = () => { // Check if session already has a completed score - if (session.score_status === 'completed') { + if (session.score_status === SCORING_STATUS.COMPLETED) { // Show confirmation dialog for rescoring setShowRescoreDialog(true); } else {backend/tarsy/models/api_models.py (1)
14-14: Import of CURRENT_PROMPT_HASH creates coupling between API and agent modules.Importing from
tarsy.agents.prompts.judgesin the API models creates a dependency from the API layer to the agent/prompt layer. This could be moved to a shared constants module if this coupling becomes problematic.backend/tests/unit/services/test_scoring_service.py (2)
307-319: Unused fixture parametermock_chat_conversation_history.The
mock_chat_conversation_historyparameter intest_build_score_prompt_with_chat_conversation_presentis declared but the test actually usesmock_final_analysis_responsewhich already contains the chat conversation.This is a minor issue in test code and the fixture dependency is technically correct (ensures the data is available), but the explicit parameter suggests it might have been intended to be used directly.
571-598: Consider combining nestedwithstatements for readability.The nested
withstatements can be combined into a single statement as suggested by static analysis. However, the current structure is clear and the nesting shows the mock dependency hierarchy.♻️ Combined with statement (optional)
- with patch.object( - scoring_service.history_service, - "get_session", - return_value=mock_session_completed, - ): - with patch.object(scoring_service, "_get_latest_score", return_value=None): - with patch.object( - scoring_service, - "_create_score_record", - return_value=mock_score_record_pending, - ): - with patch("asyncio.create_task") as mock_create_task: + with ( + patch.object( + scoring_service.history_service, + "get_session", + return_value=mock_session_completed, + ), + patch.object(scoring_service, "_get_latest_score", return_value=None), + patch.object( + scoring_service, + "_create_score_record", + return_value=mock_score_record_pending, + ), + patch("asyncio.create_task") as mock_create_task, + ):backend/tarsy/repositories/session_score_repository.py (1)
99-136: Truthy check forcompleted_at_uscould skip valid zero value.Line 121 uses
if completed_at_us:which would skip if the value is0. While0microseconds since epoch (1970-01-01 00:00:00.000000) is practically never valid, usingis not Nonewould be more defensive and consistent with thetotal_scorecheck on line 127.Suggested fix for consistency
- if completed_at_us: + if completed_at_us is not None: score.completed_at_us = completed_at_usbackend/tarsy/services/scoring_service.py (2)
97-115: Repository context manager silently yields None on error.The
_get_repositorymethod yieldsNoneon exceptions instead of propagating them. While callers do check forNone, this pattern can mask underlying database errors and make debugging harder.Consider logging more details or re-raising after logging to help diagnose issues:
except Exception as e: logger.error(f"Failed to get scoring repository: {str(e)}") - yield None + raise # Let caller handle the exceptionAlternatively, if the silent failure is intentional for resilience, document this behavior in the docstring.
142-200: Inline import should be at module top level.Line 161 has
import jsoninside the method. Per PEP 8 and Ruff conventions, imports should be at the top of the file.Move import to top of file
import asyncio +import json import randomAnd remove line 161.
| def get_current_prompt_hash() -> str: | ||
| """ | ||
| Compute SHA256 hash of both judge prompts concatenated. | ||
|
|
||
| This hash provides deterministic criteria versioning - when prompts change, | ||
| the hash changes, allowing detection of scores using obsolete criteria. | ||
|
|
||
| Returns: | ||
| Hex string of SHA256 hash (64 characters) | ||
| """ | ||
| # Concatenate both prompts in the order they're used | ||
| combined_prompts = JUDGE_PROMPT_SCORE + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS | ||
|
|
||
| # Compute SHA256 hash | ||
| hash_obj = hashlib.sha256(combined_prompts.encode("utf-8")) | ||
|
|
There was a problem hiding this comment.
Include all prompts in the version hash (system + reminder).
CURRENT_PROMPT_HASH won’t change if the system prompt or the reminder prompt changes, which breaks prompt-versioning guarantees. If those prompts are part of the judge request, they should be included in the hash (or the docstring/usage updated to make this explicit).
🐛 Proposed fix
- # Concatenate both prompts in the order they're used
- combined_prompts = JUDGE_PROMPT_SCORE + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS
+ # Concatenate all prompts in the order they're used
+ combined_prompts = (
+ JUDGE_SYSTEM_PROMPT
+ + JUDGE_PROMPT_SCORE
+ + JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS
+ + JUDGE_PROMPT_SCORE_REMINDER
+ )🤖 Prompt for AI Agents
In `@backend/tarsy/agents/prompts/judges.py` around lines 320 - 335,
get_current_prompt_hash currently only hashes JUDGE_PROMPT_SCORE and
JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS, so changes to other prompt parts
(system/reminder) won't update the version; update the function to concatenate
all prompt strings used in judge requests (include the system prompt and
reminder prompt constants used in this module — e.g., JUDGE_PROMPT_SYSTEM and
JUDGE_PROMPT_REMINDER or whatever names are defined here — along with
JUDGE_PROMPT_SCORE and JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) before computing the
SHA256, or if you intentionally want a subset, update the function docstring to
explicitly list which prompts are included; ensure you reference
get_current_prompt_hash and the prompt constant names when making the change.
| except RuntimeError as e: | ||
| # Database errors | ||
| logger.error(f"Scoring service error for session {session_id}: {str(e)}") | ||
| raise HTTPException( | ||
| status_code=500, detail=f"Scoring service error: {str(e)}" | ||
| ) from e | ||
| except Exception as e: | ||
| # Unexpected errors | ||
| logger.exception( | ||
| f"Unexpected error initiating score for session {session_id}: {str(e)}" | ||
| ) | ||
| raise HTTPException( | ||
| status_code=500, detail=f"Internal server error: {str(e)}" | ||
| ) from e |
There was a problem hiding this comment.
Error messages may expose internal details in 500 responses.
The 500 error responses include str(e) which could leak implementation details or stack traces in production. Consider using a generic message for unexpected errors.
🛡️ Suggested fix
except Exception as e:
# Unexpected errors
logger.exception(
f"Unexpected error initiating score for session {session_id}: {str(e)}"
)
raise HTTPException(
- status_code=500, detail=f"Internal server error: {str(e)}"
+ status_code=500, detail="Internal server error"
) from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except RuntimeError as e: | |
| # Database errors | |
| logger.error(f"Scoring service error for session {session_id}: {str(e)}") | |
| raise HTTPException( | |
| status_code=500, detail=f"Scoring service error: {str(e)}" | |
| ) from e | |
| except Exception as e: | |
| # Unexpected errors | |
| logger.exception( | |
| f"Unexpected error initiating score for session {session_id}: {str(e)}" | |
| ) | |
| raise HTTPException( | |
| status_code=500, detail=f"Internal server error: {str(e)}" | |
| ) from e | |
| except RuntimeError as e: | |
| # Database errors | |
| logger.error(f"Scoring service error for session {session_id}: {str(e)}") | |
| raise HTTPException( | |
| status_code=500, detail=f"Scoring service error: {str(e)}" | |
| ) from e | |
| except Exception as e: | |
| # Unexpected errors | |
| logger.exception( | |
| f"Unexpected error initiating score for session {session_id}: {str(e)}" | |
| ) | |
| raise HTTPException( | |
| status_code=500, detail="Internal server error" | |
| ) from e |
🤖 Prompt for AI Agents
In `@backend/tarsy/controllers/scoring_controller.py` around lines 152 - 165, The
except handlers in scoring_controller.py currently include str(e) in
HTTPException details (see the RuntimeError except block and the generic except
Exception block around initiating score for session_id), which can leak internal
info; change both HTTPException detail strings to generic messages like
"Internal server error" or "Scoring service error" without including str(e),
while retaining the existing logger.error/logger.exception calls to record the
full error internally (use session_id in log context as already done).
| # Fetch latest scores for sessions (most recent scoring attempt per session) | ||
| score_data = {} | ||
| try: | ||
| from tarsy.models.db_models import SessionScoreDB | ||
|
|
||
| # Get latest score for each session (most recent started_at_us) | ||
| # Use subquery to get max started_at_us per session, then join to get full record | ||
| subquery = select( | ||
| SessionScoreDB.session_id, | ||
| func.max(SessionScoreDB.started_at_us).label('max_started_at') | ||
| ).where( | ||
| SessionScoreDB.session_id.in_(session_ids) | ||
| ).group_by(SessionScoreDB.session_id).subquery() | ||
|
|
||
| score_query = select(SessionScoreDB).join( | ||
| subquery, | ||
| and_( | ||
| SessionScoreDB.session_id == subquery.c.session_id, | ||
| SessionScoreDB.started_at_us == subquery.c.max_started_at | ||
| ) | ||
| ) | ||
|
|
||
| score_results = self.session.exec(score_query).all() | ||
| score_data = { | ||
| score.session_id: { | ||
| 'total_score': score.total_score, | ||
| 'status': score.status | ||
| } | ||
| for score in score_results | ||
| } | ||
| except ImportError: | ||
| # SessionScoreDB not available (shouldn't happen but defensive) | ||
| pass |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check for SessionScoreDB and SessionScore class definitions
rg -n "class SessionScoreDB|class SessionScore\b" backend/tarsy/models/
rg -n "SessionScoreDB|SessionScore" backend/tarsy/repositories/history_repository.pyRepository: codeready-toolchain/tarsy-bot
Length of output: 1171
Fix SessionScoreDB → SessionScore import; model class name is incorrect and silently fails.
The model at backend/tarsy/models/db_models.py:330 is defined as SessionScore, not SessionScoreDB. Lines 661 and 985 attempt to import SessionScoreDB, which does not exist. The ImportError is silently caught, leaving score_data empty and causing score_total/score_status to be None in history views.
Replace all occurrences of SessionScoreDB with SessionScore in both locations (lines 661–690 and 985–999).
🤖 Prompt for AI Agents
In `@backend/tarsy/repositories/history_repository.py` around lines 658 - 690, The
import and usage of the non-existent model SessionScoreDB causes the except
ImportError to swallow the failure and leave score_data empty; update the import
and all references to use the correct model name SessionScore (replace
SessionScoreDB -> SessionScore) in the score-fetching block (the select/subquery
and score_query that reference SessionScoreDB) and in the other occurrence noted
in the file so the query uses the actual model class SessionScore and no longer
silently fails.
| def _extract_score_from_response(self, response: str) -> Tuple[Optional[int], str]: | ||
| """ | ||
| Extract total_score and analysis from LLM response. | ||
|
|
||
| Score is expected on the last line via regex: r'(\\d+)\\s*$' | ||
| Analysis is everything before the score line. | ||
|
|
||
| If the total score could not be extracted, None is returned instead, along with the full response | ||
| as the score analysis. | ||
|
|
||
| Args: | ||
| response: LLM response text | ||
|
|
||
| Returns: | ||
| Tuple of (total_score, score_analysis) | ||
| """ | ||
| # Find score on last line | ||
| lines = response.splitlines() | ||
| last_line = lines[-1] | ||
| score_match = re.search(r"^((\+|-)?(\d+))\s*$", last_line) | ||
| if not score_match: | ||
| logger.error( | ||
| f"No score found on the last line in the response: {last_line[:500]}..." | ||
| ) | ||
| return None, response | ||
|
|
||
| total_score = int(score_match.group(1)) | ||
|
|
||
| # Extract analysis (everything before score) | ||
| score_analysis = "\n".join(response.splitlines()[0:-1]) | ||
|
|
||
| return total_score, score_analysis |
There was a problem hiding this comment.
Score extraction lacks range validation and empty response handling.
Two issues:
-
The regex
r"^((\+|-)?(\d+))\s*$"allows negative numbers, but scores should be 0-100. A value like-5would be parsed as-5. -
Line 226
lines[-1]will raiseIndexErrorif the response is empty or contains only whitespace.
Proposed fix with validation
def _extract_score_from_response(self, response: str) -> Tuple[Optional[int], str]:
# Find score on last line
lines = response.splitlines()
+ if not lines:
+ logger.error("Empty response received for score extraction")
+ return None, response
+
last_line = lines[-1]
- score_match = re.search(r"^((\+|-)?(\d+))\s*$", last_line)
+ score_match = re.search(r"^(\d+)\s*$", last_line)
if not score_match:
logger.error(
f"No score found on the last line in the response: {last_line[:500]}..."
)
return None, response
total_score = int(score_match.group(1))
+
+ # Validate score is in expected range
+ if not 0 <= total_score <= 100:
+ logger.error(f"Score {total_score} is outside valid range 0-100")
+ return None, response
# Extract analysis (everything before score)
score_analysis = "\n".join(response.splitlines()[0:-1])
return total_score, score_analysis🤖 Prompt for AI Agents
In `@backend/tarsy/services/scoring_service.py` around lines 208 - 239, In
_extract_score_from_response, handle empty/whitespace responses and restrict
parsed scores to 0–100: first strip the response and if empty return (None,
response) (avoid indexing lines[-1]); otherwise find the last non-empty line
(not just lines[-1]) and attempt to parse it using a stricter pattern that only
accepts integers 0–100 (e.g., match 100 or 0–99) or parse then validate range;
if the match fails or the numeric value is outside 0–100, log an error including
the problematic last line and return (None, response); otherwise convert to int,
build score_analysis from all lines before that last non-empty line, and return
(total_score, score_analysis).
| # Launch background scoring task | ||
| asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) | ||
| logger.debug( | ||
| f"Launched background scoring task for score {score_record.score_id}" | ||
| ) |
There was a problem hiding this comment.
Background task created without reference may be garbage collected.
asyncio.create_task() returns a task object that should be stored to prevent potential garbage collection and to properly handle exceptions. If the task is GC'd before completion, it may be silently cancelled.
Store task reference for proper lifecycle management
+ # Store reference to prevent GC and enable exception handling
+ task = asyncio.create_task(self._execute_scoring(score_record.score_id, session_id))
+ task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None)
- asyncio.create_task(self._execute_scoring(score_record.score_id, session_id))
logger.debug(
f"Launched background scoring task for score {score_record.score_id}"
)Alternatively, consider using a task tracking mechanism if you need to monitor or cancel these tasks.
🤖 Prompt for AI Agents
In `@backend/tarsy/services/scoring_service.py` around lines 567 - 571, The
background task launched with
asyncio.create_task(self._execute_scoring(score_record.score_id, session_id)) is
not stored and may be garbage-collected or have unobserved exceptions; modify
the ScoringService (or the class containing _execute_scoring) to keep a
reference to the created Task (e.g., append the returned task to a tasks
set/list or a tracking dict keyed by score_id/session_id) and ensure you remove
completed tasks (or await/monitor them) to handle exceptions and lifecycle;
update the code that calls asyncio.create_task to assign its return value to
that tracker so tasks aren't silently cancelled.
| async def mock_generate_score(*args, **kwargs): | ||
| conversation = kwargs.get("conversation") | ||
| # Add assistant message with score | ||
| conversation.append_assistant_message( | ||
| """## Evaluation | ||
|
|
||
| **Logical Flow: 18/25** | ||
| The investigation followed a reasonable pattern with good use of tools. | ||
|
|
||
| **Consistency: 20/25** | ||
| Conclusions are well-supported by evidence gathered. | ||
|
|
||
| **Tool Relevance: 17/25** | ||
| Good tool selection, though some opportunities were missed. | ||
|
|
||
| **Synthesis Quality: 18/25** | ||
| Final analysis is comprehensive and acknowledges limitations. | ||
|
|
||
| 73""" | ||
| ) | ||
| return conversation | ||
|
|
||
| # Mock missing tools response (Turn 2) | ||
| turn_count = {"count": 0} | ||
|
|
||
| async def mock_generate(*args, **kwargs): | ||
| conversation = kwargs.get("conversation") | ||
| turn_count["count"] += 1 | ||
|
|
||
| if turn_count["count"] == 1: | ||
| # Turn 1: Score response | ||
| conversation.append_assistant_message( | ||
| """## Evaluation | ||
|
|
||
| **Logical Flow: 18/25** | ||
| The investigation followed a reasonable pattern. | ||
|
|
||
| **Consistency: 20/25** | ||
| Conclusions are well-supported. | ||
|
|
||
| **Tool Relevance: 17/25** | ||
| Good tool selection overall. | ||
|
|
||
| **Synthesis Quality: 18/25** | ||
| Comprehensive final analysis. | ||
|
|
||
| 73""" | ||
| ) | ||
| else: | ||
| # Turn 2: Missing tools analysis | ||
| conversation.append_assistant_message( | ||
| """1. **kubectl-describe**: Would have provided more detailed pod information. | ||
|
|
||
| 2. **read-file**: Could have inspected configuration files for root cause analysis.""" | ||
| ) | ||
|
|
||
| return conversation | ||
|
|
There was a problem hiding this comment.
Rename unused args parameters to avoid Ruff ARG001.
Lines 94, 119, and 338 define *args but never use them. Ruff flags this; rename to _args (or drop them) to keep lint clean.
Fix
- async def mock_generate_score(*args, **kwargs):
+ async def mock_generate_score(*_args, **kwargs):
...
- async def mock_generate(*args, **kwargs):
+ async def mock_generate(*_args, **kwargs):
...
- async def mock_generate_bad_response(*args, **kwargs):
+ async def mock_generate_bad_response(*_args, **kwargs):Also applies to: 338-343
🧰 Tools
🪛 Ruff (0.14.14)
94-94: Unused function argument: args
(ARG001)
119-119: Unused function argument: args
(ARG001)
🤖 Prompt for AI Agents
In `@backend/tests/integration/test_scoring_integration.py` around lines 94 - 151,
In the async test helpers mock_generate_score and mock_generate (and the other
similar test helper around lines referenced), rename the unused variadic
parameter *args to _args (or remove it entirely) so Ruff ARG001 is not raised;
update function signatures where *args is declared (e.g., async def
mock_generate_score(*args, **kwargs) and async def mock_generate(*args,
**kwargs)) to use async def mock_generate_score(_args, **kwargs) or async def
mock_generate(_args, **kwargs) (or drop the variadic parameter) while leaving
usage of kwargs and the rest of the function body unchanged.
| with patch.object( | ||
| scoring_service, "_get_llm_client", return_value=mock_llm_client | ||
| ): | ||
| with pytest.raises( | ||
| ValueError, | ||
| match="Cannot force rescore while an already existing scoring is pending", | ||
| ): |
There was a problem hiding this comment.
Collapse nested with blocks for readability (SIM117).
This nested with can be a single context manager tuple, which also satisfies Ruff’s SIM117.
Refactor
- with patch.object(
- scoring_service, "_get_llm_client", return_value=mock_llm_client
- ):
- with pytest.raises(
- ValueError,
- match="Cannot force rescore while an already existing scoring is pending",
- ):
- await scoring_service.initiate_scoring(
- session_id=session_id, triggered_by="test-user", force_rescore=True
- )
+ with (
+ patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client),
+ pytest.raises(
+ ValueError,
+ match="Cannot force rescore while an already existing scoring is pending",
+ ),
+ ):
+ await scoring_service.initiate_scoring(
+ session_id=session_id, triggered_by="test-user", force_rescore=True
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with patch.object( | |
| scoring_service, "_get_llm_client", return_value=mock_llm_client | |
| ): | |
| with pytest.raises( | |
| ValueError, | |
| match="Cannot force rescore while an already existing scoring is pending", | |
| ): | |
| with ( | |
| patch.object(scoring_service, "_get_llm_client", return_value=mock_llm_client), | |
| pytest.raises( | |
| ValueError, | |
| match="Cannot force rescore while an already existing scoring is pending", | |
| ), | |
| ): | |
| await scoring_service.initiate_scoring( | |
| session_id=session_id, triggered_by="test-user", force_rescore=True | |
| ) |
🧰 Tools
🪛 Ruff (0.14.14)
285-291: Use a single with statement with multiple contexts instead of nested with statements
Combine with statements
(SIM117)
🤖 Prompt for AI Agents
In `@backend/tests/integration/test_scoring_integration.py` around lines 285 -
291, The nested context managers around the test can be collapsed into a single
combined `with` statement to improve readability and satisfy SIM117; replace the
two nested `with` blocks that use `patch.object(scoring_service,
"_get_llm_client", return_value=mock_llm_client)` and `pytest.raises(ValueError,
match=...)` with a single `with` that joins both context managers (e.g., `with
patch.object(...), pytest.raises(...):`), keeping the same `mock_llm_client`,
the same `scoring_service._get_llm_client` patch target, and the same exception
match string.
| """Tests for judge prompts and hashing.""" | ||
|
|
||
| from tarsy.agents.prompts.judges import ( | ||
| JUDGE_PROMPT_SCORE, | ||
| JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS, | ||
| get_current_prompt_hash, | ||
| CURRENT_PROMPT_HASH, | ||
| ) | ||
|
|
||
|
|
||
| class TestJudgePrompts: | ||
| """Test judge prompt constants.""" | ||
|
|
||
| def test_judge_prompt_score_contains_placeholders(self): | ||
| """Test that JUDGE_PROMPT_SCORE contains required placeholders.""" | ||
| assert "{{ALERT_DATA}}" in JUDGE_PROMPT_SCORE | ||
| assert "{{FINAL_ANALYSIS}}" in JUDGE_PROMPT_SCORE | ||
| assert "{{LLM_CONVERSATION}}" in JUDGE_PROMPT_SCORE | ||
| assert "{{CHAT_CONVERSATION}}" in JUDGE_PROMPT_SCORE | ||
| assert "{{OUTPUT_SCHEMA}}" in JUDGE_PROMPT_SCORE | ||
|
|
||
| def test_judge_prompt_followup_contains_content(self): | ||
| """Test that JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS is non-empty.""" | ||
| assert len(JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) > 0 | ||
| assert "missing tool" in JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS.lower() | ||
|
|
||
|
|
||
| class TestPromptHashing: | ||
| """Test SHA256 hashing logic for prompt versioning.""" | ||
|
|
||
| def test_hash_determinism(self): | ||
| """Test that hashing produces consistent results.""" | ||
| hash1 = get_current_prompt_hash() | ||
| hash2 = get_current_prompt_hash() | ||
|
|
||
| assert hash1 == hash2 | ||
| assert len(hash1) == 64 # SHA256 hex digest length | ||
|
|
||
| def test_hash_matches_module_constant(self): | ||
| """Test that module-level CURRENT_PROMPT_HASH matches function result.""" | ||
| computed_hash = get_current_prompt_hash() | ||
| assert CURRENT_PROMPT_HASH == computed_hash | ||
|
|
||
| def test_hash_is_hex_string(self): | ||
| """Test that hash is a valid hexadecimal string.""" | ||
| assert all(c in "0123456789abcdef" for c in CURRENT_PROMPT_HASH) | ||
| assert len(CURRENT_PROMPT_HASH) == 64 |
There was a problem hiding this comment.
Add pytest unit markers + return type hints (and apply Ruff SIM300).
This keeps tests aligned with project conventions and addresses the reported yoda-condition warning.
✅ Suggested updates
+import pytest
+
+@pytest.mark.unit
class TestJudgePrompts:
"""Test judge prompt constants."""
- def test_judge_prompt_score_contains_placeholders(self):
+ def test_judge_prompt_score_contains_placeholders(self) -> None:
"""Test that JUDGE_PROMPT_SCORE contains required placeholders."""
assert "{{ALERT_DATA}}" in JUDGE_PROMPT_SCORE
assert "{{FINAL_ANALYSIS}}" in JUDGE_PROMPT_SCORE
assert "{{LLM_CONVERSATION}}" in JUDGE_PROMPT_SCORE
assert "{{CHAT_CONVERSATION}}" in JUDGE_PROMPT_SCORE
assert "{{OUTPUT_SCHEMA}}" in JUDGE_PROMPT_SCORE
- def test_judge_prompt_followup_contains_content(self):
+ def test_judge_prompt_followup_contains_content(self) -> None:
"""Test that JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS is non-empty."""
assert len(JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) > 0
assert "missing tool" in JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS.lower()
+@pytest.mark.unit
class TestPromptHashing:
"""Test SHA256 hashing logic for prompt versioning."""
- def test_hash_determinism(self):
+ def test_hash_determinism(self) -> None:
"""Test that hashing produces consistent results."""
hash1 = get_current_prompt_hash()
hash2 = get_current_prompt_hash()
assert hash1 == hash2
assert len(hash1) == 64 # SHA256 hex digest length
- def test_hash_matches_module_constant(self):
+ def test_hash_matches_module_constant(self) -> None:
"""Test that module-level CURRENT_PROMPT_HASH matches function result."""
computed_hash = get_current_prompt_hash()
- assert CURRENT_PROMPT_HASH == computed_hash
+ assert computed_hash == CURRENT_PROMPT_HASH
- def test_hash_is_hex_string(self):
+ def test_hash_is_hex_string(self) -> None:
"""Test that hash is a valid hexadecimal string."""
assert all(c in "0123456789abcdef" for c in CURRENT_PROMPT_HASH)
assert len(CURRENT_PROMPT_HASH) == 64🧰 Tools
🪛 Ruff (0.14.14)
42-42: Yoda condition detected
Rewrite as computed_hash == CURRENT_PROMPT_HASH
(SIM300)
🤖 Prompt for AI Agents
In `@backend/tests/unit/agents/test_judges.py` around lines 1 - 47, Add pytest
markers and return type hints: import pytest at top and decorate each test class
or each test function with `@pytest.mark.unit`, then add explicit return type
hints (-> None) to every test method (e.g.,
test_judge_prompt_score_contains_placeholders(self) -> None). Also run Ruff
SIM300 and fix any yoda-condition warnings by ensuring comparisons place the
variable on the left side (e.g., keep len(CURRENT_PROMPT_HASH) == 64 rather than
64 == len(...)) and adjust any expressions flagged by SIM300 accordingly;
reference the test class and function names (TestJudgePrompts, TestPromptHashing
and all test_* methods) to locate where to apply these changes.
| class TestScoringStatusEnum: | ||
| """Test ScoringStatus enum.""" | ||
|
|
||
| def test_active_values(self): | ||
| """Test active_values() returns pending and in_progress.""" | ||
| active = ScoringStatus.active_values() | ||
| assert "pending" in active | ||
| assert "in_progress" in active | ||
| assert len(active) == 2 | ||
|
|
||
| def test_values(self): | ||
| """Test values() returns all status strings.""" | ||
| all_values = ScoringStatus.values() | ||
| assert len(all_values) == 4 | ||
| assert "pending" in all_values | ||
| assert "in_progress" in all_values | ||
| assert "completed" in all_values | ||
| assert "failed" in all_values |
There was a problem hiding this comment.
Add unit markers and return type hints for tests.
Line 11 and the other test classes should be marked as unit tests, and test functions should include -> None return type hints to meet project test conventions. This keeps markers consistent and helps mypy/linters.
Suggested pattern
+@pytest.mark.unit
class TestScoringStatusEnum:
"""Test ScoringStatus enum."""
- def test_active_values(self):
+ def test_active_values(self) -> None:
"""Test active_values() returns pending and in_progress."""Apply the same pattern to the other classes and test functions in this file.
As per coding guidelines: “Use test markers appropriately: @pytest.mark.unit …” and “Include type hints in test functions following project standards.”
Also applies to: 31-177
🤖 Prompt for AI Agents
In `@backend/tests/unit/models/test_session_score_models.py` around lines 11 - 28,
Add the pytest unit marker and return type hints to the test classes and
functions in this file: decorate TestScoringStatusEnum (and the other test
classes in the file) with `@pytest.mark.unit`, and add explicit return type hints
(-> None) to the test methods such as test_active_values and test_values (and
all other test_* functions in the file) so each test function signature follows
the project convention and linters/mypy expectations.
| import { Chip, Tooltip, CircularProgress } from '@mui/material'; | ||
| import { CheckCircle, Error, Warning, HourglassEmpty } from '@mui/icons-material'; | ||
| import type { ScoringStatus } from '../types'; | ||
|
|
||
| interface ScoreBadgeProps { | ||
| score: number | null; | ||
| status?: ScoringStatus; | ||
| onClick?: (event: React.MouseEvent) => void; | ||
| size?: 'small' | 'medium'; | ||
| label?: string; // Optional label to show after the score (e.g., 'score', 'points') | ||
| } | ||
|
|
||
| /** | ||
| * ScoreBadge component displays session quality score with 3-tier color coding | ||
| * | ||
| * Color scheme (aligned with judge prompt scoring philosophy): | ||
| * - 0-49: Red (failed investigation) | ||
| * - 50-74: Yellow (weak investigation) | ||
| * - 75-100: Green (good investigation) | ||
| * | ||
| * Also handles pending/in_progress/failed states | ||
| */ | ||
| const ScoreBadge: React.FC<ScoreBadgeProps> = ({ | ||
| score, | ||
| status, | ||
| onClick, | ||
| size = 'small', | ||
| label: customLabel, | ||
| }) => { | ||
| // Handle different scoring states | ||
| if (status === 'pending' || status === 'in_progress') { | ||
| return ( | ||
| <Tooltip title={status === 'pending' ? 'Score pending' : 'Scoring in progress'} arrow> | ||
| <Chip | ||
| icon={status === 'in_progress' ? <CircularProgress size={14} /> : <HourglassEmpty />} | ||
| label={status === 'in_progress' ? 'Scoring...' : 'Pending'} | ||
| size={size} | ||
| variant="outlined" | ||
| color="default" | ||
| onClick={onClick} | ||
| sx={{ cursor: onClick ? 'pointer' : 'default' }} | ||
| /> | ||
| </Tooltip> | ||
| ); | ||
| } | ||
|
|
||
| if (status === 'failed') { | ||
| return ( | ||
| <Tooltip title="Scoring failed" arrow> | ||
| <Chip | ||
| icon={<Error />} | ||
| label="Error" | ||
| size={size} | ||
| color="error" | ||
| variant="outlined" | ||
| onClick={onClick} | ||
| sx={{ cursor: onClick ? 'pointer' : 'default' }} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n dashboard/src/components/ScoreBadge.tsxRepository: codeready-toolchain/tarsy-bot
Length of output: 4258
Avoid shadowing the global Error identifier.
Line 3 imports Error from @mui/icons-material, which shadows the global Error and triggers the linter. Alias it and update both usages.
Fix
-import { CheckCircle, Error, Warning, HourglassEmpty } from '@mui/icons-material';
+import { CheckCircle, Error as ErrorIcon, Warning, HourglassEmpty } from '@mui/icons-material';Update usages on lines 52 and 88:
- icon={<Error />}
+ icon={<ErrorIcon />}and
- icon = <Error />;
+ icon = <ErrorIcon />;🧰 Tools
🪛 Biome (2.1.2)
[error] 3-3: Do not shadow the global "Error" property.
Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.
(lint/suspicious/noShadowRestrictedNames)
🤖 Prompt for AI Agents
In `@dashboard/src/components/ScoreBadge.tsx` around lines 2 - 58, The import of
Error from '@mui/icons-material' shadows the global Error; rename the imported
icon (e.g., Error as ErrorIcon) and update all usages in the ScoreBadge
component where the icon prop uses <Error /> (the Chip for the failed state and
any other Chip/Icon usage) to use the new alias (<ErrorIcon />) and any
corresponding JSX references so the global Error identifier is no longer
shadowed.
|
I need to close this because it doesn't stack on top of the previous PRs 🤦🏼 |
This PR stacks on top of #246 and adds UI for the scoring functionality.
Part of the PR stack:
Summary by CodeRabbit
New Features
Enhancements
✏️ Tip: You can customize this high-level summary in your review settings.