[Feat] RAG 평가 시스템 구현 (Gear/Whistle/Skill Lab)#73
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a complete evaluation framework: datasets for gear/skill/whistle, metric utilities, per-agent runners (RAG and no‑RAG baseline for gear), a report generator, and a CLI to run and aggregate evaluations into timestamped Markdown reports. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as "scripts/evaluate.py"
participant Runner as "Gear Runner"
participant Agent as "gear_agent_graph"
participant LLM as "LLM (RAG / gpt-4o)"
participant Data as "datasets (shoes/players/gear_cases)"
CLI->>Runner: start gear evaluation
Runner->>Data: load gear_cases.json, shoes.json, players.json
alt RAG path
Runner->>Agent: build initial chat state (case + retrieval)
Agent->>LLM: query (RAG retrieval + prompt)
LLM-->>Agent: JSON response (GearAdvisorResponse)
Agent-->>Runner: parsed response
else No-RAG baseline
Runner->>LLM: send unified shoe/player prompt to gpt-4o
LLM-->>Runner: JSON response (GearAdvisorResponse)
end
Runner->>Runner: resolve predicted shoe IDs, compute per-case metrics
Runner-->>CLI: aggregated results
CLI->>Report: generate markdown report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45–60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (6)
scripts/eval/report.py (1)
44-46: Escape markdown table cells before interpolating free-form text.
description, predicted IDs/articles, and step summaries come from datasets or model output. If any value contains|or a newline, the generated table will break and shift subsequent columns. A small shared cell-escaping helper would make all three sections much more robust.Possible fix
+def _md_cell(value: object) -> str: + return str(value).replace("|", r"\|").replace("\n", "<br>") + ... lines.append( - f"| {d['id']} | {d['description']} | {expected} " - f"| {rag_pred} | {base_pred} | {rag_hit} | {base_hit} |" + f"| {_md_cell(d['id'])} | {_md_cell(d['description'])} | {_md_cell(expected)} " + f"| {_md_cell(rag_pred)} | {_md_cell(base_pred)} | {rag_hit} | {base_hit} |" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/report.py` around lines 44 - 46, The table output can break if free-form fields contain '|' or newlines; add a small helper like escape_cell(text) and use it when building the row so d['description'], expected, rag_pred, base_pred, rag_hit, and base_hit are sanitized before interpolation; the helper should replace pipe characters (e.g. '|' -> '\|') and convert or remove newlines (e.g. '\n' -> ' ' or '<br>') to keep Markdown table cells intact and then call escape_cell(...) for each of those variables in the lines.append row construction.scripts/eval/runners/whistle_runner.py (2)
79-86: Broad exception handling is acceptable here but could mask unexpected errors.The
except Exceptionis flagged by static analysis (BLE001), but in evaluation contexts this pattern is reasonable to ensure one failing case doesn't abort the entire evaluation run. Consider logging the exception type for better diagnostics.Optional: Log exception type for better debugging
except Exception as e: - logger.error("Whistle failed for %s: %s", case_id, e) + logger.error("Whistle failed for %s: %s: %s", case_id, type(e).__name__, e) predicted_articles = [] predicted_decision = "error"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/runners/whistle_runner.py` around lines 79 - 86, The catch-all except in the whistle runner currently swallows exception context; update the except block around the call to _run_single_whistle(case) so the logger.error includes the exception type and details (e.g., append type(e).__name__ and/or enable exc_info=True in logger.error) while preserving the existing fallback assignment to predicted_articles = [] and predicted_decision = "error" for case_id; keep the broad except but enhance the log to show the exception type and stack trace for better diagnostics.
23-29: Edge case:_normalize_articlereturns original string if no digits found.If an article string contains no numbers (e.g., "Appendix A"), the function returns the original string. This may cause citation mismatches if expected articles are normalized differently. Ensure the dataset's
expected_articlesvalues are pre-normalized or handle this case explicitly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/runners/whistle_runner.py` around lines 23 - 29, The _normalize_article function currently returns the original article string when no digits are found, which can cause mismatches; update _normalize_article to explicitly handle non-numeric cases (e.g., return a normalized fallback such as article.strip().lower() or a canonical token like "" or "appendix-a" after normalizing whitespace and punctuation) and ensure the dataset's expected_articles are pre-normalized to the same scheme before comparison; reference the function _normalize_article and the dataset field expected_articles when making the change so both normalization paths match.scripts/eval/runners/gear_runner.py (3)
24-34: Module-level cache is not thread-safe.The
_MODEL_TO_IDdict is populated lazily on first access. While this works for single-threaded evaluation, if this module is ever used in a concurrent context, multiple threads could race to populate the cache simultaneously.For a single-threaded evaluation script, this is acceptable. Just noting for awareness if future usage patterns change.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/runners/gear_runner.py` around lines 24 - 34, The module-level cache _MODEL_TO_ID is lazily populated in _get_model_to_id_map and is not thread-safe; to fix, guard the initialization with a lock or switch to a thread-safe approach: add a module-level threading.Lock (e.g., _MODEL_TO_ID_LOCK) and acquire it around the population block inside _get_model_to_id_map (check again inside the lock to avoid double-init), or alternate use functools.lru_cache on a pure function that returns the mapping; reference _MODEL_TO_ID and _get_model_to_id_map when making the change and ensure reads after initialization do not hold the lock for performance.
37-45: Silent fallback may hide resolution failures.If
shoe_id_or_namedoesn't start with"shoe_"and the model name lookup fails, the function silently returns the originalshoe_id_or_name. This could cause downstream metric calculation to fail silently or produce misleading results.Consider logging a warning when the lookup fails to aid debugging.
Add warning log for failed lookups
def _resolve_shoe_id(shoe_id_or_name: str, model_name: str) -> str: """Resolve to canonical shoe_id. Falls back to model_name lookup.""" if shoe_id_or_name.startswith("shoe_"): return shoe_id_or_name lookup = _get_model_to_id_map() resolved = lookup.get(model_name.lower()) if resolved: return resolved + logger.warning( + "Could not resolve shoe_id for model_name='%s', returning '%s'", + model_name, shoe_id_or_name + ) return shoe_id_or_name🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/runners/gear_runner.py` around lines 37 - 45, The function _resolve_shoe_id currently silently returns the input when a model_name lookup fails; update it to emit a warning so failures are visible: use the module logger (e.g., import logging and getLogger(__name__) or use the existing logger if present) and call logger.warning including both shoe_id_or_name and model_name when resolved is None before returning shoe_id_or_name; keep the return behavior intact but ensure the warning message clearly states the lookup failed and which keys were used, referencing _resolve_shoe_id and _get_model_to_id_map so the log helps trace the lookup source.
109-120: Broad exception handling acknowledged.The
except Exceptionblocks (lines 111, 118) are flagged by static analysis but are appropriate here for evaluation resilience. Similar to the whistle runner, consider including the exception type in logs for better diagnostics.Optional: Include exception type in error logs
except Exception as e: - logger.error("RAG failed for %s: %s", case_id, e) + logger.error("RAG failed for %s: %s: %s", case_id, type(e).__name__, e) rag_predicted = [] # Baseline try: baseline_predicted = _run_single_baseline(case) except Exception as e: - logger.error("Baseline failed for %s: %s", case_id, e) + logger.error("Baseline failed for %s: %s: %s", case_id, type(e).__name__, e) baseline_predicted = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/runners/gear_runner.py` around lines 109 - 120, The except Exception blocks around _run_single_rag(case) and _run_single_baseline(case) should log the exception type as well as the message for better diagnostics; update the logger.error calls (for logger.error("RAG failed for %s: %s", case_id, e) and logger.error("Baseline failed for %s: %s", case_id, e)) to include the exception class name (e.__class__.__name__ or type(e)) alongside the exception text so logs show both the exception type and message while retaining the existing broad exception handling and fallback to an empty prediction list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/datasets/gear_cases.json`:
- Around line 128-131: The expected_shoe_ids array for the gear-10 test case
includes shoe_034 (All-Pro NITRO 2 at 159K) which exceeds the 150K budget;
remove "shoe_034" from the expected_shoe_ids list and ensure the accompanying
"rationale" for gear-10 (which already notes All-Pro NITRO 2 exceeds 150K)
remains consistent with that change. Target the gear-10 object, updating the
expected_shoe_ids field and leaving the rationale field as-is (or adjust it only
if you intentionally want to keep shoe_034).
In `@scripts/eval/report.py`:
- Around line 69-80: The table currently shows a "Match" column derived from
decision_match inside the for d in details loop (variable match = "O" if
d["decision_match"] else "X"), which confuses citation auditing; change the
header lines (the two strings in lines[...] that define the table header and
separator) to either rename "Match" to "Decision Match" and/or add a new
"Citation Hit" column, then compute a citation_hit value (e.g., citation_hit =
"O" if set(d["predicted_articles"]) == set(d["expected_articles"]) else "X" or a
more appropriate comparison) alongside match and include citation_hit in the row
f-string (the same f-string that references d['id'], d['description'],
d['expected_decision'], d['predicted_decision'], match, exp_art, pred_art) so
each row shows both decision match and article/citation success.
In `@scripts/eval/results/eval_report_20260318_151043.md`:
- Around line 1-72: This file is an obsolete pre-fix evaluation snapshot that
duplicates/conflicts with the corrected report; remove the file named
eval_report_20260318_151043.md from the PR (delete it from the repository and
commit the deletion) so only the corrected report eval_report_20260318_151753.md
remains; ensure the commit removes the file from git (git rm or your VCS
equivalent) and push the change so reviewers see only the final evaluation
output.
In `@scripts/eval/runners/gear_runner.py`:
- Around line 56-80: The initial_state built in _run_single_rag is missing the
required GearAgentState field context, causing downstream access errors; update
_run_single_rag to populate "context" (a List[Document>) with the retrieved RAG
documents before calling gear_agent_graph.invoke, or if retrieval lives
elsewhere, add a default empty list under the "context" key in initial_state;
reference GearAgentState, initial_state, and gear_agent_graph.invoke when
locating where to add the context value so the workflow that reads context (for
separating shoe and player data) always has a defined List[Document].
In `@scripts/eval/runners/skill_runner.py`:
- Around line 132-148: The current try/except around _run_single_skill masks all
failures (transport/auth/graph/runtime) by catching Exception; narrow the error
handling so only expected per-case response errors are converted to a FAIL:
update the block around
_run_single_skill/_check_equipment_constraint/_check_time_constraint to catch
only the specific per-case exception type(s) (or introduce a
PerCaseResponseError and have _run_single_skill raise it for normal bad
responses) and in that except set equip_pass=False, time_pass=False and
step_summary=[]. For any other unexpected Exception, log with logger.error
including case_id and full error info and re-raise the exception so execution
failures surface instead of being folded into metrics.
- Around line 55-58: The loop over EQUIPMENT_KEYWORDS incorrectly exempts "ball"
from equipment validation, causing ball-related steps to bypass the allowed
check; update the loop in scripts/eval/runners/skill_runner.py so that only
equipment present in the allowed set is skipped (remove the special-case equip
== "ball") and treat "ball" like any other equipment keyword when checking
allowed, ensuring the validation uses EQUIPMENT_KEYWORDS and allowed
consistently.
---
Nitpick comments:
In `@scripts/eval/report.py`:
- Around line 44-46: The table output can break if free-form fields contain '|'
or newlines; add a small helper like escape_cell(text) and use it when building
the row so d['description'], expected, rag_pred, base_pred, rag_hit, and
base_hit are sanitized before interpolation; the helper should replace pipe
characters (e.g. '|' -> '\|') and convert or remove newlines (e.g. '\n' -> ' '
or '<br>') to keep Markdown table cells intact and then call escape_cell(...)
for each of those variables in the lines.append row construction.
In `@scripts/eval/runners/gear_runner.py`:
- Around line 24-34: The module-level cache _MODEL_TO_ID is lazily populated in
_get_model_to_id_map and is not thread-safe; to fix, guard the initialization
with a lock or switch to a thread-safe approach: add a module-level
threading.Lock (e.g., _MODEL_TO_ID_LOCK) and acquire it around the population
block inside _get_model_to_id_map (check again inside the lock to avoid
double-init), or alternate use functools.lru_cache on a pure function that
returns the mapping; reference _MODEL_TO_ID and _get_model_to_id_map when making
the change and ensure reads after initialization do not hold the lock for
performance.
- Around line 37-45: The function _resolve_shoe_id currently silently returns
the input when a model_name lookup fails; update it to emit a warning so
failures are visible: use the module logger (e.g., import logging and
getLogger(__name__) or use the existing logger if present) and call
logger.warning including both shoe_id_or_name and model_name when resolved is
None before returning shoe_id_or_name; keep the return behavior intact but
ensure the warning message clearly states the lookup failed and which keys were
used, referencing _resolve_shoe_id and _get_model_to_id_map so the log helps
trace the lookup source.
- Around line 109-120: The except Exception blocks around _run_single_rag(case)
and _run_single_baseline(case) should log the exception type as well as the
message for better diagnostics; update the logger.error calls (for
logger.error("RAG failed for %s: %s", case_id, e) and logger.error("Baseline
failed for %s: %s", case_id, e)) to include the exception class name
(e.__class__.__name__ or type(e)) alongside the exception text so logs show both
the exception type and message while retaining the existing broad exception
handling and fallback to an empty prediction list.
In `@scripts/eval/runners/whistle_runner.py`:
- Around line 79-86: The catch-all except in the whistle runner currently
swallows exception context; update the except block around the call to
_run_single_whistle(case) so the logger.error includes the exception type and
details (e.g., append type(e).__name__ and/or enable exc_info=True in
logger.error) while preserving the existing fallback assignment to
predicted_articles = [] and predicted_decision = "error" for case_id; keep the
broad except but enhance the log to show the exception type and stack trace for
better diagnostics.
- Around line 23-29: The _normalize_article function currently returns the
original article string when no digits are found, which can cause mismatches;
update _normalize_article to explicitly handle non-numeric cases (e.g., return a
normalized fallback such as article.strip().lower() or a canonical token like ""
or "appendix-a" after normalizing whitespace and punctuation) and ensure the
dataset's expected_articles are pre-normalized to the same scheme before
comparison; reference the function _normalize_article and the dataset field
expected_articles when making the change so both normalization paths match.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d76642ca-e2f1-4e14-b6c8-14dcb424a81d
📒 Files selected for processing (14)
scripts/eval/__init__.pyscripts/eval/baseline.pyscripts/eval/datasets/gear_cases.jsonscripts/eval/datasets/skill_cases.jsonscripts/eval/datasets/whistle_cases.jsonscripts/eval/metrics.pyscripts/eval/report.pyscripts/eval/results/eval_report_20260318_151043.mdscripts/eval/results/eval_report_20260318_151753.mdscripts/eval/runners/__init__.pyscripts/eval/runners/gear_runner.pyscripts/eval/runners/skill_runner.pyscripts/eval/runners/whistle_runner.pyscripts/evaluate.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/eval/report.py (1)
141-147: Use a single captured timestamp for consistent report metadata.
datetime.now()is called twice (Line 141 and Line 146), which can produce mismatched filename/header timestamps near second boundaries.♻️ Suggested refactor
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + now = datetime.now() + timestamp = now.strftime("%Y%m%d_%H%M%S") @@ - f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}", + f"Generated: {now.strftime('%Y-%m-%d %H:%M:%S')}",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/report.py` around lines 141 - 147, The timestamp is captured twice causing potential mismatches; capture datetime.now() once into a single variable (e.g., now or report_time) before building the filename variable `timestamp` and the `lines` list, then use that single captured datetime when formatting both the filename (`timestamp = ...`) and the header line (`Generated: ...`) so both use the same instant (refer to the `timestamp` variable and the `lines` list construction).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/report.py`:
- Around line 144-145: Remove the unnecessary f-string prefixes on the two
static string literals used when building the report (the entries currently
written as f"# RAG Evaluation Report" and f""); replace them with plain string
literals "# RAG Evaluation Report" and "" so Ruff F541 is satisfied. Locate
these literals in the report generation block in scripts/eval/report.py (the
list/sequence of report lines) and update only those two items.
---
Nitpick comments:
In `@scripts/eval/report.py`:
- Around line 141-147: The timestamp is captured twice causing potential
mismatches; capture datetime.now() once into a single variable (e.g., now or
report_time) before building the filename variable `timestamp` and the `lines`
list, then use that single captured datetime when formatting both the filename
(`timestamp = ...`) and the header line (`Generated: ...`) so both use the same
instant (refer to the `timestamp` variable and the `lines` list construction).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dbb229f2-f2eb-4ac2-8708-faa41aca072f
📒 Files selected for processing (3)
.gitignorescripts/eval/datasets/gear_cases.jsonscripts/eval/report.py
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/eval/datasets/gear_cases.json
어떤 변경사항인가요?
작업 상세 내용
gear_cases.json15건,whistle_cases.json5건,skill_cases.json5건)metrics.py— Hit@3, MRR, Citation Hit Rate, Constraint Pass Rate 메트릭 함수baseline.py— No-RAG baseline (shoes.json59개 전체 프롬프트 삽입, 동일 LLM/프롬프트)gear_runner.py— RAG + baseline 실행,model_name→shoe_id역매핑whistle_runner.py— 인용 정확도 실행, article 번호 정규화skill_runner.py— 장비/시간 제약조건 검증report.py— 마크다운 리포트 생성 (RAG vs Baseline 비교 테이블 포함)evaluate.py— CLI 진입점 (--all,--gear,--whistle,--skill)체크리스트
관련 이슈
리뷰 포인트
shoe_id역매핑: RAG 파이프라인이 컨텍스트에shoe_id를 포함하지 않아 LLM이 임의 ID 생성 → runner에서model_name→shoe_id역매핑으로 해결. baseline은 컨텍스트에Shoe ID:필드 추가."Art 25","33.9"등 다양한 형식으로 반환 →_normalize_article()로 숫자만 추출하여 비교.defender,opponent등 시뮬레이션 설명에서 오탐 발생 →partner키워드에서 제외하여 Equipment Pass Rate0.4→1.0개선.참고사항 및 스크린샷(선택)
파일 구조
실행 결과
Summary by CodeRabbit
New Features
Chores