Skip to content

[Feat] RAG 평가 시스템 구현 (Gear/Whistle/Skill Lab)#73

Merged
zweadfx merged 11 commits intomainfrom
feat/rag-evaluation
Mar 18, 2026
Merged

[Feat] RAG 평가 시스템 구현 (Gear/Whistle/Skill Lab)#73
zweadfx merged 11 commits intomainfrom
feat/rag-evaluation

Conversation

@zweadfx
Copy link
Copy Markdown
Owner

@zweadfx zweadfx commented Mar 18, 2026

어떤 변경사항인가요?

포트폴리오에서 "RAG가 왜 필요했고, 얼마나 효과적이었는가"를 정량적으로 증명하기 위한 RAG 평가 시스템을 구현했습니다. Gear Advisor는 RAG vs No-RAG baseline 직접 비교, Whistle / Skill Lab은 각각 인용 정확도와 제약조건 준수율을 평가합니다.

작업 상세 내용

  • 평가 데이터셋 작성 (gear_cases.json 15건, whistle_cases.json 5건, skill_cases.json 5건)
  • metrics.py — Hit@3, MRR, Citation Hit Rate, Constraint Pass Rate 메트릭 함수
  • baseline.py — No-RAG baseline (shoes.json 59개 전체 프롬프트 삽입, 동일 LLM/프롬프트)
  • gear_runner.py — RAG + baseline 실행, model_nameshoe_id 역매핑
  • whistle_runner.py — 인용 정확도 실행, article 번호 정규화
  • skill_runner.py — 장비/시간 제약조건 검증
  • report.py — 마크다운 리포트 생성 (RAG vs Baseline 비교 테이블 포함)
  • evaluate.py — CLI 진입점 (--all, --gear, --whistle, --skill)
  • 전체 실행 및 리포트 검증

체크리스트

  • self-test를 수행하였는가?
  • 관련 문서나 주석을 업데이트하였는가?
  • 설정한 코딩 컨벤션을 준수하였는가?

관련 이슈

리뷰 포인트

  • Gear 평가 결과 해석: RAG Hit@3(0.87) = Baseline Hit@3(0.87) 동률. 데이터 59개 규모에서는 정확도 차이 없음. RAG의 가치는 비용/확장성(5개 vs 59개 컨텍스트, 토큰 ~12배 절감)으로 해석.
  • shoe_id 역매핑: RAG 파이프라인이 컨텍스트에 shoe_id를 포함하지 않아 LLM이 임의 ID 생성 → runner에서 model_nameshoe_id 역매핑으로 해결. baseline은 컨텍스트에 Shoe ID: 필드 추가.
  • article 정규화: LLM이 "Art 25", "33.9" 등 다양한 형식으로 반환 → _normalize_article()로 숫자만 추출하여 비교.
  • 장비 키워드 검증: defender, opponent 등 시뮬레이션 설명에서 오탐 발생 → partner 키워드에서 제외하여 Equipment Pass Rate 0.41.0 개선.

참고사항 및 스크린샷(선택)

파일 구조

scripts/
├── evaluate.py
└── eval/
    ├── __init__.py
    ├── datasets/
    │   ├── gear_cases.json (15건)
    │   ├── whistle_cases.json (5건)
    │   └── skill_cases.json (5건)
    ├── runners/
    │   ├── __init__.py
    │   ├── gear_runner.py
    │   ├── whistle_runner.py
    │   └── skill_runner.py
    ├── baseline.py
    ├── metrics.py
    ├── report.py
    └── results/
        └── eval_report_YYYYMMDD_HHMMSS.md

실행 결과

평가 메트릭 결과
Gear RAG Hit@3 0.87
Gear Baseline Hit@3 0.87
Whistle Citation Hit Rate 0.80
Whistle Decision Accuracy 1.00
Skill Lab Equipment Pass Rate 1.00
Skill Lab Time Pass Rate 1.00

Summary by CodeRabbit

  • New Features

    • Added an end-to-end evaluation framework for gear recommendations, skill coaching, and whistle rulings (RAG vs no‑RAG baseline), including runners, metrics, and markdown report generation.
    • Added a CLI to run evaluations individually or together.
  • Chores

    • Added datasets for gear, skill, and whistle test cases and updated ignore rules for evaluation results.
    • Added metrics utilities and report generation utilities.

@zweadfx zweadfx self-assigned this Mar 18, 2026
@zweadfx zweadfx added the feature 새로운 기능 구현 시 사용합니다. label Mar 18, 2026
@zweadfx zweadfx linked an issue Mar 18, 2026 that may be closed by this pull request
11 tasks
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9cdcae5f-87be-4506-9b90-26f25581ad67

📥 Commits

Reviewing files that changed from the base of the PR and between e44637b and 5f1c077.

📒 Files selected for processing (1)
  • scripts/eval/report.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/eval/report.py

📝 Walkthrough

Walkthrough

Adds a complete evaluation framework: datasets for gear/skill/whistle, metric utilities, per-agent runners (RAG and no‑RAG baseline for gear), a report generator, and a CLI to run and aggregate evaluations into timestamped Markdown reports.

Changes

Cohort / File(s) Summary
CLI & Report
scripts/evaluate.py, scripts/eval/report.py
New CLI entrypoint to run evaluations (--all/--gear/--whistle/--skill) and a Markdown report generator that writes timestamped evaluation reports to scripts/eval/results/ (now gitignored).
Gear evaluation & baseline
scripts/eval/runners/gear_runner.py, scripts/eval/baseline.py, scripts/eval/datasets/gear_cases.json, scripts/eval/datasets/..., scripts/eval/metrics.py
New gear runner executing RAG and no‑RAG baseline flows, shoe ID resolution, aggregation of per-case predictions; baseline builds unified shoe/player prompt and calls gpt-4o; metrics utilities compute Hit@K, MRR and aggregate gear metrics.
Skill evaluation
scripts/eval/runners/skill_runner.py, scripts/eval/datasets/skill_cases.json
Skill runner that invokes coach agent graph per case, validates equipment/time constraints, aggregates per-case results and overall constraint pass rates.
Whistle evaluation
scripts/eval/runners/whistle_runner.py, scripts/eval/datasets/whistle_cases.json
Whistle runner that invokes judge agent graph, normalizes predicted article references, compares to expected articles, and computes citation hit rate.
Results ignore
.gitignore
Added scripts/eval/results/*.md to ignore generated evaluation reports.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "scripts/evaluate.py"
  participant Runner as "Gear Runner"
  participant Agent as "gear_agent_graph"
  participant LLM as "LLM (RAG / gpt-4o)"
  participant Data as "datasets (shoes/players/gear_cases)"

  CLI->>Runner: start gear evaluation
  Runner->>Data: load gear_cases.json, shoes.json, players.json
  alt RAG path
    Runner->>Agent: build initial chat state (case + retrieval)
    Agent->>LLM: query (RAG retrieval + prompt)
    LLM-->>Agent: JSON response (GearAdvisorResponse)
    Agent-->>Runner: parsed response
  else No-RAG baseline
    Runner->>LLM: send unified shoe/player prompt to gpt-4o
    LLM-->>Runner: JSON response (GearAdvisorResponse)
  end
  Runner->>Runner: resolve predicted shoe IDs, compute per-case metrics
  Runner-->>CLI: aggregated results
  CLI->>Report: generate markdown report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–60 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hopped through data, prompt in paw,

Shoes, players, cases—lined in a row;
RAG and baseline weighed without a flaw,
Metrics counted where answers go;
Reports tucked safe, evaluation aglow.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: implementing a RAG evaluation system for three components (Gear/Whistle/Skill Lab) with quantitative metrics.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/rag-evaluation
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (6)
scripts/eval/report.py (1)

44-46: Escape markdown table cells before interpolating free-form text.

description, predicted IDs/articles, and step summaries come from datasets or model output. If any value contains | or a newline, the generated table will break and shift subsequent columns. A small shared cell-escaping helper would make all three sections much more robust.

Possible fix
+def _md_cell(value: object) -> str:
+    return str(value).replace("|", r"\|").replace("\n", "<br>")
+
 ...
         lines.append(
-            f"| {d['id']} | {d['description']} | {expected} "
-            f"| {rag_pred} | {base_pred} | {rag_hit} | {base_hit} |"
+            f"| {_md_cell(d['id'])} | {_md_cell(d['description'])} | {_md_cell(expected)} "
+            f"| {_md_cell(rag_pred)} | {_md_cell(base_pred)} | {rag_hit} | {base_hit} |"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/report.py` around lines 44 - 46, The table output can break if
free-form fields contain '|' or newlines; add a small helper like
escape_cell(text) and use it when building the row so d['description'],
expected, rag_pred, base_pred, rag_hit, and base_hit are sanitized before
interpolation; the helper should replace pipe characters (e.g. '|' -> '\|') and
convert or remove newlines (e.g. '\n' -> ' ' or '<br>') to keep Markdown table
cells intact and then call escape_cell(...) for each of those variables in the
lines.append row construction.
scripts/eval/runners/whistle_runner.py (2)

79-86: Broad exception handling is acceptable here but could mask unexpected errors.

The except Exception is flagged by static analysis (BLE001), but in evaluation contexts this pattern is reasonable to ensure one failing case doesn't abort the entire evaluation run. Consider logging the exception type for better diagnostics.

Optional: Log exception type for better debugging
         except Exception as e:
-            logger.error("Whistle failed for %s: %s", case_id, e)
+            logger.error("Whistle failed for %s: %s: %s", case_id, type(e).__name__, e)
             predicted_articles = []
             predicted_decision = "error"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/runners/whistle_runner.py` around lines 79 - 86, The catch-all
except in the whistle runner currently swallows exception context; update the
except block around the call to _run_single_whistle(case) so the logger.error
includes the exception type and details (e.g., append type(e).__name__ and/or
enable exc_info=True in logger.error) while preserving the existing fallback
assignment to predicted_articles = [] and predicted_decision = "error" for
case_id; keep the broad except but enhance the log to show the exception type
and stack trace for better diagnostics.

23-29: Edge case: _normalize_article returns original string if no digits found.

If an article string contains no numbers (e.g., "Appendix A"), the function returns the original string. This may cause citation mismatches if expected articles are normalized differently. Ensure the dataset's expected_articles values are pre-normalized or handle this case explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/runners/whistle_runner.py` around lines 23 - 29, The
_normalize_article function currently returns the original article string when
no digits are found, which can cause mismatches; update _normalize_article to
explicitly handle non-numeric cases (e.g., return a normalized fallback such as
article.strip().lower() or a canonical token like "" or "appendix-a" after
normalizing whitespace and punctuation) and ensure the dataset's
expected_articles are pre-normalized to the same scheme before comparison;
reference the function _normalize_article and the dataset field
expected_articles when making the change so both normalization paths match.
scripts/eval/runners/gear_runner.py (3)

24-34: Module-level cache is not thread-safe.

The _MODEL_TO_ID dict is populated lazily on first access. While this works for single-threaded evaluation, if this module is ever used in a concurrent context, multiple threads could race to populate the cache simultaneously.

For a single-threaded evaluation script, this is acceptable. Just noting for awareness if future usage patterns change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/runners/gear_runner.py` around lines 24 - 34, The module-level
cache _MODEL_TO_ID is lazily populated in _get_model_to_id_map and is not
thread-safe; to fix, guard the initialization with a lock or switch to a
thread-safe approach: add a module-level threading.Lock (e.g.,
_MODEL_TO_ID_LOCK) and acquire it around the population block inside
_get_model_to_id_map (check again inside the lock to avoid double-init), or
alternate use functools.lru_cache on a pure function that returns the mapping;
reference _MODEL_TO_ID and _get_model_to_id_map when making the change and
ensure reads after initialization do not hold the lock for performance.

37-45: Silent fallback may hide resolution failures.

If shoe_id_or_name doesn't start with "shoe_" and the model name lookup fails, the function silently returns the original shoe_id_or_name. This could cause downstream metric calculation to fail silently or produce misleading results.

Consider logging a warning when the lookup fails to aid debugging.

Add warning log for failed lookups
 def _resolve_shoe_id(shoe_id_or_name: str, model_name: str) -> str:
     """Resolve to canonical shoe_id. Falls back to model_name lookup."""
     if shoe_id_or_name.startswith("shoe_"):
         return shoe_id_or_name
     lookup = _get_model_to_id_map()
     resolved = lookup.get(model_name.lower())
     if resolved:
         return resolved
+    logger.warning(
+        "Could not resolve shoe_id for model_name='%s', returning '%s'",
+        model_name, shoe_id_or_name
+    )
     return shoe_id_or_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/runners/gear_runner.py` around lines 37 - 45, The function
_resolve_shoe_id currently silently returns the input when a model_name lookup
fails; update it to emit a warning so failures are visible: use the module
logger (e.g., import logging and getLogger(__name__) or use the existing logger
if present) and call logger.warning including both shoe_id_or_name and
model_name when resolved is None before returning shoe_id_or_name; keep the
return behavior intact but ensure the warning message clearly states the lookup
failed and which keys were used, referencing _resolve_shoe_id and
_get_model_to_id_map so the log helps trace the lookup source.

109-120: Broad exception handling acknowledged.

The except Exception blocks (lines 111, 118) are flagged by static analysis but are appropriate here for evaluation resilience. Similar to the whistle runner, consider including the exception type in logs for better diagnostics.

Optional: Include exception type in error logs
         except Exception as e:
-            logger.error("RAG failed for %s: %s", case_id, e)
+            logger.error("RAG failed for %s: %s: %s", case_id, type(e).__name__, e)
             rag_predicted = []

         # Baseline
         try:
             baseline_predicted = _run_single_baseline(case)
         except Exception as e:
-            logger.error("Baseline failed for %s: %s", case_id, e)
+            logger.error("Baseline failed for %s: %s: %s", case_id, type(e).__name__, e)
             baseline_predicted = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/runners/gear_runner.py` around lines 109 - 120, The except
Exception blocks around _run_single_rag(case) and _run_single_baseline(case)
should log the exception type as well as the message for better diagnostics;
update the logger.error calls (for logger.error("RAG failed for %s: %s",
case_id, e) and logger.error("Baseline failed for %s: %s", case_id, e)) to
include the exception class name (e.__class__.__name__ or type(e)) alongside the
exception text so logs show both the exception type and message while retaining
the existing broad exception handling and fallback to an empty prediction list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/datasets/gear_cases.json`:
- Around line 128-131: The expected_shoe_ids array for the gear-10 test case
includes shoe_034 (All-Pro NITRO 2 at 159K) which exceeds the 150K budget;
remove "shoe_034" from the expected_shoe_ids list and ensure the accompanying
"rationale" for gear-10 (which already notes All-Pro NITRO 2 exceeds 150K)
remains consistent with that change. Target the gear-10 object, updating the
expected_shoe_ids field and leaving the rationale field as-is (or adjust it only
if you intentionally want to keep shoe_034).

In `@scripts/eval/report.py`:
- Around line 69-80: The table currently shows a "Match" column derived from
decision_match inside the for d in details loop (variable match = "O" if
d["decision_match"] else "X"), which confuses citation auditing; change the
header lines (the two strings in lines[...] that define the table header and
separator) to either rename "Match" to "Decision Match" and/or add a new
"Citation Hit" column, then compute a citation_hit value (e.g., citation_hit =
"O" if set(d["predicted_articles"]) == set(d["expected_articles"]) else "X" or a
more appropriate comparison) alongside match and include citation_hit in the row
f-string (the same f-string that references d['id'], d['description'],
d['expected_decision'], d['predicted_decision'], match, exp_art, pred_art) so
each row shows both decision match and article/citation success.

In `@scripts/eval/results/eval_report_20260318_151043.md`:
- Around line 1-72: This file is an obsolete pre-fix evaluation snapshot that
duplicates/conflicts with the corrected report; remove the file named
eval_report_20260318_151043.md from the PR (delete it from the repository and
commit the deletion) so only the corrected report eval_report_20260318_151753.md
remains; ensure the commit removes the file from git (git rm or your VCS
equivalent) and push the change so reviewers see only the final evaluation
output.

In `@scripts/eval/runners/gear_runner.py`:
- Around line 56-80: The initial_state built in _run_single_rag is missing the
required GearAgentState field context, causing downstream access errors; update
_run_single_rag to populate "context" (a List[Document>) with the retrieved RAG
documents before calling gear_agent_graph.invoke, or if retrieval lives
elsewhere, add a default empty list under the "context" key in initial_state;
reference GearAgentState, initial_state, and gear_agent_graph.invoke when
locating where to add the context value so the workflow that reads context (for
separating shoe and player data) always has a defined List[Document].

In `@scripts/eval/runners/skill_runner.py`:
- Around line 132-148: The current try/except around _run_single_skill masks all
failures (transport/auth/graph/runtime) by catching Exception; narrow the error
handling so only expected per-case response errors are converted to a FAIL:
update the block around
_run_single_skill/_check_equipment_constraint/_check_time_constraint to catch
only the specific per-case exception type(s) (or introduce a
PerCaseResponseError and have _run_single_skill raise it for normal bad
responses) and in that except set equip_pass=False, time_pass=False and
step_summary=[]. For any other unexpected Exception, log with logger.error
including case_id and full error info and re-raise the exception so execution
failures surface instead of being folded into metrics.
- Around line 55-58: The loop over EQUIPMENT_KEYWORDS incorrectly exempts "ball"
from equipment validation, causing ball-related steps to bypass the allowed
check; update the loop in scripts/eval/runners/skill_runner.py so that only
equipment present in the allowed set is skipped (remove the special-case equip
== "ball") and treat "ball" like any other equipment keyword when checking
allowed, ensuring the validation uses EQUIPMENT_KEYWORDS and allowed
consistently.

---

Nitpick comments:
In `@scripts/eval/report.py`:
- Around line 44-46: The table output can break if free-form fields contain '|'
or newlines; add a small helper like escape_cell(text) and use it when building
the row so d['description'], expected, rag_pred, base_pred, rag_hit, and
base_hit are sanitized before interpolation; the helper should replace pipe
characters (e.g. '|' -> '\|') and convert or remove newlines (e.g. '\n' -> ' '
or '<br>') to keep Markdown table cells intact and then call escape_cell(...)
for each of those variables in the lines.append row construction.

In `@scripts/eval/runners/gear_runner.py`:
- Around line 24-34: The module-level cache _MODEL_TO_ID is lazily populated in
_get_model_to_id_map and is not thread-safe; to fix, guard the initialization
with a lock or switch to a thread-safe approach: add a module-level
threading.Lock (e.g., _MODEL_TO_ID_LOCK) and acquire it around the population
block inside _get_model_to_id_map (check again inside the lock to avoid
double-init), or alternate use functools.lru_cache on a pure function that
returns the mapping; reference _MODEL_TO_ID and _get_model_to_id_map when making
the change and ensure reads after initialization do not hold the lock for
performance.
- Around line 37-45: The function _resolve_shoe_id currently silently returns
the input when a model_name lookup fails; update it to emit a warning so
failures are visible: use the module logger (e.g., import logging and
getLogger(__name__) or use the existing logger if present) and call
logger.warning including both shoe_id_or_name and model_name when resolved is
None before returning shoe_id_or_name; keep the return behavior intact but
ensure the warning message clearly states the lookup failed and which keys were
used, referencing _resolve_shoe_id and _get_model_to_id_map so the log helps
trace the lookup source.
- Around line 109-120: The except Exception blocks around _run_single_rag(case)
and _run_single_baseline(case) should log the exception type as well as the
message for better diagnostics; update the logger.error calls (for
logger.error("RAG failed for %s: %s", case_id, e) and logger.error("Baseline
failed for %s: %s", case_id, e)) to include the exception class name
(e.__class__.__name__ or type(e)) alongside the exception text so logs show both
the exception type and message while retaining the existing broad exception
handling and fallback to an empty prediction list.

In `@scripts/eval/runners/whistle_runner.py`:
- Around line 79-86: The catch-all except in the whistle runner currently
swallows exception context; update the except block around the call to
_run_single_whistle(case) so the logger.error includes the exception type and
details (e.g., append type(e).__name__ and/or enable exc_info=True in
logger.error) while preserving the existing fallback assignment to
predicted_articles = [] and predicted_decision = "error" for case_id; keep the
broad except but enhance the log to show the exception type and stack trace for
better diagnostics.
- Around line 23-29: The _normalize_article function currently returns the
original article string when no digits are found, which can cause mismatches;
update _normalize_article to explicitly handle non-numeric cases (e.g., return a
normalized fallback such as article.strip().lower() or a canonical token like ""
or "appendix-a" after normalizing whitespace and punctuation) and ensure the
dataset's expected_articles are pre-normalized to the same scheme before
comparison; reference the function _normalize_article and the dataset field
expected_articles when making the change so both normalization paths match.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d76642ca-e2f1-4e14-b6c8-14dcb424a81d

📥 Commits

Reviewing files that changed from the base of the PR and between 3b94eb8 and 5a48d51.

📒 Files selected for processing (14)
  • scripts/eval/__init__.py
  • scripts/eval/baseline.py
  • scripts/eval/datasets/gear_cases.json
  • scripts/eval/datasets/skill_cases.json
  • scripts/eval/datasets/whistle_cases.json
  • scripts/eval/metrics.py
  • scripts/eval/report.py
  • scripts/eval/results/eval_report_20260318_151043.md
  • scripts/eval/results/eval_report_20260318_151753.md
  • scripts/eval/runners/__init__.py
  • scripts/eval/runners/gear_runner.py
  • scripts/eval/runners/skill_runner.py
  • scripts/eval/runners/whistle_runner.py
  • scripts/evaluate.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/eval/report.py (1)

141-147: Use a single captured timestamp for consistent report metadata.

datetime.now() is called twice (Line 141 and Line 146), which can produce mismatched filename/header timestamps near second boundaries.

♻️ Suggested refactor
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    now = datetime.now()
+    timestamp = now.strftime("%Y%m%d_%H%M%S")
@@
-        f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
+        f"Generated: {now.strftime('%Y-%m-%d %H:%M:%S')}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/report.py` around lines 141 - 147, The timestamp is captured
twice causing potential mismatches; capture datetime.now() once into a single
variable (e.g., now or report_time) before building the filename variable
`timestamp` and the `lines` list, then use that single captured datetime when
formatting both the filename (`timestamp = ...`) and the header line
(`Generated: ...`) so both use the same instant (refer to the `timestamp`
variable and the `lines` list construction).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/report.py`:
- Around line 144-145: Remove the unnecessary f-string prefixes on the two
static string literals used when building the report (the entries currently
written as f"# RAG Evaluation Report" and f""); replace them with plain string
literals "# RAG Evaluation Report" and "" so Ruff F541 is satisfied. Locate
these literals in the report generation block in scripts/eval/report.py (the
list/sequence of report lines) and update only those two items.

---

Nitpick comments:
In `@scripts/eval/report.py`:
- Around line 141-147: The timestamp is captured twice causing potential
mismatches; capture datetime.now() once into a single variable (e.g., now or
report_time) before building the filename variable `timestamp` and the `lines`
list, then use that single captured datetime when formatting both the filename
(`timestamp = ...`) and the header line (`Generated: ...`) so both use the same
instant (refer to the `timestamp` variable and the `lines` list construction).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dbb229f2-f2eb-4ac2-8708-faa41aca072f

📥 Commits

Reviewing files that changed from the base of the PR and between 5a48d51 and e44637b.

📒 Files selected for processing (3)
  • .gitignore
  • scripts/eval/datasets/gear_cases.json
  • scripts/eval/report.py
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/eval/datasets/gear_cases.json

@zweadfx zweadfx merged commit 5e13a04 into main Mar 18, 2026
2 checks passed
@zweadfx zweadfx deleted the feat/rag-evaluation branch March 18, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature 새로운 기능 구현 시 사용합니다.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat] RAG 평가 시스템 구현 (Gear/Whistle/Skill Lab)

1 participant