Conversation
- Add benchwise/types.py with TypedDict, Protocol, and Literal definitions - Add mypy.ini with strict type checking - Add py.typed marker for PEP 561 compliance - Improve type annotations in exceptions, config, and datasets modules
- Update ModelAdapter and all implementations with ModelConfig TypedDict - Add PricingInfo TypedDict for pricing dictionaries - Add proper type annotations to all __init__ and generate methods - Fix type issues in cost estimation calculations
- Add proper Callable type signatures for evaluate, benchmark, and stress_test decorators - Use ParamSpec and TypeVar for generic decorator typing - Add Awaitable type annotations for async functions - Type all **kwargs parameters as Any - Update _run_evaluation with specific tuple[str, ...] type - Add type ignore comments for dynamic _benchmark_metadata attributes - All 124 tests passing (2 pre-existing benchmark test failures unchanged)
All mypy errors resolved (82 → 0). Fixed type narrowing, cast() calls for response.json(), and missing return type annotations. Tests: 124/126 passing.
Added GitHub Actions CI workflow, pre-commit hooks, and README development section with type checking info.
WalkthroughAdds a centralized types module and stringent mypy config, applies pervasive static typing and return annotations across the package, introduces a GitHub Actions CI workflow and pre-commit mypy/ruff integration, updates lint dependencies and .gitignore, and performs widespread non-functional formatting and test refactors. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–75 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchwise/config.py (1)
451-459: Fix unusedanthropic_clientto satisfy RuffRuff is right that
anthropic_clientis assigned but never used. Since you’re currently just assuming validity if the key exists (per the comment), the variable isn’t needed.Minimal fix:
- if os.getenv("ANTHROPIC_API_KEY"): - try: - import anthropic - - anthropic_client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY")) - # Note: Anthropic doesn't have a simple test endpoint - results["anthropic"] = True # Assume valid if key exists - except Exception: - results["anthropic"] = False + if os.getenv("ANTHROPIC_API_KEY"): + try: + import anthropic # noqa: F401 + # Note: Anthropic doesn't have a simple test endpoint + results["anthropic"] = True # Assume valid if key exists + except Exception: + results["anthropic"] = False(or drop the
import anthropicentirely if you’re comfortable treating any non-empty key as “valid”.)
🧹 Nitpick comments (8)
test_single_doc_file.py (2)
63-63: Remove redundant import.The
remodule is already imported at the module level (line 29). This local import is unnecessary.def prepare_code_for_real_models(code: str) -> str: """ Replace model names with real OpenAI and Google models. Ensures we use exactly 2 models: gpt-3.5-turbo and gemini-2.5-flash """ - import re - # Find all @evaluate decorators and replace models to ensure diversity
224-224: Add return type annotation.Given the PR's focus on type annotations, this function should have an explicit
-> intreturn type for consistency.-def main(): +def main() -> int:benchwise/metrics.py (3)
193-285: BLEU metrics: dynamic structure vs. strong result type
bleu_scorenow returnsDict[str, Any]whileBleuScoresinbenchwise/types.pydescribes a more constrained schema. Given the richer keys here (corpus_bleu,sentence_bleu,std_sentence_bleu,bleu_n, etc.), keeping the broadDict[str, Any]is reasonable, but the TypedDict can be misleading.Consider either:
- Aligning
BleuScoreswith this actual return structure, or- Documenting that
BleuScoresis for a different, higher-level aggregation, not this function.
681-720: Perplexity: explicit float casting is good; consider similar consistency elsewhereCasting
np.mean/np.mediantofloatformean_perplexityandmedian_perplexityis good for JSON-serialization friendliness and type clarity.You might consider doing the same in other metrics that still expose numpy scalars (
bert_score_metric,semantic_similarity,factual_correctness,coherence_score,safety_score) for consistency.
899-969: Dead expression in keyword overlap helperIn
_calculate_enhanced_keyword_overlap, this line is a no-op:" ".join(ref_words)It doesn’t affect behavior and can be safely removed for clarity.
- important_ref_words: set[str] = set() - " ".join(ref_words) + important_ref_words: set[str] = set()benchwise/types.py (1)
42-139: Metric result TypedDicts vs. actual metric outputsThe metric result TypedDicts (
RougeScores,BleuScores,BertScoreResults,AccuracyResults,SemanticSimilarityResults, etc.) don’t currently mirror the exact shapes returned by the implementations inbenchwise/metrics.py(e.g.,bleu_scoreusescorpus_bleu/sentence_bleuandbleu_nkeys,semantic_similarityreturnsmean_similarity,median_similarity, etc.).If these types are meant to describe:
- The raw function outputs → consider aligning the field names and optionality.
- Higher-level, normalized results → consider clarifying that in docstrings and possibly renaming to avoid confusion.
benchwise/datasets.py (1)
294-297: Unreachable code path.This
raise ValueErroris unreachable because:
isinstance(source, dict)→ returns at line 221isinstance(source, (str, Path))→ either returns or raises in all branches- No other type satisfies the
Union[str, Path, Dict[str, Any]]signatureThe type checker may flag this as dead code. Consider removing it or adding a comment explaining it's a defensive fallback for type safety.
- raise ValueError( - f"Unable to load dataset from source: {source}" - ) + # This line should be unreachable given the type signature, + # but serves as a defensive fallback for runtime type errors + raise ValueError(f"Unable to load dataset from source: {source}")benchwise/models.py (1)
90-91: Redundant casts onPricingInfovalues.Since
PricingInfoalready definesinputandoutputasfloat, thefloat()casts are unnecessary. However, they're harmless and provide runtime safety if the dict comes from an untrusted source.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
.github/workflows/ci.yml(1 hunks).pre-commit-config.yaml(1 hunks)GEMINI.md(1 hunks)README.md(5 hunks)benchwise/cli.py(9 hunks)benchwise/client.py(18 hunks)benchwise/config.py(14 hunks)benchwise/core.py(10 hunks)benchwise/datasets.py(14 hunks)benchwise/exceptions.py(2 hunks)benchwise/logging.py(1 hunks)benchwise/metrics.py(24 hunks)benchwise/models.py(11 hunks)benchwise/results.py(12 hunks)benchwise/types.py(1 hunks)mypy.ini(1 hunks)pyproject.toml(1 hunks)test_single_doc_file.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (7)
benchwise/config.py (1)
benchwise/types.py (1)
ConfigDict(180-194)
benchwise/models.py (2)
benchwise/types.py (2)
ModelConfig(24-33)PricingInfo(36-39)tests/conftest.py (1)
generate(180-186)
benchwise/datasets.py (1)
benchwise/types.py (2)
DatasetItem(142-157)DatasetMetadata(160-168)
benchwise/cli.py (3)
benchwise/metrics.py (1)
accuracy(439-550)benchwise/datasets.py (1)
references(67-81)run_tests.py (1)
main(110-250)
benchwise/client.py (3)
tests/test_rate_limits.py (1)
json(18-19)benchwise/core.py (1)
benchmark(136-154)benchwise/datasets.py (1)
get(420-421)
benchwise/types.py (3)
benchwise/metrics.py (1)
accuracy(439-550)benchwise/datasets.py (3)
size(46-48)prompts(51-64)references(67-81)benchwise/results.py (1)
success(37-39)
benchwise/core.py (4)
benchwise/metrics.py (1)
evaluate(1494-1510)benchwise/datasets.py (1)
Dataset(13-200)benchwise/results.py (1)
EvaluationResult(12-76)benchwise/cli.py (1)
run_evaluation(114-285)
🪛 GitHub Actions: CI
benchwise/config.py
[error] 455-455: ruff: Local variable 'anthropic_client' assigned to but never used
benchwise/datasets.py
[error] 9-9: ruff: 'DatasetItem' imported but unused
[error] 9-9: ruff: 'DatasetMetadata' imported but unused
benchwise/cli.py
[error] 8-8: ruff: 'typing.Dict' imported but unused
benchwise/metrics.py
[error] 1-1: ruff: 'Union' imported but unused
[error] 3-3: ruff: 'NDArray' imported but unused
[error] 4-4: ruff: 'BleuScores' imported but unused
[error] 4-4: ruff: 'BertScoreResults' imported but unused
[error] 4-4: ruff: 'AccuracyResults' imported but unused
🔇 Additional comments (54)
GEMINI.md (1)
1-1: LGTM!Simple cross-reference to consolidate AI assistant instructions in a single location (CLAUDE.md).
README.md (1)
181-227: Good documentation of the new type safety requirements.The Development section clearly documents the type checking setup, test commands, and code quality tooling. This aligns well with the PR's goal of introducing strict type checking across the codebase.
benchwise/logging.py (1)
71-71: LGTM!Adding the explicit
-> Nonereturn type annotation improves type safety and aligns with the PR's objective of comprehensive type annotations.benchwise/exceptions.py (1)
7-8: LGTM!The addition of
Optionalimport and explicit type annotations onRateLimitError.__init__improves type safety without changing runtime behavior.Also applies to: 22-25
benchwise/metrics.py (5)
64-162: ROUGE-L typing and casting look consistentUsing
RougeScoresas the return type and explicitly castingnp.mean/np.stdoutputs tofloatavoids numpy-scalar surprises and matches the TypedDict shape (including the nestedscores: Dict[str, List[float]]). No issues from a typing or behavior standpoint here.
287-303: Smoothing function typing is fine; ignores are appropriateReturning
Optional[Callable[..., Any]]and using# type: ignore[no-any-return]on bound methods is a pragmatic way to appease mypy given the NLTK stubs. The behavior remains unchanged and matchessentence_bleu’s expectations.
1012-1088: Coherence score casting and component typing look solidTyping
component_scoresasDict[str, List[float]]and castingoverall_coherencetofloatkeeps the result structure predictable and avoids numpy scalars leaking into top-level fields. This plays nicely with mypy and downstream JSON consumers.
1260-1441: Safety score: good use of optional parameters and typed structuresThe added type annotations for
safety_categories,perspective_api_key,category_scores, andviolation_detailsare consistent with how the function is used. Logic remains unchanged and the result dict structure is clear.
1484-1511: MetricCollection typing matches usageUsing
Dict[str, Tuple[Callable[..., Any], Dict[str, Any]]]forself.metricsand an optionalreferencesparameter inevaluatelines up with how you register and invoke the metric functions. This should make mypy happier without altering behavior.benchwise/types.py (2)
141-217: Dataset and config TypedDicts look appropriate
DatasetItem,DatasetMetadata,DatasetSchema, andConfigDictline up with how datasets and config are described in the rest of the codebase (e.g.,benchwise/datasets.py,BenchwiseConfig.to_dict). Usingtotal=Falseis sensible given the optional nature of many fields.
228-291: Protocols provide useful structural contractsThe
SupportsGenerate,SupportsCache, andSupportsMetricsProtocols capture the expected interfaces cleanly and should integrate well with adapters and caching layers. Asyncgenerateand syncevaluate/save/loadsignatures look reasonable.pyproject.toml (1)
69-75: Typing stub additions align with mypy usageAdding
pandas-stubsandtypes-requeststo thelintextra matches the new mypy/pre-commit configuration and avoidsignore_missing_importsfor those libraries. No issues here..pre-commit-config.yaml (1)
20-28: Mypy pre-commit hook is wired correctlyUsing
mirrors-mypywithadditional_dependenciesmatching your stub packages andargs: [--config-file=mypy.ini]looks good. Restricting tofiles: ^benchwise/keeps checks focused on library code..github/workflows/ci.yml (1)
1-36: CI workflow matches tooling stackThe CI job (ruff lint/format, mypy with
mypy.ini, and basic tests viarun_tests.py) is consistent with the new typing and linting setup. Using Python 3.12 while configuring mypy for 3.11 is fine given 3.11 is your minimum supported version.mypy.ini (2)
1-24: Strict mypy configuration is coherentEnabling
strictplus the additional flags, and limiting checks tobenchwiseis a solid baseline. Showing error codes will help with targeted suppressions.
25-63: Third-party ignore sections are sensibleIgnoring or skipping imports for libraries without stubs (rouge_score, bert_score, nltk, transformers, etc.) while relying on typed stubs for pandas/requests matches the dependencies declared in
pyproject.toml. This should keep noise down while still catching real type issues.benchwise/cli.py (2)
205-235: UsingAnyfor metric_result is pragmaticAnnotating
metric_resultasAnywhen indexing into the dict returned byaccuracyavoids over-constraining the type and matches the current implementation ofbenchwise.metrics.accuracy. No behavior change, and it keeps mypy satisfied.
288-513: CLI helper annotations improve clarityAdding explicit
-> None/-> BenchmarkResult/-> Nonereturn types toconfigure_api,sync_offline,show_status,list_resources,validate_dataset,compare_results, andmainmakes the CLI surface clearer without altering behavior. The async/sync boundaries andasyncio.runusage remain consistent.benchwise/config.py (5)
56-65: Config list fields and post_init typing look goodTyping
default_modelsanddefault_metricsasList[str]withfield(default_factory=list)plus a__post_init__that delegates to the existing loaders/validators is a clean improvement and lines up withConfigDictinbenchwise.types.
174-233: ConfigDict integration is consistent
to_dictreturningConfigDict(with a maskedapi_key) matches the TypedDict definition and keeps the shape of configuration data explicit for any consumers (printing, status commands, etc.).
274-317: configure_benchwise kwargs typing is appropriateTyping
**kwargs: Anyand re-validating via_validate_config()after applying overrides is a good balance between flexibility and safety. This should work well under the new strict mypy settings.
414-419: validate_api_connection: clear boolean typingExplicitly annotating the inner
check_connection() -> booland wrapping the status code comparison inbool(...)clarifies intent and aligns with the function’s declared-> boolreturn type. Behavior is unchanged.
487-516: print_configuration_status nicely composes existing helpersUsing
config.print_config(),validate_api_connection,validate_api_keys, andget_cache_dirinprint_configuration_statusgives a good, centralized health view. The function’s-> Noneannotation and printed statuses match its intended use as a diagnostic helper.benchwise/datasets.py (5)
29-38: LGTM!The
-> Nonereturn type annotation for__post_init__is correct and aligns with dataclass conventions.
83-91: LGTM!The
Callable[[Dict[str, Any]], bool]type hint accurately describes the filter condition, and the defensiveself.metadata or {}pattern prevents potentialNonespread errors.
176-199: LGTM!The explicit type annotations and type narrowing pattern (filtering with
isinstancebefore numeric operations) satisfy strict type checkers while maintaining runtime safety.
299-408: LGTM!The
**kwargs: Anytype annotations oncreate_qa_dataset,create_summarization_dataset, andcreate_classification_datasetare appropriate for their flexible metadata usage.
411-431: LGTM!The
-> Nonereturn type annotations on__init__,register,remove, andclearmethods correctly indicate they perform mutations without returning values.benchwise/client.py (7)
5-5: LGTM!Adding
castfrom typing enables explicit type assertions for JSON response handling throughout the client.
22-26: LGTM!The
Optional[int]andOptional[str]type annotations correctly reflect that these exception parameters may be absent.
59-78: LGTM!The type annotations for
offline_queue, async context manager methods, andclose()are correct. UsingAnyfor__aexit__exception parameters is a common and acceptable pattern.
325-360: LGTM!The
cast()calls provide type safety for JSON responses. These assume a stable API contract; if the API response structure changes, runtime errors would occur at the cast sites, making debugging easier.
467-504: LGTM!The
Optional[Dict[str, Any]]type formetadataand consistent use ofcast()for response handling align with the file's type safety approach.
744-751: LGTM!The
-> Nonereturn type annotation correctly indicates this method performs a side effect (appending to queue) without returning a value.
824-828: LGTM!The nested
cast()calls for extractingfile_info["url"]from the response provide type safety for the chained dictionary access.benchwise/models.py (5)
4-5: LGTM!Importing
ModelConfigandPricingInfofrom the new types module enables consistent typing across all model adapters.
10-17: LGTM!The type annotations are correct. Since
ModelConfigis aTypedDictwithtotal=False, an empty dict{}is a valid instance, makingconfig or {}type-safe.
47-52: LGTM!The
Dict[str, PricingInfo]annotation correctly describes the pricing structure, and all model entries conform to thePricingInfoTypedDict with requiredinputandoutputfloat keys.
95-273: LGTM!All model adapters (
AnthropicAdapter,GoogleAdapter,HuggingFaceAdapter,MockAdapter) consistently implement the type annotations withOptional[ModelConfig]for config andList[str]for generate return types.
275-298: LGTM!The factory function correctly accepts
Optional[ModelConfig]and returns theModelAdapterbase type, allowing for polymorphic usage of different adapter implementations.benchwise/results.py (6)
60-76: LGTM!The
Optional[str]type formetric_namecorrectly indicates thatNonemeans "return the main result," and theUnion[float, Any]return type accommodates both numeric scores and arbitrary result types.
96-154: LGTM!The
-> Noneforadd_resultandOptional[EvaluationResult]returns forget_best_model/get_worst_modelcorrectly handle empty results scenarios.
156-196: LGTM!The
compare_modelsmethod correctly usesOptional[str]formetric_nameand returns a comprehensive comparison dictionary with consistent typing.
244-253: LGTM!The
-> Nonereturn type annotations onsave_to_jsonandsave_to_csvcorrectly indicate these are side-effect-only methods.
258-299: LGTM!The explicit
Dict[str, Any]type annotation oncomparisonand the conversion ofset()tolist()before return ensures type consistency.
547-549: LGTM!The
-> Nonereturn type correctly indicates this function performs file I/O without returning a value.benchwise/core.py (8)
1-1: LGTM!Adding
ParamSpec,TypeVar,Awaitable, andUnionenables sophisticated generic typing for decorators while maintaining runtime flexibility withAny.
13-16: LGTM!
ParamSpecandTypeVardefinitions follow the standard pattern for preserving callable signatures in decorator type annotations.
20-60: LGTM!The complex return type annotation accurately describes the decorator transformation: a test function becomes a wrapper that accepts a
Datasetand returnsList[EvaluationResult]. The# type: ignore[attr-defined]is appropriate for the dynamic_benchmark_metadataattribute.
136-154: LGTM!The
benchmarkdecorator correctly usesCallable[[Callable[P, R]], Callable[P, R]]to indicate it preserves the original function's signature. The# type: ignore[attr-defined]is appropriate for the dynamic metadata attribute.
157-194: LGTM!The
stress_testdecorator correctly usesParamSpecfor signature preservation andUnion[R, BaseException]to reflectasyncio.gather(return_exceptions=True)semantics.
200-272: LGTM!The
EvaluationRunnerclass has comprehensive type annotations: explicit field types in__init__,Callable[..., Awaitable[Any]]for flexible test functions, andOptional[str]for optional metric selection.
275-298: LGTM!The
run_benchmarkandquick_evalfunctions have precise type annotations:Callable[..., Awaitable[Any]]for flexible async functions andCallable[[str], float]for the metric function that scores individual responses.
63-70: > Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchwise/datasets.py (1)
213-305:load_datasetURL handling is broken for HTTP(S) JSON sourcesThe code calls
source_path = Path(source)before checking for HTTP URLs (line 234), causing strings like"https://example.com/data.json"to be treated as local paths with.jsonsuffix. The conditionif source_path.suffix == ".json"(line 237) then executes first, attempting to open the URL as a local file withopen(source_path, "r")(line 238), resulting inFileNotFoundError. The laterelif str(source).startswith(("http://", "https://"))branch (line 271) is unreachable for any.jsonURL.Detect HTTP(S) URLs before treating the source as a filesystem path:
- elif isinstance(source, (str, Path)): - source_path = Path(source) - - if source_path.suffix == ".json": + elif isinstance(source, (str, Path)): + # Handle HTTP(S) URLs first + if isinstance(source, str) and source.startswith(("http://", "https://")): + source_str = source + response = requests.get(source_str) + response.raise_for_status() + + if source_str.endswith(".json"): + data = response.json() + if isinstance(data, dict) and "data" in data: + return Dataset( + name=data.get("name", "remote_dataset"), + data=data["data"], + metadata=data.get("metadata", {}), + schema=data.get("schema"), + ) + elif isinstance(data, list): + return Dataset( + name=kwargs.get("name", "remote_dataset"), + data=data, + metadata=kwargs.get("metadata", {}), + ) + else: + raise ValueError( + f"Invalid JSON format from '{source_str}'. Expected a list or a dict with 'data' key." + ) + else: + raise ValueError( + f"Unsupported URL format '{source_str}'. Only .json URLs are supported." + ) + + else: + source_path = Path(source) + + if source_path.suffix == ".json": ... - elif source_path.suffix == ".csv": + elif source_path.suffix == ".csv": ... - elif str(source).startswith(("http://", "https://")): - # Convert to str for requests.get - source_str = str(source) - response = requests.get(source_str) - ... - else: - raise ValueError( - f"Unsupported file format '{source_path.suffix}'. Supported formats: .json, .csv" - ) + else: + raise ValueError( + f"Unsupported file format '{source_path.suffix}'. Supported formats: .json, .csv" + )
🧹 Nitpick comments (9)
benchwise/cli.py (2)
8-8: Tighten typing formetric_result/resultsinstead of usingAny
accuracy(and the other metric functions) return dictionaries, so declaringmetric_result: Any = ...loses most of the benefit of the new type hints, andresultsis currently untyped as well.If you want stricter static checking here, consider something along these lines:
-from typing import List, Optional, Any +from typing import Any, Dict, List, Optional ... - results = {} + results: Dict[str, Any] = {} for metric_name in metrics: - try: - if metric_name == "accuracy": - metric_result: Any = accuracy(responses, references) + try: + metric_result: Dict[str, Any] + if metric_name == "accuracy": + metric_result = accuracy(responses, references) results["accuracy"] = metric_result["accuracy"] elif metric_name == "rouge_l": metric_result = rouge_l(responses, references) results["rouge_l_f1"] = metric_result["f1"] elif metric_name == "semantic_similarity": metric_result = semantic_similarity(responses, references)This keeps runtime behavior identical while giving mypy/pyright a clearer view of the data flowing through this block.
Also applies to: 205-213
288-355: CLI helper type hints look consistent; consider narrowingargs: AnyThe new return annotations and parameter types for
list_resources,validate_dataset,compare_results, andmainlook correct and match how these functions are used from the CLI.For the async helpers that take
args:async def configure_api(args: Any) -> None: async def sync_offline(args: Any) -> None: async def show_status(args: Any) -> None:
argsis always produced byparser.parse_args(), so its runtime type isargparse.Namespace. If you want slightly stronger typing without major refactors, you could narrow these to:-import argparse +import argparse ... -async def configure_api(args: Any) -> None: +async def configure_api(args: argparse.Namespace) -> None: ... -async def sync_offline(args: Any) -> None: +async def sync_offline(args: argparse.Namespace) -> None: ... -async def show_status(args: Any) -> None: +async def show_status(args: argparse.Namespace) -> None:This is optional but makes the intent clearer and avoids
Anyin a few more places while keeping the current CLI behavior unchanged.Also applies to: 357-413, 415-479, 481-483, 514-571
benchwise/metrics.py (4)
25-44: Float casting in_bootstrap_confidence_intervalstabilizes the return typeWrapping
np.mean/np.percentileresults infloat(...)ensures the helper always returns plain Python floats, which is friendlier for typing and JSON serialization. Behavior for emptyscores(returning(nan, nan)whenscores == []) is unchanged but a bit surprising; consider explicitly guarding against the empty-list case if you ever expect it.
69-69:rouge_lresult now matches the structuredRougeScoresshapeThe
RougeScoresreturn annotation, explicitscores: Dict[str, List[float]], and casting of aggregates/std-devs tofloatmake the output shape and types much clearer and consistent with the typed dict definition. One minor completeness thought:benchwise/types.py’s docstring mentions arougeL_f1field; if that key is meant to be part ofRougeScores, you may want to either (a) add it toscores/result, or (b) adjust the type/docstring sof1is clearly the custom ROUGE-L F-score you already compute.Also applies to: 100-106, 140-150
170-177: BLEU helpers’ typing is clearer; consider minor polish onlyTyping
bleu_scoreasDict[str, Any]and explicitly annotatingngram_precisions: Dict[str, List[float]]clarify the somewhat dynamic metric dict._get_smoothing_function’sOptional[Callable[..., Any]]and_get_weights’sTuple[float, ...]are appropriate, and thetype: ignorecomments around NLTK’s bound methods are a pragmatic way to satisfy static checkers. No correctness issues spotted.Also applies to: 214-216, 294-307, 310-314
797-797: Factual correctness helpers’ typing is mostly solid; remove stray no-op expressionAnnotating
_analyze_factual_correctnessand_calculate_entity_overlapwith concretestr/floattypes andnlp_model: Anykeeps spaCy optional without over-constraining the interface, and the enhanced keyword overlap typing (important_ref_words: set[str]) is clear.There is one small cleanup opportunity: line 929 (
" ".join(ref_words)) is a standalone expression whose result is ignored and can be dropped without changing behavior.Also applies to: 844-848, 883-883, 909-979
benchwise/datasets.py (1)
81-90: Filter/sample/split metadata handling is reasonable, but size remains originalThe new
Callable[[Dict[str, Any]], bool]type onfilter, and use ofmetadata = self.metadata or {}withmetadata={**metadata, ...}forfilter,sample, andsplitare all sound and avoid mutating the original metadata dict.Note that the
sizein propagated metadata continues to reflect the original dataset rather than the filtered/sample/split sizes; that’s pre-existing behavior but might surprise consumers expectingsize == len(data)on derived datasets. Consider updating or adding a separate field (e.g.,effective_size) if you ever revisit metadata semantics.Also applies to: 91-105, 122-140
benchwise/config.py (2)
400-420:validate_api_connectionasync usage can misreport failures inside running event loopsWrapping the HTTP health check in an
async def check_connection()and callingasyncio.run(check_connection())works well in plain scripts/CLIs, butasyncio.runraisesRuntimeErrorwhen called from an already running event loop (e.g., some notebook or async framework contexts). In those cases you’ll fall into the broadexceptand returnFalseeven if the API is reachable.If you want this helper to be robust in such environments, consider either:
- Switching to a synchronous
httpx.Clientcall here, or- Detecting a running loop and using it instead of
asyncio.run.As it stands this is acceptable for CLI-style diagnostics but may give false negatives in async-heavy applications.
427-486: API key validation and configuration status are useful diagnostics; be mindful of network cost
validate_api_keysnow uses provider-specific clients/endpoints (OpenAI client, Anthropic client construction, Google Generative AIlist_models, and Hugging Face’s/api/whoami) to infer key validity, andprint_configuration_statuswires that together withvalidate_api_connectionand cache-dir checks. This is a good fit for an on-demand diagnostic command.Given these helpers perform live network calls and, in some cases, potentially heavy operations (e.g., listing all models), it’s worth keeping them out of hot paths and reserving them for explicit “status”/debug flows, which you are already doing.
Also applies to: 488-517
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.gitignore(1 hunks)benchwise/cli.py(9 hunks)benchwise/client.py(24 hunks)benchwise/config.py(14 hunks)benchwise/datasets.py(14 hunks)benchwise/metrics.py(24 hunks)mypy.ini(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- mypy.ini
- benchwise/client.py
🧰 Additional context used
🧬 Code graph analysis (4)
benchwise/metrics.py (3)
benchwise/types.py (1)
RougeScores(43-58)benchwise/core.py (1)
evaluate(20-60)benchwise/datasets.py (1)
references(65-79)
benchwise/datasets.py (1)
tests/test_integration_real_api.py (1)
test_dataset(23-28)
benchwise/config.py (1)
benchwise/types.py (1)
ConfigDict(180-194)
benchwise/cli.py (2)
benchwise/metrics.py (1)
accuracy(446-557)benchwise/datasets.py (1)
references(65-79)
🔇 Additional comments (12)
benchwise/metrics.py (4)
1-3: Import cleanup and RougeScores wiring look goodRemoving unused typing/NDArray imports and importing only
RougeScoresfrombenchwise.typesaligns with prior Ruff feedback and keeps the module’s surface tidy. No issues here.
317-324: Return-type widening for metric functions is consistent with their dynamic outputsUsing
Dict[str, Any]forbert_score_metric,accuracy,semantic_similarity, andperplexity, plus casting perplexity aggregates tofloat, accurately reflects that these functions return heterogeneous metric dictionaries while still giving callers a predictable top-level shape. This aligns with how the rest of the metrics module structures its outputs.Also applies to: 446-455, 560-568, 688-689, 724-725
1042-1042: Coherence-related helpers: numeric typing and casts are coherentExplicitly typing
component_scores: Dict[str, List[float]], castingoverall_coherenceand the length coefficient-of-variation tofloat, and returningfloat(np.mean(overlaps))from_calculate_topic_consistencyall help static analysis and avoid leaking numpy scalar types. The semantics stay the same and the changes are safe.Also applies to: 1061-1061, 1119-1125, 1268-1268
1272-1279: Safety metrics andMetricCollectiontyping align with usageAnnotating
safety_categories/perspective_api_keyasOptional[...], makingcategory_scores: Dict[str, List[float]]andviolation_details: List[List[str]], and typingMetricCollection.metricsasDict[str, Tuple[Callable[..., Any], Dict[str, Any]]]give a realistic type view of how metrics and their kwargs are stored and invoked. Theevaluatemethod’sreferences: Optional[List[str]]matches how you mix metrics that do and don’t need references. No behavioral regressions spotted.Also applies to: 1357-1358, 1498-1503, 1508-1519
benchwise/datasets.py (5)
1-1: Typing imports and__post_init__annotation are appropriateAdding
Callable/Unionto the imports and annotating__post_init__as returningNonematch how the class is used and keep the dataclass lifecycle explicit. No issues here.Also applies to: 27-27
65-79: Includinglabelinreferencesimproves classification supportExtending the reference extraction to fall back to
item.get("label")makes theDataset.referencesproperty usable for simple classification datasets without extra wiring. Converting tostras before keeps the return type consistent.
163-167:to_csvandget_statisticstyping/numeric handling are solidAnnotating
to_csvas returningNonematches its side-effect-only behavior. Inget_statistics, explicitly typingfields: List[str]andstats: Dict[str, Any], then narrowing tonumeric_valuesunder theall(isinstance(..., (int, float)))guard, makes the numeric summary logic clearer and type-checker-friendly without changing semantics.Also applies to: 182-210
307-343: Dataset factory functions: type annotations align with usageAdding
**kwargs: Anyand explicitList[str]annotations forquestions/answers,documents/summaries, andtexts/labelsmatches the way these helpers are used (including intests/test_integration_real_api.py) and makes their contracts clearer. The schemas and metadata blocks are unchanged and remain consistent.Also applies to: 345-382, 385-418
421-441:DatasetRegistrytyping is straightforward and correctTyping the registry’s internal map as
Dict[str, Dataset]and annotating__init__,register,remove, andclearas returningNonedocuments the side-effect-only API and avoids ambiguity for callers. No behavior changes introduced.benchwise/config.py (3)
10-10: Config typing imports and default model/metric lists are well integratedImporting
ListandConfigDict, then declaringdefault_models/default_metricsasList[str] = field(default_factory=list), cleanly extends the configuration surface and avoids the shared-mutable-default pitfall. TheConfigDictalias ties the in-memory config and its serialized form together nicely.Also applies to: 14-14, 57-59
60-67: Internal config lifecycle methods are now explicitly typed and safer to consumeAdding
-> Noneto__post_init__,_load_from_env,_load_from_file,_validate_config, andsave_to_filemakes their side-effect-only nature explicit. Persistingdefault_models/default_metricsinsave_to_file, and returning aConfigDictfromto_dictthat masksapi_keyas"***"when present, are good choices for both type safety and secret handling.print_configbuilds on that masked dict, which avoids leaking real keys.Also applies to: 118-145, 174-199, 217-233, 235-243
263-272: Global config helpers: signatures and kwargs handling look correctAnnotating
set_api_configandreset_configas returningNone, andconfigure_benchwise(..., **kwargs: Any) -> BenchwiseConfig, makes the public API clearer. Thekwargsloop still only sets attributes that exist on the config object, which keeps this extension-safe while preventing accidental misspellings from silently adding new attributes. Re-validating viaconfig._validate_config()after changes is also the right call.Also applies to: 274-281, 296-317, 320-324
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/test_models.py (1)
83-86: Verify Python 3.10+ requirement for parenthesized context managers.The parenthesized context manager syntax at these locations requires Python 3.10+. This is the same syntax change applied in
tests/test_integration.py(lines 112-115).Also applies to: 201-204, 209-212
🧹 Nitpick comments (5)
benchwise/types.py (3)
21-26: Potential duplicate type variable definition.
PandRare also defined inbenchwise/core.py(lines 24-25). If both modules are imported together, this could cause confusion or shadowing issues. Consider importing these fromtypes.pyincore.pyinstead of redefining them.
107-116: AccuracyResults TypedDict is incomplete.The
accuracy()function inmetrics.pyreturns additional fields not defined here:exact_accuracy,fuzzy_accuracy,correct_fuzzy,mean_score,std_score,individual_scores, andmatch_types. Consider adding these for complete type coverage.class AccuracyResults(TypedDict, total=False): """Return type for accuracy metric.""" accuracy: float + exact_accuracy: float + fuzzy_accuracy: float correct: int + correct_fuzzy: int total: int - std_accuracy: float - scores: List[float] + mean_score: float + std_score: float + individual_scores: List[float] + match_types: List[str] # Optional confidence interval accuracy_confidence_interval: Tuple[float, float]
163-179: DatasetItem is missing fields used by Dataset class.The
Dataset.promptsproperty (indatasets.py) also checks fordocumentfield, andDataset.referenceschecks forlabelfield. Add these for completeness.class DatasetItem(TypedDict, total=False): """A single item in a dataset.""" # Common field names prompt: str input: str question: str text: str + document: str # For summarization datasets # Reference/target fields reference: str output: str answer: str target: str summary: str + label: str # Additional fields id: str metadata: Dict[str, Any]benchwise/core.py (2)
23-25: Import type variables frombenchwise.typesinstead of redefining.
PandRare also defined inbenchwise/types.py. For consistency with the PR's goal of centralizing type definitions, import them from there.+from .types import P, R + -# Type variables for decorator typing -P = ParamSpec("P") -R = TypeVar("R")
30-35: Return type annotation may not precisely match wrapper signature.The wrapper function signature is
(dataset: Dataset, **test_kwargs: Any), but the return type annotation specifiesCallable[[Dataset, Any], Awaitable[...]]. Consider usingCallable[..., Awaitable[List[EvaluationResult]]]for the wrapper type, or documenting thatAnyrepresents**kwargs.def evaluate( *models: str, upload: Optional[bool] = None, **kwargs: Any ) -> Callable[ [Callable[..., Awaitable[Any]]], - Callable[[Dataset, Any], Awaitable[List[EvaluationResult]]], + Callable[..., Awaitable[List[EvaluationResult]]], ]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
benchwise/core.py(12 hunks)benchwise/exceptions.py(1 hunks)benchwise/logging.py(1 hunks)benchwise/results.py(12 hunks)benchwise/types.py(1 hunks)tests/test_config.py(1 hunks)tests/test_docs_examples.py(7 hunks)tests/test_integration.py(1 hunks)tests/test_models.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/test_config.py
🚧 Files skipped from review as they are similar to previous changes (2)
- benchwise/exceptions.py
- benchwise/results.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_models.py (1)
benchwise/models.py (1)
HuggingFaceAdapter(208-250)
benchwise/types.py (4)
benchwise/metrics.py (1)
accuracy(446-557)benchwise/datasets.py (3)
size(44-46)prompts(49-62)references(65-79)benchwise/results.py (1)
success(37-39)benchwise/core.py (1)
evaluate(30-79)
benchwise/core.py (4)
benchwise/types.py (1)
evaluate(320-324)benchwise/datasets.py (1)
Dataset(11-210)benchwise/results.py (1)
EvaluationResult(12-76)benchwise/client.py (1)
upload_results(943-1013)
🔇 Additional comments (8)
tests/test_docs_examples.py (1)
20-172: LGTM! String quote standardization improves consistency.The changes standardize string quotes to double quotes throughout the test file, improving code consistency. No functional changes detected.
tests/test_integration.py (1)
112-115: This concern is invalid — the project requires Python 3.11+, which supports parenthesized context managers.The project's
pyproject.tomlspecifiesrequires-python = ">=3.11"and targets Python 3.11+ (ruff config:target-version = "py311"). Parenthesized context managers are supported in Python 3.10+, so this syntax is compatible with the project's minimum Python version.benchwise/logging.py (1)
71-80: LGTM! Explicit return type annotation added.The
-> Nonereturn type annotation correctly documents the function's behavior and improves type checker support.benchwise/types.py (1)
285-323: LGTM! Well-defined protocols.The
SupportsGenerate,SupportsCache, andSupportsMetricsprotocols are correctly structured with appropriate method signatures for structural subtyping.benchwise/core.py (4)
176-223: LGTM! Well-typed stress_test decorator.The type annotations correctly capture the async decorator pattern and the
Union[R, BaseException]return type properly reflectsasyncio.gather(..., return_exceptions=True)behavior.
229-257: LGTM! Clear type annotations for EvaluationRunner.The explicit typing for
config,results_cache, and method signatures improves code clarity and enables better IDE support.
317-335: LGTM! Correct type annotations for quick_eval.The
Dict[str, Optional[float]]return type correctly captures that model evaluations may fail (returningNone).
85-85: > Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.pre-commit-config.yaml(2 hunks)benchwise/config.py(15 hunks)benchwise/metrics.py(25 hunks)tests/test_memory_large_datasets.py(8 hunks)tests/test_results.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
benchwise/metrics.py (1)
benchwise/types.py (1)
RougeScores(56-72)
benchwise/config.py (1)
benchwise/types.py (1)
ConfigDict(204-219)
🪛 GitHub Actions: CI
benchwise/metrics.py
[error] 4-4: ModuleNotFoundError: No module named 'rouge_score' while importing benchwise.metrics during test execution.
🔇 Additional comments (13)
tests/test_results.py (1)
187-189: LGTM!The reformatted assertion maintains the same logic while improving consistency with the test suite's formatting style. The parenthesized message format is valid and readable.
tests/test_memory_large_datasets.py (2)
32-34: Consistent assertion formatting applied throughout the file.All assertions have been reformatted to use the parenthesized message style consistently. The memory threshold checks and test logic remain unchanged. This aligns with the PR's broader formatting standardization effort.
60-62: Formatting changes consistent with the rest of the file.All remaining assertion reformats follow the same pattern—no logic changes, just stylistic consistency improvements.
Also applies to: 72-74, 103-105, 138-140, 167-169, 187-189, 201-203
.pre-commit-config.yaml (1)
20-28: LGTM! Mypy integration properly configured.The mypy hook is well-configured with:
- Appropriate stub packages (types-requests, pandas-stubs)
- Custom config file reference
- Restricted scope to benchwise/ directory
This aligns perfectly with the PR's objective to improve type annotations.
benchwise/metrics.py (5)
69-69: LGTM! Proper typing with RougeScores TypedDict.The function signature correctly uses the
RougeScoresTypedDict from benchwise/types.py, and the result dictionary is properly typed and constructed with explicit float conversions for numpy values. This ensures type safety and mypy compatibility.Also applies to: 140-149, 176-176
294-307: LGTM! Appropriate use of type: ignore for bound methods.The
type: ignore[no-any-return]comments correctly suppress mypy warnings for returning bound methods from the NLTK SmoothingFunction. This is necessary because these methods don't have proper type stubs.
1498-1503: LGTM! Proper typing for MetricCollection methods.The method signatures correctly use:
-> Nonefor__init__andadd_metricCallable[..., Any]for metric functions**kwargs: Anyfor flexible argumentsThis provides good type safety while maintaining flexibility.
928-928: The project requires Python 3.11+ (requires-python = ">=3.11"in pyproject.toml), which fully supports PEP 585 lowercase generics. Usingset[str]is the correct, idiomatic approach and no changes are needed.
1-11: Fix unconditional import of undeclared optional dependencyrouge_score.The
ModuleNotFoundError: No module named 'rouge_score'occurs because line 4 importsrouge_scorerunconditionally, butrouge-scoreis not declared in any dependency configuration (neitherinstall_requiresnor any extras). Either addrouge-scoreto an optional dependency extra (e.g.,metrics), or guard the import with a try-except block or conditional import since ROUGE metrics are optional functionality.Likely an incorrect or invalid review comment.
benchwise/config.py (4)
10-14: LGTM! Consistent typing with centralized type definitions.The changes properly integrate
ConfigDictfrom benchwise/types.py and useList[str]for model/metric lists. Theto_dict()method now has a precise return type that aligns with the TypedDict definition, improving type safety throughout the codebase.Also applies to: 57-58, 217-233
60-64: LGTM! Comprehensive None return type annotations.All void methods now properly annotate
-> Nonereturn types, which is essential for mypy strict mode and improves code documentation. This includes initialization, validation, configuration, and printing methods.Also applies to: 66-117, 118-143, 145-172, 174-215, 235-242, 263-271, 320-323, 488-517
427-485: validate_api_keys performs live external API calls.The
validate_api_keys()function makes actual API requests to OpenAI, Anthropic, Google GenAI, and HuggingFace to verify keys. This can:
- Consume API quota/credits
- Add latency to configuration checks
- Fail in offline or restricted network environments
Consider documenting this behavior clearly, or making the validation opt-in rather than automatic during configuration operations.
Based on learnings about user experience with API calls during configuration.
203-205: Good security practice: API key handling.The explanatory note added when not saving the API key clearly informs users why the key is excluded and how to set it via environment variables. This balances security with user experience.
|
@CodeRabbit is it fine to use cast() ? |
|
Tip For best results, initiate chat on the files or code changes. @yanurag-dev Using When
|
…otations in cli.py, client.py, core.py, and metrics.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
benchwise/cli.py (1)
213-219: Verify key existence when accessing typed dictionary results.The code accesses
["f1"]and["mean_similarity"]from typed dictionaries. While the TypedDicts inbenchwise/types.pydefine these fields withtotal=False, the actual implementations inbenchwise/metrics.pyalways populate these fields. However, for robustness, consider using.get()with defaults or add explicit checks.Apply this diff to use safer dictionary access:
elif metric_name == "rouge_l": rouge_result = rouge_l(responses, references) - results["rouge_l_f1"] = rouge_result["f1"] + results["rouge_l_f1"] = rouge_result.get("f1", 0.0) elif metric_name == "semantic_similarity": semantic_result = semantic_similarity(responses, references) - results["semantic_similarity"] = semantic_result[ - "mean_similarity" - ] + results["semantic_similarity"] = semantic_result.get( + "mean_similarity", 0.0 + )benchwise/client.py (1)
249-249: Consider adding runtime validation for API responses.The code uses
cast()extensively to type API responses. While this satisfies mypy and is acceptable for the current stage (as discussed in PR comments),cast()provides no runtime validation. If the API contract changes or returns unexpected data, errors will only surface when accessing specific fields.Based on learnings, cast() is fine for this use case given the stable API contract, but consider adding a TODO or follow-up issue to implement runtime validation using Pydantic models for production robustness.
Example with Pydantic (for future consideration):
from pydantic import BaseModel class UserResponse(BaseModel): id: int username: str email: str # ... other fields async def get_current_user(self) -> Dict[str, Any]: response = await self.client.get("/api/v1/users/me") if response.status_code == 200: # Validates at runtime and raises ValidationError if invalid user = UserResponse.model_validate(response.json()) return user.model_dump() # ... error handlingAlso applies to: 276-276, 348-350, 375-384
benchwise/metrics.py (1)
150-177: Consider consistent typing pattern for result construction.The code uses two patterns for building typed results:
- Lines 150-177: Explicitly typed variable
result: RougeScores = {...}- Lines 279-302: Build
Dict[str, Any]then cast at returnBoth patterns work, but pattern 1 provides slightly better type safety during dictionary construction (mypy can catch typos in keys). Consider using pattern 1 consistently for better type checking.
Example refactor for consistency:
- result_dict: Dict[str, Any] = { + result: BleuScores = { "corpus_bleu": corpus_bleu, "sentence_bleu": float(np.mean(sentence_scores)), # ... other fields } - return cast(BleuScores, result_dict) + return resultAlso applies to: 279-302
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
benchwise/cli.py(10 hunks)benchwise/client.py(23 hunks)benchwise/core.py(12 hunks)benchwise/metrics.py(34 hunks)benchwise/types.py(1 hunks)mypy.ini(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
benchwise/types.py (4)
benchwise/metrics.py (2)
accuracy(463-574)evaluate(1527-1543)benchwise/datasets.py (3)
size(44-46)prompts(49-62)references(65-79)benchwise/results.py (1)
success(37-39)benchwise/core.py (1)
evaluate(30-79)
benchwise/cli.py (2)
benchwise/types.py (4)
ConfigureArgs(449-456)ConfigKwargs(472-477)SyncArgs(459-462)StatusArgs(465-469)benchwise/metrics.py (2)
rouge_l(73-177)semantic_similarity(577-703)
benchwise/core.py (4)
benchwise/types.py (1)
evaluate(442-446)benchwise/datasets.py (1)
Dataset(11-210)benchwise/results.py (1)
EvaluationResult(12-76)benchwise/cli.py (1)
run_evaluation(115-286)
benchwise/metrics.py (2)
benchwise/types.py (10)
RougeScores(56-72)BleuScores(75-96)BertScoreResults(99-129)AccuracyResults(132-154)SemanticSimilarityResults(157-182)PerplexityResults(185-193)FactualCorrectnessResults(263-281)CoherenceResults(204-221)SafetyResults(232-252)evaluate(442-446)benchwise/core.py (1)
evaluate(30-79)
benchwise/client.py (1)
benchwise/types.py (1)
OfflineQueueItem(480-484)
🪛 GitHub Actions: CI
benchwise/metrics.py
[error] 14-14: ModuleNotFoundError: No module named 'rouge_score'. PyTest run failed while importing rouge_scorer from rouge_score during test collection. Command: python -m pytest -v tests/test_core.py tests/test_datasets.py tests/test_models.py tests/test_results.py tests/test_config.py --tb=short --color=yes --durations=5 --maxfail=10 --disable-warnings
🔇 Additional comments (8)
mypy.ini (1)
1-73: LGTM! Comprehensive mypy configuration.The configuration enables strict type checking appropriately and handles third-party library stubs correctly. The use of
follow_imports = skipfor large libraries liketransformersandgoogle.generativeaiis a good practice to avoid performance issues.benchwise/cli.py (1)
289-323: LGTM! Protocol-based typing for CLI arguments.The use of Protocol types (
ConfigureArgs,ConfigKwargs, etc.) is an excellent pattern for typing argparse Namespace objects and function kwargs. This provides type safety without requiring concrete class dependencies.benchwise/types.py (1)
1-484: LGTM! Well-structured centralized type definitions.This module provides excellent type coverage with:
- Appropriate use of
TypedDictwithtotal=Falsefor optional fields- Well-documented Protocol interfaces for duck typing
- Comprehensive metric result types matching the actual implementations
- Clear separation of concerns (models, metrics, datasets, config, API)
The centralized types module will significantly improve IDE support and catch type errors early.
benchwise/client.py (1)
77-93: LGTM! Properly typed async context manager.The context manager implementation correctly types the
__aenter__and__aexit__methods with proper exception handling types. The typing improvements make the async context manager protocol explicit and checkable by mypy.benchwise/metrics.py (1)
151-158: LGTM! Proper float conversions for numpy scalars.The explicit
float()conversions for numpy aggregations (mean, std, percentile) are correct and necessary. NumPy returns numpy scalar types (e.g.,np.float64) which don't satisfy type checkers expectingfloat. This pattern ensures type compatibility.benchwise/core.py (3)
30-79: LGTM! Advanced decorator typing with ParamSpec and Awaitable.The decorator typing correctly captures the complex nested function signatures:
evaluate()returns a decorator function- The decorator accepts an async function and returns a wrapper
- The wrapper preserves the async nature and returns typed results
The
type: ignorecomment on line 75 is appropriate for the dynamically added_benchmark_metadataattribute.This is excellent use of Python's advanced typing features to provide strong type safety for decorator patterns.
176-223: LGTM! Correct typing for exception handling in stress tests.The
List[Union[R, BaseException]]return type correctly captures that stress test results can contain either successful results or exceptions. This aligns perfectly withasyncio.gather(..., return_exceptions=True)which returns a mix of results and exceptions.The typing makes it explicit that callers must check each result to determine if it's an exception or a successful value.
317-334: LGTM! Clean typing for quick evaluation.The function signature correctly types:
metricparameter as a callable that scores text:Callable[[str], float]- Return type as
Dict[str, Optional[float]]to handle failures gracefullyThis provides clear contracts for callers and enables proper type checking.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
benchwise/core.py (1)
74-75: Consider using a Protocol for dynamic attributes.The
type: ignorecomment is necessary here because_benchmark_metadatais dynamically attached by the@benchmarkdecorator. For improved type safety, consider defining a Protocol:from typing import Protocol class BenchmarkFunction(Protocol): _benchmark_metadata: Dict[str, Any]Then use conditional checks with
isinstance()orhasattr()paired with proper type guards. This would eliminate the need fortype: ignorecomments while maintaining runtime flexibility.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
benchwise/core.py(12 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchwise/core.py (4)
benchwise/datasets.py (1)
Dataset(11-210)benchwise/results.py (1)
EvaluationResult(12-76)benchwise/models.py (1)
get_model_adapter(275-298)benchwise/client.py (1)
upload_results(950-1020)
🔇 Additional comments (10)
benchwise/core.py (10)
1-25: LGTM! Type variable declarations are appropriate.The addition of
ParamSpecandTypeVarenables proper generic typing for decorators, preserving both parameter signatures and return types through decorator chains.
30-35: LGTM! Decorator signature accurately represents the evaluation pattern.The complex generic return type correctly captures the decorator chain: a function that accepts an async test function and returns a wrapper that takes a Dataset and produces evaluation results.
82-90: LGTM! Function signature accurately reflects parameter types.The use of
tuple[str, ...]for themodelsparameter correctly represents that it originates from*models(variadic args), which creates a tuple. The addition of thewrapper_funcparameter enables metadata access from the decorator chain.
154-174: LGTM! Generic signature preserves wrapped function's type.The use of
ParamSpec[P]andTypeVar[R]ensures that the@benchmarkdecorator doesn't alter the wrapped function's signature, enabling proper type checking for decorated functions. Thetype: ignore[attr-defined]on line 167 is consistent with the pattern on line 75.
177-224: LGTM! Generic signature correctly captures stress test behavior.The return type
List[Union[R, BaseException]]accurately represents the behavior ofasyncio.gather(..., return_exceptions=True), which returns both successful results and exceptions. The use ofParamSpecpreserves the wrapped function's parameter signature.
230-233: LGTM! Explicit attribute annotations improve type safety.The explicit type annotations for
self.configandself.results_cache, along with the-> Nonereturn type for__init__, enhance type checking. UsingDict[str, Any]is appropriate for flexible configuration storage.
235-254: LGTM! Method signature clearly defines expectations.The explicit typing of
test_funcasCallable[..., Awaitable[Any]]and the annotatedresultslist provide clear type information for this evaluation method.
256-307: LGTM! Optional parameter correctly typed and defensively handled.The change to
Optional[str] = Noneformetric_nameis correct. The score extraction logic defensively handles missing or None scores by falling back to 0, ensuring robustness when result structures vary.
310-315: LGTM! Signature consistent with other async evaluation functions.The typing of
benchmark_funcasCallable[..., Awaitable[Any]]maintains consistency with theevaluateandrun_evaluationmethods.
318-336: LGTM! Function signature and error handling properly typed.The
metricparameter typed asCallable[[str], float]clearly defines expectations, and the return typeDict[str, Optional[float]]correctly represents that evaluation can fail (None) for some models.
… test gpt adapter
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
.github/workflows/ci.yml (2)
19-19: Verify Python version support scope.The workflow only tests against Python 3.12. If the project is intended to support multiple Python versions (e.g., 3.9, 3.10, 3.11), consider adding a matrix strategy to catch version-specific compatibility issues early.
If multi-version support is needed, you can refactor the Python setup step to use a matrix:
jobs: test: runs-on: ubuntu-latest + strategy: + matrix: + python-version: ["3.9", "3.10", "3.11", "3.12"] steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: - python-version: "3.12" + python-version: ${{ matrix.python-version }}Also applies to: 19-19
21-24: Consider adding pip dependency caching for faster builds.Caching pip dependencies can significantly reduce build times on subsequent runs. This is an optional performance optimization that works well for CI pipelines with frequent pushes.
Add pip caching before the install step:
- uses: actions/checkout@v4 + - name: Cache pip dependencies + uses: actions/cache@v4 + with: + path: ~/.cache/pip + key: ${{ runner.os }}-pip-${{ hashFiles('**/pyproject.toml') }} + restore-keys: | + ${{ runner.os }}-pip- - name: Set up Python 3.12
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/ci.yml(1 hunks)tests/test_models.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_models.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (1)
.github/workflows/ci.yml (1)
26-36: Well-structured CI workflow with comprehensive checks.The workflow covers all essential quality gates: linting, formatting, type checking, and testing. It aligns well with the PR's focus on improving type annotations and introducing strict mypy configuration. The sequential execution is appropriate for this pipeline. All supporting infrastructure is properly configured:
mypy.inicontains the benchwise target,pyproject.tomldefines the required optional dependencies, andrun_tests.pyis available with--basicflag support.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
benchwise/client.py (3)
816-871: Offline sync is currently broken (wrong reconstruction + wrong kwargs).
Two runtime issues insync_offline_queue():
BenchmarkResult(**benchmark_result_dict)won’t rebuildresultsintoEvaluationResultobjects (they’re dicts fromto_dict()), so laterresult.success/result.to_dict()will crash.- The queued
"create_evaluation"payload includes"config", butcreate_evaluation()doesn’t accept it—self.create_evaluation(**evaluation_data)will raiseTypeError.for item in self.offline_queue: try: queue_data: Dict[str, Any] = item["data"] data_type: Any = queue_data.get("type") @@ if data_type == "full_benchmark_result": - # Reconstruct BenchmarkResult and upload - from .results import BenchmarkResult - - benchmark_result_dict: Dict[str, Any] = queue_data.get( - "benchmark_result", {} - ) - benchmark_result = BenchmarkResult(**benchmark_result_dict) + # Reconstruct BenchmarkResult + EvaluationResult objects and upload + from .results import BenchmarkResult, EvaluationResult + from datetime import datetime + + benchmark_result_dict: Dict[str, Any] = cast( + Dict[str, Any], queue_data.get("benchmark_result", {}) + ) + raw_results: Any = benchmark_result_dict.get("results", []) + results_objs: List[EvaluationResult] = [] + if isinstance(raw_results, list): + for r in raw_results: + if isinstance(r, dict): + ts = r.get("timestamp") + if isinstance(ts, str): + try: + r["timestamp"] = datetime.fromisoformat(ts) + except Exception: + pass + # drop computed fields + r_clean = {k: v for k, v in r.items() if k not in ("success", "failed")} + results_objs.append(EvaluationResult(**r_clean)) + + ts_bm = benchmark_result_dict.get("timestamp") + bm_timestamp = datetime.now() + if isinstance(ts_bm, str): + try: + bm_timestamp = datetime.fromisoformat(ts_bm) + except Exception: + pass + + benchmark_result = BenchmarkResult( + benchmark_name=str(benchmark_result_dict.get("benchmark_name", "offline_benchmark")), + results=results_objs, + metadata=cast(Dict[str, Any], benchmark_result_dict.get("metadata", {})), + timestamp=bm_timestamp, + ) await self.upload_benchmark_result(benchmark_result) elif data_type == "create_evaluation": - evaluation_data: Dict[str, Any] = queue_data.get("data", {}) - await self.create_evaluation(**evaluation_data) + evaluation_data: Dict[str, Any] = cast(Dict[str, Any], queue_data.get("data", {})) + name = evaluation_data.get("name") + benchmark_id = evaluation_data.get("benchmark_id") + model_ids = evaluation_data.get("model_ids") + metadata = evaluation_data.get("metadata") + if isinstance(name, str) and isinstance(benchmark_id, int) and isinstance(model_ids, list): + await self.create_evaluation( + name=name, + benchmark_id=benchmark_id, + model_ids=cast(List[int], model_ids), + metadata=cast(Optional[EvaluationMetadata], metadata), + )
196-226: Avoid logging raw usernames/emails on login attempts.
logger.info(f"Attempting login for user: {username}")can leak PII into logs; consider redacting or logging at debug with a hash/truncated value.
109-174: Handle HTTP-date retry-after format and validate headers dict before mutation.Line 135–137:
int(response.headers.get("retry-after", ...))will crash if a server sends HTTP-date format (e.g.,"Fri, 31 Dec 1999 23:59:59 GMT"), which is valid per RFC 7231. The check at line 118 is insufficient—if a caller passesheaders=None, line 120 will crash attempting to subscriptNone.benchwise/results.py (2)
180-228: Ensurecompare_models()only ranks numeric scores (can TypeError today).
valid_scorescurrently accepts any non-None score; if a metric returns a dict/string,sorted(..., key=lambda x: x[1])andnp.mean()will break.- valid_scores = [ - (name, score) - for name, score in zip(model_names, scores) - if score is not None - ] + valid_scores: List[tuple[str, float]] = [] + for name, score in zip(model_names, scores): + if isinstance(score, (int, float)): + valid_scores.append((name, float(score))) @@ - sorted_results = sorted(valid_scores, key=lambda x: x[1], reverse=True) + sorted_results = sorted(valid_scores, key=lambda x: x[1], reverse=True)
294-342: Make cross-benchmark output JSON-friendly (timestamp type mismatch).
benchmark_info["timestamp"] = benchmark.timestampis adatetime, but your typed shape suggests astrtimestamp (and you already useisoformat()into_dict()).benchmark_info = { "name": benchmark.benchmark_name, - "timestamp": benchmark.timestamp, + "timestamp": benchmark.timestamp.isoformat(), "models": benchmark.model_names, "success_rate": benchmark.success_rate, }benchwise/core.py (1)
268-288:EvaluationRunner.run_evaluation()returns the wrong thing (likely breaksrun_benchmark).
run_evaluation()promisesList[EvaluationResult]but appends the rawresultfromtest_func(model, dataset)(which is typically metrics/Any). This is a runtime bug ifrun_benchmark()is used.async def run_evaluation( @@ ) -> List[EvaluationResult]: """Run evaluation on multiple models.""" - results: List[EvaluationResult] = [] + results: List[EvaluationResult] = [] @@ for model_name in models: try: model = get_model_adapter(model_name) - result = await test_func(model, dataset) - results.append(result) + start_time = time.time() + payload = await test_func(model, dataset) + end_time = time.time() + results.append( + EvaluationResult( + model_name=model_name, + test_name=getattr(test_func, "__name__", "evaluation"), + result=payload, + duration=end_time - start_time, + dataset_info=convert_metadata_to_info(dataset.metadata) + if dataset.metadata + else None, + ) + ) except Exception as e: self.logger.error(f"Evaluation failed for {model_name}: {e}") + results.append( + EvaluationResult( + model_name=model_name, + test_name=getattr(test_func, "__name__", "evaluation"), + error=str(e), + duration=0.0, + dataset_info=convert_metadata_to_info(dataset.metadata) + if dataset.metadata + else None, + ) + )Also applies to: 354-360
benchwise/datasets.py (1)
334-448: Fix URL handling order + add request timeouts (currently breaks remote URLs).When
sourceis a string URL like"https://host/data.json", it passesisinstance(source, (str, Path))check, thenPath(source).suffixreturns".json", causing the local.jsonfile branch to execute first. This attemptsopen("https://host/data.json", "r"), which fails before the URL branch at line 399 ever runs. Additionally,requests.get()calls at line 401 lack a timeout parameter, risking indefinite hangs.Move URL detection before the local file suffix checks, and add
timeout=30(or appropriate value) torequests.get()calls.
🧹 Nitpick comments (2)
docs/test_load_dataset.py (1)
22-42: LGTM! The__main__block successfully makes this test runnable.The addition allows developers to execute the test directly, which improves the demo/documentation experience. The approach of creating a dummy data file on-the-fly is appropriate for example code.
One optional improvement: consider using
json.dump()instead of a raw string for better maintainability:import json if __name__ == "__main__": if not os.path.exists(data_file_path): dummy_data = [ {"prompt": "What is the capital of France?", "reference": "Paris"}, {"prompt": "Who wrote 'Hamlet'?", "reference": "William Shakespeare"} ] with open(data_file_path, "w") as f: json.dump(dummy_data, f, indent=2) print(f"Created dummy {data_file_path} for testing.") test_load_dataset_from_json()However, the current raw string approach is perfectly acceptable for documentation purposes and may even be more readable since the exact JSON structure is visible.
benchwise/datasets.py (1)
136-140: Prefer SHA-256 over MD5 for dataset hashes (even if non-crypto).
MD5 is collision-prone; using SHA-256 is basically free here and avoids raising eyebrows for downstream consumers.- return hashlib.md5(data_str.encode()).hexdigest() + return hashlib.sha256(data_str.encode()).hexdigest()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.github/workflows/ci.yml(1 hunks).gitignore(1 hunks)benchwise/cli.py(14 hunks)benchwise/client.py(35 hunks)benchwise/core.py(10 hunks)benchwise/datasets.py(14 hunks)benchwise/results.py(20 hunks)benchwise/types.py(1 hunks)demo.py(4 hunks)docs/test_load_dataset.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .gitignore
- benchwise/types.py
🧰 Additional context used
🧬 Code graph analysis (4)
demo.py (3)
benchwise/core.py (2)
evaluate(39-95)benchmark(185-207)benchwise/datasets.py (1)
create_qa_dataset(451-495)benchwise/metrics.py (2)
accuracy(463-574)semantic_similarity(577-703)
benchwise/cli.py (2)
benchwise/datasets.py (3)
convert_metadata_to_info(714-761)references(163-177)size(142-144)benchwise/types.py (5)
ConfigKwargs(641-646)SyncArgs(628-631)StatusArgs(634-638)DatasetInfo(326-338)EvaluationMetadata(378-393)
benchwise/results.py (1)
benchwise/types.py (10)
DatasetInfo(326-338)EvaluationMetadata(378-393)EvaluationResultDict(396-407)BenchmarkSummary(410-416)BenchmarkResultDict(419-426)ModelComparisonResult(436-447)CrossBenchmarkComparison(694-699)ModelPerformanceAnalysis(702-716)CachedResultInfo(675-682)BenchmarkComparisonInfo(685-691)
benchwise/core.py (3)
benchwise/datasets.py (3)
Dataset(106-331)convert_metadata_to_info(714-761)size(142-144)benchwise/results.py (2)
EvaluationResult(25-94)success(52-54)benchwise/types.py (7)
RunnerConfig(656-664)ModelComparisonResult(436-447)EvaluationResultDict(396-407)EvaluationMetadata(378-393)DatasetInfo(326-338)CallableWithBenchmarkMetadata(608-615)evaluate(594-598)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (4)
.github/workflows/ci.yml (1)
3-8: CI looks solid; verify repo paths/extras and consider pip caching.
- Please confirm
mypy.iniis at repo root (since--config-file=mypy.ini) andrun_tests.pyis in the working directory the workflow runs from.- If your tests live under
benchwise/tests(or another path),ruff check benchwise tests/ruff format --check benchwise testsmay miss or fail—worth confirming the directory names.- Optional: add
actions/cache(pip) to speed up repeated runs.Also applies to: 21-36
demo.py (1)
30-42: Demo usage matches the@evaluatewrapper contract.
await test_general_knowledge(qa_dataset)returning an iterable of per-model results is consistent with the decorator behavior, and printingresult.result[...]aligns with theEvaluationResultshape.Also applies to: 45-57
benchwise/cli.py (1)
115-155: DatasetInfo conversion + upload fallback are clean and consistent.
convert_metadata_to_info(dataset.metadata)usage (withNonefallback) keeps evaluation results and uploads aligned with the new typedDatasetInfomodel.Also applies to: 243-309
benchwise/core.py (1)
39-183:evaluate()+_run_evaluation()integration looks consistent (metadata + DatasetInfo).
TheCallableWithBenchmarkMetadatacast approach avoids the oldattr-definedissue while keeping runtime behavior straightforward, and theDatasetInfofallback for uploads is sensible.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
benchwise/datasets.py (2)
122-134: Consider simplifying the metadata initialization.The nested
ifchecks can be consolidated into a single block. The current approach casts an empty dict, then immediately checks if it's empty and casts again with defaults.Apply this diff to simplify:
def __post_init__(self) -> None: - if self.metadata is None: - self.metadata = cast(DatasetMetadata, {}) - - if not self.metadata: - self.metadata = cast( - DatasetMetadata, - { - "size": len(self.data), - "created_at": pd.Timestamp.now().isoformat(), - "hash": self._compute_hash(), - }, - ) + if not self.metadata: + self.metadata = cast( + DatasetMetadata, + { + "size": len(self.data), + "created_at": pd.Timestamp.now().isoformat(), + "hash": self._compute_hash(), + }, + )
458-502: Consistent and well-structured dataset creation functions.All three creation functions follow a uniform pattern with proper type annotations, explicit schemas, and appropriate metadata. The use of
cast()aligns with the PR's typing improvements.Minor note: The name extraction pattern (checking
isinstance(kwargs.get("name"), str)) is repeated in all three functions. If you add more creation functions in the future, consider extracting this into a helper, but it's fine as-is for now.Also applies to: 505-549, 552-597
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
benchwise/datasets.py(14 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchwise/datasets.py (1)
benchwise/types.py (5)
DatasetItem(285-301)DatasetMetadata(304-313)DatasetSchema(316-323)DatasetDict(349-355)DatasetInfo(326-338)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (6)
benchwise/datasets.py (6)
346-367: LGTM—DatasetDict name field is now respected.The implementation now properly extracts and validates the
namefrom the incomingDatasetDictbefore falling back tokwargs, addressing the previous review feedback.
362-367: Good use of centralized validation helpers.Consistently applying the validation functions across all loading paths ensures uniform type checking and error messages.
721-768: LGTM—pragmatic approach to handling TypedDict limitations.The cast to
Dict[str, Any]at line 746 is a reasonable workaround for accessing fields like"hash","task", and"difficulty"that may exist at runtime but aren't defined in theDatasetMetadataTypedDict. This function properly handles missing fields and provides proper type safety for the returnedDatasetInfo.
321-329: Type narrowing approach is sound.The explicit filtering at lines 322-324 helps mypy understand that
numeric_valuescontains onlyintorfloattypes, even though theall()check at line 320 already verified this. This is a reasonable tradeoff for strict type checking.
603-620: LGTM—proper return type annotations.The DatasetRegistry methods now have explicit return type annotations, completing the type coverage for this class.
627-718: LGTM—consistent type annotations in sample loaders.All three sample dataset functions now have proper type annotations with
cast()calls, maintaining consistency with the rest of the module's typing improvements.
| def _validate_dataset_item(item: Any) -> DatasetItem: | ||
| """ | ||
| Validate and convert a dictionary to DatasetItem. | ||
|
|
||
| Args: | ||
| item: Dictionary or any value to validate | ||
|
|
||
| Returns: | ||
| Validated DatasetItem | ||
|
|
||
| Raises: | ||
| ValueError: If item is not a dictionary | ||
| """ | ||
| if not isinstance(item, dict): | ||
| raise ValueError(f"Expected dict for DatasetItem, got {type(item).__name__}") | ||
| return cast(DatasetItem, item) | ||
|
|
||
|
|
||
| def _validate_dataset_items(items: Any) -> List[DatasetItem]: | ||
| """ | ||
| Validate and convert a list of dictionaries to List[DatasetItem]. | ||
|
|
||
| Args: | ||
| items: List of dictionaries or any value to validate | ||
|
|
||
| Returns: | ||
| Validated List[DatasetItem] | ||
|
|
||
| Raises: | ||
| ValueError: If items is not a list or contains non-dict items | ||
| """ | ||
| if not isinstance(items, list): | ||
| raise ValueError(f"Expected list for dataset data, got {type(items).__name__}") | ||
|
|
||
| validated_items: List[DatasetItem] = [] | ||
| for i, item in enumerate(items): | ||
| if not isinstance(item, dict): | ||
| raise ValueError( | ||
| f"Expected dict for dataset item at index {i}, got {type(item).__name__}" | ||
| ) | ||
| validated_items.append(cast(DatasetItem, item)) | ||
|
|
||
| return validated_items | ||
|
|
||
|
|
||
| def _validate_dataset_metadata(metadata: Any) -> Optional[DatasetMetadata]: | ||
| """ | ||
| Validate and convert metadata to DatasetMetadata. | ||
|
|
||
| Args: | ||
| metadata: Dictionary or None to validate | ||
|
|
||
| Returns: | ||
| Validated DatasetMetadata or None | ||
| """ | ||
| if metadata is None: | ||
| return None | ||
|
|
||
| if not isinstance(metadata, dict): | ||
| raise ValueError( | ||
| f"Expected dict or None for DatasetMetadata, got {type(metadata).__name__}" | ||
| ) | ||
|
|
||
| return cast(DatasetMetadata, metadata) | ||
|
|
||
|
|
||
| def _validate_dataset_schema(schema: Any) -> Optional[DatasetSchema]: | ||
| """ | ||
| Validate and convert schema to DatasetSchema. | ||
|
|
||
| Args: | ||
| schema: Dictionary or None to validate | ||
|
|
||
| Returns: | ||
| Validated DatasetSchema or None | ||
| """ | ||
| if schema is None: | ||
| return None | ||
|
|
||
| if not isinstance(schema, dict): | ||
| raise ValueError( | ||
| f"Expected dict or None for DatasetSchema, got {type(schema).__name__}" | ||
| ) | ||
|
|
||
| return cast(DatasetSchema, schema) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the file structure and imports
head -20 benchwise/datasets.pyRepository: Benchwise/benchwise
Length of output: 485
🏁 Script executed:
# Search for where these validation functions are called
rg "_validate_dataset" --type python -A 3 -B 1Repository: Benchwise/benchwise
Length of output: 149
🏁 Script executed:
# Look at the TypedDict definitions to understand what structure is expected
rg "TypedDict|DatasetItem|DatasetMetadata|DatasetSchema" --type python -B 2 -A 5Repository: Benchwise/benchwise
Length of output: 149
🏁 Script executed:
# Check if there are any other validation patterns in the file
rg "isinstance|validate|check" benchwise/datasets.py -A 2Repository: Benchwise/benchwise
Length of output: 5607
🏁 Script executed:
# Look at the PR discussion context - search for cast usage and related comments
rg "cast\(" benchwise/datasets.py -B 2 -A 2Repository: Benchwise/benchwise
Length of output: 4970
🏁 Script executed:
# Check the types module to see TypedDict definitions
find . -name "types.py" -type f | head -5Repository: Benchwise/benchwise
Length of output: 83
🏁 Script executed:
# Look at the types definitions
cat benchwise/types.py | head -100Repository: Benchwise/benchwise
Length of output: 2413
🏁 Script executed:
# Check what keys are actually accessed in Dataset methods
rg "self\.data\[|self\.metadata\[|self\.schema\[" benchwise/datasets.py -A 2Repository: Benchwise/benchwise
Length of output: 414
🏁 Script executed:
# Check if there are required fields in the TypedDict definitions
rg "class DatasetItem|class DatasetMetadata|class DatasetSchema|TypedDict" benchwise/types.py -A 10Repository: Benchwise/benchwise
Length of output: 11637
🏁 Script executed:
# Look at the validate_schema method and how it accesses schema keys
rg -A 20 "def validate_schema" benchwise/datasets.pyRepository: Benchwise/benchwise
Length of output: 788
🏁 Script executed:
# Check other methods that access metadata or schema keys
rg "self\.schema\[|self\.metadata\[" benchwise/datasets.py -B 2 -A 3Repository: Benchwise/benchwise
Length of output: 394
🏁 Script executed:
# Look at the stats method that accesses data[0].keys()
rg -A 30 "def get_stats" benchwise/datasets.pyRepository: Benchwise/benchwise
Length of output: 45
Consider adding structural validation for safety-critical scenarios.
The validation functions only verify that inputs are dicts or lists, then use cast() without checking field structure. While this is intentional given the TypedDict definitions use total=False (all fields optional), methods like validate_schema() that access specific keys should include field existence checks throughout. For data loaded from external sources (files, remote APIs), consider adding structural validation—either by checking for required fields in the validation functions or by adopting a schema validation library like Pydantic—to catch malformed data early and provide clearer error messages.
🤖 Prompt for AI Agents
In benchwise/datasets.py around lines 18-102 the helper validators only check
top-level types and then cast, which lets malformed or missing fields slip
through; update each validator to perform structural checks for the fields your
code later relies on (e.g., required keys and their expected primitive types),
returning the validated dict only if required keys exist and types match and
otherwise raising descriptive ValueError messages; alternatively replace these
simple functions with lightweight Pydantic models (or similar schema validators)
for DatasetItem, DatasetMetadata, and DatasetSchema and use model.parse_obj to
validate and produce clear errors for external data.
…d enhance type safety by removing unused imports and organizing dependencies
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
benchwise/config.py (1)
414-419: Remove redundantbool()wrapper.The comparison
response.status_code == 200already returns a boolean, so wrapping it withbool()is unnecessary.Apply this diff:
async def check_connection() -> bool: async with httpx.AsyncClient(timeout=5.0) as client: response = await client.get(f"{config.api_url}/health") - return bool(response.status_code == 200) + return response.status_code == 200benchwise/datasets.py (1)
947-947: UseSet[str]from typing for consistency.The file uses typed imports like
List,Dict,Optionalfrom thetypingmodule throughout, but here uses the lowercaseset[str]syntax (Python 3.9+). For consistency with the rest of the codebase, useSet[str]from typing.Apply this diff:
+from typing import List, Dict, Any, Optional, Union, Callable, cast, Set ... - important_ref_words: set[str] = set() + important_ref_words: Set[str] = set()benchwise/client.py (1)
186-195: Remove redundantbool()wrapper.The comparison
response.status_code == 200already returns a boolean, so wrapping it withbool()is unnecessary.Apply this diff:
async def health_check(self) -> bool: """Check if the Benchwise API is available.""" try: response = await self.client.get("/health", timeout=5.0) - is_healthy = bool(response.status_code == 200) + is_healthy = response.status_code == 200 logger.info(f"Health check: {'healthy' if is_healthy else 'unhealthy'}") return is_healthy
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
benchwise/cli.py(14 hunks)benchwise/client.py(35 hunks)benchwise/config.py(15 hunks)benchwise/datasets.py(12 hunks)benchwise/metrics.py(34 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
benchwise/config.py (1)
benchwise/types.py (1)
ConfigDict(359-374)
benchwise/cli.py (3)
benchwise/datasets.py (3)
convert_metadata_to_info(718-765)references(164-178)size(143-145)benchwise/client.py (1)
upload_results(992-1067)benchwise/types.py (6)
ConfigureArgs(618-625)ConfigKwargs(641-646)SyncArgs(628-631)StatusArgs(634-638)DatasetInfo(326-338)EvaluationMetadata(378-393)
benchwise/datasets.py (1)
benchwise/types.py (5)
DatasetItem(285-301)DatasetMetadata(304-313)DatasetSchema(316-323)DatasetDict(349-355)DatasetInfo(326-338)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
Summary by CodeRabbit
New Features
Public API
Tests
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.