Merged
Conversation
…ges, and IO drift tracking - Add _summarize_compare_diff() to extract compare_delta, step_summary, taxonomy_summary, budget_badges, io_step_count, and changed_step_count from compare_diff payload - Add action_changed, result_changed, and has_io_drift flags to compare_step_summary entries - Add compare_taxonomy_summary with failure_type and termination_reason comparison rows - Add compare_budget_badges with steps/tool_calls/wall_clock_
…y, budget, and step data structures - Replace direct compare_diff.taxonomy access with compare_taxonomy_summary loop for failure_type and termination_reason rows - Replace compare_diff.budget_delta with compare_budget_badges loop using badge.kind, badge.label, badge.value, and badge.suffix - Add compare_changed_step_count and compare_io_step_count metric cards above IO audit pills - Add action_changed and has_io_drift columns to compare
…ixed options - Add _filter_compare_step_summary() to filter compare steps by action_changed, result_changed, has_io_drift, or mixed (2+ flags) - Add compare_drift form parameter with all/action/result/io/mixed dropdown options - Add compare_filters dict to template context with drift selection - Add compare_step_summary_total to show filtered vs total step counts - Add filter status text above delta view table showing active filter
…are form with workflow guidance - Add compare-run-datalist with run_id, agent, task_ref, and seed options from recent_runs - Add list="compare-run-datalist" attribute to run_a and run_b inputs for autocomplete - Add "Use recent for A/B" button rows with first 3 recent runs for one-click population - Add workflow tip text encouraging users to start from recent runs and refine with drift filter - Add test_compare_route_renders_recent_
…late matching agent/task pairs from recent runs - Add _suggest_compare_inputs() to find most recent run pair with matching agent and task_ref from recent_runs list - Fall back to first two runs if no matching pair found, or empty strings if fewer than 2 runs available - Add compare_suggestions to template context and pre-populate compare_inputs when user hasn't provided values - Add suggestion hint text in compare form showing
…d compare tab to main dashboard - Add seed matching to _suggest_compare_inputs() to prefer run pairs with identical agent/task_ref/seed before falling back to agent/task_ref-only matches - Extract _load_compare_diff() helper to deduplicate diff loading logic between index and compare_runs routes - Add compare_a, compare_b, compare_drift, and active_tab query params to index route - Add compare_diff, compare_error, compare_inputs, compare
…re list - Add `.pytest_tm` to ignored pytest temporary directories - Add `assets/*` directory to gitignore - Fix typo in comment: "notest" → "notes"
…th expanded system_info fields - Add lineage object with parent_run_id and baseline_run_id to artifact schema for experiment ancestry tracking - Expand system_info with machine, processor, and cpu_count fields beyond existing platform/python - Add compare_runs.py CLI tool to diff two wrapper run artifacts with metric delta calculation - Support explicit run_id args or auto-select two most recent runs from --runs-dir - Display outcome
…, outcome filtering, and cleanup safety guard - Add validate_artifacts.py CLI tool to check artifact.json schema compliance with required fields and type validation - Add --baseline-metric flag to compare_runs.py to override artifact baseline values and show delta vs baseline for both runs - Add --include-outcomes filter to summarize_runs.py to show only runs matching specified outcome values - Add --allow-delete-newest safety
…ration and seed field to artifact schema - Add report_runs.py to generate markdown tables of recent runs sorted by metric (asc) or time (desc) with --limit, --sort, and --output flags - Add --seed argument to run_wrapper.py to record optional seed value in artifact.json for reproducibility tracking - Add seed field to artifact schema alongside lineage object - Fix cpu_count collection to use os.cpu_count() instead of platform.cpu_count() - Add
…with autonomous research loop benefits - Add progress.md documenting wrapper completion with shim runs, artifact validation, comparison/report tooling, and pytest coverage - Document 5 shim runs captured with val_bpb 1.05–1.30 range and best run showing success_improved vs 1.5 baseline - List completed CLI tools: run_wrapper, summarize_runs, compare_runs, report_runs, cleanup_runs, validate_artifacts - Add usage examples for
… commands to progress and usage docs - Add CPU laptop example experiment section to progress.md with 5 PowerShell one-liners for running shim variants and generating reports - Add matching example to USAGE.md with same commands and expected output shape - Document expected deterministic metrics ~1.11 and 1.30 vs baseline 1.50 with deltas ≈ -0.39 and -0.20 - Add expected behavior notes for summarize_runs outcomes/deltas, compare_runs metric
…im run guidance - Add bullet point referencing CPU-only example workflow: run shim twice with different metrics vs baseline, then use summarize/compare/report tools - Point to wrapper/USAGE.md for specific commands and expected delta values (1.11 and 1.30 vs 1.50 baseline)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Testing
python -m pytestpython -m ruff check agent_benchChecklist