Skip to content

Agent filtering and cli unification#19

Open
sdubagun-amd wants to merge 76 commits intodevfrom
agent-filtering-and-cli-unification
Open

Agent filtering and cli unification#19
sdubagun-amd wants to merge 76 commits intodevfrom
agent-filtering-and-cli-unification

Conversation

@sdubagun-amd
Copy link

Latest changes with Unit Test Discovery, Heterogenous parallel agents with task creation and execution.

Umangatamd and others added 30 commits February 9, 2026 11:38
Decoupled additions (copied as-is from MSA):
- mcp_tools/: 6 MCP servers (automated-test-discovery, kernel-evolve,
  kernel-ercs, kernel-profiler, metrix-mcp, openevolve-mcp) + mcp-client
- Dockerfile, entrypoint.sh, scripts/run-docker.sh
- runtime_env.py (local/Docker auto-detection)
- optimizer/ (unified OpenEvolve + Autotune interface)
- benchmark.py (standardized benchmarking framework)
- kernel_profile.py (GPU profiling CLI)
- mcp_tools/metrix.py (AMD Metrix API tool)
- reference/ (50+ GPU optimization strategies database + state machine)
- test_suite/ (10-kernel AITER regression suite)
- examples/add_kernel/
- docs: DISCOVERY_PIPELINE.md, METRIX_TOOL.md, GETTING_STARTED.md,
  RUNTIME_ENV.md, RUNTIME_QUICKSTART.md

Integrated changes (best-of-both-worlds):
- Test discovery: MSA's content-based pipeline runs first (fast, free),
  results fed into v3's UnitTestAgent as context. Subagent always runs
  but starts informed rather than exploring from scratch.
- mini.py: Added --runtime, --docker-image, --workspace CLI flags
- pyproject.toml: Added geak/kernel-profile scripts, mcp[cli] dep
- README.md: Added MCP servers, Docker, architecture sections

All geakagent -> minisweagent import references fixed in ported files.
Analysis/comparison docs moved to ~/geak_analysis_docs/ (not needed in repo).
Cherry-picked 6 geak_v3_features commits (167fc13..18853fb):
- Model refactor: amd_base, amd_claude, amd_openai, amd_gemini
- Tool-call message protocol in default.py
- Trajectory saving, test_profiling_tool.py deletion
- Parallel worktree fixes, unit test prompt

Ported 15 msa post-squash-merge commits (path-translated geakagent->minisweagent):
- baseline_metrics.py, protected_files.py (new)
- resolve_kernel_url.py, test discovery injection (new)
- OpenEvolve COMMANDMENT-based evaluation refactor
- kernel_profile: remove --filter, always profile all kernels
- optimizer/core.py: new MCP API (gpu, output_dir, commandment_path)
- mini.py: --kernel-url flag, discovery injection, INSTRUCTIONS.md loading
- default.py: summary_on_cost_limit feature
- Dockerfile: OpenEvolve installation
- openevolve-mcp/server.py: refactored

Doc rejects (README, METRIX_TOOL, RUNTIME_QUICKSTART) deferred to cleanup phase.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace double quotes inside f-string expressions with single quotes.
Python 3.10 does not support reusing the outer quote character inside
f-string braces (PEP 701 landed in 3.12).

Co-authored-by: Cursor <cursoragent@cursor.com>
New MCP server `profiler-mcp` wraps both profiling backends behind a
single `profile_kernel` tool with a `backend` parameter:
- backend="metrix": AMD Metrix API (structured JSON, bottleneck classification)
- backend="rocprof-compute": rocprof-compute CLI (deep roofline, instruction mix)

Files:
- mcp_tools/profiler-mcp/src/profiler_mcp/server.py - unified MCP server
- mcp_tools/profiler-mcp/tests/test_profiler_unit.py - 14 mock-based tests
- mcp_tools/profiler-mcp/tests/test_profiler_integration.py - 4 GPU tests
- mcp_tools/profiler-mcp/examples/profile_kernel.py - Python API example

All tests pass (14 unit, 4 integration on MI300X). Ruff-clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…(Phase 4)

Co-authored-by: Cursor <cursoragent@cursor.com>
…Worker (Phase 5)

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Delete geak_agent/ legacy package; move resolve_kernel_url to src/minisweagent/tools/
- Deduplicate MetrixTool (delete src copy, keep mcp_tools/metrix-mcp canonical)
- Delete 66 inherited mini-swe-agent docs + mkdocs.yml + assets
- Delete 23 dead tests (missing upstream modules), fix 3 remaining
- Rename test_suite/ -> eval_suite/, reference/ -> knowledge_base/
- Consolidate examples under examples/
- Fix all stale geakagent/sdubagun/yueliu14 references
- Remove 16 ghost __pycache__-only directories
- Deprecate mcp_tools/kernel-profiler/ (superseded by profiler-mcp)
- Remove mcp_tools/kernel-from-url-mcp/ (dead, only __pycache__)
- Remove mkdocs dependencies from pyproject.toml
- Update git remote to AMD-AGI/GEAK.git

Tests: 96 passed, 58 skipped, 1 xfailed, 0 failures
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…plumbing tests

- Refactor MCPToolBridge to use a single persistent asyncio event loop on a
  background daemon thread per instance, resolving "Future attached to a
  different loop" errors that broke MCP server calls during e2e runs.

- Fix discovery scoping for --kernel-url flows: when the kernel lives inside
  a .geak_resolved clone, scope both discovery calls (in mini.py and
  unit_test_agent.py) to the clone root instead of the entire workspace.
  Add a hard boundary in discovery._expand_workspace_for_file to prevent
  walking above .geak_resolved.

- Export RESOLVED_DIR_NAME constant and add find_resolved_clone_root() helper
  in resolve_kernel_url_impl.py to couple the directory convention cleanly.

- Fix mini.py second discovery call: prioritize _resolved_kernel_path (from
  --kernel-url resolution) when determining _kernel_path, instead of falling
  through to None when --task is not provided.

- Add new test files: test_discovery_scope.py (scope boundary + mini.py
  wiring), test_e2e_pipeline_smoke.py, test_mcp_server_smoke.py,
  test_plumbing_contracts.py, test_toolruntime_dispatch.py, and extend
  test_mcp_bridge.py with event loop lifecycle tests.

- Apply ruff formatting fixes across modified files.
…nt instructions

- File-based MCP transport for large profiler results; bump StreamReader limit to 16MB
- Auto-detect num_parallel from gpu_ids when not explicitly set
- Externalize discovery patterns into discovery_defaults.toml with per-project overrides
- Detect Triton wrapper files; fall back to kernel_file stem for test/bench matching
- Strengthen test_perf mandatory usage in agent prompts
- Fix default task to defer to INSTRUCTIONS.md instead of banning OpenEvolve
- Add confirm_exit to base AgentConfig for --exit-immediately support
- Extend discovery to detect HIP/CK/ASM kernels, trace cross-language
  call chains (Python→torch.ops→pybind11→.cu), and build dependency
  graphs with fusion opportunity detection
- Add GPU pool scheduler: M tasks on N GPUs with dynamic slot assignment
  via ThreadPoolExecutor and thread-safe GPU queue
- Add dynamic task planner generating language-aware optimization tasks
  (OpenEvolve, CK template tuning, HIP launch config, fusion, etc.)
- Add COMMANDMENT.md validation (required sections, shell built-in
  detection) with auto-validation hook in str_replace_editor
- Fix baseline_metrics NaN/inf sanitization and post-write JSON roundtrip
- Update SelectPatchAgent to handle task_*/parallel_* directories and
  prefer per-kernel latency metrics
- Consolidate duplicated extension lists into shared constants
- Add 25 unit tests for validate_commandment and task_planner

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add GPU/profiler rules to task planner and strategy yaml to prevent
  agents from using inline env vars or HIP_VISIBLE_DEVICES prefixes
- Detect inline env var prefixes (VAR=val cmd) in COMMANDMENT validator
- Add COMMANDMENT.md validation hook to bash tool (agents bypass editor)
- Update INSTRUCTIONS.md with anti-patterns and wrapper script template
- Add 4 new tests for inline env var detection

Co-authored-by: Cursor <cursoragent@cursor.com>
Discovery was broken when given a directory instead of a single file:
- MCP server crashed on read_text() for directories, _expand_workspace
  started from parent instead of the directory itself
- mini.py missed .git inside the directory (kp.parents excludes kp)
- DiscoveryPipeline.run() skipped workspace expansion for directories

Fixes:
- Add directory mode to MCP discover() with recursive kernel scanning
- Fix _expand_workspace to check the path itself when it's a directory
- Add _expand_workspace_for_dir() that scopes workspace to the given
  directory, preventing unrelated sibling files from polluting results
- Use parent directory name as kernel_name when file has a generic name
  (kernel.py, main.py) so test-name matching works properly
- Fix mini.py to check kp itself for .git before walking kp.parents

Co-authored-by: Cursor <cursoragent@cursor.com>
- Eliminate double discovery: mini.py reuses stashed _run_discovery._last_result
  instead of calling run_discovery_pipeline() a second time
- Enrich discovery context: new format_discovery_for_agent() includes kernel
  analysis, language-specific testing guidance (triton/hip/ck/asm), and
  extracted test patterns (tolerances, shapes, dtypes, imports)
- Extract test patterns: _extract_test_patterns() in discovery.py pulls
  atol/rtol, input shapes, dtypes, reference impls, and import patterns from
  top-confidence test files
- Upgrade UnitTestAgent to TestHarnessAgent: creates a fixed test harness
  with --correctness/--profile/--benchmark modes. Reads INSTRUCTIONS.md for
  harness rules. The harness is an immutable evaluation contract.
- Update INSTRUCTIONS.md: section 1a references pre-scanned discovery results
  (no re-discovery needed), section 1b notes pre-built harness from UTA

Co-authored-by: Cursor <cursoragent@cursor.com>
Umangatamd and others added 30 commits February 19, 2026 01:45
Fix task pipeline: agent cwd, config conflicts, and task execution
resolve_kernel_url stored local_repo_path as a relative path while
local_file_path was absolute. The parallel agent resolved the relative
path against the task file directory, producing a doubled nonsense path
that didn't exist. Now all three layers ensure absolute paths: the
source (resolve_kernel_url_impl), the orchestrator context loader, and
the dispatch batch runner.

Co-authored-by: Cursor <cursoragent@cursor.com>
The Full Pipeline Mode (preprocessor → orchestrator) was skipping the
UnitTestAgent, relying on a single-shot LLM finisher in the MCP
discovery server for harness creation. That approach consistently failed
because a single LLM call can't reliably generate correct test harnesses
(wrong tensor shapes, wrong tolerances, wrong imports).

The UnitTestAgent is a multi-turn agent with bash/editor tools that can
read the kernel, read existing tests, run them, see errors, and iterate
until the harness works. It was already built for this purpose but
wasn't wired into the new pipeline.

Changes:
- preprocessor.py: Add model/model_factory params to run_preprocessor().
  After MCP discovery (Step 2), run UnitTestAgent (Step 2b) with
  discovery context to create a validated harness. Extract absolute path
  to the harness script for the profiler. Fall back to raw discovery
  test command if UnitTestAgent fails.
- mini.py: Pass model and model_factory to run_preprocessor().

Tested on ROCm/aiter RoPE kernel: UnitTestAgent creates a working
harness, profiling succeeds (48.44 us baseline), orchestrator generates
tasks, optimization agent produces 18+ patches with ~13% speedup.

Co-authored-by: Cursor <cursoragent@cursor.com>
Wire UnitTestAgent into Full Pipeline Mode preprocessor
…ntext passing

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix GPU isolation: propagate HIP_VISIBLE_DEVICES through BashCommand,
  MCPToolBridge, ProfilingAnalyzer, and OpenEvolve subprocess env.
  Prevent shallow-copy race in ParallelAgent by creating new env dicts
  per thread. Add defensive copy in ToolRuntime.set_env().

- GPU-aware task generation: extend AgentTask with num_gpus, teach
  task-generator LLM to allocate GPUs per task, ParallelAgent acquires
  N GPU slots from pool for multi-GPU tasks (e.g. OpenEvolve).

- Docker: remove hardcoded HIP_VISIBLE_DEVICES=0 from Dockerfile,
  unset it in entrypoint.sh so geak --gpu-ids controls isolation.

- Fix profiler integration tests: add __main__ to examples/add_kernel
  so rocprofv3 captures GPU activity, fix MetrixTool empty
  HIP_VISIBLE_DEVICES handling, update test assertions to match
  add_kernel (not rope), mark rocprof-compute roofline as xfail.

- Add developer docs: gpu-isolation.md (invariants, how-to, pitfalls),
  update architecture/flow/tools diagrams with SweAgent, codebase
  context passing chain, multi-GPU dispatch, and --gpu-ids flags.
  Remove redundant diagrams.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix 1: profiler-mcp no longer mutates os.environ; passes clean env
  via _env_override to ProfilingAnalyzer subprocess instead.
- Fix 2: Centralize agent-type ↔ class mappings into agent_spec.py
  (_agent_type_to_class / _agent_class_to_type) eliminating 4 duplicate
  definitions across dispatch, orchestrator, task_generator, task_runner.
- Fix 3: Replace silent `except Exception: pass` in
  OpenEvolveWorker._save_result_artifacts with logger.warning().
- Fix 4: Add public set_tools() to AmdLlmModelBase and AmdLlmModel
  router; SweAgent and task_generator use it instead of reaching into
  model._impl.
- Fix 5: Remove duplicate `cfg: dict` type annotation in dispatch.py
  else-branch.
- Fix 6: Harden _derive_test_command_from_commandment to support
  fenced code blocks, add fallback for raw .py commands, and log
  debug messages on parse outcomes.

Co-authored-by: Cursor <cursoragent@cursor.com>
The previous _env_override approach didn't actually remove the empty
key from the subprocess env (dict merge brings it back from os.environ).
Switch to save/restore of os.environ, which is safe here because
profiler-mcp runs as a dedicated single-threaded MCP server process.

Co-authored-by: Cursor <cursoragent@cursor.com>
Swe agent, openevolve fixes and context
The test harness had no control over how many shapes were used for
profiling vs testing, causing OOM during GPU profiling.

Changes:
- Add select_shapes_uniform() utility in discovery.py for programmatic
  shape selection (dedup, sort by element count, uniform sampling)
- UnitTestAgent system prompt now instructs the LLM to read discovered
  test files, extract ALL shapes (variables, loops, configs — not just
  literal tuples), and build two lists:
  HARNESS_SHAPES (20-25) for correctness/benchmark
  PROFILE_SHAPES (5) for --profile mode only
- format_discovery_for_agent() cleaned up: passes all extracted patterns
  without truncation so the LLM has full shape context

Co-authored-by: Cursor <cursoragent@cursor.com>
The harness now supports four CLI modes with distinct shape sets:
  --profile        → PROFILE_SHAPES (5)
  --benchmark      → HARNESS_SHAPES (20-25 sampled)
  --correctness    → HARNESS_SHAPES
  --full-benchmark → ALL_SHAPES (every discovered shape)

--full-benchmark runs all discovered shapes and is intended for use
only at the start and end of optimization to get the complete picture.
--benchmark uses the sampled subset for fast iteration loops.
If ALL_SHAPES has ≤25 entries, HARNESS_SHAPES = ALL_SHAPES and both
benchmark modes behave identically.

Updated INSTRUCTIONS.md and UTA system prompt accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
The baseline must record BOTH --benchmark (reduced, 20-25 shapes) and
--full-benchmark (all shapes) results. During iterations the agent
compares reduced vs reduced; at the end it compares full vs full.
Mixing modes in a comparison produces meaningless speedup numbers
because the shape sets differ.

Co-authored-by: Cursor <cursoragent@cursor.com>
Fix OOM in profiling: LLM-driven shape extraction from discovery
The uniform index calculation divides by (count-1), which crashes when
count=1. Add early returns for count<=0 (empty) and count==1 (median
shape).

Co-authored-by: Cursor <cursoragent@cursor.com>
Agent filtering:
- GEAK_ALLOWED_AGENTS / GEAK_EXCLUDED_AGENTS env vars with CLI flags
- Prompt-level enforcement via system prompt addendum in task_generator
- Parse-time safety-net filter in dispatch and task_generator
- Default fallback agent: swe_agent
- Accept **_extra kwargs in orchestrator tool functions
- Handle task_files arriving as JSON string in dispatch_tasks

CLI/pipeline unification:
- run-tasks delegates to dispatch.task_file_to_agent_task and run_task_batch
- Publicize dispatch.task_file_to_agent_task as canonical entry point
- Add codebase-context standalone CLI entry point
- Update tools.md diagram with agent type nodes
- Fix preprocessor CLI to pass model_factory for UnitTestAgent

Tests:
- test_agent_filtering: 17 cases for filtering logic and prompt injection
- test_tool_consistency: structural checks for CLI/pipeline alignment

Co-authored-by: Cursor <cursoragent@cursor.com>
…code

- Add pipeline_helpers.py: centralize harness creation/validation, baseline
  profiling, context injection, model loading, and agent filtering across
  all CLI entry points (geak, geak-orchestrate, run-tasks, task-generator)
- Add discovery_types.py: shared DiscoveryResult/KernelInfo dataclasses
  with from_dict() factory and kernel language inference
- Flow per-kernel metrics (duration, pct_of_total, bottleneck) from
  baseline_metrics.json through inject_pipeline_context to all agents
- Fix config priority to CLI > Prompt > YAML (prevents LLM prompt from
  overriding explicit CLI flags like --gpu-ids)
- Fix COMMANDMENT path resolution in task file writing (relative_to)
- Add backend-agnostic warmup to profiler-mcp
- Add harness static validation for --profile/--correctness flags
- Integrate mini-swe-agent tools into orchestrator (bash, str_replace_editor,
  profile_kernel, strategy_manager)
- Remove deprecated kernel-profiler MCP, discovery built-in tool, and
  discovery_defaults.toml
- Update architecture, flow, and tools documentation

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add _normalize_command() to profiler-mcp: wraps commands containing
  shell constructs (cd, $VAR, &&, |) in bash -c so rocprofv3's
  os.execvpe can execute them correctly
- Fix all MCP test suites (profiler-mcp, kernel-evolve, openevolve-mcp,
  kernel-ercs) broken by fastmcp API change: replace _tool_manager._tools
  with asyncio.run(mcp.list_tools()) and .fn() with direct function calls

Co-authored-by: Cursor <cursoragent@cursor.com>
…llocations

- Extend _normalize_command to detect VAR=value prefix patterns (e.g.
  HIP_VISIBLE_DEVICES=4 python3 ...) that crash rocprofv3's execvpe
- Add static check in validate_harness for torch.randn(..., device='cuda')
  inside run_profile, which pollutes profiler traces with RNG/memset kernels

Co-authored-by: Cursor <cursoragent@cursor.com>
…add per-round evaluation

- Add BENCHMARK and FULL_BENCHMARK sections to COMMANDMENT.md generation
  and validation (now requires all 5: SETUP, CORRECTNESS, PROFILE,
  BENCHMARK, FULL_BENCHMARK)
- Rename test_perf tool to save_and_test across all agents, configs,
  prompts, and tests
- Preprocessor now captures benchmark_baseline.txt and
  full_benchmark_baseline.txt from harness --benchmark/--full-benchmark
- dispatch.py includes BENCHMARK section in _geak_test_cmd.sh so agents
  get wall-clock latency feedback via save_and_test
- Orchestrator runs per-round evaluation: FULL_BENCHMARK + PROFILE on
  the best candidate from each round, feeding results into next-round
  task generation
- Update Dockerfile to use geak-oe branch with BENCHMARK-based evaluator
- Update all tests to match 5-section COMMANDMENT structure

Co-authored-by: Cursor <cursoragent@cursor.com>
The preprocessor writes the UnitTestAgent's generated harness path to
harness_path.txt. The orchestrator reads it instead of falling back to
discovery's focused_test_file, which lacks --benchmark/--profile support.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Rewrite _evaluate_round_best to create a temporary worktree, apply the
  best patch, set GEAK_* env vars, and run SETUP+FULL_BENCHMARK+PROFILE
  against the patched kernel (was running against unpatched baseline).
- Add programmatic benchmark parsing (_parse_median_latency_ms,
  _parse_shape_count) for independent speedup verification.
- Auto-discover task_files and results_dir when LLM omits them.
- Catch LimitsExceeded from task-generation sub-agent gracefully.
- Add GEAK_ORCHESTRATOR_STEP_LIMIT safety net (default 200).
- Write benchmark_duration_us into baseline_metrics.json during
  preprocessing for consistent wall-clock comparisons.
- Increase task-generator limits (step: 75→200, cost: $10→$50).

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix eval worktree crash: resolve eval_dir to absolute path so
  subprocess.run(cwd=...) works regardless of the process CWD.
- Fix wrong winner selection: compare absolute TOTAL_KERNEL_TIME_MS
  across agents instead of self-reported speedup (which varies per
  agent baseline). Falls back to speedup when kernel times unavailable.
- Fix profiler call: replace unsupported env= kwarg with temporary
  os.environ PYTHONPATH, avoiding both the TypeError and rocprofv3
  nested-quote issues.
- Add --start-round flag to geak-orchestrate CLI so the orchestrator
  can resume from a given round, skipping exploration and loading
  prior round evaluations from disk.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Validate fallback agent is in the allowed set before using it
- Strip fenced code block markers when reading COMMANDMENT sections
- Use unique temp filenames for test scripts to avoid concurrent races
- Gracefully handle test discovery failures in preprocessor
- Safely parse priority and num_gpus in task generator LLM responses
- Fix commandment auto-fix regexes to match both quote styles
- Fix workspace_path type annotation (Path -> Path | None)

Co-authored-by: Cursor <cursoragent@cursor.com>
A task with num_gpus > len(gpu_ids) would block forever in
gpu_queue.get() waiting for GPUs that don't exist. Cap the
request to the pool size to prevent the deadlock.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Exclude .rocprofv3/, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/,
  *.so, and .geak_resolved/ from git diff when generating patches to
  prevent binary artifacts from breaking patch application.

- Add DEFAULT_EVAL_BENCHMARK_ITERATIONS (50) as a shared constant in
  pipeline_helpers.py. All benchmark invocations — preprocessing baselines,
  agent benchmarks, and orchestrator evaluations — now use this value via
  the GEAK_BENCHMARK_EXTRA_ARGS env var, ensuring apples-to-apples
  speedup comparisons.

- COMMANDMENT BENCHMARK/FULL_BENCHMARK sections now expand
  ${GEAK_BENCHMARK_EXTRA_ARGS:-} so iteration count is configurable.

- Preprocessor re-runs all harness modes with --iterations 50 after
  initial validation to collect high-quality baselines.

- geak --from-task and geak parallel mode now propagate
  GEAK_BENCHMARK_EXTRA_ARGS to agent environments.

- Harness template (mini_unit_test_agent.yaml) instructs agents to
  accept --iterations N CLI arg with GEAK_BENCHMARK_ITERATIONS env
  fallback.

- Updated INSTRUCTIONS.md, README.md, and docs/ with new env vars,
  patch exclusion list, and baseline benchmark re-run step.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add HIP/CK kernel language detection and COMMANDMENT generation
- Broaden orchestrator metric parsers to recognize BENCHMARK_LATENCY_MS
  and "Total median time:" output formats
- Fix preprocessor baseline enrichment to parse BENCHMARK_LATENCY_MS
  from benchmark_baseline.txt
- Update test discovery, task planner, and unit test agent for CK kernels

Co-authored-by: Cursor <cursoragent@cursor.com>
…, skip redundant baselines

Seven focused changes to reduce geak-orchestrate wall-clock time:

1. Build on prior round's best patch: round N+1 agents start from the
   globally best patch via create_worktree_with_patch, tracked in ctx.
2. Full round context for task generator: auto-inject ALL prior rounds'
   results, planned tasks, and orchestrator evaluations so the task
   generator avoids repeating strategies.
3. Remove baseline establishment: replace Phase 3 / Step 3 in YAML
   prompts with "Review Provided Baselines" and add skip-baseline
   instruction to inject_pipeline_context for dispatch-path agents.
4. Separate agent vs eval benchmark iterations: agents use 10 iterations
   (GEAK_AGENT_BENCHMARK_ITERATIONS) for fast feedback; eval keeps 50.
5. Deterministic patch selection first: try rewrite_best_results before
   falling back to LLM SelectPatchAgent, saving 8-76 LLM steps per task.
6. Default rounds to 2: change GEAK_MAX_ROUNDS default from 5 to 2.
7. Early stopping: break out of round loop when verified_speedup doesn't
   improve over prior best by GEAK_EARLY_STOP_THRESHOLD (default 0.5%).

Also includes bash tool cwd propagation and save_and_test git exclude
syntax fix (:(exclude) instead of :!).

Co-authored-by: Cursor <cursoragent@cursor.com>
…ture

- Add benchmark_parsing.py to git (imported by orchestrator, default agent,
  and parallel agent but was untracked -- would break on fresh clone)
- Add previous_tasks_dir, round_evaluations, current_round params to
  generate_tasks_from_content to match generate_tasks signature

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve all conflicts by keeping PR branch versions:
- README.md, scripts/README.md: keep pipeline-focused docs
- mini.py, profiling_tools.py: keep PR's CLI flags and ruff formatting
- mkdocs.yml: delete (intentionally removed in cleanup)
- examples/test_scripts/: keep renamed files

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants