Skip to content

Add agent filtering and unify CLI/pipeline code paths#17

Open
sdubagun-amd wants to merge 33 commits intoyabfrom
agent-filtering-and-cli-unification
Open

Add agent filtering and unify CLI/pipeline code paths#17
sdubagun-amd wants to merge 33 commits intoyabfrom
agent-filtering-and-cli-unification

Conversation

@sdubagun-amd
Copy link

Agent filtering:

  • GEAK_ALLOWED_AGENTS / GEAK_EXCLUDED_AGENTS env vars with CLI flags
  • Prompt-level enforcement via system prompt addendum in task_generator
  • Parse-time safety-net filter in dispatch and task_generator
  • Default fallback agent: swe_agent
  • Accept **_extra kwargs in orchestrator tool functions
  • Handle task_files arriving as JSON string in dispatch_tasks

CLI/pipeline unification:

  • run-tasks delegates to dispatch.task_file_to_agent_task and run_task_batch
  • Publicize dispatch.task_file_to_agent_task as canonical entry point
  • Add codebase-context standalone CLI entry point
  • Update tools.md diagram with agent type nodes
  • Fix preprocessor CLI to pass model_factory for UnitTestAgent

Tests:

  • test_agent_filtering: 17 cases for filtering logic and prompt injection
  • test_tool_consistency: structural checks for CLI/pipeline alignment

Puyuan-Yang and others added 6 commits February 6, 2026 09:41
Co-authored-by: Cursor <cursoragent@cursor.com>
Agent filtering:
- GEAK_ALLOWED_AGENTS / GEAK_EXCLUDED_AGENTS env vars with CLI flags
- Prompt-level enforcement via system prompt addendum in task_generator
- Parse-time safety-net filter in dispatch and task_generator
- Default fallback agent: swe_agent
- Accept **_extra kwargs in orchestrator tool functions
- Handle task_files arriving as JSON string in dispatch_tasks

CLI/pipeline unification:
- run-tasks delegates to dispatch.task_file_to_agent_task and run_task_batch
- Publicize dispatch.task_file_to_agent_task as canonical entry point
- Add codebase-context standalone CLI entry point
- Update tools.md diagram with agent type nodes
- Fix preprocessor CLI to pass model_factory for UnitTestAgent

Tests:
- test_agent_filtering: 17 cases for filtering logic and prompt injection
- test_tool_consistency: structural checks for CLI/pipeline alignment

Co-authored-by: Cursor <cursoragent@cursor.com>
sdubagun-amd and others added 22 commits February 20, 2026 21:01
…code

- Add pipeline_helpers.py: centralize harness creation/validation, baseline
  profiling, context injection, model loading, and agent filtering across
  all CLI entry points (geak, geak-orchestrate, run-tasks, task-generator)
- Add discovery_types.py: shared DiscoveryResult/KernelInfo dataclasses
  with from_dict() factory and kernel language inference
- Flow per-kernel metrics (duration, pct_of_total, bottleneck) from
  baseline_metrics.json through inject_pipeline_context to all agents
- Fix config priority to CLI > Prompt > YAML (prevents LLM prompt from
  overriding explicit CLI flags like --gpu-ids)
- Fix COMMANDMENT path resolution in task file writing (relative_to)
- Add backend-agnostic warmup to profiler-mcp
- Add harness static validation for --profile/--correctness flags
- Integrate mini-swe-agent tools into orchestrator (bash, str_replace_editor,
  profile_kernel, strategy_manager)
- Remove deprecated kernel-profiler MCP, discovery built-in tool, and
  discovery_defaults.toml
- Update architecture, flow, and tools documentation

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add _normalize_command() to profiler-mcp: wraps commands containing
  shell constructs (cd, $VAR, &&, |) in bash -c so rocprofv3's
  os.execvpe can execute them correctly
- Fix all MCP test suites (profiler-mcp, kernel-evolve, openevolve-mcp,
  kernel-ercs) broken by fastmcp API change: replace _tool_manager._tools
  with asyncio.run(mcp.list_tools()) and .fn() with direct function calls

Co-authored-by: Cursor <cursoragent@cursor.com>
…llocations

- Extend _normalize_command to detect VAR=value prefix patterns (e.g.
  HIP_VISIBLE_DEVICES=4 python3 ...) that crash rocprofv3's execvpe
- Add static check in validate_harness for torch.randn(..., device='cuda')
  inside run_profile, which pollutes profiler traces with RNG/memset kernels

Co-authored-by: Cursor <cursoragent@cursor.com>
…add per-round evaluation

- Add BENCHMARK and FULL_BENCHMARK sections to COMMANDMENT.md generation
  and validation (now requires all 5: SETUP, CORRECTNESS, PROFILE,
  BENCHMARK, FULL_BENCHMARK)
- Rename test_perf tool to save_and_test across all agents, configs,
  prompts, and tests
- Preprocessor now captures benchmark_baseline.txt and
  full_benchmark_baseline.txt from harness --benchmark/--full-benchmark
- dispatch.py includes BENCHMARK section in _geak_test_cmd.sh so agents
  get wall-clock latency feedback via save_and_test
- Orchestrator runs per-round evaluation: FULL_BENCHMARK + PROFILE on
  the best candidate from each round, feeding results into next-round
  task generation
- Update Dockerfile to use geak-oe branch with BENCHMARK-based evaluator
- Update all tests to match 5-section COMMANDMENT structure

Co-authored-by: Cursor <cursoragent@cursor.com>
The preprocessor writes the UnitTestAgent's generated harness path to
harness_path.txt. The orchestrator reads it instead of falling back to
discovery's focused_test_file, which lacks --benchmark/--profile support.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Rewrite _evaluate_round_best to create a temporary worktree, apply the
  best patch, set GEAK_* env vars, and run SETUP+FULL_BENCHMARK+PROFILE
  against the patched kernel (was running against unpatched baseline).
- Add programmatic benchmark parsing (_parse_median_latency_ms,
  _parse_shape_count) for independent speedup verification.
- Auto-discover task_files and results_dir when LLM omits them.
- Catch LimitsExceeded from task-generation sub-agent gracefully.
- Add GEAK_ORCHESTRATOR_STEP_LIMIT safety net (default 200).
- Write benchmark_duration_us into baseline_metrics.json during
  preprocessing for consistent wall-clock comparisons.
- Increase task-generator limits (step: 75→200, cost: $10→$50).

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix eval worktree crash: resolve eval_dir to absolute path so
  subprocess.run(cwd=...) works regardless of the process CWD.
- Fix wrong winner selection: compare absolute TOTAL_KERNEL_TIME_MS
  across agents instead of self-reported speedup (which varies per
  agent baseline). Falls back to speedup when kernel times unavailable.
- Fix profiler call: replace unsupported env= kwarg with temporary
  os.environ PYTHONPATH, avoiding both the TypeError and rocprofv3
  nested-quote issues.
- Add --start-round flag to geak-orchestrate CLI so the orchestrator
  can resume from a given round, skipping exploration and loading
  prior round evaluations from disk.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Validate fallback agent is in the allowed set before using it
- Strip fenced code block markers when reading COMMANDMENT sections
- Use unique temp filenames for test scripts to avoid concurrent races
- Gracefully handle test discovery failures in preprocessor
- Safely parse priority and num_gpus in task generator LLM responses
- Fix commandment auto-fix regexes to match both quote styles
- Fix workspace_path type annotation (Path -> Path | None)

Co-authored-by: Cursor <cursoragent@cursor.com>
A task with num_gpus > len(gpu_ids) would block forever in
gpu_queue.get() waiting for GPUs that don't exist. Cap the
request to the pool size to prevent the deadlock.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Exclude .rocprofv3/, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/,
  *.so, and .geak_resolved/ from git diff when generating patches to
  prevent binary artifacts from breaking patch application.

- Add DEFAULT_EVAL_BENCHMARK_ITERATIONS (50) as a shared constant in
  pipeline_helpers.py. All benchmark invocations — preprocessing baselines,
  agent benchmarks, and orchestrator evaluations — now use this value via
  the GEAK_BENCHMARK_EXTRA_ARGS env var, ensuring apples-to-apples
  speedup comparisons.

- COMMANDMENT BENCHMARK/FULL_BENCHMARK sections now expand
  ${GEAK_BENCHMARK_EXTRA_ARGS:-} so iteration count is configurable.

- Preprocessor re-runs all harness modes with --iterations 50 after
  initial validation to collect high-quality baselines.

- geak --from-task and geak parallel mode now propagate
  GEAK_BENCHMARK_EXTRA_ARGS to agent environments.

- Harness template (mini_unit_test_agent.yaml) instructs agents to
  accept --iterations N CLI arg with GEAK_BENCHMARK_ITERATIONS env
  fallback.

- Updated INSTRUCTIONS.md, README.md, and docs/ with new env vars,
  patch exclusion list, and baseline benchmark re-run step.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Add MCP integration module (mcp_integration/) from geak_v3_db
- Add AMD/NVIDIA knowledge base (knowledge-base/) from geak_v3_db
- Add RAG scripts and examples from geak_v3_db
- Add --mcp / --debug flags to mini.py entry point
- Add TeeOutput for console log capture to trajectory
- Add langchain optional-dependencies group to pyproject.toml
- Apply output truncation logic to all agent config templates (geak_v3_db)
- Keep dev multi-agent architecture: ParallelAgent, StrategyInteractiveAgent, UnitTestAgent
- Keep dev tools_runtime, save_patch, config_editor, task_parser modules
- Merge mkdocs.yml nav: retain all pages from both branches
- save.py: use ensure_ascii=False for non-ASCII character support

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Add missing geak_v3_db content to README.md: detailed config table
  (mini_no_temp, mini_reverse_kl), RAG config parameter table,
  knowledge base document requirements
- Rewrite README_zh.md to match README.md structure: add GEAK-v3 intro,
  multi-agent architecture, parallel optimization, tools, output artifacts

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add HIP/CK kernel language detection and COMMANDMENT generation
- Broaden orchestrator metric parsers to recognize BENCHMARK_LATENCY_MS
  and "Total median time:" output formats
- Fix preprocessor baseline enrichment to parse BENCHMARK_LATENCY_MS
  from benchmark_baseline.txt
- Update test discovery, task planner, and unit test agent for CK kernels

Co-authored-by: Cursor <cursoragent@cursor.com>
…llel

- Remove separate MCP early-return path in mini.py
- MCP environment now flows into the standard ParallelAgent pipeline
- env_factory creates MCPEnabledEnvironment for parallel sub-agents when --mcp
- MCP prompts (SYSTEM_TEMPLATE, INSTANCE_TEMPLATE) injected into agent config
- Fix profiling_tools.py f-string Python 3.10 compatibility (double→single quotes)

Co-authored-by: Cursor <cursoragent@cursor.com>
…, skip redundant baselines

Seven focused changes to reduce geak-orchestrate wall-clock time:

1. Build on prior round's best patch: round N+1 agents start from the
   globally best patch via create_worktree_with_patch, tracked in ctx.
2. Full round context for task generator: auto-inject ALL prior rounds'
   results, planned tasks, and orchestrator evaluations so the task
   generator avoids repeating strategies.
3. Remove baseline establishment: replace Phase 3 / Step 3 in YAML
   prompts with "Review Provided Baselines" and add skip-baseline
   instruction to inject_pipeline_context for dispatch-path agents.
4. Separate agent vs eval benchmark iterations: agents use 10 iterations
   (GEAK_AGENT_BENCHMARK_ITERATIONS) for fast feedback; eval keeps 50.
5. Deterministic patch selection first: try rewrite_best_results before
   falling back to LLM SelectPatchAgent, saving 8-76 LLM steps per task.
6. Default rounds to 2: change GEAK_MAX_ROUNDS default from 5 to 2.
7. Early stopping: break out of round loop when verified_speedup doesn't
   improve over prior best by GEAK_EARLY_STOP_THRESHOLD (default 0.5%).

Also includes bash tool cwd propagation and save_and_test git exclude
syntax fix (:(exclude) instead of :!).

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
merge: integrate geak_v3_db into dev (MCP + multi-agent coexistence)
sdubagun-amd and others added 5 commits February 25, 2026 04:04
…ture

- Add benchmark_parsing.py to git (imported by orchestrator, default agent,
  and parallel agent but was untracked -- would break on fresh clone)
- Add previous_tasks_dir, round_evaluations, current_round params to
  generate_tasks_from_content to match generate_tasks signature

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve all conflicts by keeping PR branch versions:
- README.md, scripts/README.md: keep pipeline-focused docs
- mini.py, profiling_tools.py: keep PR's CLI flags and ruff formatting
- mkdocs.yml: delete (intentionally removed in cleanup)
- examples/test_scripts/: keep renamed files

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants