Add agent filtering and unify CLI/pipeline code paths#17
Open
sdubagun-amd wants to merge 33 commits intoyabfrom
Open
Add agent filtering and unify CLI/pipeline code paths#17sdubagun-amd wants to merge 33 commits intoyabfrom
sdubagun-amd wants to merge 33 commits intoyabfrom
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
Agent filtering: - GEAK_ALLOWED_AGENTS / GEAK_EXCLUDED_AGENTS env vars with CLI flags - Prompt-level enforcement via system prompt addendum in task_generator - Parse-time safety-net filter in dispatch and task_generator - Default fallback agent: swe_agent - Accept **_extra kwargs in orchestrator tool functions - Handle task_files arriving as JSON string in dispatch_tasks CLI/pipeline unification: - run-tasks delegates to dispatch.task_file_to_agent_task and run_task_batch - Publicize dispatch.task_file_to_agent_task as canonical entry point - Add codebase-context standalone CLI entry point - Update tools.md diagram with agent type nodes - Fix preprocessor CLI to pass model_factory for UnitTestAgent Tests: - test_agent_filtering: 17 cases for filtering logic and prompt injection - test_tool_consistency: structural checks for CLI/pipeline alignment Co-authored-by: Cursor <cursoragent@cursor.com>
…code - Add pipeline_helpers.py: centralize harness creation/validation, baseline profiling, context injection, model loading, and agent filtering across all CLI entry points (geak, geak-orchestrate, run-tasks, task-generator) - Add discovery_types.py: shared DiscoveryResult/KernelInfo dataclasses with from_dict() factory and kernel language inference - Flow per-kernel metrics (duration, pct_of_total, bottleneck) from baseline_metrics.json through inject_pipeline_context to all agents - Fix config priority to CLI > Prompt > YAML (prevents LLM prompt from overriding explicit CLI flags like --gpu-ids) - Fix COMMANDMENT path resolution in task file writing (relative_to) - Add backend-agnostic warmup to profiler-mcp - Add harness static validation for --profile/--correctness flags - Integrate mini-swe-agent tools into orchestrator (bash, str_replace_editor, profile_kernel, strategy_manager) - Remove deprecated kernel-profiler MCP, discovery built-in tool, and discovery_defaults.toml - Update architecture, flow, and tools documentation Co-authored-by: Cursor <cursoragent@cursor.com>
- Add _normalize_command() to profiler-mcp: wraps commands containing shell constructs (cd, $VAR, &&, |) in bash -c so rocprofv3's os.execvpe can execute them correctly - Fix all MCP test suites (profiler-mcp, kernel-evolve, openevolve-mcp, kernel-ercs) broken by fastmcp API change: replace _tool_manager._tools with asyncio.run(mcp.list_tools()) and .fn() with direct function calls Co-authored-by: Cursor <cursoragent@cursor.com>
…llocations - Extend _normalize_command to detect VAR=value prefix patterns (e.g. HIP_VISIBLE_DEVICES=4 python3 ...) that crash rocprofv3's execvpe - Add static check in validate_harness for torch.randn(..., device='cuda') inside run_profile, which pollutes profiler traces with RNG/memset kernels Co-authored-by: Cursor <cursoragent@cursor.com>
…add per-round evaluation - Add BENCHMARK and FULL_BENCHMARK sections to COMMANDMENT.md generation and validation (now requires all 5: SETUP, CORRECTNESS, PROFILE, BENCHMARK, FULL_BENCHMARK) - Rename test_perf tool to save_and_test across all agents, configs, prompts, and tests - Preprocessor now captures benchmark_baseline.txt and full_benchmark_baseline.txt from harness --benchmark/--full-benchmark - dispatch.py includes BENCHMARK section in _geak_test_cmd.sh so agents get wall-clock latency feedback via save_and_test - Orchestrator runs per-round evaluation: FULL_BENCHMARK + PROFILE on the best candidate from each round, feeding results into next-round task generation - Update Dockerfile to use geak-oe branch with BENCHMARK-based evaluator - Update all tests to match 5-section COMMANDMENT structure Co-authored-by: Cursor <cursoragent@cursor.com>
The preprocessor writes the UnitTestAgent's generated harness path to harness_path.txt. The orchestrator reads it instead of falling back to discovery's focused_test_file, which lacks --benchmark/--profile support. Co-authored-by: Cursor <cursoragent@cursor.com>
- Rewrite _evaluate_round_best to create a temporary worktree, apply the best patch, set GEAK_* env vars, and run SETUP+FULL_BENCHMARK+PROFILE against the patched kernel (was running against unpatched baseline). - Add programmatic benchmark parsing (_parse_median_latency_ms, _parse_shape_count) for independent speedup verification. - Auto-discover task_files and results_dir when LLM omits them. - Catch LimitsExceeded from task-generation sub-agent gracefully. - Add GEAK_ORCHESTRATOR_STEP_LIMIT safety net (default 200). - Write benchmark_duration_us into baseline_metrics.json during preprocessing for consistent wall-clock comparisons. - Increase task-generator limits (step: 75→200, cost: $10→$50). Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix eval worktree crash: resolve eval_dir to absolute path so subprocess.run(cwd=...) works regardless of the process CWD. - Fix wrong winner selection: compare absolute TOTAL_KERNEL_TIME_MS across agents instead of self-reported speedup (which varies per agent baseline). Falls back to speedup when kernel times unavailable. - Fix profiler call: replace unsupported env= kwarg with temporary os.environ PYTHONPATH, avoiding both the TypeError and rocprofv3 nested-quote issues. - Add --start-round flag to geak-orchestrate CLI so the orchestrator can resume from a given round, skipping exploration and loading prior round evaluations from disk. Co-authored-by: Cursor <cursoragent@cursor.com>
- Validate fallback agent is in the allowed set before using it - Strip fenced code block markers when reading COMMANDMENT sections - Use unique temp filenames for test scripts to avoid concurrent races - Gracefully handle test discovery failures in preprocessor - Safely parse priority and num_gpus in task generator LLM responses - Fix commandment auto-fix regexes to match both quote styles - Fix workspace_path type annotation (Path -> Path | None) Co-authored-by: Cursor <cursoragent@cursor.com>
A task with num_gpus > len(gpu_ids) would block forever in gpu_queue.get() waiting for GPUs that don't exist. Cap the request to the pool size to prevent the deadlock. Co-authored-by: Cursor <cursoragent@cursor.com>
- Exclude .rocprofv3/, __pycache__/, *.pyc, .pytest_cache/, *.egg-info/,
*.so, and .geak_resolved/ from git diff when generating patches to
prevent binary artifacts from breaking patch application.
- Add DEFAULT_EVAL_BENCHMARK_ITERATIONS (50) as a shared constant in
pipeline_helpers.py. All benchmark invocations — preprocessing baselines,
agent benchmarks, and orchestrator evaluations — now use this value via
the GEAK_BENCHMARK_EXTRA_ARGS env var, ensuring apples-to-apples
speedup comparisons.
- COMMANDMENT BENCHMARK/FULL_BENCHMARK sections now expand
${GEAK_BENCHMARK_EXTRA_ARGS:-} so iteration count is configurable.
- Preprocessor re-runs all harness modes with --iterations 50 after
initial validation to collect high-quality baselines.
- geak --from-task and geak parallel mode now propagate
GEAK_BENCHMARK_EXTRA_ARGS to agent environments.
- Harness template (mini_unit_test_agent.yaml) instructs agents to
accept --iterations N CLI arg with GEAK_BENCHMARK_ITERATIONS env
fallback.
- Updated INSTRUCTIONS.md, README.md, and docs/ with new env vars,
patch exclusion list, and baseline benchmark re-run step.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Add MCP integration module (mcp_integration/) from geak_v3_db - Add AMD/NVIDIA knowledge base (knowledge-base/) from geak_v3_db - Add RAG scripts and examples from geak_v3_db - Add --mcp / --debug flags to mini.py entry point - Add TeeOutput for console log capture to trajectory - Add langchain optional-dependencies group to pyproject.toml - Apply output truncation logic to all agent config templates (geak_v3_db) - Keep dev multi-agent architecture: ParallelAgent, StrategyInteractiveAgent, UnitTestAgent - Keep dev tools_runtime, save_patch, config_editor, task_parser modules - Merge mkdocs.yml nav: retain all pages from both branches - save.py: use ensure_ascii=False for non-ASCII character support Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Add missing geak_v3_db content to README.md: detailed config table (mini_no_temp, mini_reverse_kl), RAG config parameter table, knowledge base document requirements - Rewrite README_zh.md to match README.md structure: add GEAK-v3 intro, multi-agent architecture, parallel optimization, tools, output artifacts Co-authored-by: Cursor <cursoragent@cursor.com>
- Add HIP/CK kernel language detection and COMMANDMENT generation - Broaden orchestrator metric parsers to recognize BENCHMARK_LATENCY_MS and "Total median time:" output formats - Fix preprocessor baseline enrichment to parse BENCHMARK_LATENCY_MS from benchmark_baseline.txt - Update test discovery, task planner, and unit test agent for CK kernels Co-authored-by: Cursor <cursoragent@cursor.com>
…llel - Remove separate MCP early-return path in mini.py - MCP environment now flows into the standard ParallelAgent pipeline - env_factory creates MCPEnabledEnvironment for parallel sub-agents when --mcp - MCP prompts (SYSTEM_TEMPLATE, INSTANCE_TEMPLATE) injected into agent config - Fix profiling_tools.py f-string Python 3.10 compatibility (double→single quotes) Co-authored-by: Cursor <cursoragent@cursor.com>
…, skip redundant baselines Seven focused changes to reduce geak-orchestrate wall-clock time: 1. Build on prior round's best patch: round N+1 agents start from the globally best patch via create_worktree_with_patch, tracked in ctx. 2. Full round context for task generator: auto-inject ALL prior rounds' results, planned tasks, and orchestrator evaluations so the task generator avoids repeating strategies. 3. Remove baseline establishment: replace Phase 3 / Step 3 in YAML prompts with "Review Provided Baselines" and add skip-baseline instruction to inject_pipeline_context for dispatch-path agents. 4. Separate agent vs eval benchmark iterations: agents use 10 iterations (GEAK_AGENT_BENCHMARK_ITERATIONS) for fast feedback; eval keeps 50. 5. Deterministic patch selection first: try rewrite_best_results before falling back to LLM SelectPatchAgent, saving 8-76 LLM steps per task. 6. Default rounds to 2: change GEAK_MAX_ROUNDS default from 5 to 2. 7. Early stopping: break out of round loop when verified_speedup doesn't improve over prior best by GEAK_EARLY_STOP_THRESHOLD (default 0.5%). Also includes bash tool cwd propagation and save_and_test git exclude syntax fix (:(exclude) instead of :!). Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
merge: integrate geak_v3_db into dev (MCP + multi-agent coexistence)
…ture - Add benchmark_parsing.py to git (imported by orchestrator, default agent, and parallel agent but was untracked -- would break on fresh clone) - Add previous_tasks_dir, round_evaluations, current_round params to generate_tasks_from_content to match generate_tasks signature Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve all conflicts by keeping PR branch versions: - README.md, scripts/README.md: keep pipeline-focused docs - mini.py, profiling_tools.py: keep PR's CLI flags and ruff formatting - mkdocs.yml: delete (intentionally removed in cleanup) - examples/test_scripts/: keep renamed files Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agent filtering:
CLI/pipeline unification:
Tests: