Skip to content

msa -> yab -> geak_v3_features#10

Open
sdubagun-amd wants to merge 60 commits intogeak_v3_featuresfrom
yab
Open

msa -> yab -> geak_v3_features#10
sdubagun-amd wants to merge 60 commits intogeak_v3_featuresfrom
yab

Conversation

@sdubagun-amd
Copy link

Moving to a common repository.

Umangatamd and others added 30 commits February 9, 2026 11:38
Decoupled additions (copied as-is from MSA):
- mcp_tools/: 6 MCP servers (automated-test-discovery, kernel-evolve,
  kernel-ercs, kernel-profiler, metrix-mcp, openevolve-mcp) + mcp-client
- Dockerfile, entrypoint.sh, scripts/run-docker.sh
- runtime_env.py (local/Docker auto-detection)
- optimizer/ (unified OpenEvolve + Autotune interface)
- benchmark.py (standardized benchmarking framework)
- kernel_profile.py (GPU profiling CLI)
- mcp_tools/metrix.py (AMD Metrix API tool)
- reference/ (50+ GPU optimization strategies database + state machine)
- test_suite/ (10-kernel AITER regression suite)
- examples/add_kernel/
- docs: DISCOVERY_PIPELINE.md, METRIX_TOOL.md, GETTING_STARTED.md,
  RUNTIME_ENV.md, RUNTIME_QUICKSTART.md

Integrated changes (best-of-both-worlds):
- Test discovery: MSA's content-based pipeline runs first (fast, free),
  results fed into v3's UnitTestAgent as context. Subagent always runs
  but starts informed rather than exploring from scratch.
- mini.py: Added --runtime, --docker-image, --workspace CLI flags
- pyproject.toml: Added geak/kernel-profile scripts, mcp[cli] dep
- README.md: Added MCP servers, Docker, architecture sections

All geakagent -> minisweagent import references fixed in ported files.
Analysis/comparison docs moved to ~/geak_analysis_docs/ (not needed in repo).
Cherry-picked 6 geak_v3_features commits (167fc13..18853fb):
- Model refactor: amd_base, amd_claude, amd_openai, amd_gemini
- Tool-call message protocol in default.py
- Trajectory saving, test_profiling_tool.py deletion
- Parallel worktree fixes, unit test prompt

Ported 15 msa post-squash-merge commits (path-translated geakagent->minisweagent):
- baseline_metrics.py, protected_files.py (new)
- resolve_kernel_url.py, test discovery injection (new)
- OpenEvolve COMMANDMENT-based evaluation refactor
- kernel_profile: remove --filter, always profile all kernels
- optimizer/core.py: new MCP API (gpu, output_dir, commandment_path)
- mini.py: --kernel-url flag, discovery injection, INSTRUCTIONS.md loading
- default.py: summary_on_cost_limit feature
- Dockerfile: OpenEvolve installation
- openevolve-mcp/server.py: refactored

Doc rejects (README, METRIX_TOOL, RUNTIME_QUICKSTART) deferred to cleanup phase.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace double quotes inside f-string expressions with single quotes.
Python 3.10 does not support reusing the outer quote character inside
f-string braces (PEP 701 landed in 3.12).

Co-authored-by: Cursor <cursoragent@cursor.com>
New MCP server `profiler-mcp` wraps both profiling backends behind a
single `profile_kernel` tool with a `backend` parameter:
- backend="metrix": AMD Metrix API (structured JSON, bottleneck classification)
- backend="rocprof-compute": rocprof-compute CLI (deep roofline, instruction mix)

Files:
- mcp_tools/profiler-mcp/src/profiler_mcp/server.py - unified MCP server
- mcp_tools/profiler-mcp/tests/test_profiler_unit.py - 14 mock-based tests
- mcp_tools/profiler-mcp/tests/test_profiler_integration.py - 4 GPU tests
- mcp_tools/profiler-mcp/examples/profile_kernel.py - Python API example

All tests pass (14 unit, 4 integration on MI300X). Ruff-clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…(Phase 4)

Co-authored-by: Cursor <cursoragent@cursor.com>
…Worker (Phase 5)

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Delete geak_agent/ legacy package; move resolve_kernel_url to src/minisweagent/tools/
- Deduplicate MetrixTool (delete src copy, keep mcp_tools/metrix-mcp canonical)
- Delete 66 inherited mini-swe-agent docs + mkdocs.yml + assets
- Delete 23 dead tests (missing upstream modules), fix 3 remaining
- Rename test_suite/ -> eval_suite/, reference/ -> knowledge_base/
- Consolidate examples under examples/
- Fix all stale geakagent/sdubagun/yueliu14 references
- Remove 16 ghost __pycache__-only directories
- Deprecate mcp_tools/kernel-profiler/ (superseded by profiler-mcp)
- Remove mcp_tools/kernel-from-url-mcp/ (dead, only __pycache__)
- Remove mkdocs dependencies from pyproject.toml
- Update git remote to AMD-AGI/GEAK.git

Tests: 96 passed, 58 skipped, 1 xfailed, 0 failures
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…plumbing tests

- Refactor MCPToolBridge to use a single persistent asyncio event loop on a
  background daemon thread per instance, resolving "Future attached to a
  different loop" errors that broke MCP server calls during e2e runs.

- Fix discovery scoping for --kernel-url flows: when the kernel lives inside
  a .geak_resolved clone, scope both discovery calls (in mini.py and
  unit_test_agent.py) to the clone root instead of the entire workspace.
  Add a hard boundary in discovery._expand_workspace_for_file to prevent
  walking above .geak_resolved.

- Export RESOLVED_DIR_NAME constant and add find_resolved_clone_root() helper
  in resolve_kernel_url_impl.py to couple the directory convention cleanly.

- Fix mini.py second discovery call: prioritize _resolved_kernel_path (from
  --kernel-url resolution) when determining _kernel_path, instead of falling
  through to None when --task is not provided.

- Add new test files: test_discovery_scope.py (scope boundary + mini.py
  wiring), test_e2e_pipeline_smoke.py, test_mcp_server_smoke.py,
  test_plumbing_contracts.py, test_toolruntime_dispatch.py, and extend
  test_mcp_bridge.py with event loop lifecycle tests.

- Apply ruff formatting fixes across modified files.
…nt instructions

- File-based MCP transport for large profiler results; bump StreamReader limit to 16MB
- Auto-detect num_parallel from gpu_ids when not explicitly set
- Externalize discovery patterns into discovery_defaults.toml with per-project overrides
- Detect Triton wrapper files; fall back to kernel_file stem for test/bench matching
- Strengthen test_perf mandatory usage in agent prompts
- Fix default task to defer to INSTRUCTIONS.md instead of banning OpenEvolve
- Add confirm_exit to base AgentConfig for --exit-immediately support
- Extend discovery to detect HIP/CK/ASM kernels, trace cross-language
  call chains (Python→torch.ops→pybind11→.cu), and build dependency
  graphs with fusion opportunity detection
- Add GPU pool scheduler: M tasks on N GPUs with dynamic slot assignment
  via ThreadPoolExecutor and thread-safe GPU queue
- Add dynamic task planner generating language-aware optimization tasks
  (OpenEvolve, CK template tuning, HIP launch config, fusion, etc.)
- Add COMMANDMENT.md validation (required sections, shell built-in
  detection) with auto-validation hook in str_replace_editor
- Fix baseline_metrics NaN/inf sanitization and post-write JSON roundtrip
- Update SelectPatchAgent to handle task_*/parallel_* directories and
  prefer per-kernel latency metrics
- Consolidate duplicated extension lists into shared constants
- Add 25 unit tests for validate_commandment and task_planner

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add GPU/profiler rules to task planner and strategy yaml to prevent
  agents from using inline env vars or HIP_VISIBLE_DEVICES prefixes
- Detect inline env var prefixes (VAR=val cmd) in COMMANDMENT validator
- Add COMMANDMENT.md validation hook to bash tool (agents bypass editor)
- Update INSTRUCTIONS.md with anti-patterns and wrapper script template
- Add 4 new tests for inline env var detection

Co-authored-by: Cursor <cursoragent@cursor.com>
Discovery was broken when given a directory instead of a single file:
- MCP server crashed on read_text() for directories, _expand_workspace
  started from parent instead of the directory itself
- mini.py missed .git inside the directory (kp.parents excludes kp)
- DiscoveryPipeline.run() skipped workspace expansion for directories

Fixes:
- Add directory mode to MCP discover() with recursive kernel scanning
- Fix _expand_workspace to check the path itself when it's a directory
- Add _expand_workspace_for_dir() that scopes workspace to the given
  directory, preventing unrelated sibling files from polluting results
- Use parent directory name as kernel_name when file has a generic name
  (kernel.py, main.py) so test-name matching works properly
- Fix mini.py to check kp itself for .git before walking kp.parents

Co-authored-by: Cursor <cursoragent@cursor.com>
- Eliminate double discovery: mini.py reuses stashed _run_discovery._last_result
  instead of calling run_discovery_pipeline() a second time
- Enrich discovery context: new format_discovery_for_agent() includes kernel
  analysis, language-specific testing guidance (triton/hip/ck/asm), and
  extracted test patterns (tolerances, shapes, dtypes, imports)
- Extract test patterns: _extract_test_patterns() in discovery.py pulls
  atol/rtol, input shapes, dtypes, reference impls, and import patterns from
  top-confidence test files
- Upgrade UnitTestAgent to TestHarnessAgent: creates a fixed test harness
  with --correctness/--profile/--benchmark modes. Reads INSTRUCTIONS.md for
  harness rules. The harness is an immutable evaluation contract.
- Update INSTRUCTIONS.md: section 1a references pre-scanned discovery results
  (no re-discovery needed), section 1b notes pre-built harness from UTA

Co-authored-by: Cursor <cursoragent@cursor.com>
Umangatamd and others added 30 commits February 17, 2026 04:59
In large repos like aiter (161 test files), content-based scoring alone
gives all tests confidence=1.0, making results effectively random.
test_moe_dp_share_expert.py ranked above test_rope.py for a rope kernel.

Fixes:
- Add _relevance_score() that combines name matching + path proximity
- Kernel name in filename: +3.0, in path: +2.0, partial match: +0.5*n
- Path proximity: tests near the kernel in directory tree get +1.0
- Remove confidence cap at 1.0 so relevant tests visibly outrank generic
- Add generic stem detection (kernel.py -> parent dir name) to MCP server

Before: test_moe_sorting_mxfp4.py (conf=1.0) for rope kernel
After: test_rope.py (conf=5.85) for rope kernel
Co-authored-by: Cursor <cursoragent@cursor.com>
When multiple test files share the same name (e.g. test_rope.py in
op_tests/ and triton_tests/rope/), the summary now shows the full
path instead of just the filename.

Co-authored-by: Cursor <cursoragent@cursor.com>
Phase 1 (automated): Now accepts kernel_function param from
resolve_kernel_url (e.g. "_rope_fwd"). Uses function name to boost
test files that actually reference the target function in their source.
Also extracts @triton.jit and __global__ function names from kernel.

Phase 2 (LLM finisher): After Phase 1 ranking, calls the AMD LLM
gateway to validate whether the top test actually exercises the target
kernel functions. If it does, isolates relevant test functions. If not,
generates a focused test script from scratch that directly imports and
tests the kernel functions. Writes the script to output_dir.

Tested on aiter rope kernel:
- Phase 1: test_rope.py ranked #1 (correct)
- Phase 2: LLM correctly identified that test_rope.py tests high-level
  wrappers, not the low-level _rope_fwd kernel, and generated a focused
  test that directly exercises the Triton helper functions

Co-authored-by: Cursor <cursoragent@cursor.com>
When discover() is given a directory with multiple kernels, each kernel
now gets its own recommended_test and recommended_benchmark based on
per-kernel relevance scoring (using _relevance_score). Previously all
tests were scored globally with one flat "Recommended test" for the
whole directory.

Also fixes directory mode parity with single-file mode:
- Apply generic stem fix (kernel.py -> parent dir name)
- Remove confidence cap (min(score, 1.0)) so relevant tests visibly
  outrank generic ones
- Use _relevance_score for per-kernel matching

Tested on aiter (124 kernels) and geak_eval (8 kernels): each kernel
gets its own correct recommendation (e.g. rope -> test_rope.py,
topk -> test_moe_topk_sigmoid.py).

Co-authored-by: Cursor <cursoragent@cursor.com>
…ndling

28 tests covering all scenarios discovered during development:

Single-file mode (8 tests):
- Triton kernel finds matching test (not random)
- Generic kernel.py uses parent dir name
- kernel_function param boosts matching tests
- Function names extracted from kernel source
- Non-existent path returns error
- No confidence cap (relevant tests score > 1.0)
- Irrelevant tests don't rank above relevant ones
- Full path in summary (not ambiguous filename)

Directory mode (7 tests):
- Directory with .git used as workspace
- Directory without .git expands upward
- Per-kernel recommendations for multi-kernel dirs
- Generic kernel.py names resolved in directory mode
- Empty directory returns zero
- Single kernel collapses to dict
- Per-kernel summary in output

Relevance scoring (3 tests):
- Name in filename scores highest
- Name in path scores second
- Path proximity boosts nearby tests

Main pipeline directory handling (4 tests):
- _expand_workspace_for_dir uses dir itself
- Fallback uses dir (not parent)
- Kernel name uses parent dir for generic names
- Nested kernels found recursively

Pattern extraction (2 tests):
- Extracts atol/rtol tolerances
- Extracts torch dtype references

Enriched context formatter (4 tests):
- Includes kernel analysis
- Includes language-specific guidance
- None result returns empty
- Includes extracted patterns

Co-authored-by: Cursor <cursoragent@cursor.com>
For kernel gemm_a8w8, test_gemm_a8w8.py (exact match) now scores higher
than test_gemm_a8w8_blockscale.py (substring containment).

Scoring tiers:
- Exact stem match (test_<kernel>.py): +4.0
- Substring in filename: +2.5
- Name in path: +2.0
- Partial parts match: +0.5 * n

This prevents confusion between similar kernels like:
- gemm_a8w8 vs gemm_a8w8_blockscale
- gemm_a16w16 vs gemm_a16w16_gated
- fp8 vs fp16 kernel variants

Added unit test: test_exact_stem_beats_substring (29 tests total)

Co-authored-by: Cursor <cursoragent@cursor.com>
Make every optimization pipeline step independently callable from the CLI
with chainable --from-* flags and -o output options:

New modules:
- task_file.py: shared YAML-frontmatter task file I/O + git worktree helpers
  (extracted from ParallelAgent to eliminate duplication)
- task_generator.py: LLM-assisted task generation with -o DIR for Markdown
  task files, --from-results for iterative round-over-round refinement
- commandment.py: programmatic COMMANDMENT.md generation with built-in
  validation loop

CLI enhancements:
- openevolve-worker --from-task: reads task .md, auto-creates worktree
- geak --from-task: reads task .md, populates --task/--repo, auto-worktree
- All pipeline CLIs support --from-discovery/--from-resolved/--from-profile
  for seamless chaining via intermediate JSON files
- resolve-kernel-url --json/-o, test-discovery --from-resolved/-o,
  kernel-profile --from-discovery/--json/-o, baseline-metrics --from-profile

Infrastructure:
- pyproject.toml: new entry points for all modular CLI tools
- entrypoint.sh: health checks for new tools
- scripts/run-docker.sh: updated docs with full pipeline, iterative
  refinement, and --from-task examples
- tests/run/test_task_generator.py: unit tests for LLM fallback logic

Co-authored-by: Cursor <cursoragent@cursor.com>
Modular task pipeline: task files, --from-task, iterative refinement
- Fix agent working directory: set env.config.cwd to repo/worktree path
  so bash commands and test_perf run in the correct location instead of
  the container root (/workspace)
- Fix env_factory lambda to propagate cwd for parallel agent respawns
- Set agent.base_repo_path for single-agent --from-task runs, enabling
  correct patch generation (diff between original repo and worktree)
- Suppress config auto-detection conflicts for --from-task by not
  passing task body to load_and_merge_configs
- Skip redundant baseline profiling: prepend note to task body when
  running from task files so agent skips re-profiling
- Inject task metadata (KERNEL FILE, TEST COMMAND, REPO ROOT) into
  agent prompt so it knows exactly which files to edit
- Print agent log path prominently with tail -f hint
- Skip test harness creation for --from-task runs
- Extract test_command from discovery JSON into task file metadata
- Guard Path() checks against long task body strings (OSError fix)
- Fix empty tools list causing litellm BadRequestError in amd_claude
- Add run-tasks CLI and task_runner module for batch task execution
- Add model registry utility
- Agent-based task generator with tool-calling (replaces monolithic prompt)
- Backend-neutral kernel profiler output (metrix + rocprof-compute)
- Streamlined README, docs, and code cleanup across the codebase

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix orchestrator and task_generator to prefer focused_test command
  from discovery.json instead of the original repo test, which was
  causing all test_perf validations to fail with argument parsing errors
- Add auto-finalization when orchestrator exhausts step limit: scans all
  rounds for best_results.json, picks highest speedup, writes final_report.json
- Inject COMMANDMENT, baseline metrics, profiling data, and explicit
  kernel/repo/test paths into sub-agent task bodies via dispatch.py
- Exclude traj.json and *.log from git diff patches in test_perf
- Add .ruff_cache to .dockerignore
- Run ruff check + format on entire src/
- Update README with high-level and low-level command reference,
  output directory structure, and architecture overview

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix dispatch.py: run_parallel returns (task_id, agent, exit_status,
  result) but was unpacked as (idx, result, patches, exit), causing
  wrong success checks and crashes on len(string). Now counts patches
  from disk instead.
- Fix select_patch_agent.py: log exceptions instead of silently
  swallowing them in run_select_patch().
- Fix orchestrator.py: copy original model tools list before mutation
  so nested callers (task_generator) don't corrupt the saved reference.
- Fix orchestrator.py: wrap _dispatch_tool_call in try/except so tool
  failures (e.g. API outage during task generation) return JSON error
  payloads to the LLM instead of crashing the orchestrator loop.

Co-authored-by: Cursor <cursoragent@cursor.com>
The MCP tool packages (automated-test-discovery, kernel-ercs, etc.) are
installed via pip in the Dockerfile but not declared in pyproject.toml.
This causes 18+ test failures when running outside Docker (local dev, CI).

Add sys.path entries for all mcp_tools/*/src/ directories in conftest.py
so tests can import these packages without requiring a separate pip install.

Co-authored-by: Cursor <cursoragent@cursor.com>
Tests were passing {"command": "..."} (a dict) to execute() which
expects a plain string. This caused TypeError in subprocess since the
dict was forwarded as the cwd argument. Fixed all 22 call sites to pass
the command string directly.

Co-authored-by: Cursor <cursoragent@cursor.com>
kernel-ercs and kernel-evolve MCP servers require fastmcp>=2.0.0 to
start. Without it, the server subprocess crashes on import causing
7 test failures. Adding it to dev dependencies ensures MCP smoke tests
pass outside Docker.

Co-authored-by: Cursor <cursoragent@cursor.com>
The task-generation agent was hitting LimitsExceeded on complex kernels
(e.g. RoPE with 30+ functions). Defaults raised from 30/2.0 to 75/10.0
and made configurable via GEAK_TASKGEN_STEP_LIMIT / GEAK_TASKGEN_COST_LIMIT
environment variables.

Co-authored-by: Cursor <cursoragent@cursor.com>
Catch subprocess.TimeoutExpired and return a structured dict with
returncode=-1 and exception_info instead of letting it propagate as
an unhandled exception. Matches the contract expected by tests and
the pattern used by other environments.

Co-authored-by: Cursor <cursoragent@cursor.com>
Fix task pipeline: agent cwd, config conflicts, and task execution
resolve_kernel_url stored local_repo_path as a relative path while
local_file_path was absolute. The parallel agent resolved the relative
path against the task file directory, producing a doubled nonsense path
that didn't exist. Now all three layers ensure absolute paths: the
source (resolve_kernel_url_impl), the orchestrator context loader, and
the dispatch batch runner.

Co-authored-by: Cursor <cursoragent@cursor.com>
The Full Pipeline Mode (preprocessor → orchestrator) was skipping the
UnitTestAgent, relying on a single-shot LLM finisher in the MCP
discovery server for harness creation. That approach consistently failed
because a single LLM call can't reliably generate correct test harnesses
(wrong tensor shapes, wrong tolerances, wrong imports).

The UnitTestAgent is a multi-turn agent with bash/editor tools that can
read the kernel, read existing tests, run them, see errors, and iterate
until the harness works. It was already built for this purpose but
wasn't wired into the new pipeline.

Changes:
- preprocessor.py: Add model/model_factory params to run_preprocessor().
  After MCP discovery (Step 2), run UnitTestAgent (Step 2b) with
  discovery context to create a validated harness. Extract absolute path
  to the harness script for the profiler. Fall back to raw discovery
  test command if UnitTestAgent fails.
- mini.py: Pass model and model_factory to run_preprocessor().

Tested on ROCm/aiter RoPE kernel: UnitTestAgent creates a working
harness, profiling succeeds (48.44 us baseline), orchestrator generates
tasks, optimization agent produces 18+ patches with ~13% speedup.

Co-authored-by: Cursor <cursoragent@cursor.com>
Wire UnitTestAgent into Full Pipeline Mode preprocessor
…ntext passing

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix GPU isolation: propagate HIP_VISIBLE_DEVICES through BashCommand,
  MCPToolBridge, ProfilingAnalyzer, and OpenEvolve subprocess env.
  Prevent shallow-copy race in ParallelAgent by creating new env dicts
  per thread. Add defensive copy in ToolRuntime.set_env().

- GPU-aware task generation: extend AgentTask with num_gpus, teach
  task-generator LLM to allocate GPUs per task, ParallelAgent acquires
  N GPU slots from pool for multi-GPU tasks (e.g. OpenEvolve).

- Docker: remove hardcoded HIP_VISIBLE_DEVICES=0 from Dockerfile,
  unset it in entrypoint.sh so geak --gpu-ids controls isolation.

- Fix profiler integration tests: add __main__ to examples/add_kernel
  so rocprofv3 captures GPU activity, fix MetrixTool empty
  HIP_VISIBLE_DEVICES handling, update test assertions to match
  add_kernel (not rope), mark rocprof-compute roofline as xfail.

- Add developer docs: gpu-isolation.md (invariants, how-to, pitfalls),
  update architecture/flow/tools diagrams with SweAgent, codebase
  context passing chain, multi-GPU dispatch, and --gpu-ids flags.
  Remove redundant diagrams.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Fix 1: profiler-mcp no longer mutates os.environ; passes clean env
  via _env_override to ProfilingAnalyzer subprocess instead.
- Fix 2: Centralize agent-type ↔ class mappings into agent_spec.py
  (_agent_type_to_class / _agent_class_to_type) eliminating 4 duplicate
  definitions across dispatch, orchestrator, task_generator, task_runner.
- Fix 3: Replace silent `except Exception: pass` in
  OpenEvolveWorker._save_result_artifacts with logger.warning().
- Fix 4: Add public set_tools() to AmdLlmModelBase and AmdLlmModel
  router; SweAgent and task_generator use it instead of reaching into
  model._impl.
- Fix 5: Remove duplicate `cfg: dict` type annotation in dispatch.py
  else-branch.
- Fix 6: Harden _derive_test_command_from_commandment to support
  fenced code blocks, add fallback for raw .py commands, and log
  debug messages on parse outcomes.

Co-authored-by: Cursor <cursoragent@cursor.com>
The previous _env_override approach didn't actually remove the empty
key from the subprocess env (dict merge brings it back from os.environ).
Switch to save/restore of os.environ, which is safe here because
profiler-mcp runs as a dedicated single-threaded MCP server process.

Co-authored-by: Cursor <cursoragent@cursor.com>
Swe agent, openevolve fixes and context
The test harness had no control over how many shapes were used for
profiling vs testing, causing OOM during GPU profiling.

Changes:
- Add select_shapes_uniform() utility in discovery.py for programmatic
  shape selection (dedup, sort by element count, uniform sampling)
- UnitTestAgent system prompt now instructs the LLM to read discovered
  test files, extract ALL shapes (variables, loops, configs — not just
  literal tuples), and build two lists:
  HARNESS_SHAPES (20-25) for correctness/benchmark
  PROFILE_SHAPES (5) for --profile mode only
- format_discovery_for_agent() cleaned up: passes all extracted patterns
  without truncation so the LLM has full shape context

Co-authored-by: Cursor <cursoragent@cursor.com>
The harness now supports four CLI modes with distinct shape sets:
  --profile        → PROFILE_SHAPES (5)
  --benchmark      → HARNESS_SHAPES (20-25 sampled)
  --correctness    → HARNESS_SHAPES
  --full-benchmark → ALL_SHAPES (every discovered shape)

--full-benchmark runs all discovered shapes and is intended for use
only at the start and end of optimization to get the complete picture.
--benchmark uses the sampled subset for fast iteration loops.
If ALL_SHAPES has ≤25 entries, HARNESS_SHAPES = ALL_SHAPES and both
benchmark modes behave identically.

Updated INSTRUCTIONS.md and UTA system prompt accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
The baseline must record BOTH --benchmark (reduced, 20-25 shapes) and
--full-benchmark (all shapes) results. During iterations the agent
compares reduced vs reduced; at the end it compares full vs full.
Mixing modes in a comparison produces meaningless speedup numbers
because the shape sets differ.

Co-authored-by: Cursor <cursoragent@cursor.com>
Fix OOM in profiling: LLM-driven shape extraction from discovery
The uniform index calculation divides by (count-1), which crashes when
count=1. Add early returns for count<=0 (empty) and count==1 (median
shape).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants