Skip to content

Commit 9f38d13

Browse files
sjarmakclaude
andcommitted
feat: US-023 - Documentation and extensibility guide
- Created docs/MCP_UNIQUE_TASKS.md: full end-to-end guide for MCP-unique system - Architecture ASCII diagram showing component relationships - Suite structure table (10 suites, 6 active with task counts) - Task authoring: generator usage + worked example (CCX-dep-trace-001) - Evaluation framework: oracle check types, agent answer format, PRD spec - Running tasks: selection-file, category filter, monitoring, report - Retrieval metrics API with code example - Deep Search tasks: 3 DS variants, rubric judge, hybrid scoring (60/40) - Extensibility: new tasks, categories, sg-benchmarks mirrors, cross-host deferral - Design decisions Q1-Q10 rationale - Updated docs/EXTENSIBILITY.md: Section 7 for MCP-unique tasks with constraints - Updated docs/CONFIGS.md: --selection-file and --use-case-category flags documented - Updated docs/SCORING_SEMANTICS.md: oracle checks, composite score, hybrid scoring - Updated CLAUDE.md: references MCP-unique extension + new doc entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent eb34190 commit 9f38d13

File tree

7 files changed

+683
-5
lines changed

7 files changed

+683
-5
lines changed

CLAUDE.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,11 @@ This file is the operational quick-reference for benchmark maintenance.
44
`AGENTS.md` mirrors this file.
55

66
## Benchmark Overview
7-
8 SDLC phase suites, 157 tasks. Tasks are organized by development lifecycle
8-
phase: build, debug, design, document, fix, secure, test, understand.
7+
8 SDLC phase suites + 6 MCP-unique suites. SDLC tasks measure code quality
8+
across phases: build, debug, design, document, fix, secure, test, understand.
9+
MCP-unique tasks measure org-scale cross-repo discovery and retrieval.
910
See `README.md` for the full suite table and `docs/TASK_CATALOG.md` for
10-
per-task details.
11+
per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
1112

1213
## Canonical References
1314
- `README.md` - repo overview and quick start
@@ -16,7 +17,9 @@ per-task details.
1617
- `docs/ERROR_CATALOG.md` - known failures and remediation
1718
- `docs/TASK_SELECTION.md` - curation/difficulty policy
1819
- `docs/TASK_CATALOG.md` - current task inventory
19-
- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation
20+
- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation (incl. oracle checks + hybrid scoring)
21+
- `docs/MCP_UNIQUE_TASKS.md` - MCP-unique task system (suites, authoring, oracle, DS tasks)
22+
- `docs/MCP_UNIQUE_CALIBRATION.md` - oracle coverage analysis and threshold calibration data
2023
- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
2124
- `docs/AGENT_INTERFACE.md` - runtime I/O contract
2225
- `docs/EXTENSIBILITY.md` - safe suite/task/config extension

docs/CONFIGS.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,3 +138,58 @@ fail-fast if that model is unavailable in the configured Codex environment.
138138

139139
For this rollout, Codex MCP policy is sourcegraph_full-only for MCP-enabled
140140
runs, with baseline comparisons using `none`. No other MCP modes are allowed.
141+
142+
## Running MCP-Unique Tasks
143+
144+
MCP-unique tasks use a separate selection file and support category filtering.
145+
146+
### Selection File
147+
148+
```bash
149+
# Run all 14 MCP-unique starter tasks (both configs)
150+
configs/run_selected_tasks.sh \
151+
--selection-file configs/selected_mcp_unique_tasks.json
152+
153+
# Dry run to preview
154+
configs/run_selected_tasks.sh \
155+
--selection-file configs/selected_mcp_unique_tasks.json \
156+
--dry-run
157+
```
158+
159+
The `--selection-file` flag accepts any path to a selection JSON file. The
160+
file format is compatible with `configs/selected_benchmark_tasks.json` but
161+
uses `mcp_suite` instead of `benchmark` for suite identification.
162+
163+
### Category Filter
164+
165+
```bash
166+
# Run only category A (cross-repo tracing)
167+
configs/run_selected_tasks.sh \
168+
--selection-file configs/selected_mcp_unique_tasks.json \
169+
--use-case-category A
170+
171+
# Run only Deep Search relevant tasks (E and J categories)
172+
configs/run_selected_tasks.sh \
173+
--selection-file configs/selected_mcp_unique_tasks.json \
174+
--use-case-category E
175+
configs/run_selected_tasks.sh \
176+
--selection-file configs/selected_mcp_unique_tasks.json \
177+
--use-case-category J
178+
```
179+
180+
The `--use-case-category` flag filters tasks by the `use_case_category` field in
181+
the selection file (values: A through J, corresponding to the 10 ccb_mcp_* suites).
182+
This flag is only meaningful when used with `--selection-file`.
183+
184+
### MCP-Unique vs Standard Suites
185+
186+
| Feature | Standard suites | MCP-unique suites |
187+
|---------|----------------|-------------------|
188+
| Selection file | `selected_benchmark_tasks.json` | `selected_mcp_unique_tasks.json` |
189+
| Suite prefix | `ccb_<phase>` | `ccb_mcp_<category>` |
190+
| Verifier script | `tests/test.sh` | `tests/eval.sh` |
191+
| Oracle format | task-specific | `oracle_answer.json` + `oracle_checks.py` |
192+
| Local repo | full workspace | 1 local_checkout repo only |
193+
| MCP-Full behavior | truncated source | no source clone |
194+
195+
See `docs/MCP_UNIQUE_TASKS.md` for full task authoring and evaluation details.

docs/EXTENSIBILITY.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,3 +80,50 @@ When adding benchmark environment variants, keep canonical task definitions inta
8080
3. Document variant intent and caveats in a per-suite `VARIANTS.md`
8181
(for example under `benchmarks/ccb_document/`).
8282
4. Treat variant runs as separate studies in reporting and curation.
83+
84+
## 7) Add MCP-Unique Tasks (ccb_mcp_* suites)
85+
86+
MCP-unique tasks measure org-scale cross-repo discovery — what local-only agents
87+
cannot do. See `docs/MCP_UNIQUE_TASKS.md` for the full authoring guide.
88+
89+
**Quick start:**
90+
91+
```bash
92+
# 1. Generate from use case registry
93+
python3 scripts/generate_mcp_unique_tasks.py --use-case-ids <N> --curate-oracle --validate
94+
95+
# 2. Register in selection file
96+
# configs/selected_mcp_unique_tasks.json
97+
98+
# 3. Validate
99+
python3 scripts/validate_mcp_task_instance.py --task-dir benchmarks/ccb_mcp_<suite>/<task>
100+
python3 scripts/validate_tasks_preflight.py --suite ccb_mcp_<suite>
101+
```
102+
103+
**Key constraints:**
104+
- `task.toml` verification type must be `"test"` (Harbor standard)
105+
- `tests/eval.sh` must be executable (`chmod +x`)
106+
- Use `/tests/` paths inside eval.sh (Harbor uploads `tests/` to `/tests/`)
107+
- All repos in fixtures must be indexed in Sourcegraph
108+
- `scripts/ccb_metrics/oracle_checks.py` must be stdlib-only Python
109+
110+
**Directory structure:**
111+
```
112+
benchmarks/ccb_mcp_<suite>/<task>/
113+
├── task.toml
114+
├── instruction.md
115+
├── environment/
116+
│ ├── Dockerfile (baseline: clones local_checkout_repo)
117+
│ └── Dockerfile.sg_only (MCP-full: no clone, marks /tmp/.sg_only_mode)
118+
└── tests/
119+
├── eval.sh (exit-code-first evaluator)
120+
├── task_spec.json (PRD-centered spec)
121+
├── oracle_answer.json (gold agent answer)
122+
├── oracle_checks.py (stdlib eval library)
123+
└── criteria.json (optional: rubric for Deep Search tasks)
124+
```
125+
126+
When adding a new ccb_mcp_* suite, add the prefix to `DIR_PREFIX_TO_SUITE` in:
127+
- `scripts/aggregate_status.py`
128+
- `scripts/generate_manifest.py`
129+
- `scripts/run_judge.py`

0 commit comments

Comments
 (0)