sourcegraph
diff --git a/‎CLAUDE.md‎
Lines changed: 7 additions & 4 deletions b/‎CLAUDE.md‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎docs/CONFIGS.md‎
Lines changed: 55 additions & 0 deletions b/‎docs/CONFIGS.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎docs/EXTENSIBILITY.md‎
Lines changed: 47 additions & 0 deletions b/‎docs/EXTENSIBILITY.md‎
Lines changed: 47 additions & 0 deletions
@@ -4,10 +4,11 @@ This file is the operational quick-reference for benchmark maintenance.
 `AGENTS.md` mirrors this file.
 
 ## Benchmark Overview
-8 SDLC phase suites, 157 tasks. Tasks are organized by development lifecycle
-phase: build, debug, design, document, fix, secure, test, understand.
+8 SDLC phase suites + 6 MCP-unique suites. SDLC tasks measure code quality
+across phases: build, debug, design, document, fix, secure, test, understand.
+MCP-unique tasks measure org-scale cross-repo discovery and retrieval.
 See `README.md` for the full suite table and `docs/TASK_CATALOG.md` for
-per-task details.
+per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
 
 ## Canonical References
 - `README.md` - repo overview and quick start
@@ -16,7 +17,9 @@ per-task details.
 - `docs/ERROR_CATALOG.md` - known failures and remediation
 - `docs/TASK_SELECTION.md` - curation/difficulty policy
 - `docs/TASK_CATALOG.md` - current task inventory
-- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation
+- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation (incl. oracle checks + hybrid scoring)
+- `docs/MCP_UNIQUE_TASKS.md` - MCP-unique task system (suites, authoring, oracle, DS tasks)
+- `docs/MCP_UNIQUE_CALIBRATION.md` - oracle coverage analysis and threshold calibration data
 - `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
 - `docs/AGENT_INTERFACE.md` - runtime I/O contract
 - `docs/EXTENSIBILITY.md` - safe suite/task/config extension
 
@@ -138,3 +138,58 @@ fail-fast if that model is unavailable in the configured Codex environment.
 
 For this rollout, Codex MCP policy is sourcegraph_full-only for MCP-enabled
 runs, with baseline comparisons using `none`. No other MCP modes are allowed.
+
+## Running MCP-Unique Tasks
+
+MCP-unique tasks use a separate selection file and support category filtering.
+
+### Selection File
+
+```bash
+# Run all 14 MCP-unique starter tasks (both configs)
+configs/run_selected_tasks.sh \
+  --selection-file configs/selected_mcp_unique_tasks.json
+
+# Dry run to preview
+configs/run_selected_tasks.sh \
+  --selection-file configs/selected_mcp_unique_tasks.json \
+  --dry-run
+```
+
+The `--selection-file` flag accepts any path to a selection JSON file. The
+file format is compatible with `configs/selected_benchmark_tasks.json` but
+uses `mcp_suite` instead of `benchmark` for suite identification.
+
+### Category Filter
+
+```bash
+# Run only category A (cross-repo tracing)
+configs/run_selected_tasks.sh \
+  --selection-file configs/selected_mcp_unique_tasks.json \
+  --use-case-category A
+
+# Run only Deep Search relevant tasks (E and J categories)
+configs/run_selected_tasks.sh \
+  --selection-file configs/selected_mcp_unique_tasks.json \
+  --use-case-category E
+configs/run_selected_tasks.sh \
+  --selection-file configs/selected_mcp_unique_tasks.json \
+  --use-case-category J
+```
+
+The `--use-case-category` flag filters tasks by the `use_case_category` field in
+the selection file (values: A through J, corresponding to the 10 ccb_mcp_* suites).
+This flag is only meaningful when used with `--selection-file`.
+
+### MCP-Unique vs Standard Suites
+
+| Feature | Standard suites | MCP-unique suites |
+|---------|----------------|-------------------|
+| Selection file | `selected_benchmark_tasks.json` | `selected_mcp_unique_tasks.json` |
+| Suite prefix | `ccb_<phase>` | `ccb_mcp_<category>` |
+| Verifier script | `tests/test.sh` | `tests/eval.sh` |
+| Oracle format | task-specific | `oracle_answer.json` + `oracle_checks.py` |
+| Local repo | full workspace | 1 local_checkout repo only |
+| MCP-Full behavior | truncated source | no source clone |
+
+See `docs/MCP_UNIQUE_TASKS.md` for full task authoring and evaluation details.
@@ -80,3 +80,50 @@ When adding benchmark environment variants, keep canonical task definitions inta
 3. Document variant intent and caveats in a per-suite `VARIANTS.md`
    (for example under `benchmarks/ccb_document/`).
 4. Treat variant runs as separate studies in reporting and curation.
+
+## 7) Add MCP-Unique Tasks (ccb_mcp_* suites)
+
+MCP-unique tasks measure org-scale cross-repo discovery — what local-only agents
+cannot do. See `docs/MCP_UNIQUE_TASKS.md` for the full authoring guide.
+
+**Quick start:**
+
+```bash
+# 1. Generate from use case registry
+python3 scripts/generate_mcp_unique_tasks.py --use-case-ids <N> --curate-oracle --validate
+
+# 2. Register in selection file
+#    configs/selected_mcp_unique_tasks.json
+
+# 3. Validate
+python3 scripts/validate_mcp_task_instance.py --task-dir benchmarks/ccb_mcp_<suite>/<task>
+python3 scripts/validate_tasks_preflight.py --suite ccb_mcp_<suite>
+```
+
+**Key constraints:**
+- `task.toml` verification type must be `"test"` (Harbor standard)
+- `tests/eval.sh` must be executable (`chmod +x`)
+- Use `/tests/` paths inside eval.sh (Harbor uploads `tests/` to `/tests/`)
+- All repos in fixtures must be indexed in Sourcegraph
+- `scripts/ccb_metrics/oracle_checks.py` must be stdlib-only Python
+
+**Directory structure:**
+```
+benchmarks/ccb_mcp_<suite>/<task>/
+├── task.toml
+├── instruction.md
+├── environment/
+│   ├── Dockerfile           (baseline: clones local_checkout_repo)
+│   └── Dockerfile.sg_only   (MCP-full: no clone, marks /tmp/.sg_only_mode)
+└── tests/
+    ├── eval.sh              (exit-code-first evaluator)
+    ├── task_spec.json       (PRD-centered spec)
+    ├── oracle_answer.json   (gold agent answer)
+    ├── oracle_checks.py     (stdlib eval library)
+    └── criteria.json        (optional: rubric for Deep Search tasks)
+```
+
+When adding a new ccb_mcp_* suite, add the prefix to `DIR_PREFIX_TO_SUITE` in:
+- `scripts/aggregate_status.py`
+- `scripts/generate_manifest.py`
+- `scripts/run_judge.py`