Skip to content

Commit 88bed78

Browse files
sjarmakclaude
andcommitted
feat: expand MCP-unique suite from 20 to 40 tasks across 10 categories
Add 20 new MCP-unique tasks covering all 10 use case categories including two new categories (H: domain lineage, I: agentic correctness). Introduces three new repo sets (apache-kafka-ecosystem, envoy-service-mesh, rust-systems) for Java, C++, and Rust language diversity. - Create configs/use_case_registry.json with 40 use case entries - Create 4 repo set fixtures in fixtures/repo_sets/ - Generate 20 task skeletons with full file structure (11 files each) - Register tasks in both selection files (210 total: 170 SDLC + 40 MCP) - Add domain-lineage family to generate_mcp_unique_tasks.py - New suites: ccb_mcp_domain (4 tasks), ccb_mcp_org (3 tasks) Oracle curation (Phase 3) is pending — task_spec.json oracles are stubbed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 2adaa0c commit 88bed78

File tree

230 files changed

+24385
-50
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

230 files changed

+24385
-50
lines changed
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
FROM ubuntu:22.04
2+
3+
ENV DEBIAN_FRONTEND=noninteractive
4+
5+
# Base tools
6+
RUN apt-get update && apt-get install -y --no-install-recommends \
7+
git \
8+
ca-certificates \
9+
curl \
10+
python3 \
11+
g++ make \
12+
&& rm -rf /var/lib/apt/lists/*
13+
14+
WORKDIR /workspace
15+
16+
# Clone local checkout repos (baseline config: agent has local access to these)
17+
RUN git clone --depth 1 --branch v1.31.2 https://github.com/sg-benchmarks/envoy--v1.31.2 /workspace/envoy--v1.31.2
18+
RUN git clone --depth 1 --branch 84e84367 https://github.com/sg-benchmarks/data-plane-api--84e84367 /workspace/data-plane-api--84e84367
19+
RUN git clone --depth 1 --branch 71637ad6 https://github.com/sg-benchmarks/go-control-plane--71637ad6 /workspace/go-control-plane--71637ad6
20+
RUN git clone --depth 1 --branch 957dba5e https://github.com/sg-benchmarks/grpc--957dba5e /workspace/grpc--957dba5e
21+
22+
# Initialize git identity for agent commits
23+
RUN git config --global user.email "agent@example.com" && \
24+
git config --global user.name "Agent" && \
25+
git config --global safe.directory '*'
26+
27+
# Create log directories
28+
RUN mkdir -p /logs/agent /logs/verifier
29+
30+
ENTRYPOINT []
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# ccx-compliance-052 — artifact_only variant
2+
# No local repo clone — agent uses Sourcegraph MCP exclusively for code access.
3+
# Agent produces answer.json artifact; verifier scores the artifact.
4+
5+
FROM ubuntu:22.04
6+
7+
ENV DEBIAN_FRONTEND=noninteractive
8+
ENV SOURCEGRAPH_REPOS="sg-benchmarks/envoy--v1.31.2,sg-benchmarks/data-plane-api--84e84367,sg-benchmarks/go-control-plane--71637ad6,sg-benchmarks/grpc--957dba5e"
9+
10+
RUN apt-get update && apt-get install -y --no-install-recommends \
11+
git \
12+
ca-certificates \
13+
python3 \
14+
curl \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
WORKDIR /workspace
18+
19+
# Empty workspace — agent discovers code via MCP tools only
20+
RUN git init && \
21+
git config user.email "agent@example.com" && \
22+
git config user.name "Agent" && \
23+
git config --global safe.directory '*'
24+
25+
# Create log directories
26+
RUN mkdir -p /logs/agent /logs/verifier
27+
28+
# Mark artifact-only mode — verifiers and eval scripts check this flag
29+
RUN touch /tmp/.artifact_only_mode
30+
31+
ENTRYPOINT []
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# CCX-compliance-052 — sg_only variant
2+
# No local repo clone — agent uses Sourcegraph MCP exclusively for code access.
3+
# The verifier clones mirror repos at verification time (no /repo_full/ backup).
4+
5+
FROM ubuntu:22.04
6+
7+
ENV DEBIAN_FRONTEND=noninteractive
8+
ENV SOURCEGRAPH_REPOS="sg-benchmarks/envoy--v1.31.2,sg-benchmarks/data-plane-api--84e84367,sg-benchmarks/go-control-plane--71637ad6,sg-benchmarks/grpc--957dba5e"
9+
10+
RUN apt-get update && apt-get install -y --no-install-recommends \
11+
git \
12+
ca-certificates \
13+
python3 \
14+
curl \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
WORKDIR /workspace
18+
19+
# Empty workspace — agent discovers code via MCP tools only
20+
RUN git init && \
21+
git config user.email "agent@example.com" && \
22+
git config user.name "Agent" && \
23+
git config --global safe.directory '*'
24+
25+
# Create log directories
26+
RUN mkdir -p /logs/agent /logs/verifier
27+
28+
# Mark sg_only mode — verifiers and eval scripts check this flag
29+
RUN touch /tmp/.sg_only_mode
30+
31+
RUN echo '{"workdir":"/workspace","repos":[{"mirror":"sg-benchmarks/envoy--v1.31.2","target_dir":"envoy"},{"mirror":"sg-benchmarks/data-plane-api--84e84367","target_dir":"data-plane-api"},{"mirror":"sg-benchmarks/go-control-plane--71637ad6","target_dir":"go-control-plane"},{"mirror":"sg-benchmarks/grpc--957dba5e","target_dir":"grpc"}]}' > /tmp/.sg_only_clone_manifest.json
32+
33+
ENTRYPOINT []
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Compliance Audit: Envoy Access Logging Configuration Coverage
2+
3+
## Your Task
4+
5+
Audit the access logging infrastructure in `envoyproxy/envoy`. Find all C++ source files under `source/extensions/access_loggers/` and `source/common/access_log/` that define access log filter implementations or access log sink implementations. Also find the corresponding `.proto` definitions in `envoyproxy/data-plane-api` under `envoy/extensions/access_loggers/`. For each implementation, report whether it supports structured (JSON) logging or only unstructured text logging.
6+
7+
## Context
8+
9+
You are working on a codebase task involving repos from the compliance domain.
10+
11+
## Available Resources
12+
13+
The local `/workspace/` directory contains: sg-benchmarks/envoy--v1.31.2, sg-benchmarks/data-plane-api--84e84367, sg-benchmarks/go-control-plane--71637ad6, sg-benchmarks/grpc--957dba5e.
14+
15+
**Note:** Additional repositories are accessible via Sourcegraph MCP tools:
16+
- `sg-benchmarks/envoy--v1.31.2` (envoyproxy/envoy)
17+
- `sg-benchmarks/data-plane-api--84e84367` (envoyproxy/data-plane-api)
18+
- `sg-benchmarks/go-control-plane--71637ad6` (envoyproxy/go-control-plane)
19+
- `sg-benchmarks/grpc--957dba5e` (grpc/grpc)
20+
21+
## Output Format
22+
23+
Create a file at `/workspace/answer.json` with your findings in the following structure:
24+
25+
```json
26+
{
27+
"files": [
28+
{"repo": "org/repo-name", "path": "relative/path/to/file.cpp"}
29+
],
30+
"symbols": [
31+
{"repo": "org/repo-name", "path": "relative/path/to/file.cpp", "symbol": "SymbolName"}
32+
],
33+
"chain": [
34+
{"repo": "org/repo-name", "path": "relative/path/to/file.cpp", "symbol": "FunctionName"}
35+
],
36+
"text": "Narrative explanation of your findings, citing repos and file paths."
37+
}
38+
```
39+
40+
Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
41+
42+
## Evaluation
43+
44+
Your answer will be scored on:
45+
- **File recall and precision**: Did you find all relevant files?
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# IMPORTANT: Source Code Access
2+
3+
**Local source files are not present.** Your workspace does not contain source code. You **MUST** use Sourcegraph MCP tools to discover, read, and understand code before making any changes.
4+
5+
**Target Repositories (version-pinned mirrors):**
6+
7+
- `github.com/sg-benchmarks/envoy--v1.31.2` — use `repo:^github.com/sg-benchmarks/envoy--v1.31.2$` filter
8+
- `github.com/sg-benchmarks/data-plane-api--84e84367` — use `repo:^github.com/sg-benchmarks/data-plane-api--84e84367$` filter
9+
- `github.com/sg-benchmarks/go-control-plane--71637ad6` — use `repo:^github.com/sg-benchmarks/go-control-plane--71637ad6$` filter
10+
- `github.com/sg-benchmarks/grpc--957dba5e` — use `repo:^github.com/sg-benchmarks/grpc--957dba5e$` filter
11+
12+
Scope ALL keyword_search/nls_search queries to these repos.
13+
Use the repo name as the `repo` parameter for read_file/go_to_definition/find_references.
14+
15+
16+
## Required Workflow
17+
18+
1. **Search first** — Use MCP tools to find relevant files and understand existing patterns
19+
2. **Read remotely** — Use `sg_read_file` to read full file contents from Sourcegraph
20+
3. **Edit locally** — Use Edit, Write, and Bash to create or modify files in your working directory
21+
4. **Verify locally** — Run tests with Bash to check your changes
22+
23+
## Tool Selection
24+
25+
| Goal | Tool |
26+
|------|------|
27+
| Exact symbol/string | `sg_keyword_search` |
28+
| Concepts/semantic search | `sg_nls_search` |
29+
| Trace usage/callers | `sg_find_references` |
30+
| See implementation | `sg_go_to_definition` |
31+
| Read full file | `sg_read_file` |
32+
| Browse structure | `sg_list_files` |
33+
| Find repos | `sg_list_repos` |
34+
| Search commits | `sg_commit_search` |
35+
| Track changes | `sg_diff_search` |
36+
| Compare versions | `sg_compare_revisions` |
37+
38+
**Decision logic:**
39+
1. Know the exact symbol? -> `sg_keyword_search`
40+
2. Know the concept, not the name? -> `sg_nls_search`
41+
3. Need definition of a symbol? -> `sg_go_to_definition`
42+
4. Need all callers/references? -> `sg_find_references`
43+
5. Need full file content? -> `sg_read_file`
44+
45+
## Scoping (Always Do This)
46+
47+
```
48+
repo:^github.com/ORG/REPO$ # Exact repo (preferred)
49+
repo:github.com/ORG/ # All repos in org
50+
file:.*\.ts$ # TypeScript only
51+
file:src/api/ # Specific directory
52+
```
53+
54+
Start narrow. Expand only if results are empty.
55+
56+
## Efficiency Rules
57+
58+
- Chain searches logically: search -> read -> references -> definition
59+
- Don't re-search for the same pattern; use results from prior calls
60+
- Prefer `sg_keyword_search` over `sg_nls_search` when you have exact terms
61+
- Read 2-3 related files before synthesising, rather than one at a time
62+
- Don't read 20+ remote files without writing code — once you understand the pattern, start implementing
63+
64+
## If Stuck
65+
66+
If MCP search returns no results:
67+
1. Broaden the search query (synonyms, partial identifiers)
68+
2. Try `sg_nls_search` for semantic matching
69+
3. Use `sg_list_files` to browse the directory structure
70+
4. Use `sg_list_repos` to verify the repository name
71+
72+
---
73+
74+
**Sourcegraph Repositories:** `github.com/sg-benchmarks/envoy--v1.31.2`, `github.com/sg-benchmarks/data-plane-api--84e84367`, `github.com/sg-benchmarks/go-control-plane--71637ad6`, `github.com/sg-benchmarks/grpc--957dba5e`
75+
76+
# Compliance Audit: Envoy Access Logging Configuration Coverage
77+
78+
## Your Task
79+
80+
Audit the access logging infrastructure in `envoyproxy/envoy`. Find all C++ source files under `source/extensions/access_loggers/` and `source/common/access_log/` that define access log filter implementations or access log sink implementations. Also find the corresponding `.proto` definitions in `envoyproxy/data-plane-api` under `envoy/extensions/access_loggers/`. For each implementation, report whether it supports structured (JSON) logging or only unstructured text logging.
81+
82+
## Context
83+
84+
You are working on a codebase task involving repos from the compliance domain.
85+
86+
## Available Resources
87+
88+
The local `/workspace/` directory contains: sg-benchmarks/envoy--v1.31.2, sg-benchmarks/data-plane-api--84e84367, sg-benchmarks/go-control-plane--71637ad6, sg-benchmarks/grpc--957dba5e.
89+
90+
**Note:** Additional repositories are accessible via Sourcegraph MCP tools:
91+
- `sg-benchmarks/envoy--v1.31.2` (envoyproxy/envoy)
92+
- `sg-benchmarks/data-plane-api--84e84367` (envoyproxy/data-plane-api)
93+
- `sg-benchmarks/go-control-plane--71637ad6` (envoyproxy/go-control-plane)
94+
- `sg-benchmarks/grpc--957dba5e` (grpc/grpc)
95+
96+
## Output Format
97+
98+
Create a file at `/workspace/answer.json` with your findings in the following structure:
99+
100+
```json
101+
{
102+
"files": [
103+
{"repo": "org/repo-name", "path": "relative/path/to/file.go"}
104+
],
105+
"symbols": [
106+
{"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "SymbolName"}
107+
],
108+
"chain": [
109+
{"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "FunctionName"}
110+
],
111+
"text": "Narrative explanation of your findings, citing repos and file paths."
112+
}
113+
```
114+
115+
Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
116+
117+
## Evaluation
118+
119+
Your answer will be scored on:
120+
- **File recall and precision**: Did you find all relevant files?
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
version = "1.0"
2+
3+
[metadata]
4+
name = "CCX-compliance-052"
5+
description = "Compliance Audit: Envoy Access Logging Configuration Coverage"
6+
license = "Apache-2.0"
7+
8+
[task]
9+
id = "CCX-compliance-052"
10+
repo = "sg-benchmarks/envoy--v1.31.2"
11+
category = "compliance-audit"
12+
language = "c++"
13+
difficulty = "hard"
14+
time_limit_sec = 900
15+
mcp_suite = "ccb_mcp_compliance"
16+
use_case_id = 52
17+
repo_set_id = "envoy-service-mesh"
18+
mcp_unique = true
19+
20+
[verification]
21+
type = "test"
22+
command = "bash /tests/eval.sh"
23+
24+
reward_type = "score"
25+
description = "Compliance Audit: Envoy Access Logging Configuration Coverage"
26+
27+
[environment]
28+
build_timeout_sec = 600.0
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
#!/bin/bash
2+
# eval.sh — MCP-unique benchmark evaluator for CCX-compliance-052
3+
# Exit-code-first (SWE-Factory pattern):
4+
# exit 0 — agent produced useful output (composite score > 0)
5+
# exit 1 — total failure (composite score == 0 or missing answer)
6+
#
7+
# Writes /logs/verifier/reward.txt with the composite score [0.0, 1.0]
8+
9+
set -euo pipefail
10+
11+
TASK_ID="CCX-compliance-052"
12+
ANSWER_PATH="/workspace/answer.json"
13+
TASK_SPEC_PATH="/tests/task_spec.json"
14+
ORACLE_CHECKS="/tests/oracle_checks.py"
15+
REWARD_PATH="/logs/verifier/reward.txt"
16+
17+
mkdir -p /logs/verifier
18+
19+
echo "=== CCX-compliance-052 evaluator ==="
20+
echo "Task spec: $TASK_SPEC_PATH"
21+
echo "Answer: $ANSWER_PATH"
22+
echo ""
23+
24+
# sg_only mode guard: restore full repo if verifier wrapper exists
25+
if [ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ]; then
26+
echo "sg_only mode: sourcing verifier wrapper..."
27+
source /tests/sgonly_verifier_wrapper.sh
28+
fi
29+
30+
# Verify answer file exists
31+
if [ ! -f "$ANSWER_PATH" ]; then
32+
echo "ERROR: answer.json not found at $ANSWER_PATH"
33+
echo "0.0" > "$REWARD_PATH"
34+
exit 1
35+
fi
36+
37+
# Validate answer is valid JSON
38+
if ! python3 -c "import json; json.load(open('$ANSWER_PATH'))" 2>/dev/null; then
39+
echo "ERROR: answer.json is not valid JSON"
40+
echo "0.0" > "$REWARD_PATH"
41+
exit 1
42+
fi
43+
44+
echo "answer.json found and valid JSON"
45+
46+
# Run oracle checks
47+
if [ ! -f "$ORACLE_CHECKS" ]; then
48+
echo "ERROR: oracle_checks.py not found at $ORACLE_CHECKS"
49+
echo "0.0" > "$REWARD_PATH"
50+
exit 1
51+
fi
52+
53+
echo "Running oracle checks..."
54+
SCORE=$(python3 "$ORACLE_CHECKS" --answer "$ANSWER_PATH" --spec "$TASK_SPEC_PATH" --verbose 2>&1 | tee /dev/stderr | tail -1)
55+
56+
# Validate score is a number
57+
if ! echo "$SCORE" | python3 -c "import sys; float(sys.stdin.read().strip())" 2>/dev/null; then
58+
echo "ERROR: oracle_checks.py did not return a valid score: $SCORE"
59+
echo "0.0" > "$REWARD_PATH"
60+
exit 1
61+
fi
62+
63+
echo ""
64+
echo "Composite score: $SCORE"
65+
echo "$SCORE" > "$REWARD_PATH"
66+
67+
# Exit based on score (SWE-Factory exit-code-first pattern)
68+
python3 -c "import sys; sys.exit(0 if float('$SCORE') > 0 else 1)"

0 commit comments

Comments
 (0)