Skip to content

Commit 05d0729

Browse files
sjarmakclaude
andcommitted
feat: add ContextBench cross-validation pipeline for baseline vs MCP comparison
Adds 5 scripts to run Harbor's task-solving agent on ContextBench's 1,136 human-annotated SWE-bench tasks and evaluate retrieval quality: - select_contextbench_pilot.py: stratified task selection + mirror manifest - scaffold_contextbench_tasks.py: converts parquet → Harbor task dirs - convert_harbor_to_contextbench.py: ATIF traces → ContextBench trajectory format - compare_contextbench_results.py: side-by-side baseline vs MCP report - contextbench_pilot_2config.sh: paired execution launcher The converter handles both local tools (Read/Grep/Glob) and MCP tools (sg_read_file, sg_keyword_search with "file" key extraction and startLine/endLine span mapping). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 323c213 commit 05d0729

File tree

7 files changed

+1638
-1
lines changed

7 files changed

+1638
-1
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/bash
2+
# ContextBench Cross-Validation Pilot: baseline + MCP (50 tasks)
3+
#
4+
# Runs Harbor task-solving agent on ContextBench SWE-bench tasks in both
5+
# baseline (full local code) and MCP (Sourcegraph) configurations.
6+
#
7+
# Prerequisites:
8+
# 1. Run scripts/select_contextbench_pilot.py to select tasks
9+
# 2. Run scripts/create_sg_mirrors.py to create mirrors
10+
# 3. Wait 24-48h for Sourcegraph indexing
11+
# 4. Run scripts/scaffold_contextbench_tasks.py to create task dirs
12+
#
13+
# Usage:
14+
# source .env.local && export HARBOR_ENV=daytona && export DAYTONA_OVERRIDE_STORAGE=10240
15+
# bash configs/contextbench_pilot_2config.sh
16+
17+
set -euo pipefail
18+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
19+
REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
20+
21+
source "$SCRIPT_DIR/_common.sh"
22+
load_credentials
23+
enforce_subscription_mode
24+
25+
# Selection file produced by scaffold_contextbench_tasks.py
26+
export SELECTION_FILE="$REPO_ROOT/configs/contextbench_run_selection.json"
27+
export CATEGORY="staging"
28+
export MODEL="${MODEL:-anthropic/claude-haiku-4-5-20251001}"
29+
30+
if [ ! -f "$SELECTION_FILE" ]; then
31+
echo "ERROR: Selection file not found: $SELECTION_FILE"
32+
echo "Run: python3 scripts/scaffold_contextbench_tasks.py first"
33+
exit 1
34+
fi
35+
36+
TASK_COUNT=$(python3 -c "import json; print(len(json.load(open('$SELECTION_FILE'))))")
37+
echo "=== ContextBench Cross-Validation Pilot ==="
38+
echo "Tasks: $TASK_COUNT"
39+
echo "Configs: baseline-local-direct + mcp-remote-direct"
40+
echo "Category: $CATEGORY"
41+
echo "Model: $MODEL"
42+
echo "Env: ${HARBOR_ENV:-local}"
43+
echo ""
44+
45+
"$SCRIPT_DIR/run_selected_tasks.sh" \
46+
--selection-file "$SELECTION_FILE" \
47+
--benchmark ccb_contextbench \
48+
--full-config mcp-remote-direct \
49+
--category "$CATEGORY"

docs/ops/SCRIPT_INDEX.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,9 +167,11 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
167167
- `scripts/backfill_size_metadata.py` [one_off] - Historical one-off script: backfill size metadata.
168168
- `scripts/backfill_triage_from_manifest.py` [one_off] - Historical one-off script: backfill triage from manifest.
169169
- `scripts/check_harness_readiness.py` - Utility script for check harness readiness.
170+
- `scripts/compare_contextbench_results.py` - Utility script for compare contextbench results.
170171
- `scripts/compute_bootstrap_cis.py` - Utility script for compute bootstrap cis.
171172
- `scripts/context_retrieval_agent.py` - Utility script for context retrieval agent.
172173
- `scripts/control_plane.py` - Utility script for control plane.
174+
- `scripts/convert_harbor_to_contextbench.py` - Utility script for convert harbor to contextbench.
173175
- `scripts/cross_validate_oracles.py` - Utility script for cross validate oracles.
174176
- `scripts/daytona_poc_runner.py` - Utility script for daytona poc runner.
175177
- `scripts/daytona_runner.py` - Utility script for daytona runner.
@@ -208,10 +210,12 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
208210
- `scripts/rerun_zero_mcp_tasks.sh` [one_off] - Historical one-off script: rerun zero mcp tasks.
209211
- `scripts/rescore_difficulty.py` - Utility script for rescore difficulty.
210212
- `scripts/run_judge.py` - Utility script for run judge.
213+
- `scripts/scaffold_contextbench_tasks.py` - Utility script for scaffold contextbench tasks.
211214
- `scripts/scaffold_feature_tasks.py` - Utility script for scaffold feature tasks.
212215
- `scripts/scaffold_refactor_tasks.py` - Utility script for scaffold refactor tasks.
213216
- `scripts/scan_swebench_errors.py` - Utility script for scan swebench errors.
214217
- `scripts/sdlc_anomaly_scan.py` - Utility script for sdlc anomaly scan.
218+
- `scripts/select_contextbench_pilot.py` - Utility script for select contextbench pilot.
215219
- `scripts/smoke_artifact_verifier.py` - Utility script for smoke artifact verifier.
216220
- `scripts/verify_retrieval_eval_smoke.py` - Utility script for verify retrieval eval smoke.
217221

0 commit comments

Comments
 (0)