Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# env
.venv/
**/.venv/
venv/
**/venv/
env/
**/env/
.env
.env.local
uv.lock
Expand Down Expand Up @@ -42,3 +45,5 @@ scratch/
.vscode/
*.swp
.DS_Store
environments/mcp_fetch/.env
environments/mcp_fetch/tests/test_gpt5_tool_call.py
130 changes: 130 additions & 0 deletions environments/mcp_fetch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# `mcp-fetch`

Deterministic MCP environment that exposes a single `fetch` tool wired through the
shared `verifiers.envs.mcp_env.MCPEnv` wrapper. The tool talks to a local stdio MCP
server (`tools/fetch_mcp_server.py`) which in turn can only reach the fixtures
hosted by `utils/mini_httpd.py` unless explicitly configured for online hosts.

Key components:

- **Offline fixtures** – `utils/mini_httpd.py` serves HTML/JSON/text, redirects,
auth checks, query endpoints, and gzipped responses at `http://127.0.0.1:31415`.
- **MCP server** – `tools/fetch_mcp_server.py` doubles as a CLI (`--url ...`) and
an MCP stdio process (`--run-server`) returning structured JSON and plaintext.
- **Tasks** – `tasks/qa.jsonl` now includes 84 prompts covering direct lookups,
multi-step pointer puzzles, ledger math, and short-form judge summaries. The
latest gauntlet (IDs `fetch_065`–`fetch_084`) leans hard on planner/workflow
chains and poem character counts to separate small vs. strong models. Each
row includes metadata and a verifier definition used by scripts/tests.
- **Judge rubrics** – `tasks/judge_rubrics.yaml` defines four LLM-graded
summaries (poem, fruits, ledger, manifest).

The expanded fixture set introduces chained lookups (pointer → directive → HTML),
numeric reasoning over the ledger JSON, and rubric-graded summarization to ensure
frontier models have to plan multiple tool calls instead of memorising a single
endpoint.

## Installation

```bash
cd environments/mcp_fetch
uv pip install -e .
```

This registers two console scripts:

- `mcp-server-fetch` – stdio MCP server used by the environment.
- `mcp-fetch-mini-httpd` – helper for manually serving fixtures.

## Running the environment

```bash
uv run vf-eval mcp-fetch -n 5 -m gpt-4.1-mini
```

When invoking the Runner directly (outside `uv run`), make sure `PYTHONPATH`
includes both the repo root and `environments/` so `environments.mcp_fetch`
resolves correctly:

```bash
PYTHONPATH=.:environments vf-eval mcp-fetch -n 10 -m gpt-4.1-mini
```

Arguments exposed via `load_environment(...)`:

| Arg | Type | Default | Description |
| --- | ---- | ------- | ----------- |
| `allow_online` | bool | `False` | Allow a curated list of public hosts in addition to localhost fixtures. |
| `allow_any_host` | bool | `False` | Disable allowlist enforcement entirely (use with caution). |
| `allowed_hosts` | Iterable[str] | `None` | Custom host allowlist (`host` or `host:port`). Overrides both flags above. |
| `server_cmd` | str \| Sequence[str] | `None` | Override the `mcp-server-fetch` launch command. |
| `fixture_port` | int | `31415` | Port for the deterministic fixture host. |
| `auto_start_fixtures` | bool | `True` | Automatically launch the mini HTTP daemon (skipped when online mode is enabled). |
| `task_path` | str \| Path | internal | Path to the QA JSONL file (swap for ablations). |

## Why non-trivial?

The latest calibration run keeps smaller models honest while leaving headroom
for GPT-5–class evaluators. Scores are averaged over the full 84-task suite:

| Model | Correct / Total | Accuracy |
| --- | --- | --- |
| `gpt-4.1-mini` | 46 / 84 | **54.8%** |
| `gpt-4.1` | 36 / 46 | **78.3%** |
| `gpt-5` | 68 / 84 | **80.95%** |

Mini models routinely stall on the planner/workflow gauntlet and poem
character-count checks, while GPT-4.1 clears most but not all retries and
rubric-graded prompts. GPT-5 still needs deliberate tool planning to stay above
80%, which is exactly the intended difficulty band for the MCP Agents bounty.

## Testing

Offline verification mirrors the PI requirements:

```bash
uv run pytest tests/environments/test_mcp_fetch.py -q
uv run ruff check environments/mcp_fetch
```

The pytest suite spins up the fixtures, drives the MCP server via the same helper
functions the environment uses, and asserts each canonical verifier passes. This
ensures regressions in the HTTP fixtures, hashing, or truncation metadata are
caught during CI.

## API keys & GPT-5 diagnostic

The calibration script and the GPT-5 sanity-check test automatically load envvars
from `environments/mcp_fetch/.env` (and the repo-level `.env`, if present) via
`python-dotenv`. Add your OpenAI key to the existing gitignored file:

```bash
echo "OPENAI_API_KEY=sk-..." >> environments/mcp_fetch/.env
```

Because the file never leaves your machine, secrets stay local while still being
picked up automatically by both scripts. Once populated you can run:

```bash
PYTHONPATH=.:environments .venv/bin/pytest environments/mcp_fetch/tests/test_gpt5_tool_call.py -s
PYTHONPATH=.:environments .venv/bin/python environments/mcp_fetch/scripts/calibrate_questions.py --model gpt-5 --max-turns 6
```

The pytest harness mirrors OpenAI’s Responses tool loop and fails fast if GPT-5
doesn’t return the expected “Mini Site” H1, which makes debugging empty-answer
cases easier.

## Calibration cadence

Run a single pass per reference model so token usage stays bounded while the new
gauntlet still lands:

```bash
PYTHONPATH=.:environments .venv/bin/python environments/mcp_fetch/scripts/calibrate_questions.py --model gpt-4.1-mini --max-turns 6 --include-judge
PYTHONPATH=.:environments .venv/bin/python environments/mcp_fetch/scripts/calibrate_questions.py --model gpt-5 --max-turns 6 --include-judge
```

Review `environments/mcp_fetch/reports/question_quality_<model>.json` after each
run; planner/workflow and poem-char categories should show the widest gap. If
mini creeps above ~50% again, extend the gauntlet with additional planner or
character-count prompts before re-running these commands.
3 changes: 3 additions & 0 deletions environments/mcp_fetch/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .mcp_env import FetchEnv, load_environment

__all__ = ["FetchEnv", "load_environment"]
8 changes: 8 additions & 0 deletions environments/mcp_fetch/fixtures/html/about.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<!doctype html>
<html>
<head><meta charset="utf-8"><title>About</title></head>
<body>
<h1>About This Mini Site</h1>
<p>It is served by a tiny Python HTTP server for deterministic testing.</p>
</body>
</html>
8 changes: 8 additions & 0 deletions environments/mcp_fetch/fixtures/html/final.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<!doctype html>
<html>
<head><meta charset="utf-8"><title>Final</title></head>
<body>
<h1>Redirect Target</h1>
<p>You have reached the final page after a redirect.</p>
</body>
</html>
9 changes: 9 additions & 0 deletions environments/mcp_fetch/fixtures/html/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<!doctype html>
<html>
<head><meta charset="utf-8"><title>Fixtures Home</title></head>
<body>
<h1>Mini Site — Index</h1>
<p>Welcome to the deterministic mini site for MCP Fetch tasks.</p>
<a href="/html/about.html">About</a>
</body>
</html>
7 changes: 7 additions & 0 deletions environments/mcp_fetch/fixtures/html/latin1.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<!doctype html>
<html>
<head><meta charset="ISO-8859-1"><title>Latin1</title></head>
<body>
<h1>Café au lait</h1>
</body>
</html>
14 changes: 14 additions & 0 deletions environments/mcp_fetch/fixtures/html/manifest.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Manifest Beacon</title>
</head>
<body>
<section id="cargo">
<p>Stage: twilight</p>
<code data-role="final">orbit-cascade</code>
<p>Notes: keep the code uppercase when reporting.</p>
</section>
</body>
</html>
11 changes: 11 additions & 0 deletions environments/mcp_fetch/fixtures/json/checksum_hints.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"segments": [
{"source": "north", "value": 14},
{"source": "west", "value": 6},
{"source": "north", "value": 9},
{"source": "south", "value": 4},
{"source": "east", "value": 7},
{"source": "north", "value": 5}
],
"sequence": [8, 1, 3, 2, 1]
}
18 changes: 18 additions & 0 deletions environments/mcp_fetch/fixtures/json/data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"title": "Deterministic dataset",
"count": 3,
"items": [
{
"id": 1,
"name": "alpha"
},
{
"id": 2,
"name": "beta"
},
{
"id": 3,
"name": "gamma"
}
]
}
20 changes: 20 additions & 0 deletions environments/mcp_fetch/fixtures/json/data_large.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{"i": 0, "v": 0}
{"i": 1, "v": 2}
{"i": 2, "v": 4}
{"i": 3, "v": 6}
{"i": 4, "v": 8}
{"i": 5, "v": 10}
{"i": 6, "v": 12}
{"i": 7, "v": 14}
{"i": 8, "v": 16}
{"i": 9, "v": 18}
{"i": 10, "v": 20}
{"i": 11, "v": 22}
{"i": 12, "v": 24}
{"i": 13, "v": 26}
{"i": 14, "v": 28}
{"i": 15, "v": 30}
{"i": 16, "v": 32}
{"i": 17, "v": 34}
{"i": 18, "v": 36}
{"i": 19, "v": 38}
12 changes: 12 additions & 0 deletions environments/mcp_fetch/fixtures/json/ledger.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"entries": [
{"id": "A1", "type": "expense", "amount": 12, "tags": ["ops", "night"]},
{"id": "A2", "type": "expense", "amount": 9, "tags": ["ops"]},
{"id": "A3", "type": "refund", "amount": 4, "tags": ["ops", "delta"]},
{"id": "A4", "type": "expense", "amount": 15, "tags": ["delta"]},
{"id": "A5", "type": "income", "amount": 40, "tags": ["ops", "spot"]},
{"id": "A6", "type": "expense", "amount": 7, "tags": ["night"]},
{"id": "A7", "type": "expense", "amount": 3, "tags": ["ops", "delta"]},
{"id": "A8", "type": "income", "amount": 18, "tags": ["delta"]}
]
}
12 changes: 12 additions & 0 deletions environments/mcp_fetch/fixtures/json/pointers.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"paths": {
"ember": "/text/ember.txt",
"relay": "/text/relay.txt",
"sequence": "/text/sequence.txt",
"directive": "/text/directive.txt"
},
"extras": {
"manifest": "/html/manifest.html",
"checksum": "/json/checksum_hints.json"
}
}
4 changes: 4 additions & 0 deletions environments/mcp_fetch/fixtures/text/directive.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Directive 7b:
1. Fetch http://127.0.0.1:31415/json/pointers.json and look for extras.manifest.
2. Use that value to request http://127.0.0.1:31415/html/manifest.html and read <code data-role="final">.
3. Respond only with that code in uppercase letters and remind the reader it must stay uppercase.
4 changes: 4 additions & 0 deletions environments/mcp_fetch/fixtures/text/ember.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Ember log
Line 1: Sparks coil beneath the relay mesh.
Line 2: Guidance phrase = stay patient.
Line 3: Final token: glow-lantern.
4 changes: 4 additions & 0 deletions environments/mcp_fetch/fixtures/text/poem.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
In the middle of the mini site,
Deterministic servers hum at night.
Signals fetch and journeys flow,
Answers bloom in ordered glow.
4 changes: 4 additions & 0 deletions environments/mcp_fetch/fixtures/text/relay.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Relay diagnostics
Paths: /json/ledger.json, /json/pointers.json
Password: BRIDGE-NODE-7
Checksum taps: 4
3 changes: 3 additions & 0 deletions environments/mcp_fetch/fixtures/text/sequence.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Sequence brief. Calibration sentences follow.
Second sentence states: Signals stack in ordered pairs again.
Third piece watches quietly.
Loading