PrimeIntellect-ai · aibozo · Nov 4, 2025 · Nov 4, 2025 · Nov 4, 2025 · Nov 4, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,10 @@
 # env 
 .venv/
+**/.venv/
 venv/
+**/venv/
 env/
+**/env/
 .env
 .env.local
 uv.lock
@@ -42,3 +45,5 @@ scratch/
 .vscode/
 *.swp
 .DS_Store
+environments/mcp_fetch/.env
+environments/mcp_fetch/tests/test_gpt5_tool_call.py
diff --git a/environments/mcp_fetch/README.md b/environments/mcp_fetch/README.md
@@ -0,0 +1,130 @@
+# `mcp-fetch`
+
+Deterministic MCP environment that exposes a single `fetch` tool wired through the
+shared `verifiers.envs.mcp_env.MCPEnv` wrapper. The tool talks to a local stdio MCP
+server (`tools/fetch_mcp_server.py`) which in turn can only reach the fixtures
+hosted by `utils/mini_httpd.py` unless explicitly configured for online hosts.
+
+Key components:
+
+- **Offline fixtures** – `utils/mini_httpd.py` serves HTML/JSON/text, redirects,
+  auth checks, query endpoints, and gzipped responses at `http://127.0.0.1:31415`.
+- **MCP server** – `tools/fetch_mcp_server.py` doubles as a CLI (`--url ...`) and
+  an MCP stdio process (`--run-server`) returning structured JSON and plaintext.
+- **Tasks** – `tasks/qa.jsonl` now includes 84 prompts covering direct lookups,
+  multi-step pointer puzzles, ledger math, and short-form judge summaries. The
+  latest gauntlet (IDs `fetch_065`–`fetch_084`) leans hard on planner/workflow
+  chains and poem character counts to separate small vs. strong models. Each
+  row includes metadata and a verifier definition used by scripts/tests.
+- **Judge rubrics** – `tasks/judge_rubrics.yaml` defines four LLM-graded
+  summaries (poem, fruits, ledger, manifest).
+
+The expanded fixture set introduces chained lookups (pointer → directive → HTML),
+numeric reasoning over the ledger JSON, and rubric-graded summarization to ensure
+frontier models have to plan multiple tool calls instead of memorising a single
+endpoint.
+
+## Installation
+
+```bash
+cd environments/mcp_fetch
+uv pip install -e .
+```
+
+This registers two console scripts:
+
+- `mcp-server-fetch` – stdio MCP server used by the environment.
+- `mcp-fetch-mini-httpd` – helper for manually serving fixtures.
+
+## Running the environment
+
+```bash
+uv run vf-eval mcp-fetch -n 5 -m gpt-4.1-mini
+```
+
+When invoking the Runner directly (outside `uv run`), make sure `PYTHONPATH`
+includes both the repo root and `environments/` so `environments.mcp_fetch`
+resolves correctly:
+
+```bash
+PYTHONPATH=.:environments vf-eval mcp-fetch -n 10 -m gpt-4.1-mini
+```
+
+Arguments exposed via `load_environment(...)`:
+
+| Arg | Type | Default | Description |
+| --- | ---- | ------- | ----------- |
+| `allow_online` | bool | `False` | Allow a curated list of public hosts in addition to localhost fixtures. |
+| `allow_any_host` | bool | `False` | Disable allowlist enforcement entirely (use with caution). |
+| `allowed_hosts` | Iterable[str] | `None` | Custom host allowlist (`host` or `host:port`). Overrides both flags above. |
+| `server_cmd` | str \| Sequence[str] | `None` | Override the `mcp-server-fetch` launch command. |
+| `fixture_port` | int | `31415` | Port for the deterministic fixture host. |
+| `auto_start_fixtures` | bool | `True` | Automatically launch the mini HTTP daemon (skipped when online mode is enabled). |
+| `task_path` | str \| Path | internal | Path to the QA JSONL file (swap for ablations). |
+
+## Why non-trivial?
+
+The latest calibration run keeps smaller models honest while leaving headroom
+for GPT-5–class evaluators. Scores are averaged over the full 84-task suite:
+
+| Model | Correct / Total | Accuracy |
+| --- | --- | --- |
+| `gpt-4.1-mini` | 46 / 84 | **54.8%** |
+| `gpt-4.1` | 36 / 46 | **78.3%** |
+| `gpt-5` | 68 / 84 | **80.95%** |
+
+Mini models routinely stall on the planner/workflow gauntlet and poem
+character-count checks, while GPT-4.1 clears most but not all retries and
+rubric-graded prompts. GPT-5 still needs deliberate tool planning to stay above
+80%, which is exactly the intended difficulty band for the MCP Agents bounty.
+
+## Testing
+
+Offline verification mirrors the PI requirements:
+
+```bash
+uv run pytest tests/environments/test_mcp_fetch.py -q
+uv run ruff check environments/mcp_fetch
+```
+
+The pytest suite spins up the fixtures, drives the MCP server via the same helper
+functions the environment uses, and asserts each canonical verifier passes. This
+ensures regressions in the HTTP fixtures, hashing, or truncation metadata are
+caught during CI.
+
+## API keys & GPT-5 diagnostic
+
+The calibration script and the GPT-5 sanity-check test automatically load envvars
+from `environments/mcp_fetch/.env` (and the repo-level `.env`, if present) via
+`python-dotenv`. Add your OpenAI key to the existing gitignored file:
+
+```bash
+echo "OPENAI_API_KEY=sk-..." >> environments/mcp_fetch/.env
+```
+
+Because the file never leaves your machine, secrets stay local while still being
+picked up automatically by both scripts. Once populated you can run:
+
+```bash
+PYTHONPATH=.:environments .venv/bin/pytest environments/mcp_fetch/tests/test_gpt5_tool_call.py -s
+PYTHONPATH=.:environments .venv/bin/python environments/mcp_fetch/scripts/calibrate_questions.py --model gpt-5 --max-turns 6
+```
+
+The pytest harness mirrors OpenAI’s Responses tool loop and fails fast if GPT-5
+doesn’t return the expected “Mini Site” H1, which makes debugging empty-answer
+cases easier.
+
+## Calibration cadence
+
+Run a single pass per reference model so token usage stays bounded while the new
+gauntlet still lands:
+
+```bash
+PYTHONPATH=.:environments .venv/bin/python environments/mcp_fetch/scripts/calibrate_questions.py --model gpt-4.1-mini --max-turns 6 --include-judge
+PYTHONPATH=.:environments .venv/bin/python environments/mcp_fetch/scripts/calibrate_questions.py --model gpt-5 --max-turns 6 --include-judge
+```
+
+Review `environments/mcp_fetch/reports/question_quality_<model>.json` after each
+run; planner/workflow and poem-char categories should show the widest gap. If
+mini creeps above ~50% again, extend the gauntlet with additional planner or
+character-count prompts before re-running these commands.
diff --git a/environments/mcp_fetch/__init__.py b/environments/mcp_fetch/__init__.py
@@ -0,0 +1,3 @@
+from .mcp_env import FetchEnv, load_environment
+
+__all__ = ["FetchEnv", "load_environment"]
diff --git a/environments/mcp_fetch/fixtures/html/about.html b/environments/mcp_fetch/fixtures/html/about.html
@@ -0,0 +1,8 @@
+<!doctype html>
+<html>
+  <head><meta charset="utf-8"><title>About</title></head>
+  <body>
+    <h1>About This Mini Site</h1>
+    <p>It is served by a tiny Python HTTP server for deterministic testing.</p>
+  </body>
+</html>
diff --git a/environments/mcp_fetch/fixtures/html/final.html b/environments/mcp_fetch/fixtures/html/final.html
@@ -0,0 +1,8 @@
+<!doctype html>
+<html>
+  <head><meta charset="utf-8"><title>Final</title></head>
+  <body>
+    <h1>Redirect Target</h1>
+    <p>You have reached the final page after a redirect.</p>
+  </body>
+</html>
diff --git a/environments/mcp_fetch/fixtures/html/index.html b/environments/mcp_fetch/fixtures/html/index.html
@@ -0,0 +1,9 @@
+<!doctype html>
+<html>
+  <head><meta charset="utf-8"><title>Fixtures Home</title></head>
+  <body>
+    <h1>Mini Site — Index</h1>
+    <p>Welcome to the deterministic mini site for MCP Fetch tasks.</p>
+    <a href="/html/about.html">About</a>
+  </body>
+</html>
diff --git a/environments/mcp_fetch/fixtures/html/latin1.html b/environments/mcp_fetch/fixtures/html/latin1.html
@@ -0,0 +1,7 @@
+<!doctype html>
+<html>
+  <head><meta charset="ISO-8859-1"><title>Latin1</title></head>
+  <body>
+    <h1>Café au lait</h1>
+  </body>
+</html>
diff --git a/environments/mcp_fetch/fixtures/html/manifest.html b/environments/mcp_fetch/fixtures/html/manifest.html
@@ -0,0 +1,14 @@
+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <title>Manifest Beacon</title>
+  </head>
+  <body>
+    <section id="cargo">
+      <p>Stage: twilight</p>
+      <code data-role="final">orbit-cascade</code>
+      <p>Notes: keep the code uppercase when reporting.</p>
+    </section>
+  </body>
+</html>
diff --git a/environments/mcp_fetch/fixtures/json/checksum_hints.json b/environments/mcp_fetch/fixtures/json/checksum_hints.json
@@ -0,0 +1,11 @@
+{
+  "segments": [
+    {"source": "north", "value": 14},
+    {"source": "west", "value": 6},
+    {"source": "north", "value": 9},
+    {"source": "south", "value": 4},
+    {"source": "east", "value": 7},
+    {"source": "north", "value": 5}
+  ],
+  "sequence": [8, 1, 3, 2, 1]
+}
diff --git a/environments/mcp_fetch/fixtures/json/data.json b/environments/mcp_fetch/fixtures/json/data.json
@@ -0,0 +1,18 @@
+{
+  "title": "Deterministic dataset",
+  "count": 3,
+  "items": [
+    {
+      "id": 1,
+      "name": "alpha"
+    },
+    {
+      "id": 2,
+      "name": "beta"
+    },
+    {
+      "id": 3,
+      "name": "gamma"
+    }
+  ]
+}
diff --git a/environments/mcp_fetch/fixtures/json/data_large.jsonl b/environments/mcp_fetch/fixtures/json/data_large.jsonl
@@ -0,0 +1,20 @@
+{"i": 0, "v": 0}
+{"i": 1, "v": 2}
+{"i": 2, "v": 4}
+{"i": 3, "v": 6}
+{"i": 4, "v": 8}
+{"i": 5, "v": 10}
+{"i": 6, "v": 12}
+{"i": 7, "v": 14}
+{"i": 8, "v": 16}
+{"i": 9, "v": 18}
+{"i": 10, "v": 20}
+{"i": 11, "v": 22}
+{"i": 12, "v": 24}
+{"i": 13, "v": 26}
+{"i": 14, "v": 28}
+{"i": 15, "v": 30}
+{"i": 16, "v": 32}
+{"i": 17, "v": 34}
+{"i": 18, "v": 36}
+{"i": 19, "v": 38}
diff --git a/environments/mcp_fetch/fixtures/json/ledger.json b/environments/mcp_fetch/fixtures/json/ledger.json
@@ -0,0 +1,12 @@
+{
+  "entries": [
+    {"id": "A1", "type": "expense", "amount": 12, "tags": ["ops", "night"]},
+    {"id": "A2", "type": "expense", "amount": 9, "tags": ["ops"]},
+    {"id": "A3", "type": "refund", "amount": 4, "tags": ["ops", "delta"]},
+    {"id": "A4", "type": "expense", "amount": 15, "tags": ["delta"]},
+    {"id": "A5", "type": "income", "amount": 40, "tags": ["ops", "spot"]},
+    {"id": "A6", "type": "expense", "amount": 7, "tags": ["night"]},
+    {"id": "A7", "type": "expense", "amount": 3, "tags": ["ops", "delta"]},
+    {"id": "A8", "type": "income", "amount": 18, "tags": ["delta"]}
+  ]
+}
diff --git a/environments/mcp_fetch/fixtures/json/pointers.json b/environments/mcp_fetch/fixtures/json/pointers.json
@@ -0,0 +1,12 @@
+{
+  "paths": {
+    "ember": "/text/ember.txt",
+    "relay": "/text/relay.txt",
+    "sequence": "/text/sequence.txt",
+    "directive": "/text/directive.txt"
+  },
+  "extras": {
+    "manifest": "/html/manifest.html",
+    "checksum": "/json/checksum_hints.json"
+  }
+}
diff --git a/environments/mcp_fetch/fixtures/text/directive.txt b/environments/mcp_fetch/fixtures/text/directive.txt
@@ -0,0 +1,4 @@
+Directive 7b:
+1. Fetch http://127.0.0.1:31415/json/pointers.json and look for extras.manifest.
+2. Use that value to request http://127.0.0.1:31415/html/manifest.html and read <code data-role="final">.
+3. Respond only with that code in uppercase letters and remind the reader it must stay uppercase.
diff --git a/environments/mcp_fetch/fixtures/text/ember.txt b/environments/mcp_fetch/fixtures/text/ember.txt
@@ -0,0 +1,4 @@
+Ember log
+Line 1: Sparks coil beneath the relay mesh.
+Line 2: Guidance phrase = stay patient.
+Line 3: Final token: glow-lantern.
diff --git a/environments/mcp_fetch/fixtures/text/poem.txt b/environments/mcp_fetch/fixtures/text/poem.txt
@@ -0,0 +1,4 @@
+In the middle of the mini site,
+Deterministic servers hum at night.
+Signals fetch and journeys flow,
+Answers bloom in ordered glow.
diff --git a/environments/mcp_fetch/fixtures/text/relay.txt b/environments/mcp_fetch/fixtures/text/relay.txt
@@ -0,0 +1,4 @@
+Relay diagnostics
+Paths: /json/ledger.json, /json/pointers.json
+Password: BRIDGE-NODE-7
+Checksum taps: 4
diff --git a/environments/mcp_fetch/fixtures/text/sequence.txt b/environments/mcp_fetch/fixtures/text/sequence.txt
@@ -0,0 +1,3 @@
+Sequence brief. Calibration sentences follow.
+Second sentence states: Signals stack in ordered pairs again.
+Third piece watches quietly.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from .mcp_env import FetchEnv, load_environment

		__all__ = ["FetchEnv", "load_environment"]