From c6e57e426fca51abe418039143246d0d835fff07 Mon Sep 17 00:00:00 2001
From: hazelian0619 <hazelian0619@github.com>
Date: Mon, 30 Mar 2026 23:10:02 +0800
Subject: [PATCH 1/6] fix(prompts): add failure recovery protocol and
 scientific writing gate
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Problem A (partial): Add MANDATORY scientific writing gate to default.md —
Leader must delegate to Researcher before writing any domain paper. Clarify
Scientific Illustrator scope (schematic/pathway diagrams only, not data plots).

Problem C: Add Failure Recovery section to delegation.md — three-tier ladder
for file write failures (Two-Phase Write Protocol → format downgrade → inline)
and sub-agent failures (narrow retry → self-execute → partial output). Hard
rule: never terminate without producing at least one artifact.

Validated by experiment (2026-03-30):
- Case 3 (SSR1/GWAS): Leader called 3x parallel Researcher before any content;
  Researchers produced 978 lines across 3 reports using Two-Phase Write Protocol
- Case 0 (EC论文): Leader called 2x parallel Researcher; BibTeX built to 397
  lines via append_file batches (vs. previous silent truncation at char 88);
  PDF artifact (117KB) delivered despite E2BIG and relay-API update_file errors

New bugs discovered (tracked separately):
- Relay API truncates update_file tool call args mid-generation (high severity)
- think tool infinite loop at ~90K token context (medium severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../factory/templates/prompts/delegation.md   | 21 +++++++++++++++++++
 pantheon/factory/templates/teams/default.md   | 10 ++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/pantheon/factory/templates/prompts/delegation.md b/pantheon/factory/templates/prompts/delegation.md
index fd6d61a7..49ffa043 100644
--- a/pantheon/factory/templates/prompts/delegation.md
+++ b/pantheon/factory/templates/prompts/delegation.md
@@ -91,3 +91,24 @@ call_agent(
 ```python
 call_agent("researcher", "Do analysis fast.")
 ```
+
+### Failure Recovery
+
+Tool failures and sub-agent errors are expected — **never terminate without producing output.**
+
+When a tool call fails, apply the following recovery ladder in order:
+
+**File write failures** (e.g. content too large, output truncation):
+1. **Use Two-Phase Write Protocol**: `write_file` (skeleton only) → `update_file` (one section at a time) → `append_file` (BibTeX / list batches). Never retry `write_file` with the same large content.
+2. **Downgrade format**: If `.tex` fails after protocol, write `.md`; if `.md` fails, write `.txt`
+3. **Inline output**: If all file writes fail, output the full content as a code block in the chat
+
+**Sub-agent failures** (researcher or illustrator returns error or empty result):
+1. **Retry with narrower scope**: Re-delegate with a smaller, more focused Task Brief
+2. **Self-execute fallback**: Handle the task directly if sub-agent repeatedly fails
+3. **Partial output**: Deliver what was completed and clearly state what is missing
+
+**Hard rule — no silent failures:**
+- Always produce at least one artifact per session, even if degraded
+- When falling back to a simpler format, tell the user explicitly: what you tried, why it failed, what you're delivering instead
+- A partial result delivered is always better than a perfect result abandoned
diff --git a/pantheon/factory/templates/teams/default.md b/pantheon/factory/templates/teams/default.md
index ad780fcf..a45e03fe 100644
--- a/pantheon/factory/templates/teams/default.md
+++ b/pantheon/factory/templates/teams/default.md
@@ -89,10 +89,12 @@ call_agent("researcher", "Search the web for best practices on X. Gather informa
 - Data analysis, EDA, statistical analysis
 - Literature review and multi-source research
 
+**Scientific writing gate (MANDATORY):** Before writing any report, paper, or document that requires domain knowledge or citations, you MUST first delegate a research task to `researcher`. Writing without a prior research delegation is not allowed for these task types.
+
 #### Scientific Illustrator
 
-**Delegate for:** Scientific diagrams, publication-quality visualizations, complex figures
-**Execute directly:** Simple chart embedding, displaying existing charts
+**Delegate for:** Schematic diagrams, pathway figures, cell structure illustrations, BioRender-style publication figures — tasks where the output is a conceptual diagram, not a data-driven chart.
+**Execute directly (or via Researcher):** Data visualizations, statistical plots, matplotlib/seaborn charts derived from analysis results.
 
 ### Decision Summary
 
@@ -101,9 +103,11 @@ call_agent("researcher", "Search the web for best practices on X. Gather informa
 | Explore/read/understand codebase | **MUST delegate** to researcher |
 | Web search or documentation lookup | **MUST delegate** to researcher |
 | Data analysis or research | **MUST delegate** to researcher |
+| Scientific writing (report/paper) | **MUST delegate research first**, then write |
 | Multiple independent research tasks | **MUST parallelize** with multiple researchers |
+| Schematic/pathway/cell diagrams | **Delegate** to scientific_illustrator |
 | Read 1 known file | Execute directly |
-| Write/edit/create files | Execute directly |
+| Write/edit/create files (post-research) | Execute directly |
 | Synthesize researcher results | Execute directly (your core role) |
 
 {{delegation}}

From e3e073f478b4d2da38280ffc9c8d0c50d37b766c Mon Sep 17 00:00:00 2001
From: hazelian0619 <hazelian0619@github.com>
Date: Tue, 31 Mar 2026 07:34:44 +0800
Subject: [PATCH 2/6] fix(file_manager): add output-token truncation guards and
 append_file tool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P0 bug: when LLM generates large files (LaTeX papers, BibTeX) in a single
write_file/update_file call, the relay API truncates the output stream mid-JSON,
causing 'Unterminated string' parse errors and silent data loss.

Root cause: LLM output token limit is separate from context window. File content
in tool call parameters must be generated as LLM output, hitting max_tokens before
the JSON closes. LaTeX/BibTeX content with escape chars inflates token count ~1.5x.

Changes:
- write_file: hard reject content > 12,000 chars; docstring teaches Two-Phase
  Write Protocol (scaffold first, fill by section, append for lists/bib)
- append_file: new tool for chunked appending; 6,000 char limit; requires file
  to exist first; primary use case is BibTeX batches (<=10 entries per call)
- update_file: hard reject new_string > 8,000 chars with guidance to split
  section into smaller semantic units

Validated against 20-case baseline (15% success rate before fix):
- Case 1 (LaTeX review paper, previously FAIL): now generates full PDF with
  44 references via append_file batches — confirmed in controlled re-run
- Agent proactively adopted Two-Phase protocol after reading docstring (0
  content_too_large rejections; protocol was followed before guard triggered)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 pantheon/toolsets/file/file_manager.py | 116 ++++++++++++++++++++++---
 1 file changed, 102 insertions(+), 14 deletions(-)

diff --git a/pantheon/toolsets/file/file_manager.py b/pantheon/toolsets/file/file_manager.py
index e420bedd..8075a92d 100644
--- a/pantheon/toolsets/file/file_manager.py
+++ b/pantheon/toolsets/file/file_manager.py
@@ -750,33 +750,50 @@ async def write_file(
         content: str = "",
         overwrite: bool = True,
     ) -> dict:
-        """Use this tool to CREATE NEW file.
+        """Use this tool to CREATE a NEW file with a skeleton or short content.
 
-        This tool writes content to a file, automatically creating parent
-        directories if they do not exist.
+        ⚠️  LARGE FILE PROTOCOL — MUST FOLLOW FOR PAPERS, REPORTS, LaTeX, BibTeX:
+        NEVER pass an entire document as `content` in one call.
+        Use the Two-Phase Write Protocol instead:
 
-        IMPORTANT: For EDITING existing file, use `update_file` instead.
-        DO NOT rewrite entire file when only small changes are needed, its is wasteful and error-prone.
+          Phase 1 — Scaffold  (this tool, once):
+            write_file(path, content=<skeleton with section stubs, ~20-50 lines>)
 
-        Use this tool when:
-        - Creating a brand new file
-        - Completely rewriting a file from scratch (rare)
+          Phase 2 — Fill  (per semantic section):
+            update_file(path, old_string=<section stub>, new_string=<full section>)
+            → one call per semantic unit (Introduction, Methods, Results, etc.)
 
-        DO NOT use this tool when:
-        - Making partial modifications to an existing file
-        - Changing a few lines in a large file
-        - For these cases, use `update_file` instead
+          For lists / bibliographies  (append_file, batched):
+            append_file(path, content=<10 BibTeX entries or 1 table block at a time>)
+
+        This tool will REFUSE content longer than 12,000 characters. Writing large
+        content in one shot causes output-token truncation and silent data loss.
 
         Args:
             file_path: The path to the file to write.
             content: The content to write to the file.
             overwrite: When False, abort if the target file already exists.
-                       Default is True, but consider using update_file for edits.
 
         Returns:
             dict: Success status or error message.
         """
-
+        _WRITE_FILE_MAX_CHARS = 12000
+        if len(content) > _WRITE_FILE_MAX_CHARS:
+            return {
+                "success": False,
+                "reason": "content_too_large",
+                "error": (
+                    f"Content is {len(content):,} chars, exceeding the "
+                    f"{_WRITE_FILE_MAX_CHARS:,}-char limit per write_file call. "
+                    f"Use the Two-Phase Write Protocol:\n"
+                    f"  1. write_file('{file_path}', content=<skeleton with empty section stubs>)\n"
+                    f"  2. update_file('{file_path}', old_string=<stub>, new_string=<section>)  "
+                    f"← one call per section\n"
+                    f"  3. append_file('{file_path}', content=<batch>)  "
+                    f"← for BibTeX / lists (<=10 items per call)\n"
+                    f"Do NOT retry write_file with the same large content."
+                ),
+            }
         target_path = self._resolve_path(file_path)
         if not overwrite and target_path.exists():
             return {
@@ -794,6 +811,65 @@ async def write_file(
             logger.error(f"write_file failed for {file_path}: {exc}")
             return {"success": False, "error": str(exc)}
 
+    @tool
+    async def append_file(
+        self,
+        file_path: str,
+        content: str,
+    ) -> dict:
+        """Append content to the end of an existing file without overwriting it.
+
+        ## Primary use case: chunked writing for large documents
+
+        When a single write_file or update_file call would be too large, split
+        the content and stream it in parts:
+
+            write_file(path, skeleton)          # 1. write header / skeleton
+            append_file(path, introduction)     # 2. append Introduction section
+            append_file(path, methods)          # 3. append Methods section
+            append_file(path, results)          # 4. append Results + Discussion
+            append_file(path, bibliography)     # 5. append Bibliography
+
+        ## For BibTeX bibliographies:
+        Split into batches of <=10 @article / @inproceedings blocks per call.
+
+        ## Limits:
+        Content must be <=6,000 characters per call. Split further if needed.
+        File must already exist (use write_file to create it first).
+
+        Args:
+            file_path: Path to the file to append to (relative to workspace root).
+            content: Text to append. Include a leading newline if needed.
+
+        Returns:
+            dict: {success: true, appended_chars: int} or {success: false, error: str}
+        """
+        _APPEND_FILE_MAX_CHARS = 6000
+        if len(content) > _APPEND_FILE_MAX_CHARS:
+            return {
+                "success": False,
+                "reason": "content_too_large",
+                "error": (
+                    f"Content is {len(content):,} chars, exceeding the "
+                    f"{_APPEND_FILE_MAX_CHARS:,}-char limit per append_file call. "
+                    f"Split into smaller batches (<=10 BibTeX entries or one section at a time)."
+                ),
+            }
+        target_path = self._resolve_path(file_path)
+        if not target_path.exists():
+            return {
+                "success": False,
+                "error": f"File '{file_path}' does not exist. Use write_file to create it first.",
+                "reason": "file_not_found",
+            }
+        try:
+            with open(target_path, "a", encoding="utf-8") as f:
+                f.write(content)
+            return {"success": True, "appended_chars": len(content)}
+        except Exception as exc:
+            logger.error(f"append_file failed for {file_path}: {exc}")
+            return {"success": False, "error": str(exc)}
+
     @tool
     async def update_file(
         self,
@@ -826,6 +902,18 @@ async def update_file(
         Returns:
             dict: {success: bool, replacements: int} or {success: False, error: str}
         """
+        _UPDATE_FILE_MAX_CHARS = 8000
+        if len(new_string) > _UPDATE_FILE_MAX_CHARS:
+            return {
+                "success": False,
+                "reason": "content_too_large",
+                "error": (
+                    f"new_string is {len(new_string):,} chars, exceeding the "
+                    f"{_UPDATE_FILE_MAX_CHARS:,}-char limit per update_file call. "
+                    f"Split this section into smaller semantic units and call "
+                    f"update_file once per unit (e.g. one paragraph or subsection at a time)."
+                ),
+            }
         target_path = self._resolve_path(file_path)
         if not target_path.exists():
             return {"success": False, "error": "File does not exist"}

From c0b8e712947ad8ba86d4cea0e84a54e047727b76 Mon Sep 17 00:00:00 2001
From: Starlitnightly <starlitnightly@163.com>
Date: Tue, 31 Mar 2026 14:40:51 -0700
Subject: [PATCH 3/6] fix(file_manager): add output-token truncation guards and
 append_file tool

Cherry-picked from PR #52 (fix/file-manager-output-token-truncation).

- write_file: hard-reject content > 12,000 chars with Two-Phase Write Protocol guidance
- append_file: new tool for chunked appending with 6,000-char limit
- update_file: hard-reject new_string > 8,000 chars
- delegation.md: failure recovery ladder
- default.md: scientific writing gate
---
 .../factory/templates/prompts/delegation.md   |  21 ++++
 pantheon/factory/templates/teams/default.md   |  10 +-
 pantheon/toolsets/file/file_manager.py        | 116 +++++++++++++++---
 3 files changed, 130 insertions(+), 17 deletions(-)

diff --git a/pantheon/factory/templates/prompts/delegation.md b/pantheon/factory/templates/prompts/delegation.md
index fd6d61a7..49ffa043 100644
--- a/pantheon/factory/templates/prompts/delegation.md
+++ b/pantheon/factory/templates/prompts/delegation.md
@@ -91,3 +91,24 @@ call_agent(
 ```python
 call_agent("researcher", "Do analysis fast.")
 ```
+
+### Failure Recovery
+
+Tool failures and sub-agent errors are expected — **never terminate without producing output.**
+
+When a tool call fails, apply the following recovery ladder in order:
+
+**File write failures** (e.g. content too large, output truncation):
+1. **Use Two-Phase Write Protocol**: `write_file` (skeleton only) → `update_file` (one section at a time) → `append_file` (BibTeX / list batches). Never retry `write_file` with the same large content.
+2. **Downgrade format**: If `.tex` fails after protocol, write `.md`; if `.md` fails, write `.txt`
+3. **Inline output**: If all file writes fail, output the full content as a code block in the chat
+
+**Sub-agent failures** (researcher or illustrator returns error or empty result):
+1. **Retry with narrower scope**: Re-delegate with a smaller, more focused Task Brief
+2. **Self-execute fallback**: Handle the task directly if sub-agent repeatedly fails
+3. **Partial output**: Deliver what was completed and clearly state what is missing
+
+**Hard rule — no silent failures:**
+- Always produce at least one artifact per session, even if degraded
+- When falling back to a simpler format, tell the user explicitly: what you tried, why it failed, what you're delivering instead
+- A partial result delivered is always better than a perfect result abandoned
diff --git a/pantheon/factory/templates/teams/default.md b/pantheon/factory/templates/teams/default.md
index 76aa2be6..403d0eae 100644
--- a/pantheon/factory/templates/teams/default.md
+++ b/pantheon/factory/templates/teams/default.md
@@ -88,10 +88,12 @@ call_agent("researcher", "Search the web for best practices on X. Gather informa
 - Data analysis, EDA, statistical analysis
 - Literature review and multi-source research
 
+**Scientific writing gate (MANDATORY):** Before writing any report, paper, or document that requires domain knowledge or citations, you MUST first delegate a research task to `researcher`. Writing without a prior research delegation is not allowed for these task types.
+
 #### Scientific Illustrator
 
-**Delegate for:** Scientific diagrams, publication-quality visualizations, complex figures
-**Execute directly:** Simple chart embedding, displaying existing charts
+**Delegate for:** Schematic diagrams, pathway figures, cell structure illustrations, BioRender-style publication figures — tasks where the output is a conceptual diagram, not a data-driven chart.
+**Execute directly (or via Researcher):** Data visualizations, statistical plots, matplotlib/seaborn charts derived from analysis results.
 
 ### Decision Summary
 
@@ -100,9 +102,11 @@ call_agent("researcher", "Search the web for best practices on X. Gather informa
 | Explore/read/understand codebase | **MUST delegate** to researcher |
 | Web search or documentation lookup | **MUST delegate** to researcher |
 | Data analysis or research | **MUST delegate** to researcher |
+| Scientific writing (report/paper) | **MUST delegate research first**, then write |
 | Multiple independent research tasks | **MUST parallelize** with multiple researchers |
+| Schematic/pathway/cell diagrams | **Delegate** to scientific_illustrator |
 | Read 1 known file | Execute directly |
-| Write/edit/create files | Execute directly |
+| Write/edit/create files (post-research) | Execute directly |
 | Synthesize researcher results | Execute directly (your core role) |
 
 {{delegation}}
diff --git a/pantheon/toolsets/file/file_manager.py b/pantheon/toolsets/file/file_manager.py
index 24da1a3c..b91e6f87 100644
--- a/pantheon/toolsets/file/file_manager.py
+++ b/pantheon/toolsets/file/file_manager.py
@@ -817,33 +817,50 @@ async def write_file(
         content: str = "",
         overwrite: bool = True,
     ) -> dict:
-        """Use this tool to CREATE NEW file.
+        """Use this tool to CREATE a NEW file with a skeleton or short content.
 
-        This tool writes content to a file, automatically creating parent
-        directories if they do not exist.
+        ⚠️  LARGE FILE PROTOCOL — MUST FOLLOW FOR PAPERS, REPORTS, LaTeX, BibTeX:
+        NEVER pass an entire document as `content` in one call.
+        Use the Two-Phase Write Protocol instead:
 
-        IMPORTANT: For EDITING existing file, use `update_file` instead.
-        DO NOT rewrite entire file when only small changes are needed, its is wasteful and error-prone.
+          Phase 1 — Scaffold  (this tool, once):
+            write_file(path, content=<skeleton with section stubs, ~20-50 lines>)
 
-        Use this tool when:
-        - Creating a brand new file
-        - Completely rewriting a file from scratch (rare)
+          Phase 2 — Fill  (per semantic section):
+            update_file(path, old_string=<section stub>, new_string=<full section>)
+            → one call per semantic unit (Introduction, Methods, Results, etc.)
 
-        DO NOT use this tool when:
-        - Making partial modifications to an existing file
-        - Changing a few lines in a large file
-        - For these cases, use `update_file` instead
+          For lists / bibliographies  (append_file, batched):
+            append_file(path, content=<10 BibTeX entries or 1 table block at a time>)
+
+        This tool will REFUSE content longer than 12,000 characters. Writing large
+        content in one shot causes output-token truncation and silent data loss.
 
         Args:
             file_path: The path to the file to write.
             content: The content to write to the file.
             overwrite: When False, abort if the target file already exists.
-                       Default is True, but consider using update_file for edits.
 
         Returns:
             dict: Success status or error message.
         """
-
+        _WRITE_FILE_MAX_CHARS = 12000
+        if len(content) > _WRITE_FILE_MAX_CHARS:
+            return {
+                "success": False,
+                "reason": "content_too_large",
+                "error": (
+                    f"Content is {len(content):,} chars, exceeding the "
+                    f"{_WRITE_FILE_MAX_CHARS:,}-char limit per write_file call. "
+                    f"Use the Two-Phase Write Protocol:\n"
+                    f"  1. write_file('{file_path}', content=<skeleton with empty section stubs>)\n"
+                    f"  2. update_file('{file_path}', old_string=<stub>, new_string=<section>)  "
+                    f"← one call per section\n"
+                    f"  3. append_file('{file_path}', content=<batch>)  "
+                    f"← for BibTeX / lists (<=10 items per call)\n"
+                    f"Do NOT retry write_file with the same large content."
+                ),
+            }
         target_path = self._resolve_path(file_path)
         if not overwrite and target_path.exists():
             return {
@@ -861,6 +878,65 @@ async def write_file(
             logger.error(f"write_file failed for {file_path}: {exc}")
             return {"success": False, "error": str(exc)}
 
+    @tool
+    async def append_file(
+        self,
+        file_path: str,
+        content: str,
+    ) -> dict:
+        """Append content to the end of an existing file without overwriting it.
+
+        ## Primary use case: chunked writing for large documents
+
+        When a single write_file or update_file call would be too large, split
+        the content and stream it in parts:
+
+            write_file(path, skeleton)          # 1. write header / skeleton
+            append_file(path, introduction)     # 2. append Introduction section
+            append_file(path, methods)          # 3. append Methods section
+            append_file(path, results)          # 4. append Results + Discussion
+            append_file(path, bibliography)     # 5. append Bibliography
+
+        ## For BibTeX bibliographies:
+        Split into batches of <=10 @article / @inproceedings blocks per call.
+
+        ## Limits:
+        Content must be <=6,000 characters per call. Split further if needed.
+        File must already exist (use write_file to create it first).
+
+        Args:
+            file_path: Path to the file to append to (relative to workspace root).
+            content: Text to append. Include a leading newline if needed.
+
+        Returns:
+            dict: {success: true, appended_chars: int} or {success: false, error: str}
+        """
+        _APPEND_FILE_MAX_CHARS = 6000
+        if len(content) > _APPEND_FILE_MAX_CHARS:
+            return {
+                "success": False,
+                "reason": "content_too_large",
+                "error": (
+                    f"Content is {len(content):,} chars, exceeding the "
+                    f"{_APPEND_FILE_MAX_CHARS:,}-char limit per append_file call. "
+                    f"Split into smaller batches (<=10 BibTeX entries or one section at a time)."
+                ),
+            }
+        target_path = self._resolve_path(file_path)
+        if not target_path.exists():
+            return {
+                "success": False,
+                "error": f"File '{file_path}' does not exist. Use write_file to create it first.",
+                "reason": "file_not_found",
+            }
+        try:
+            with open(target_path, "a", encoding="utf-8") as f:
+                f.write(content)
+            return {"success": True, "appended_chars": len(content)}
+        except Exception as exc:
+            logger.error(f"append_file failed for {file_path}: {exc}")
+            return {"success": False, "error": str(exc)}
+
     @tool
     async def update_file(
         self,
@@ -893,6 +969,18 @@ async def update_file(
         Returns:
             dict: {success: bool, replacements: int} or {success: False, error: str}
         """
+        _UPDATE_FILE_MAX_CHARS = 8000
+        if len(new_string) > _UPDATE_FILE_MAX_CHARS:
+            return {
+                "success": False,
+                "reason": "content_too_large",
+                "error": (
+                    f"new_string is {len(new_string):,} chars, exceeding the "
+                    f"{_UPDATE_FILE_MAX_CHARS:,}-char limit per update_file call. "
+                    f"Split this section into smaller semantic units and call "
+                    f"update_file once per unit (e.g. one paragraph or subsection at a time)."
+                ),
+            }
         target_path = self._resolve_path(file_path)
         if not target_path.exists():
             return {"success": False, "error": "File does not exist"}

From c37a3c7d3c7c980e48910dabfb0eb9af7a66ab55 Mon Sep 17 00:00:00 2001
From: Starlitnightly <starlitnightly@163.com>
Date: Tue, 31 Mar 2026 14:41:30 -0700
Subject: [PATCH 4/6] test: add comprehensive tests for output-token truncation
 guards
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Tests for PR #52 file manager changes:
- write_file: reject >12K, accept at limit, file not created on reject
- append_file: basic append, multi-batch (BibTeX pattern), reject
  nonexistent file, reject >6K, accept at limit
- update_file: reject new_string >8K, accept at limit, original unchanged
- Two-Phase Write Protocol end-to-end: scaffold → section fill → append

14/14 file manager tests passing.
---
 tests/test_file_manager.py | 147 +++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/tests/test_file_manager.py b/tests/test_file_manager.py
index 1e73eda0..d877c883 100644
--- a/tests/test_file_manager.py
+++ b/tests/test_file_manager.py
@@ -510,3 +510,150 @@ async def test_manage_path_comprehensive(temp_toolset):
     result = await temp_toolset.manage_path("delete", "nonexistent.txt")
     assert result["success"] is False
     assert "does not exist" in result["error"]
+
+
+# ---------------------------------------------------------------------------
+# Output-token truncation guards (PR #52)
+# ---------------------------------------------------------------------------
+
+async def test_write_file_rejects_large_content(temp_toolset):
+    """write_file must reject content > 12,000 chars."""
+    big = "x" * 13_000
+    res = await temp_toolset.write_file("big.txt", big)
+    assert not res["success"]
+    assert res["reason"] == "content_too_large"
+    assert "12,000" in res["error"]
+    # File must NOT exist on disk
+    assert not (temp_toolset.path / "big.txt").exists()
+
+
+async def test_write_file_accepts_content_at_limit(temp_toolset):
+    """write_file must accept content exactly at 12,000 chars."""
+    content = "a" * 12_000
+    res = await temp_toolset.write_file("exact.txt", content)
+    assert res["success"]
+    assert (temp_toolset.path / "exact.txt").read_text() == content
+
+
+async def test_append_file_basic(temp_toolset):
+    """append_file appends to existing file."""
+    await temp_toolset.write_file("log.txt", "header\n")
+    res = await temp_toolset.append_file("log.txt", "line1\nline2\n")
+    assert res["success"]
+    assert res["appended_chars"] == len("line1\nline2\n")
+    content = (await temp_toolset.read_file("log.txt"))["content"]
+    assert content == "header\nline1\nline2\n"
+
+
+async def test_append_file_multiple_batches(temp_toolset):
+    """append_file supports multiple sequential appends (BibTeX batch pattern)."""
+    await temp_toolset.write_file("refs.bib", "% Bibliography\n")
+    for i in range(5):
+        batch = f"@article{{ref{i},\n  title={{Title {i}}},\n}}\n\n"
+        res = await temp_toolset.append_file("refs.bib", batch)
+        assert res["success"], f"Batch {i} failed: {res}"
+    content = (await temp_toolset.read_file("refs.bib"))["content"]
+    assert content.startswith("% Bibliography\n")
+    assert content.count("@article{") == 5
+
+
+async def test_append_file_rejects_nonexistent(temp_toolset):
+    """append_file must reject when target file does not exist."""
+    res = await temp_toolset.append_file("missing.txt", "data")
+    assert not res["success"]
+    assert res["reason"] == "file_not_found"
+
+
+async def test_append_file_rejects_large_content(temp_toolset):
+    """append_file must reject content > 6,000 chars."""
+    await temp_toolset.write_file("base.txt", "ok\n")
+    big = "x" * 7_000
+    res = await temp_toolset.append_file("base.txt", big)
+    assert not res["success"]
+    assert res["reason"] == "content_too_large"
+    assert "6,000" in res["error"]
+    # Original content must be unchanged
+    content = (await temp_toolset.read_file("base.txt"))["content"]
+    assert content == "ok\n"
+
+
+async def test_append_file_accepts_content_at_limit(temp_toolset):
+    """append_file must accept content exactly at 6,000 chars."""
+    await temp_toolset.write_file("base.txt", "start\n")
+    chunk = "b" * 6_000
+    res = await temp_toolset.append_file("base.txt", chunk)
+    assert res["success"]
+    content = (await temp_toolset.read_file("base.txt"))["content"]
+    assert content == "start\n" + chunk
+
+
+async def test_update_file_rejects_large_new_string(temp_toolset):
+    """update_file must reject new_string > 8,000 chars."""
+    await temp_toolset.write_file("doc.txt", "PLACEHOLDER\n")
+    big = "y" * 9_000
+    res = await temp_toolset.update_file("doc.txt", "PLACEHOLDER", big)
+    assert not res["success"]
+    assert res["reason"] == "content_too_large"
+    assert "8,000" in res["error"]
+    # Original content must be unchanged
+    content = (await temp_toolset.read_file("doc.txt"))["content"]
+    assert content == "PLACEHOLDER\n"
+
+
+async def test_update_file_accepts_new_string_at_limit(temp_toolset):
+    """update_file must accept new_string exactly at 8,000 chars."""
+    await temp_toolset.write_file("doc.txt", "STUB\n")
+    replacement = "c" * 8_000
+    res = await temp_toolset.update_file("doc.txt", "STUB", replacement)
+    assert res["success"]
+    content = (await temp_toolset.read_file("doc.txt"))["content"]
+    assert replacement in content
+
+
+async def test_two_phase_write_protocol(temp_toolset):
+    """End-to-end: scaffold → section fill → append (the protocol PR #52 teaches)."""
+    # Phase 1: scaffold
+    skeleton = (
+        "\\documentclass{article}\n"
+        "\\begin{document}\n"
+        "\\section{Introduction}\n"
+        "% INTRO_PLACEHOLDER\n"
+        "\\section{Methods}\n"
+        "% METHODS_PLACEHOLDER\n"
+        "\\end{document}\n"
+    )
+    res = await temp_toolset.write_file("paper.tex", skeleton)
+    assert res["success"]
+
+    # Phase 2: fill sections via update_file
+    res = await temp_toolset.update_file(
+        "paper.tex",
+        "% INTRO_PLACEHOLDER",
+        "This paper presents a novel approach to analyzing single-cell data.",
+    )
+    assert res["success"]
+
+    res = await temp_toolset.update_file(
+        "paper.tex",
+        "% METHODS_PLACEHOLDER",
+        "We applied dimensionality reduction using UMAP.",
+    )
+    assert res["success"]
+
+    # Phase 3: append bibliography
+    bib_entries = "\\begin{thebibliography}{9}\n\\bibitem{ref1} Author, Title, 2024.\n\\end{thebibliography}\n"
+    # Insert before \end{document} via update_file
+    res = await temp_toolset.update_file(
+        "paper.tex",
+        "\\end{document}",
+        bib_entries + "\\end{document}",
+    )
+    assert res["success"]
+
+    # Verify final document
+    content = (await temp_toolset.read_file("paper.tex"))["content"]
+    assert "novel approach" in content
+    assert "UMAP" in content
+    assert "\\bibitem{ref1}" in content
+    assert "INTRO_PLACEHOLDER" not in content
+    assert "METHODS_PLACEHOLDER" not in content

From 7920a724d232e1436de8ae171b08f1390306d3fc Mon Sep 17 00:00:00 2001
From: Starlitnightly <starlitnightly@163.com>
Date: Tue, 31 Mar 2026 15:06:51 -0700
Subject: [PATCH 5/6] fix(llm): set max_tokens to model's max output + raise
 tool guard thresholds
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Root cause fix: acompletion_litellm() never passed max_tokens (output)
to litellm. Anthropic models default to 4096 output tokens, causing
tool_use JSON to be truncated mid-generation when the model writes
large file content.

Fix: auto-detect model's max_output_tokens via litellm.get_model_info()
and set it as kwargs["max_tokens"] when not already specified by
model_params.

With the root cause fixed, the tool-level size guards from PR #52 are
now defense-in-depth (not the primary fix). Raised thresholds to match
actual output capacity:
- write_file: 12K → 40K chars
- update_file: 8K → 30K chars
- append_file: 6K → 20K chars

Thresholds moved to class-level constants (WRITE_FILE_MAX_CHARS, etc.)
for easy per-deployment tuning. Tests updated to reference constants
instead of hardcoded values.

14/14 file manager tests passing.
---
 pantheon/toolsets/file/file_manager.py | 60 +++++++++-----------------
 pantheon/utils/llm.py                  | 15 +++++++
 tests/test_file_manager.py             | 33 +++++++-------
 3 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/pantheon/toolsets/file/file_manager.py b/pantheon/toolsets/file/file_manager.py
index b91e6f87..99f2b59b 100644
--- a/pantheon/toolsets/file/file_manager.py
+++ b/pantheon/toolsets/file/file_manager.py
@@ -810,6 +810,15 @@ async def view_file_outline(self, file_path: str) -> dict:
         except Exception as e:
             return {"success": False, "error": str(e)}
 
+    # Configurable size guards (defense-in-depth).
+    # With max_tokens properly set at the LLM call layer, these are safety
+    # nets — not the primary truncation fix.  Defaults are generous enough
+    # for most single-section writes; the Two-Phase Protocol is recommended
+    # only for truly huge documents.
+    WRITE_FILE_MAX_CHARS = 40_000
+    APPEND_FILE_MAX_CHARS = 20_000
+    UPDATE_FILE_MAX_CHARS = 30_000
+
     @tool
     async def write_file(
         self,
@@ -817,24 +826,13 @@ async def write_file(
         content: str = "",
         overwrite: bool = True,
     ) -> dict:
-        """Use this tool to CREATE a NEW file with a skeleton or short content.
-
-        ⚠️  LARGE FILE PROTOCOL — MUST FOLLOW FOR PAPERS, REPORTS, LaTeX, BibTeX:
-        NEVER pass an entire document as `content` in one call.
-        Use the Two-Phase Write Protocol instead:
-
-          Phase 1 — Scaffold  (this tool, once):
-            write_file(path, content=<skeleton with section stubs, ~20-50 lines>)
-
-          Phase 2 — Fill  (per semantic section):
-            update_file(path, old_string=<section stub>, new_string=<full section>)
-            → one call per semantic unit (Introduction, Methods, Results, etc.)
-
-          For lists / bibliographies  (append_file, batched):
-            append_file(path, content=<10 BibTeX entries or 1 table block at a time>)
+        """Create or overwrite a file.
 
-        This tool will REFUSE content longer than 12,000 characters. Writing large
-        content in one shot causes output-token truncation and silent data loss.
+        For very large documents (papers, reports), prefer the Two-Phase
+        Write Protocol:
+          1. write_file(path, skeleton)
+          2. update_file(path, stub, full_section)  — per section
+          3. append_file(path, batch)  — for BibTeX / lists
 
         Args:
             file_path: The path to the file to write.
@@ -844,7 +842,7 @@ async def write_file(
         Returns:
             dict: Success status or error message.
         """
-        _WRITE_FILE_MAX_CHARS = 12000
+        _WRITE_FILE_MAX_CHARS = self.WRITE_FILE_MAX_CHARS
         if len(content) > _WRITE_FILE_MAX_CHARS:
             return {
                 "success": False,
@@ -884,34 +882,18 @@ async def append_file(
         file_path: str,
         content: str,
     ) -> dict:
-        """Append content to the end of an existing file without overwriting it.
-
-        ## Primary use case: chunked writing for large documents
-
-        When a single write_file or update_file call would be too large, split
-        the content and stream it in parts:
-
-            write_file(path, skeleton)          # 1. write header / skeleton
-            append_file(path, introduction)     # 2. append Introduction section
-            append_file(path, methods)          # 3. append Methods section
-            append_file(path, results)          # 4. append Results + Discussion
-            append_file(path, bibliography)     # 5. append Bibliography
-
-        ## For BibTeX bibliographies:
-        Split into batches of <=10 @article / @inproceedings blocks per call.
+        """Append content to the end of an existing file.
 
-        ## Limits:
-        Content must be <=6,000 characters per call. Split further if needed.
         File must already exist (use write_file to create it first).
 
         Args:
-            file_path: Path to the file to append to (relative to workspace root).
-            content: Text to append. Include a leading newline if needed.
+            file_path: Path to the file to append to.
+            content: Text to append.
 
         Returns:
             dict: {success: true, appended_chars: int} or {success: false, error: str}
         """
-        _APPEND_FILE_MAX_CHARS = 6000
+        _APPEND_FILE_MAX_CHARS = self.APPEND_FILE_MAX_CHARS
         if len(content) > _APPEND_FILE_MAX_CHARS:
             return {
                 "success": False,
@@ -969,7 +951,7 @@ async def update_file(
         Returns:
             dict: {success: bool, replacements: int} or {success: False, error: str}
         """
-        _UPDATE_FILE_MAX_CHARS = 8000
+        _UPDATE_FILE_MAX_CHARS = self.UPDATE_FILE_MAX_CHARS
         if len(new_string) > _UPDATE_FILE_MAX_CHARS:
             return {
                 "success": False,
diff --git a/pantheon/utils/llm.py b/pantheon/utils/llm.py
index 2d9c72c4..8e166018 100644
--- a/pantheon/utils/llm.py
+++ b/pantheon/utils/llm.py
@@ -447,6 +447,21 @@ async def acompletion_litellm(
     if model_params:
         kwargs.update(**model_params)
 
+    # ========== Ensure max_tokens (output) is set ==========
+    # Without explicit max_tokens, some providers (Anthropic) default to
+    # very low output limits (4096), causing tool_use JSON to be truncated
+    # mid-generation when the model writes large file content.
+    # Set to the model's declared max_output_tokens if not already specified.
+    if "max_tokens" not in kwargs and "max_output_tokens" not in kwargs:
+        try:
+            from litellm.utils import get_model_info
+            _info = get_model_info(model)
+            _max_out = _info.get("max_output_tokens")
+            if _max_out and _max_out > 0:
+                kwargs["max_tokens"] = _max_out
+        except Exception:
+            pass  # Fall through to provider default
+
     # ========== Mode Detection & Configuration ==========
     proxy_kwargs = get_litellm_proxy_kwargs()
     if proxy_kwargs:
diff --git a/tests/test_file_manager.py b/tests/test_file_manager.py
index d877c883..7603bf61 100644
--- a/tests/test_file_manager.py
+++ b/tests/test_file_manager.py
@@ -517,19 +517,20 @@ async def test_manage_path_comprehensive(temp_toolset):
 # ---------------------------------------------------------------------------
 
 async def test_write_file_rejects_large_content(temp_toolset):
-    """write_file must reject content > 12,000 chars."""
-    big = "x" * 13_000
+    """write_file must reject content exceeding WRITE_FILE_MAX_CHARS."""
+    limit = temp_toolset.WRITE_FILE_MAX_CHARS
+    big = "x" * (limit + 1000)
     res = await temp_toolset.write_file("big.txt", big)
     assert not res["success"]
     assert res["reason"] == "content_too_large"
-    assert "12,000" in res["error"]
     # File must NOT exist on disk
     assert not (temp_toolset.path / "big.txt").exists()
 
 
 async def test_write_file_accepts_content_at_limit(temp_toolset):
-    """write_file must accept content exactly at 12,000 chars."""
-    content = "a" * 12_000
+    """write_file must accept content exactly at WRITE_FILE_MAX_CHARS."""
+    limit = temp_toolset.WRITE_FILE_MAX_CHARS
+    content = "a" * limit
     res = await temp_toolset.write_file("exact.txt", content)
     assert res["success"]
     assert (temp_toolset.path / "exact.txt").read_text() == content
@@ -565,22 +566,23 @@ async def test_append_file_rejects_nonexistent(temp_toolset):
 
 
 async def test_append_file_rejects_large_content(temp_toolset):
-    """append_file must reject content > 6,000 chars."""
+    """append_file must reject content exceeding APPEND_FILE_MAX_CHARS."""
     await temp_toolset.write_file("base.txt", "ok\n")
-    big = "x" * 7_000
+    limit = temp_toolset.APPEND_FILE_MAX_CHARS
+    big = "x" * (limit + 1000)
     res = await temp_toolset.append_file("base.txt", big)
     assert not res["success"]
     assert res["reason"] == "content_too_large"
-    assert "6,000" in res["error"]
     # Original content must be unchanged
     content = (await temp_toolset.read_file("base.txt"))["content"]
     assert content == "ok\n"
 
 
 async def test_append_file_accepts_content_at_limit(temp_toolset):
-    """append_file must accept content exactly at 6,000 chars."""
+    """append_file must accept content exactly at APPEND_FILE_MAX_CHARS."""
     await temp_toolset.write_file("base.txt", "start\n")
-    chunk = "b" * 6_000
+    limit = temp_toolset.APPEND_FILE_MAX_CHARS
+    chunk = "b" * limit
     res = await temp_toolset.append_file("base.txt", chunk)
     assert res["success"]
     content = (await temp_toolset.read_file("base.txt"))["content"]
@@ -588,22 +590,23 @@ async def test_append_file_accepts_content_at_limit(temp_toolset):
 
 
 async def test_update_file_rejects_large_new_string(temp_toolset):
-    """update_file must reject new_string > 8,000 chars."""
+    """update_file must reject new_string exceeding UPDATE_FILE_MAX_CHARS."""
     await temp_toolset.write_file("doc.txt", "PLACEHOLDER\n")
-    big = "y" * 9_000
+    limit = temp_toolset.UPDATE_FILE_MAX_CHARS
+    big = "y" * (limit + 1000)
     res = await temp_toolset.update_file("doc.txt", "PLACEHOLDER", big)
     assert not res["success"]
     assert res["reason"] == "content_too_large"
-    assert "8,000" in res["error"]
     # Original content must be unchanged
     content = (await temp_toolset.read_file("doc.txt"))["content"]
     assert content == "PLACEHOLDER\n"
 
 
 async def test_update_file_accepts_new_string_at_limit(temp_toolset):
-    """update_file must accept new_string exactly at 8,000 chars."""
+    """update_file must accept new_string exactly at UPDATE_FILE_MAX_CHARS."""
     await temp_toolset.write_file("doc.txt", "STUB\n")
-    replacement = "c" * 8_000
+    limit = temp_toolset.UPDATE_FILE_MAX_CHARS
+    replacement = "c" * limit
     res = await temp_toolset.update_file("doc.txt", "STUB", replacement)
     assert res["success"]
     content = (await temp_toolset.read_file("doc.txt"))["content"]

From 376ad64d4607b6215db00abc4828b694ac37d7a1 Mon Sep 17 00:00:00 2001
From: Starlitnightly <starlitnightly@163.com>
Date: Thu, 2 Apr 2026 19:55:37 -0700
Subject: [PATCH 6/6] Revert "Merge branch 'claw' into dev"

This reverts commit df285664bbb5ffae6d468f86055c0a44c184188b, reversing
changes made to 7920a724d232e1436de8ae171b08f1390306d3fc.
---
 pantheon/agent.py                             |   76 +-
 pantheon/factory/templates/settings.json      |    8 +-
 pantheon/settings.py                          |   25 +-
 pantheon/team/pantheon.py                     |  199 +-
 .../toolsets/python/python_interpreter.py     |   17 -
 pantheon/utils/llm.py                         |   34 +-
 pantheon/utils/token_optimization.py          | 1776 -----------------
 pantheon/utils/truncate.py                    |   67 +-
 scripts/benchmark_prompt_cache.py             |  385 ----
 scripts/benchmark_token_optimizations.py      |  535 -----
 tests/test_token_optimization.py              | 1498 --------------
 tests/test_truncate.py                        |   25 +-
 12 files changed, 85 insertions(+), 4560 deletions(-)
 delete mode 100644 pantheon/utils/token_optimization.py
 delete mode 100644 scripts/benchmark_prompt_cache.py
 delete mode 100644 scripts/benchmark_token_optimizations.py
 delete mode 100644 tests/test_token_optimization.py

diff --git a/pantheon/agent.py b/pantheon/agent.py
index 1bf3d2b7..bd574b50 100644
--- a/pantheon/agent.py
+++ b/pantheon/agent.py
@@ -128,12 +128,8 @@ class AgentRunContext:
 
     agent: "Agent"
     memory: "Memory | None"
-    execution_context_id: str | None = None
-    process_step_message: Callable | None = None
-    process_chunk: Callable | None = None
-    cache_safe_runtime_params: Any | None = None
-    cache_safe_prompt_messages: list[dict] | None = None
-    cache_safe_tool_definitions: list[dict] | None = None
+    process_step_message: Callable | None
+    process_chunk: Callable | None
 
 
 _RUN_CONTEXT: ContextVar[AgentRunContext | None] = ContextVar(
@@ -882,8 +878,7 @@ async def get_tools_for_llm(self) -> list[dict]:
         # Providers return ToolInfo with pre-generated inputSchema (the "function" part)
         logger.debug(f"get tools for llm: {self.providers} ")
         provider_tools = []
-        for provider_name in sorted(self.providers):
-            provider = self.providers[provider_name]
+        for provider_name, provider in self.providers.items():
             try:
                 # Get tools from provider (uses cached list if available)
                 tools = await provider.list_tools()
@@ -953,9 +948,7 @@ async def get_tools_for_llm(self) -> list[dict]:
             if not self.force_litellm:
                 func["parameters"].setdefault("required", []).append("_background")
 
-        from pantheon.utils.token_optimization import stabilize_tool_definitions
-
-        return stabilize_tool_definitions(all_tools)
+        return all_tools
 
     def _should_inject_context_variables(self, prefixed_name: str) -> bool:
         """Determine if context_variables should be injected for a tool.
@@ -1356,8 +1349,7 @@ async def _run_single_tool_call(call: dict) -> dict:
                 # Process and truncate tool result in one step
                 content = process_tool_result(
                     result,
-                    max_length=self.max_tool_content_length,
-                    tool_name=func_name,
+                    max_length=self.max_tool_content_length
                 )
                 
                 tool_message.update({
@@ -1421,39 +1413,7 @@ async def _acompletion(
 
         # Step 1: Process messages for the model
         async with tracker.measure("message_processing"):
-            from pantheon.utils.token_optimization import (
-                build_llm_view_async,
-                inject_cache_control_markers,
-                is_anthropic_model,
-            )
-
-            run_context = get_current_run_context()
-            optimization_memory = run_context.memory if run_context else None
-            is_main_thread = (
-                run_context.execution_context_id is None if run_context else True
-            )
-            messages = await build_llm_view_async(
-                messages,
-                memory=optimization_memory,
-                is_main_thread=is_main_thread,
-                autocompact_model=model,
-            )
             messages = process_messages_for_model(messages, model)
-            # Inject Anthropic prompt-cache markers so the server-side cache
-            # activates — mirrors Claude Code's getCacheControl() strategy.
-            if is_anthropic_model(model):
-                messages = inject_cache_control_markers(messages)
-            if run_context is not None:
-                # Selective copy: shallow for messages with string content,
-                # deepcopy only for messages with list content (Anthropic blocks
-                # from inject_cache_control_markers) to avoid mutation issues.
-                cached = []
-                for m in messages:
-                    if isinstance(m.get("content"), list):
-                        cached.append(copy.deepcopy(m))
-                    else:
-                        cached.append({**m})
-                run_context.cache_safe_prompt_messages = cached
 
         # Step 2: Detect provider and get configuration
         provider_config = detect_provider(model, self.force_litellm)
@@ -1489,8 +1449,6 @@ async def _acompletion(
                 # Use get_tools_for_llm() for unified tool access
                 # This includes both base_functions and provider tools
                 tools = await self.get_tools_for_llm() or None
-                if run_context is not None and tools is not None:
-                    run_context.cache_safe_tool_definitions = copy.deepcopy(tools)
 
                 # For non-OpenAI providers or OpenAI-compatible providers, adjust tool format
                 # OpenAI-compatible providers (e.g. minimax) have api_key set in config
@@ -1533,15 +1491,6 @@ async def _acompletion(
         if context_variables and "model_params" in context_variables:
             # Runtime overrides instance defaults
             model_params = {**self.model_params, **context_variables["model_params"]}
-
-        if run_context is not None:
-            from pantheon.utils.token_optimization import build_cache_safe_runtime_params
-
-            run_context.cache_safe_runtime_params = build_cache_safe_runtime_params(
-                model=model,
-                model_params=model_params,
-                response_format=response_format,
-            )
         
         # Step 8: Call LLM provider (unified interface)
         # logger.info(f"Raw messages: {messages}")
@@ -2151,11 +2100,6 @@ async def _prepare_execution_context(
         # Determine whether to use memory
         should_use_memory = use_memory if use_memory is not None else self.use_memory
         memory_instance = memory or self.memory
-        working_context_variables = (context_variables or {}).copy()
-        fork_context_messages = working_context_variables.pop(
-            "_cache_safe_fork_context_messages",
-            None,
-        )
 
         input_messages = None  # Only set for normal user input, not AgentTransfer
 
@@ -2185,21 +2129,16 @@ async def _prepare_execution_context(
             conversation_history = (
                 memory_instance.get_messages(
                     execution_context_id=execution_context_id,
-                    for_llm=False
+                    for_llm=True
                 )
                 if (should_use_memory and memory_instance)
                 else []
             )
-            if isinstance(fork_context_messages, list) and fork_context_messages:
-                conversation_history = [
-                    *copy.deepcopy(fork_context_messages),
-                    *conversation_history,
-                ]
             conversation_history += input_messages
             conversation_history = self._sanitize_messages(conversation_history)
 
         # preserve execution_context_id if tool need
-        context_variables = working_context_variables
+        context_variables = (context_variables or {}).copy()
 
         # Inject global context variables from settings
         from .settings import get_settings
@@ -2339,7 +2278,6 @@ async def _process_chunk(chunk: dict):
             run_context = AgentRunContext(
                 agent=self,
                 memory=exec_context.memory_instance,
-                execution_context_id=exec_context.execution_context_id,
                 process_step_message=_process_step_message,
                 process_chunk=_process_chunk,
             )
diff --git a/pantheon/factory/templates/settings.json b/pantheon/factory/templates/settings.json
index f19c84b4..9b85093b 100644
--- a/pantheon/factory/templates/settings.json
+++ b/pantheon/factory/templates/settings.json
@@ -25,12 +25,12 @@
         // - "thread": Execute in separate thread (isolated, for heavy tasks)
         "local_toolset_execution_mode": "direct",
         // === Tool Output Limits ===
-        // Maximum characters for tool output (fallback; per-tool thresholds take priority)
-        "max_tool_content_length": 50000,
+        // Maximum characters for tool output (used for smart truncation)
+        "max_tool_content_length": 10000,
         // Maximum lines for file read operations
         "max_file_read_lines": 1000,
-        // Maximum characters for file read operations (safety valve; per-tool thresholds handle LLM sizing)
-        "max_file_read_chars": 500000,
+        // Maximum characters for file read operations (prevents single-line overflow)
+        "max_file_read_chars": 100000,
         // Maximum results for glob/search operations
         "max_glob_results": 50,
         // Enable notebook execution logging (JSONL files in .pantheon/logs/notebook/)
diff --git a/pantheon/settings.py b/pantheon/settings.py
index 3e37d67f..2194155d 100644
--- a/pantheon/settings.py
+++ b/pantheon/settings.py
@@ -661,13 +661,11 @@ def tool_timeout(self) -> int:
     def max_tool_content_length(self) -> int:
         """
         Maximum characters for tool output content.
-        Used as fallback for smart truncation at agent level.
-        Per-tool thresholds (from token_optimization.py) take priority
-        when available.
-        Defaults to 50000 (~12.5K tokens).
+        Used for smart truncation at agent level.
+        Defaults to 10000 (~5K tokens).
         """
         self._ensure_loaded()
-        return self._settings.get("endpoint", {}).get("max_tool_content_length", 50000)
+        return self._settings.get("endpoint", {}).get("max_tool_content_length", 10000)
 
     @property
     def max_file_read_lines(self) -> int:
@@ -681,16 +679,17 @@ def max_file_read_lines(self) -> int:
     @property
     def max_file_read_chars(self) -> int:
         """
-        Maximum characters for read_file output (safety valve).
-
-        Acts as an upper bound to prevent unbounded output from pathological
-        files. Per-tool thresholds (from token_optimization.py) handle the
-        actual LLM-context sizing at Layer 2.
-
-        Defaults to 500000 characters.
+        Maximum characters for read_file output.
+        
+        Set higher than max_tool_content_length to allow reading larger files
+        while preventing unbounded output. When exceeded, read_file returns
+        truncated content with a 'truncated' flag to prevent infinite loops.
+        
+        Industry reference: Cursor uses 100K limit.
+        Defaults to 50000 characters.
         """
         self._ensure_loaded()
-        return self._settings.get("endpoint", {}).get("max_file_read_chars", 500000)
+        return self._settings.get("endpoint", {}).get("max_file_read_chars", 50000)
 
     @property
     def max_glob_results(self) -> int:
diff --git a/pantheon/team/pantheon.py b/pantheon/team/pantheon.py
index 9b3217f5..a624666f 100644
--- a/pantheon/team/pantheon.py
+++ b/pantheon/team/pantheon.py
@@ -48,119 +48,6 @@
     'Expected Outcome: Report with UMAP visualization and marker genes.'"""
 
 
-def _get_cache_safe_child_run_overrides(
-    run_context,
-    target_agent: Agent | RemoteAgent,
-    child_context_variables: dict,
-) -> tuple[dict, dict]:
-    cache_params = getattr(run_context, "cache_safe_runtime_params", None)
-    caller_agent = getattr(run_context, "agent", None)
-
-    if (
-        cache_params is None
-        or not isinstance(caller_agent, Agent)
-        or not isinstance(target_agent, Agent)
-    ):
-        return {}, child_context_variables
-
-    from pantheon.utils.token_optimization import normalize_cache_safe_value
-
-    if list(target_agent.models) != list(caller_agent.models):
-        return {}, child_context_variables
-
-    if normalize_cache_safe_value(target_agent.response_format) != cache_params.response_format_normalized:
-        return {}, child_context_variables
-
-    overrides = {
-        "model": cache_params.model,
-        "response_format": cache_params.response_format_raw,
-    }
-
-    updated_context_variables = dict(child_context_variables)
-    if (
-        "model_params" not in updated_context_variables
-        and cache_params.model_params_raw
-    ):
-        updated_context_variables["model_params"] = copy.deepcopy(
-            cache_params.model_params_raw
-        )
-
-    return overrides, updated_context_variables
-
-
-def _build_structured_fork_context(run_context) -> "list[dict] | None":
-    """Build a structured fork context from the parent's optimised history.
-
-    Mirrors Claude Code's forkContextMessages: the child receives the parent's
-    already-budget+snipped message list (sans system message) as its initial
-    context, rather than a plain-text summary.  This preserves tool-call
-    structure and lets the child reason over the actual conversation, not a
-    lossy narration of it.
-
-    Returns None if there is no history worth forwarding.
-    """
-    memory = getattr(run_context, "memory", None)
-    if memory is None:
-        return None
-
-    # Use the already-computed cache_safe_prompt_messages if available —
-    # those have already been through build_llm_view (budget + microcompact).
-    cached = getattr(run_context, "cache_safe_prompt_messages", None)
-    if cached:
-        result = [
-            copy.deepcopy(m)
-            for m in cached
-            if m.get("role") != "system"
-        ]
-        return result or None
-
-    # Fallback: build fresh view from memory
-    from pantheon.utils.token_optimization import build_llm_view
-
-    raw = memory.get_messages(None)
-    if not raw:
-        return None
-    projected = build_llm_view(raw, memory=memory, is_main_thread=True)
-    result = [m for m in projected if m.get("role") != "system"]
-    return result or None
-
-
-async def _get_cache_safe_child_fork_context_messages(
-    run_context,
-    target_agent: Agent | RemoteAgent,
-) -> list[dict] | None:
-    caller_agent = getattr(run_context, "agent", None)
-    parent_messages = getattr(run_context, "cache_safe_prompt_messages", None)
-    parent_tools = getattr(run_context, "cache_safe_tool_definitions", None)
-
-    if (
-        parent_messages is None
-        or parent_tools is None
-        or not isinstance(caller_agent, Agent)
-        or not isinstance(target_agent, Agent)
-    ):
-        return None
-
-    if target_agent.instructions != caller_agent.instructions:
-        return None
-
-    if list(target_agent.models) != list(caller_agent.models):
-        return None
-
-    from pantheon.utils.token_optimization import normalize_cache_safe_value
-
-    target_tools = await target_agent.get_tools_for_llm()
-    if normalize_cache_safe_value(target_tools) != normalize_cache_safe_value(parent_tools):
-        return None
-
-    fork_context_messages = [
-        copy.deepcopy(message)
-        for message in parent_messages
-        if message.get("role") != "system"
-    ]
-    return fork_context_messages or None
-
-
 
 def _slugify(name: str) -> str:
     slug = re.sub(r"[^a-z0-9]+", "_", name.lower()).strip("_")
@@ -331,7 +218,7 @@ class PantheonTeam(Team):
     def __init__(
         self,
         agents: list[Agent | RemoteAgent],
-        use_summary: bool = True,
+        use_summary: bool = False,
         max_delegate_depth: int | None = 5,
         allow_transfer: bool = False,
         plugins: Optional[List["TeamPlugin"]] = None,
@@ -340,9 +227,8 @@ def __init__(
 
         Args:
             agents: List of agents in the team.
-            use_summary: If True (default), generate summary + recent context
-                         instead of full history for delegation. Set to False
-                         to pass only the raw instruction.
+            use_summary: If True, generate and prepend context summary
+                         when delegating tasks.
             max_delegate_depth: Maximum depth for nested call_agent calls.
             allow_transfer: If True, add transfer_to_agent tool to agents.
             plugins: Optional list of TeamPlugin instances for extending functionality.
@@ -490,50 +376,14 @@ async def call_agent(
                 child_context_variables["_metadata"] = child_metadata
                 # P2: Set execution_context_id at top level for child agent
                 child_context_variables["execution_context_id"] = execution_context_id
-                child_run_overrides, child_context_variables = (
-                    _get_cache_safe_child_run_overrides(
-                        run_context,
-                        target_agent,
-                        child_context_variables,
-                    )
-                )
-                # CC-style delegation: structured fork is PRIMARY path.
-                # Child receives parent's full optimized message history as
-                # structured messages (forkContextMessages), enabling prompt
-                # cache sharing.  Summary is only a FALLBACK when no
-                # structured context is available.
-                use_summary_fallback = False
-                fork_context_messages = await _get_cache_safe_child_fork_context_messages(
-                    run_context,
-                    target_agent,
-                )
-                if fork_context_messages:
-                    # Path 1: Cache-compatible — share parent prefix byte-for-byte
-                    child_context_variables["_cache_safe_fork_context_messages"] = (
-                        fork_context_messages
-                    )
-                elif run_context.memory:
-                    # Path 2: Incompatible agents — pass optimized structured
-                    # messages (CC's forkContextMessages for non-cache-sharing)
-                    structured_fork = _build_structured_fork_context(run_context)
-                    if structured_fork:
-                        child_context_variables["_cache_safe_fork_context_messages"] = (
-                            structured_fork
-                        )
-                    else:
-                        # Path 3: No structured context available — fall back to
-                        # summary (only when use_summary=True)
-                        use_summary_fallback = self.use_summary
-                else:
-                    use_summary_fallback = self.use_summary
-
-                # Build task message — with or without summary
+
+                # Build task message with optional history summary
                 task_message = await create_delegation_task_message(
                     history=run_context.memory.get_messages(None)
                     if run_context.memory
                     else [],
                     instruction=instruction,
-                    use_summary=use_summary_fallback,
+                    use_summary=self.use_summary,
                 )
                 if not task_message:
                     return ""
@@ -581,7 +431,6 @@ async def wrapped_chunk(chunk: dict):
                     execution_context_id=execution_context_id,
                     context_variables=child_context_variables,
                     allow_transfer=False,
-                    **child_run_overrides,
                 )
 
                 # Submit sub_agent learning via plugin hooks
@@ -769,34 +618,20 @@ async def force_compress(self, memory: "Memory") -> dict:
 
 
 
-DELEGATION_RECENT_TAIL_SIZE = 20
-
-
 async def create_delegation_task_message(
     history: list[dict],
     instruction: str,
     use_summary: bool = True,
 ) -> str | None:
-    """Create a delegated task message with summary-first, on-demand-detail strategy.
-
-    When *use_summary* is True (default):
-      1. Generate a compact LLM summary of the full history.
-      2. Pass only the **recent tail** of the history to
-         ``build_delegation_context_message`` — this avoids embedding the entire
-         parent conversation in the child prompt.
-      3. Append an on-demand hint so the child agent knows it can retrieve
-         full tool outputs from disk if needed.
-
-    When *use_summary* is False (explicit opt-out):
-      Only the raw *instruction* is returned — no history or summary.
-    """
+    """Create a delegated task message with optional summary context."""
     if not instruction:
         return None
 
+    # If summary is disabled, the instruction is the entire content.
     if not use_summary:
         return instruction
 
-    # --- summary-first: generate compact summary from full history -----------
+    # Default behavior: Summarize history and append the instruction.
     summary_text = None
     if history:
         try:
@@ -807,14 +642,8 @@ async def create_delegation_task_message(
         except Exception as e:
             logger.warning(f"Failed to generate summary for delegation: {e}")
 
-    # --- only pass the recent tail to build_delegation_context_message --------
-    # The summary covers older context; recent messages provide necessary detail.
-    recent_history = history[-DELEGATION_RECENT_TAIL_SIZE:] if history else []
-
-    from pantheon.utils.token_optimization import build_delegation_context_message
-
-    return build_delegation_context_message(
-        history=recent_history,
-        instruction=instruction,
-        summary_text=summary_text,
-    )
+    content_parts = []
+    if summary_text:
+        content_parts.append(f"Context Summary:\n{summary_text}")
+    content_parts.append(f"Task: {instruction}")
+    return "\n\n".join(content_parts)
diff --git a/pantheon/toolsets/python/python_interpreter.py b/pantheon/toolsets/python/python_interpreter.py
index 3c51c2ad..0976e26c 100644
--- a/pantheon/toolsets/python/python_interpreter.py
+++ b/pantheon/toolsets/python/python_interpreter.py
@@ -1,7 +1,6 @@
 import os
 import io
 import ast
-import asyncio
 import base64
 import json
 import traceback
@@ -92,7 +91,6 @@ def __init__(
         workdir: str | None = None,
         engine: Engine | None = None,
         init_code: str | None = DEFAULT_INIT_CODE,
-        shared_executor=None,
         **kwargs,
     ):
         super().__init__(name, **kwargs)
@@ -100,7 +98,6 @@ def __init__(
         self.jobs = {}
         self._engine = engine
         self.engine = None
-        self.shared_executor = shared_executor
         self.clientid_to_interpreterid = {}
         self.workdir = Path(workdir).expanduser().resolve() if workdir else Path.cwd()
         self.init_code = init_code
@@ -162,20 +159,6 @@ async def run_python_code(
                 "stderr": str,          # Captured standard error
             }
         """
-        if self.shared_executor is not None:
-            loop = asyncio.get_event_loop()
-            result = await loop.run_in_executor(
-                None,
-                lambda: self.shared_executor.execute(code),
-            )
-            return {
-                "success": result.get("error") is None,
-                "result": result.get("result"),
-                "stdout": result.get("output", ""),
-                "stderr": result.get("error", "") or "",
-            }
-        # ─────────────────────────────────────────────────────────────────
-
         if interpreter_id:
              p_id = interpreter_id
         else:
diff --git a/pantheon/utils/llm.py b/pantheon/utils/llm.py
index f6e5dfec..8e166018 100644
--- a/pantheon/utils/llm.py
+++ b/pantheon/utils/llm.py
@@ -877,49 +877,37 @@ def remove_hidden_fields(content: dict) -> dict:
 
 
 def process_tool_result(
-    result: Any,
+    result: Any, 
     max_length: int | None = None,
-    tool_name: str | None = None,
 ) -> Any:
     """Process tool result with optional truncation.
-
+    
     Args:
         result: Raw tool result
-        max_length: Global max length for truncation (fallback)
-        tool_name: Tool name for per-tool threshold lookup
-
+        max_length: Optional max length for truncation
+        
     Returns:
         Processed result
     """
     # Remove hidden fields
     result = remove_hidden_fields(result)
-
-    # Determine effective limit: per-tool threshold takes priority over global
-    effective_limit = max_length
+    
+    # Apply smart truncation if max_length specified
+    # (includes base64 filtering for JSON tools)
     if max_length is not None:
-        try:
-            from pantheon.utils.token_optimization import (
-                get_per_tool_limit,
-            )
-            effective_limit = int(get_per_tool_limit(tool_name, max_length))
-        except Exception as e:
-            logger.debug(f"get_per_tool_limit failed for {tool_name}: {e}")
-
-    # Apply smart truncation if limit specified
-    if effective_limit is not None:
         try:
             from pantheon.utils.truncate import smart_truncate_result
-            return smart_truncate_result(result, effective_limit, filter_base64=True)
+            return smart_truncate_result(result, max_length, filter_base64=True)
         except Exception as e:
             # Fallback to simple string conversion if truncation fails
             logger.warning(f"Smart truncation failed: {e}, falling back to simple conversion")
             content = str(result) if not isinstance(result, str) else result
-            if len(content) > effective_limit:
+            if len(content) > max_length:
                 # Simple truncation: head + tail
-                half = effective_limit // 2
+                half = max_length // 2
                 return f"{content[:half]}\n...[truncated]...\n{content[-half:]}"
             return content
-
+    
     return result
 
 
diff --git a/pantheon/utils/token_optimization.py b/pantheon/utils/token_optimization.py
deleted file mode 100644
index 3afae7a3..00000000
--- a/pantheon/utils/token_optimization.py
+++ /dev/null
@@ -1,1776 +0,0 @@
-from __future__ import annotations
-
-import json
-import re
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-from pantheon.utils.log import logger
-from pantheon.utils.truncate import (
-    PERSISTED_OUTPUT_TAG,
-    PERSISTED_OUTPUT_CLOSING_TAG,
-    PREVIEW_SIZE_BYTES,
-    _format_file_size,
-)
-
-TIME_BASED_MC_CLEARED_MESSAGE = "[Old tool result content cleared]"
-EMPTY_TOOL_RESULT_PLACEHOLDER = "[No output]"
-BYTES_PER_TOKEN = 4
-MAX_TOOL_RESULTS_PER_MESSAGE_CHARS = 200_000
-# CC default: DEFAULT_MAX_RESULT_SIZE_CHARS = 50_000.  Used as fallback when
-# a tool has no explicit entry in PER_TOOL_RESULT_SIZE_CHARS.
-DEFAULT_MAX_RESULT_SIZE_CHARS = 50_000
-TIME_BASED_MC_GAP_THRESHOLD_MINUTES = 60
-TIME_BASED_MC_KEEP_RECENT = 5
-STATE_KEY = "token_optimization"
-COMPACTABLE_TOOL_SUFFIXES = {
-    "read_file",
-    "view_file",
-    "write_file",
-    "update_file",
-    "apply_patch",
-    "glob",
-    "grep",
-    "grep_search",
-    "find_by_name",
-    "shell",
-    "bash",
-    "web_fetch",
-    "web_search",
-    "web_crawl",
-}
-
-# Per-tool result size limits (chars).  Tools that produce large but
-# structured output (web pages, full file reads) get a tighter cap than
-# the default; tools whose output is always small are left out so the
-# DEFAULT_MAX_RESULT_SIZE_CHARS (50K) applies.
-# Mirrors Claude Code's per-tool maxResultSizeChars declarations.
-PER_TOOL_RESULT_SIZE_CHARS: dict[str, int | float] = {
-    "read_file":   40_000,
-    "view_file":   40_000,
-    "web_fetch":   30_000,
-    "web_crawl":   30_000,
-    "web_search":  20_000,
-    "shell":       50_000,
-    "bash":        50_000,
-    "grep":        20_000,
-    "grep_search": 20_000,
-    "glob":        10_000,
-    "find_by_name": 10_000,
-}
-
-# Tools that opt out of persistence entirely (CC: maxResultSizeChars = Infinity).
-# Their results are never externalized regardless of size — model needs to see
-# the full output for correct reasoning.
-PERSISTENCE_OPT_OUT_TOOLS: frozenset[str] = frozenset()
-
-# Tools whose results are considered collapsible in contextCollapse.
-# Matches CC's collapseReadSearch.ts getToolSearchOrReadInfo().
-COLLAPSIBLE_SEARCH_TOOLS = frozenset({
-    "grep", "grep_search", "glob", "find_by_name",
-    "web_search",
-})
-COLLAPSIBLE_READ_TOOLS = frozenset({
-    "read_file", "view_file", "web_fetch", "web_crawl",
-})
-COLLAPSIBLE_LIST_TOOLS = frozenset({
-    "glob", "find_by_name",
-})
-ALL_COLLAPSIBLE_TOOLS = COLLAPSIBLE_SEARCH_TOOLS | COLLAPSIBLE_READ_TOOLS | frozenset({
-    "shell", "bash",  # absorbed silently like CC's REPL
-})
-
-
-@dataclass
-class ContentReplacementState:
-    seen_ids: set[str]
-    replacements: dict[str, str]
-
-
-@dataclass
-class ToolMessageCandidate:
-    tool_use_id: str
-    content: str
-    size: int
-
-
-@dataclass(frozen=True)
-class TimeBasedMicrocompactConfig:
-    enabled: bool
-    gap_threshold_minutes: int
-    keep_recent: int
-
-
-@dataclass(frozen=True)
-class CacheSafeRuntimeParams:
-    model: str
-    model_params_raw: dict[str, Any]
-    model_params_normalized: Any
-    response_format_raw: Any | None
-    response_format_normalized: Any | None
-
-
-def create_content_replacement_state() -> ContentReplacementState:
-    return ContentReplacementState(seen_ids=set(), replacements={})
-
-
-def get_time_based_microcompact_config() -> TimeBasedMicrocompactConfig:
-    return TimeBasedMicrocompactConfig(
-        enabled=True,
-        gap_threshold_minutes=TIME_BASED_MC_GAP_THRESHOLD_MINUTES,
-        keep_recent=TIME_BASED_MC_KEEP_RECENT,
-    )
-
-
-def normalize_cache_safe_value(value: Any) -> Any:
-    if value is None:
-        return None
-    if isinstance(value, dict):
-        return {
-            str(key): normalize_cache_safe_value(value[key])
-            for key in sorted(value, key=str)
-        }
-    if isinstance(value, (list, tuple)):
-        return [normalize_cache_safe_value(item) for item in value]
-    if isinstance(value, set):
-        return sorted(normalize_cache_safe_value(item) for item in value)
-    if hasattr(value, "model_json_schema"):
-        try:
-            return normalize_cache_safe_value(value.model_json_schema())
-        except TypeError:
-            pass
-    if hasattr(value, "__qualname__") and hasattr(value, "__module__"):
-        return f"{value.__module__}.{value.__qualname__}"
-    return value
-
-
-def build_cache_safe_runtime_params(
-    model: str,
-    model_params: dict[str, Any] | None,
-    response_format: Any | None,
-) -> CacheSafeRuntimeParams:
-    raw_model_params = dict(model_params or {})
-    return CacheSafeRuntimeParams(
-        model=model,
-        model_params_raw=raw_model_params,
-        model_params_normalized=normalize_cache_safe_value(raw_model_params),
-        response_format_raw=response_format,
-        response_format_normalized=normalize_cache_safe_value(response_format),
-    )
-
-
-def _normalize_state_payload(data: Any) -> dict[str, Any]:
-    if not isinstance(data, dict):
-        return {}
-    return data
-
-
-def load_content_replacement_state(memory: Any | None) -> ContentReplacementState:
-    if memory is None:
-        return create_content_replacement_state()
-    payload = _normalize_state_payload(memory.extra_data.get(STATE_KEY))
-    seen_ids = {
-        str(item)
-        for item in payload.get("seen_ids", [])
-        if isinstance(item, str) and item
-    }
-    replacements = {
-        str(tool_use_id): str(replacement)
-        for tool_use_id, replacement in payload.get("replacements", {}).items()
-        if isinstance(tool_use_id, str) and tool_use_id
-    }
-    return ContentReplacementState(seen_ids=seen_ids, replacements=replacements)
-
-
-def reconstruct_content_replacement_state(
-    messages: list[dict],
-    memory: Any | None = None,
-) -> ContentReplacementState:
-    """Reconstruct replacement state from message history on session resume.
-
-    Mirrors CC's ``reconstructContentReplacementState()``: scans messages for
-    already-externalized tool results and rebuilds the seen_ids/replacements
-    maps so the budget logic is consistent with prior decisions.
-    """
-    state = load_content_replacement_state(memory)
-
-    # Walk all tool messages; if content is already externalized (persisted-output
-    # or cleared), record the decision so we don't re-evaluate.
-    for message in messages:
-        if message.get("role") != "tool":
-            continue
-        tool_use_id = message.get("tool_call_id")
-        if not isinstance(tool_use_id, str) or not tool_use_id:
-            continue
-        content = message.get("content")
-        if not isinstance(content, str):
-            continue
-        if content.startswith(PERSISTED_OUTPUT_TAG):
-            state.seen_ids.add(tool_use_id)
-            state.replacements[tool_use_id] = content
-        elif content == TIME_BASED_MC_CLEARED_MESSAGE:
-            state.seen_ids.add(tool_use_id)
-        elif content == EMPTY_TOOL_RESULT_PLACEHOLDER:
-            state.seen_ids.add(tool_use_id)
-
-    if memory is not None:
-        save_content_replacement_state(memory, state)
-    return state
-
-
-def save_content_replacement_state(
-    memory: Any | None,
-    state: ContentReplacementState,
-) -> None:
-    if memory is None:
-        return
-    payload = {
-        "seen_ids": sorted(state.seen_ids),
-        "replacements": dict(sorted(state.replacements.items())),
-    }
-    if memory.extra_data.get(STATE_KEY) == payload:
-        return
-    memory.extra_data[STATE_KEY] = payload
-    memory.mark_dirty()
-
-
-# _format_file_size is imported from pantheon.utils.truncate
-
-
-def generate_preview(content: str, max_bytes: int) -> tuple[str, bool]:
-    if len(content) <= max_bytes:
-        return content, False
-    truncated = content[:max_bytes]
-    last_newline = truncated.rfind("\n")
-    cut_point = last_newline if last_newline > max_bytes * 0.5 else max_bytes
-    return content[:cut_point], True
-
-
-def _get_tool_results_dir(memory: Any | None, base_dir: Path | None) -> Path:
-    if base_dir is not None:
-        root = Path(base_dir)
-    else:
-        from pantheon.settings import get_settings
-
-        root = get_settings().tmp_dir / "tool-results"
-    memory_id = getattr(memory, "id", None) or "default"
-    return root / str(memory_id)
-
-
-def _detect_json_content(content: str) -> bool:
-    """Return True if *content* looks like a JSON array or object."""
-    stripped = content.lstrip()
-    if not stripped or stripped[0] not in ("[", "{"):
-        return False
-    try:
-        json.loads(content)
-        return True
-    except (json.JSONDecodeError, ValueError):
-        return False
-
-
-def _get_tool_result_path(
-    tool_use_id: str,
-    memory: Any | None,
-    base_dir: Path | None,
-    *,
-    is_json: bool = False,
-) -> Path:
-    ext = ".json" if is_json else ".txt"
-    return _get_tool_results_dir(memory, base_dir) / f"{tool_use_id}{ext}"
-
-
-def persist_tool_result(
-    content: str,
-    tool_use_id: str,
-    memory: Any | None = None,
-    base_dir: Path | None = None,
-) -> dict[str, Any]:
-    directory = _get_tool_results_dir(memory, base_dir)
-    directory.mkdir(parents=True, exist_ok=True)
-    is_json = _detect_json_content(content)
-    filepath = _get_tool_result_path(tool_use_id, memory, base_dir, is_json=is_json)
-    # Atomic-ish write: skip if file already exists (mirrors CC's 'wx' flag)
-    if not filepath.exists():
-        filepath.write_text(content, encoding="utf-8")
-    preview, has_more = generate_preview(content, PREVIEW_SIZE_BYTES)
-    return {
-        "filepath": str(filepath),
-        "original_size": len(content),
-        "preview": preview,
-        "has_more": has_more,
-    }
-
-
-def build_large_tool_result_message(result: dict[str, Any]) -> str:
-    message = f"{PERSISTED_OUTPUT_TAG}\n"
-    message += (
-        f"Output too large ({_format_file_size(result['original_size'])}). "
-        f"Full output saved to: {result['filepath']}\n\n"
-    )
-    message += f"Preview (first {_format_file_size(PREVIEW_SIZE_BYTES)}):\n"
-    message += result["preview"]
-    message += "\n...\n" if result["has_more"] else "\n"
-    message += PERSISTED_OUTPUT_CLOSING_TAG
-    return message
-
-
-def _is_already_externalized(content: str) -> bool:
-    return (
-        content.startswith(PERSISTED_OUTPUT_TAG)
-        or content == TIME_BASED_MC_CLEARED_MESSAGE
-        or "Full content saved to:" in content
-    )
-
-
-def build_tool_name_map(messages: list[dict]) -> dict[str, str]:
-    result: dict[str, str] = {}
-    for message in messages:
-        if message.get("role") != "assistant":
-            continue
-        for tool_call in message.get("tool_calls") or []:
-            if not isinstance(tool_call, dict):
-                continue
-            tool_call_id = tool_call.get("id")
-            function = tool_call.get("function") or {}
-            tool_name = function.get("name")
-            if tool_call_id and tool_name:
-                result[str(tool_call_id)] = str(tool_name)
-    return result
-
-
-def get_tool_name_for_message(message: dict, tool_name_map: dict[str, str]) -> str | None:
-    tool_use_id = message.get("tool_call_id")
-    if isinstance(tool_use_id, str) and tool_use_id:
-        mapped = tool_name_map.get(tool_use_id)
-        if mapped:
-            return mapped
-
-    tool_name = message.get("tool_name")
-    if isinstance(tool_name, str) and tool_name:
-        return tool_name
-    return None
-
-
-def normalize_tool_name(tool_name: str | None) -> str:
-    if not tool_name:
-        return ""
-    if "__" in tool_name:
-        return tool_name.rsplit("__", 1)[-1]
-    return tool_name
-
-
-def is_compactable_tool_name(tool_name: str | None) -> bool:
-    return normalize_tool_name(tool_name) in COMPACTABLE_TOOL_SUFFIXES
-
-
-def collect_candidates_from_message(message: dict) -> list[ToolMessageCandidate]:
-    if message.get("role") != "tool":
-        return []
-    tool_use_id = message.get("tool_call_id")
-    content = message.get("content")
-    if not isinstance(tool_use_id, str) or not tool_use_id:
-        return []
-    if not isinstance(content, str) or not content:
-        return []
-    if _is_already_externalized(content):
-        return []
-    return [
-        ToolMessageCandidate(
-            tool_use_id=tool_use_id,
-            content=content,
-            size=len(content),
-        )
-    ]
-
-
-def guard_empty_tool_results(messages: list[dict]) -> list[dict]:
-    """Inject a placeholder for empty tool results.
-
-    Mirrors CC's emptiness guard: some models emit a stop-sequence when
-    they see an empty tool result.  Injecting ``[No output]`` prevents that.
-    """
-    result: list[dict] = []
-    for message in messages:
-        if message.get("role") != "tool":
-            result.append(message)
-            continue
-        content = message.get("content")
-        if isinstance(content, str) and content.strip():
-            result.append(message)
-            continue
-        new_msg = dict(message)
-        new_msg["content"] = EMPTY_TOOL_RESULT_PLACEHOLDER
-        result.append(new_msg)
-    return result
-
-
-def collect_candidates_by_message(messages: list[dict]) -> list[list[ToolMessageCandidate]]:
-    groups: list[list[ToolMessageCandidate]] = []
-    current: list[ToolMessageCandidate] = []
-    seen_assistant_ids: set[str] = set()
-
-    def flush() -> None:
-        nonlocal current
-        if current:
-            groups.append(current)
-        current = []
-
-    for index, message in enumerate(messages):
-        role = message.get("role")
-        if role == "tool":
-            current.extend(collect_candidates_from_message(message))
-            continue
-        if role == "assistant":
-            assistant_id = str(message.get("id") or f"assistant-{index}")
-            if assistant_id not in seen_assistant_ids:
-                flush()
-                seen_assistant_ids.add(assistant_id)
-            continue
-        flush()
-    flush()
-    return groups
-
-
-def partition_by_prior_decision(
-    candidates: list[ToolMessageCandidate],
-    state: ContentReplacementState,
-) -> tuple[list[tuple[ToolMessageCandidate, str]], list[ToolMessageCandidate], list[ToolMessageCandidate]]:
-    must_reapply: list[tuple[ToolMessageCandidate, str]] = []
-    frozen: list[ToolMessageCandidate] = []
-    fresh: list[ToolMessageCandidate] = []
-    for candidate in candidates:
-        replacement = state.replacements.get(candidate.tool_use_id)
-        if replacement is not None:
-            must_reapply.append((candidate, replacement))
-        elif candidate.tool_use_id in state.seen_ids:
-            frozen.append(candidate)
-        else:
-            fresh.append(candidate)
-    return must_reapply, frozen, fresh
-
-
-def select_fresh_to_replace(
-    fresh: list[ToolMessageCandidate],
-    frozen_size: int,
-    limit: int,
-) -> list[ToolMessageCandidate]:
-    sorted_candidates = sorted(fresh, key=lambda item: item.size, reverse=True)
-    selected: list[ToolMessageCandidate] = []
-    remaining = frozen_size + sum(item.size for item in fresh)
-    for candidate in sorted_candidates:
-        if remaining <= limit:
-            break
-        selected.append(candidate)
-        remaining -= candidate.size
-    return selected
-
-
-def replace_tool_message_contents(
-    messages: list[dict],
-    replacement_map: dict[str, str],
-) -> list[dict]:
-    result: list[dict] = []
-    for message in messages:
-        if message.get("role") != "tool":
-            result.append(message)
-            continue
-        tool_use_id = message.get("tool_call_id")
-        replacement = replacement_map.get(str(tool_use_id))
-        if replacement is None:
-            result.append(message)
-            continue
-        new_message = dict(message)
-        new_message["content"] = replacement
-        result.append(new_message)
-    return result
-
-
-def get_per_tool_limit(tool_name: str | None, global_limit: int) -> int | float:
-    """Return the effective size limit for a single tool result.
-
-    Mirrors Claude Code's ``getPersistenceThreshold()``:
-    1. Infinity opt-out (never externalize)
-    2. Per-tool override from PER_TOOL_RESULT_SIZE_CHARS
-    3. Fallback: ``min(DEFAULT_MAX_RESULT_SIZE_CHARS, global_limit)``
-    """
-    normalized = normalize_tool_name(tool_name)
-    if normalized in PERSISTENCE_OPT_OUT_TOOLS:
-        return float("inf")
-    explicit = PER_TOOL_RESULT_SIZE_CHARS.get(normalized)
-    if explicit is not None:
-        return explicit
-    return min(DEFAULT_MAX_RESULT_SIZE_CHARS, global_limit)
-
-
-# querySource values that should NOT write to disk (mirrors CC's logic
-# in toolResultStorage.ts — agent_summary, fork calls, etc. only see
-# the already-persisted preview, they never create new disk entries).
-_PERSISTENCE_SKIP_QUERY_SOURCES = frozenset({
-    "agent_summary",
-    "memory_agent",
-    "title_agent",
-})
-
-
-def _should_persist_to_disk(query_source: str | None) -> bool:
-    """Return True if this query source is allowed to write new tool results
-    to disk.  Mirrors CC: only ``agent:*`` and ``repl_main_thread*`` persist;
-    summaries and fork helpers do not."""
-    if query_source is None:
-        return True  # conservative default
-    return query_source not in _PERSISTENCE_SKIP_QUERY_SOURCES
-
-
-def apply_tool_result_budget(
-    messages: list[dict],
-    memory: Any | None = None,
-    base_dir: Path | None = None,
-    per_message_limit: int = MAX_TOOL_RESULTS_PER_MESSAGE_CHARS,
-    skip_tool_names: set[str] | None = None,
-    query_source: str | None = None,
-) -> list[dict]:
-    """Safety-net budget enforcement for tool results.
-
-    Per-tool externalization is now handled at tool execution time by
-    ``process_tool_result`` (using per-tool thresholds from
-    ``get_per_tool_limit``).  This function serves as a second pass:
-
-    1. Replays prior externalization decisions (session resume).
-    2. Guards empty tool results.
-    3. Enforces the **per-message aggregate** limit — if a single turn
-       contains many tool results whose combined size exceeds
-       *per_message_limit*, the largest fresh results are externalized.
-    """
-    # Guard empty tool results first (CC emptiness guard)
-    messages = guard_empty_tool_results(messages)
-    state = load_content_replacement_state(memory)
-
-    # querySource filtering: fork helpers only re-apply existing decisions,
-    # they never create new disk entries.
-    persist_allowed = _should_persist_to_disk(query_source)
-    candidates_by_message = collect_candidates_by_message(messages)
-    skip_tool_names = skip_tool_names or set()
-    tool_name_map = build_tool_name_map(messages)
-    replacement_map: dict[str, str] = {}
-
-    for candidates in candidates_by_message:
-        must_reapply, frozen, fresh = partition_by_prior_decision(candidates, state)
-
-        for candidate, replacement in must_reapply:
-            replacement_map[candidate.tool_use_id] = replacement
-
-        if not fresh:
-            for candidate in candidates:
-                state.seen_ids.add(candidate.tool_use_id)
-            continue
-
-        skipped = [
-            candidate
-            for candidate in fresh
-            if tool_name_map.get(candidate.tool_use_id) in skip_tool_names
-        ]
-        for candidate in skipped:
-            state.seen_ids.add(candidate.tool_use_id)
-
-        eligible = [candidate for candidate in fresh if candidate not in skipped]
-
-        # Separate opt-out tools (Infinity limit) — never externalize them
-        checkable: list[ToolMessageCandidate] = []
-        for candidate in eligible:
-            tool_name = tool_name_map.get(candidate.tool_use_id)
-            normalized = normalize_tool_name(tool_name)
-            if normalized in PERSISTENCE_OPT_OUT_TOOLS:
-                state.seen_ids.add(candidate.tool_use_id)
-            else:
-                checkable.append(candidate)
-
-        # Per-message aggregate check: externalize largest results when the
-        # combined size of all results in this turn exceeds per_message_limit.
-        # Individual per-tool thresholds are already enforced at tool execution
-        # time, so this only catches the aggregate-too-large case.
-        frozen_size = sum(candidate.size for candidate in frozen)
-        fresh_size = sum(candidate.size for candidate in checkable)
-        aggregate_selected = (
-            select_fresh_to_replace(checkable, frozen_size, per_message_limit)
-            if frozen_size + fresh_size > per_message_limit
-            else []
-        )
-
-        selected_ids = {candidate.tool_use_id for candidate in aggregate_selected}
-        for candidate in candidates:
-            if candidate.tool_use_id not in selected_ids:
-                state.seen_ids.add(candidate.tool_use_id)
-
-        for candidate in aggregate_selected:
-            if persist_allowed:
-                persisted = persist_tool_result(
-                    candidate.content,
-                    candidate.tool_use_id,
-                    memory=memory,
-                    base_dir=base_dir,
-                )
-                replacement = build_large_tool_result_message(persisted)
-                state.seen_ids.add(candidate.tool_use_id)
-                state.replacements[candidate.tool_use_id] = replacement
-                replacement_map[candidate.tool_use_id] = replacement
-            else:
-                # Fork / summary agents: mark as seen but don't persist
-                state.seen_ids.add(candidate.tool_use_id)
-
-    save_content_replacement_state(memory, state)
-    if not replacement_map:
-        return messages
-    return replace_tool_message_contents(messages, replacement_map)
-
-
-def _parse_timestamp(timestamp: Any) -> float | None:
-    if isinstance(timestamp, (int, float)):
-        return float(timestamp)
-    if isinstance(timestamp, str):
-        try:
-            from datetime import datetime
-
-            return datetime.fromisoformat(timestamp.replace("Z", "+00:00")).timestamp()
-        except ValueError:
-            return None
-    return None
-
-
-def _collect_compactable_tool_message_ids(messages: list[dict]) -> list[str]:
-    tool_name_map = build_tool_name_map(messages)
-    ids: list[str] = []
-    for message in messages:
-        if message.get("role") != "tool":
-            continue
-        tool_use_id = message.get("tool_call_id")
-        content = message.get("content")
-        if not isinstance(tool_use_id, str) or not tool_use_id:
-            continue
-        if not isinstance(content, str) or not content:
-            continue
-        if _is_already_externalized(content):
-            continue
-        tool_name = get_tool_name_for_message(message, tool_name_map)
-        if not is_compactable_tool_name(tool_name):
-            continue
-        ids.append(tool_use_id)
-    return ids
-
-
-def evaluate_time_based_trigger(
-    messages: list[dict],
-    *,
-    is_main_thread: bool,
-    config: TimeBasedMicrocompactConfig | None = None,
-) -> tuple[float, TimeBasedMicrocompactConfig] | None:
-    resolved_config = config or get_time_based_microcompact_config()
-    if not resolved_config.enabled or not is_main_thread:
-        return None
-
-    last_assistant_timestamp: float | None = None
-    for message in reversed(messages):
-        if message.get("role") != "assistant":
-            continue
-        parsed = _parse_timestamp(message.get("timestamp"))
-        if parsed is not None:
-            last_assistant_timestamp = parsed
-            break
-
-    if last_assistant_timestamp is None:
-        return None
-
-    import time
-
-    gap_minutes = (time.time() - last_assistant_timestamp) / 60.0
-    if gap_minutes < resolved_config.gap_threshold_minutes:
-        return None
-    return gap_minutes, resolved_config
-
-
-def microcompact_messages(
-    messages: list[dict],
-    *,
-    is_main_thread: bool = True,
-    config: TimeBasedMicrocompactConfig | None = None,
-) -> list[dict]:
-    trigger = evaluate_time_based_trigger(
-        messages,
-        is_main_thread=is_main_thread,
-        config=config,
-    )
-    if trigger is None:
-        return messages
-    gap_minutes, resolved_config = trigger
-
-    compactable_ids = _collect_compactable_tool_message_ids(messages)
-    if not compactable_ids:
-        return messages
-
-    keep_count = max(1, resolved_config.keep_recent)
-    keep_set = set(compactable_ids[-keep_count:])
-    clear_set = {tool_use_id for tool_use_id in compactable_ids if tool_use_id not in keep_set}
-    if not clear_set:
-        return messages
-
-    changed = False
-    tokens_saved = 0
-    result: list[dict] = []
-    for message in messages:
-        if message.get("role") != "tool":
-            result.append(message)
-            continue
-        tool_use_id = message.get("tool_call_id")
-        if tool_use_id not in clear_set:
-            result.append(message)
-            continue
-        content = message.get("content")
-        if not isinstance(content, str) or _is_already_externalized(content):
-            result.append(message)
-            continue
-        tokens_saved += max(1, len(content) // BYTES_PER_TOKEN)
-        new_message = dict(message)
-        new_message["content"] = TIME_BASED_MC_CLEARED_MESSAGE
-        result.append(new_message)
-        changed = True
-
-    if changed:
-        logger.info(
-            "[token optimization] time-based microcompact cleared {} tool messages after {:.1f} minute gap (~{} tokens saved, kept last {})",
-            len(clear_set),
-            gap_minutes,
-            tokens_saved,
-            len(keep_set),
-        )
-    return result if changed else messages
-
-
-DEFAULT_SNIP_TOKEN_BUDGET = 80_000
-# Minimum messages to keep (system + last N) so the model always has
-# immediate context even if it was over budget.
-SNIP_KEEP_RECENT = 6
-
-
-@dataclass(frozen=True)
-class SnipConfig:
-    enabled: bool
-    token_budget: int
-    keep_recent: int
-
-
-def get_snip_config() -> SnipConfig:
-    return SnipConfig(
-        enabled=True,
-        token_budget=DEFAULT_SNIP_TOKEN_BUDGET,
-        keep_recent=SNIP_KEEP_RECENT,
-    )
-
-
-# CC-identical token estimation constants (from microCompact.ts)
-IMAGE_MAX_TOKEN_SIZE = 2000  # CC: images/documents ≈ 2000 tokens
-_TOKEN_ESTIMATE_PAD_FACTOR = 4 / 3  # CC: pad estimate by 4/3 to be conservative
-
-
-def _rough_token_count(text: str) -> int:
-    """Rough token estimation matching CC's roughTokenCountEstimation."""
-    return max(1, len(text) // BYTES_PER_TOKEN)
-
-
-def _calculate_tool_result_tokens(block: dict) -> int:
-    """CC-identical per-block token calculation (microCompact.ts:137-160)."""
-    content = block.get("content")
-    if content is None:
-        return 0
-    if isinstance(content, str):
-        return _rough_token_count(content)
-    if isinstance(content, list):
-        total = 0
-        for item in content:
-            if isinstance(item, dict):
-                btype = item.get("type", "")
-                if btype == "text":
-                    total += _rough_token_count(item.get("text", ""))
-                elif btype in ("image", "document"):
-                    total += IMAGE_MAX_TOKEN_SIZE
-        return total
-    return 0
-
-
-def _estimate_message_tokens(message: dict) -> int:
-    """CC-identical message token estimation (microCompact.ts:164-205).
-
-    Handles all block types: text, tool_result, tool_use, image, document,
-    thinking, redacted_thinking. Pads estimate by 4/3 for conservatism.
-    """
-    content = message.get("content")
-    if isinstance(content, str):
-        return max(1, len(content) // BYTES_PER_TOKEN)
-    if isinstance(content, list):
-        total = 0
-        for block in content:
-            if not isinstance(block, dict):
-                continue
-            btype = block.get("type", "")
-            if btype == "text":
-                total += _rough_token_count(block.get("text", ""))
-            elif btype == "tool_result":
-                total += _calculate_tool_result_tokens(block)
-            elif btype in ("image", "document"):
-                total += IMAGE_MAX_TOKEN_SIZE
-            elif btype == "thinking":
-                total += _rough_token_count(block.get("thinking", ""))
-            elif btype == "redacted_thinking":
-                total += _rough_token_count(block.get("data", ""))
-            elif btype == "tool_use":
-                total += _rough_token_count(
-                    block.get("name", "") + json.dumps(block.get("input", {}))
-                )
-            else:
-                # Fallback for server_tool_use, web_search_tool_result, etc.
-                total += _rough_token_count(json.dumps(block))
-        # CC: pad estimate by 4/3 to be conservative
-        return max(1, int(total * _TOKEN_ESTIMATE_PAD_FACTOR))
-    return 1
-
-
-def snip_messages_to_budget(
-    messages: list[dict],
-    *,
-    config: SnipConfig | None = None,
-) -> tuple[list[dict], int]:
-    """Drop the oldest non-system messages until total tokens fit within budget.
-
-    Mirrors Claude Code's HISTORY_SNIP: runs before microcompact so that
-    urgent over-budget situations are handled first.
-
-    Returns (new_messages, tokens_freed).
-    Protected tail (last *keep_recent* messages) and the system message are
-    never dropped.
-    """
-    resolved = config or get_snip_config()
-    if not resolved.enabled:
-        return messages, 0
-
-    total = sum(_estimate_message_tokens(m) for m in messages)
-    if total <= resolved.token_budget:
-        return messages, 0
-
-    # Split: system message (always keep), body, protected tail
-    system_msgs = [m for m in messages if m.get("role") == "system"]
-    non_system = [m for m in messages if m.get("role") != "system"]
-
-    keep_recent = max(1, resolved.keep_recent)
-    if len(non_system) <= keep_recent:
-        return messages, 0
-
-    tail = non_system[-keep_recent:]
-    candidates = non_system[:-keep_recent]  # oldest messages, eligible to drop
-
-    tokens_freed = 0
-    kept_candidates: list[dict] = []
-    # Drop from oldest first; stop as soon as we're under budget
-    for msg in candidates:
-        remaining_total = total - tokens_freed
-        if remaining_total <= resolved.token_budget:
-            kept_candidates.append(msg)
-        else:
-            tokens_freed += _estimate_message_tokens(msg)
-
-    if tokens_freed == 0:
-        return messages, 0
-
-    result = system_msgs + kept_candidates + tail
-    logger.info(
-        "[token optimization] history snip freed ~{} tokens ({} messages dropped)",
-        tokens_freed,
-        len(candidates) - len(kept_candidates),
-    )
-    return result, tokens_freed
-
-
-# ---------------------------------------------------------------------------
-# Opt4 extension: contextCollapse (read/search group folding)
-# Mirrors CC's collapseReadSearch.ts — fold consecutive collapsible tool uses
-# into compact summary messages.
-# ---------------------------------------------------------------------------
-
-@dataclass
-class CollapsedGroup:
-    """A group of consecutive collapsible tool-use / tool-result pairs."""
-    start_index: int
-    end_index: int  # exclusive
-    search_count: int
-    read_file_paths: list[str]
-    read_count: int
-    list_count: int
-    bash_count: int
-    tokens_before: int
-
-
-def _is_collapsible_message(message: dict, tool_name_map: dict[str, str]) -> bool:
-    """Return True if this message is part of a collapsible tool-use chain."""
-    role = message.get("role")
-    if role == "tool":
-        tool_name = get_tool_name_for_message(message, tool_name_map)
-        normalized = normalize_tool_name(tool_name)
-        return normalized in ALL_COLLAPSIBLE_TOOLS
-    if role == "assistant":
-        # An assistant message that ONLY has tool_calls (no text output) is
-        # absorbable into a collapse group (CC's "silent assistant" logic).
-        content = message.get("content")
-        has_text = isinstance(content, str) and content.strip()
-        has_tool_calls = bool(message.get("tool_calls"))
-        return has_tool_calls and not has_text
-    return False
-
-
-def _extract_read_paths_from_tool_calls(message: dict) -> list[str]:
-    """Extract file paths from an assistant message's tool_call arguments.
-
-    Looks for common path parameter names (path, file_path, filepath, url)
-    in the function arguments of tool_calls that target read/fetch tools.
-    """
-    paths: list[str] = []
-    for tc in message.get("tool_calls") or []:
-        if not isinstance(tc, dict):
-            continue
-        func = tc.get("function", {})
-        name = normalize_tool_name(func.get("name"))
-        if name not in COLLAPSIBLE_READ_TOOLS:
-            continue
-        try:
-            args = json.loads(func.get("arguments", "{}"))
-        except (json.JSONDecodeError, TypeError):
-            continue
-        for key in ("path", "file_path", "filepath", "url"):
-            val = args.get(key)
-            if isinstance(val, str) and val:
-                paths.append(val)
-                break
-    return paths
-
-
-def collapse_read_search_groups(
-    messages: list[dict],
-    *,
-    min_group_size: int = 3,
-) -> tuple[list[dict], int]:
-    """Collapse consecutive read/search tool-use sequences into summaries.
-
-    Mirrors CC's ``collapseReadSearchGroups()`` — identifies groups of
-    consecutive collapsible messages, summarizes each group into a single
-    compact user message, and returns the reduced list.
-
-    Returns (new_messages, tokens_saved).
-    """
-    tool_name_map = build_tool_name_map(messages)
-
-    # Phase 1: identify collapsible groups
-    groups: list[CollapsedGroup] = []
-    i = 0
-    n = len(messages)
-    while i < n:
-        # Skip non-collapsible messages
-        if not _is_collapsible_message(messages[i], tool_name_map):
-            i += 1
-            continue
-        # Start a new group
-        start = i
-        search_count = 0
-        read_count = 0
-        list_count = 0
-        bash_count = 0
-        read_paths: list[str] = []
-        seen_paths: set[str] = set()
-        tokens = 0
-
-        while i < n and _is_collapsible_message(messages[i], tool_name_map):
-            msg = messages[i]
-            tokens += _estimate_message_tokens(msg)
-            if msg.get("role") == "assistant":
-                # Extract file paths from tool_call arguments
-                for path in _extract_read_paths_from_tool_calls(msg):
-                    if path not in seen_paths:
-                        seen_paths.add(path)
-                        read_paths.append(path)
-            if msg.get("role") == "tool":
-                tool_name = normalize_tool_name(
-                    get_tool_name_for_message(msg, tool_name_map)
-                )
-                if tool_name in COLLAPSIBLE_SEARCH_TOOLS:
-                    search_count += 1
-                if tool_name in COLLAPSIBLE_READ_TOOLS:
-                    read_count += 1
-                if tool_name in COLLAPSIBLE_LIST_TOOLS:
-                    list_count += 1
-                if tool_name in ("shell", "bash"):
-                    bash_count += 1
-            i += 1
-
-        group_size = i - start
-        if group_size >= min_group_size:
-            groups.append(CollapsedGroup(
-                start_index=start,
-                end_index=i,
-                search_count=search_count,
-                read_file_paths=read_paths,
-                read_count=read_count,
-                list_count=list_count,
-                bash_count=bash_count,
-                tokens_before=tokens,
-            ))
-
-    if not groups:
-        return messages, 0
-
-    # Phase 2: build collapsed messages
-    tokens_saved = 0
-    result: list[dict] = []
-    prev_end = 0
-
-    for group in groups:
-        # Keep messages before this group
-        result.extend(messages[prev_end:group.start_index])
-
-        # Build summary text (matches CC's createCollapsedGroup output)
-        parts: list[str] = []
-        if group.search_count:
-            parts.append(f"searched {group.search_count} pattern{'s' if group.search_count > 1 else ''}")
-        if group.read_count:
-            parts.append(f"read {group.read_count} file{'s' if group.read_count > 1 else ''}")
-        if group.list_count:
-            parts.append(f"listed {group.list_count} dir{'s' if group.list_count > 1 else ''}")
-        if group.bash_count:
-            parts.append(f"ran {group.bash_count} command{'s' if group.bash_count > 1 else ''}")
-
-        summary = ", ".join(parts) if parts else "performed tool operations"
-
-        # Include file paths for context
-        file_lines = ""
-        if group.read_file_paths:
-            paths = group.read_file_paths[:8]
-            file_lines = "\nFiles: " + ", ".join(paths)
-            if len(group.read_file_paths) > 8:
-                file_lines += f" (+{len(group.read_file_paths) - 8} more)"
-
-        collapsed_content = f"[Collapsed exploration: {summary}{file_lines}]"
-        collapsed_msg = {
-            "role": "assistant",
-            "content": collapsed_content,
-            "_collapsed": True,
-            "_collapsed_message_count": group.end_index - group.start_index,
-        }
-        result.append(collapsed_msg)
-
-        collapsed_tokens = _estimate_message_tokens(collapsed_msg)
-        tokens_saved += group.tokens_before - collapsed_tokens
-        prev_end = group.end_index
-
-    result.extend(messages[prev_end:])
-
-    if tokens_saved > 0:
-        logger.info(
-            "[token optimization] contextCollapse folded {} group(s), ~{} tokens saved",
-            len(groups),
-            tokens_saved,
-        )
-    return result, tokens_saved
-
-
-# ---------------------------------------------------------------------------
-# Opt4 extension: autocompact (CC-identical LLM-based summarization)
-# Mirrors CC's autoCompactIfNeeded() + compactConversation() —
-# when context exceeds budget after all other optimizations, call an LLM
-# to generate a structured summary of the older conversation.
-# ---------------------------------------------------------------------------
-
-AUTOCOMPACT_TOKEN_BUDGET = 100_000  # trigger threshold
-AUTOCOMPACT_KEEP_RECENT = 8  # messages to preserve verbatim
-AUTOCOMPACT_MAX_OUTPUT_TOKENS = 20_000  # CC: MAX_OUTPUT_TOKENS_FOR_SUMMARY
-MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3  # CC circuit breaker
-
-# CC-identical compact prompt (from compact/prompt.ts)
-_AUTOCOMPACT_SYSTEM_PROMPT = (
-    "You are a helpful AI assistant tasked with summarizing conversations."
-)
-
-_AUTOCOMPACT_NO_TOOLS_PREAMBLE = """CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.
-
-- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
-- You already have all the context you need in the conversation above.
-- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
-- Your entire response must be plain text: an <analysis> block followed by a <summary> block.
-
-"""
-
-_AUTOCOMPACT_DETAILED_ANALYSIS = """Before providing your final summary, wrap your analysis in <analysis> tags to organize your thoughts and ensure you've covered all necessary points. In your analysis process:
-
-1. Chronologically analyze each message and section of the conversation. For each section thoroughly identify:
-   - The user's explicit requests and intents
-   - Your approach to addressing the user's requests
-   - Key decisions, technical concepts and code patterns
-   - Specific details like:
-     - file names
-     - full code snippets
-     - function signatures
-     - file edits
-   - Errors that you ran into and how you fixed them
-   - Pay special attention to specific user feedback that you received, especially if the user told you to do something differently.
-2. Double-check for technical accuracy and completeness, addressing each required element thoroughly."""
-
-_AUTOCOMPACT_USER_PROMPT = (
-    _AUTOCOMPACT_NO_TOOLS_PREAMBLE
-    + """Your task is to create a detailed summary of the conversation so far, paying close attention to the user's explicit requests and your previous actions.
-This summary should be thorough in capturing technical details, code patterns, and architectural decisions that would be essential for continuing development work without losing context.
-
-"""
-    + _AUTOCOMPACT_DETAILED_ANALYSIS
-    + """
-
-Your summary should include the following sections:
-
-1. Primary Request and Intent: Capture all of the user's explicit requests and intents in detail
-2. Key Technical Concepts: List all important technical concepts, technologies, and frameworks discussed.
-3. Files and Code Sections: Enumerate specific files and code sections examined, modified, or created. Pay special attention to the most recent messages and include full code snippets where applicable and include a summary of why this file read or edit is important.
-4. Errors and fixes: List all errors that you ran into, and how you fixed them. Pay special attention to specific user feedback that you received, especially if the user told you to do something differently.
-5. Problem Solving: Document problems solved and any ongoing troubleshooting efforts.
-6. All user messages: List ALL user messages that are not tool results. These are critical for understanding the users' feedback and changing intent.
-7. Pending Tasks: Outline any pending tasks that you have explicitly been asked to work on.
-8. Current Work: Describe in detail precisely what was being worked on immediately before this summary request, paying special attention to the most recent messages from both user and assistant. Include file names and code snippets where applicable.
-9. Optional Next Step: List the next step that you will take that is related to the most recent work you were doing. IMPORTANT: ensure that this step is DIRECTLY in line with the user's most recent explicit requests, and the task you were working on immediately before this summary request. If your last task was concluded, then only list next steps if they are explicitly in line with the users request. Do not start on tangential requests or really old requests that were already completed without confirming with the user first.
-                       If there is a next step, include direct quotes from the most recent conversation showing exactly what task you were working on and where you left off. This should be verbatim to ensure there's no drift in task interpretation.
-
-Please provide your summary based on the conversation so far, following this structure and ensuring precision and thoroughness in your response.
-
-REMINDER: Do NOT call any tools. Respond with plain text only — an <analysis> block followed by a <summary> block. Tool calls will be rejected and you will fail the task."""
-)
-
-# CC-identical post-compact user message wrapper (from compact/prompt.ts)
-_AUTOCOMPACT_SUMMARY_WRAPPER = """This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.
-
-{summary}
-Continue the conversation from where it left off without asking the user any further questions. Resume directly — do not acknowledge the summary, do not recap what was happening, do not preface with "I'll continue" or similar. Pick up the last task as if the break never happened."""
-
-
-@dataclass
-class AutocompactTrackingState:
-    """CC-identical tracking state for autocompact circuit breaker."""
-    compacted: bool = False
-    consecutive_failures: int = 0
-
-
-def _format_summary(raw_response: str) -> str:
-    """Extract <summary> block from LLM response, or use full text."""
-    import re
-    match = re.search(r"<summary>(.*?)</summary>", raw_response, re.DOTALL)
-    if match:
-        return match.group(1).strip()
-    return raw_response.strip()
-
-
-def should_autocompact(
-    messages: list[dict],
-    *,
-    token_budget: int = AUTOCOMPACT_TOKEN_BUDGET,
-    query_source: str | None = None,
-) -> bool:
-    """CC-identical predicate: should autocompact fire?
-
-    Guards:
-    - Recursion: ``compact`` and ``session_memory`` sources are rejected
-    - Budget: total tokens must exceed ``token_budget``
-    """
-    # Recursion guard (CC: querySource === 'session_memory' || 'compact')
-    if query_source in ("compact", "session_memory", "agent_summary"):
-        return False
-    total = sum(_estimate_message_tokens(m) for m in messages)
-    return total > token_budget
-
-
-async def autocompact_messages(
-    messages: list[dict],
-    *,
-    model: str | None = None,
-    token_budget: int = AUTOCOMPACT_TOKEN_BUDGET,
-    keep_recent: int = AUTOCOMPACT_KEEP_RECENT,
-    tracking: AutocompactTrackingState | None = None,
-    query_source: str | None = None,
-    transcript_path: str | None = None,
-) -> tuple[list[dict], int, AutocompactTrackingState]:
-    """CC-identical LLM-based autocompact.
-
-    Mirrors CC's ``autoCompactIfNeeded()`` + ``compactConversation()``:
-    1. Check budget threshold
-    2. Circuit-breaker check (consecutive failures)
-    3. Strip images, build compact prompt
-    4. Call LLM with CC's exact prompt template
-    5. Wrap summary in CC's post-compact user message format
-    6. Return [system, summary_msg, ...preserved_tail], tokens_freed, tracking
-
-    Falls back to heuristic summary if no model/LLM available.
-    """
-    tracking = tracking or AutocompactTrackingState()
-
-    # Budget check
-    if not should_autocompact(messages, token_budget=token_budget, query_source=query_source):
-        return messages, 0, tracking
-
-    # Circuit breaker (CC: MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3)
-    if tracking.consecutive_failures >= MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES:
-        logger.warning(
-            "[token optimization] autocompact circuit breaker tripped after {} consecutive failures",
-            tracking.consecutive_failures,
-        )
-        return messages, 0, tracking
-
-    system_msgs = [m for m in messages if m.get("role") == "system"]
-    non_system = [m for m in messages if m.get("role") != "system"]
-
-    if len(non_system) <= keep_recent:
-        return messages, 0, tracking
-
-    tail = non_system[-keep_recent:]
-    to_summarize = non_system[:-keep_recent]
-
-    if not to_summarize:
-        return messages, 0, tracking
-
-    tokens_before = sum(_estimate_message_tokens(m) for m in to_summarize)
-
-    # ---- LLM-based summarization (CC path) ----
-    summary_text: str | None = None
-    if model:
-        try:
-            # Build messages for compact LLM call (CC: streamCompactSummary)
-            # Strip images from messages to compress
-            compact_messages: list[dict] = []
-            for msg in to_summarize:
-                role = msg.get("role", "user")
-                content = msg.get("content", "")
-                if isinstance(content, list):
-                    # Strip image/document blocks → replace with [image]/[document]
-                    text_parts = []
-                    for block in content:
-                        if isinstance(block, dict):
-                            btype = block.get("type", "")
-                            if btype == "text":
-                                text_parts.append(block.get("text", ""))
-                            elif btype in ("image", "document"):
-                                text_parts.append(f"[{btype}]")
-                            else:
-                                text_parts.append(block.get("text", "") or block.get("content", ""))
-                    content = "\n".join(text_parts)
-                if not isinstance(content, str):
-                    content = str(content)
-                if role == "tool":
-                    # Embed tool results as assistant message for compact LLM
-                    tool_name = msg.get("tool_name", "tool")
-                    compact_messages.append({
-                        "role": "assistant",
-                        "content": f"[Tool result ({tool_name})]\n{content[:10_000]}",
-                    })
-                elif role in ("user", "assistant"):
-                    compact_messages.append({"role": role, "content": content})
-                elif role == "compression":
-                    compact_messages.append({"role": "user", "content": content})
-
-            # Add the compact request as final user message
-            compact_messages.append({
-                "role": "user",
-                "content": _AUTOCOMPACT_USER_PROMPT,
-            })
-
-            # Call LLM (CC: queryModelWithStreaming with querySource='compact')
-            from pantheon.utils.llm import acompletion_litellm
-            response = await acompletion_litellm(
-                model=model,
-                messages=[
-                    {"role": "system", "content": _AUTOCOMPACT_SYSTEM_PROMPT},
-                    *compact_messages,
-                ],
-                max_tokens=min(AUTOCOMPACT_MAX_OUTPUT_TOKENS, 20_000),
-                temperature=0,
-                stream=False,
-            )
-            raw_summary = ""
-            if isinstance(response, dict):
-                raw_summary = response.get("content", "")
-            elif hasattr(response, "choices") and response.choices:
-                raw_summary = response.choices[0].message.content or ""
-            else:
-                raw_summary = str(response)
-            summary_text = _format_summary(raw_summary)
-
-        except Exception as e:
-            logger.error("[token optimization] autocompact LLM call failed: {}", e)
-            tracking.consecutive_failures += 1
-            # Fall through to heuristic fallback
-
-    # ---- Heuristic fallback (when no model or LLM failed) ----
-    if not summary_text:
-        summary_parts: list[str] = []
-        for msg in to_summarize:
-            role = msg.get("role", "unknown")
-            content = msg.get("content", "")
-            if isinstance(content, list):
-                content = " ".join(
-                    b.get("text", "") for b in content if isinstance(b, dict)
-                )
-            if not isinstance(content, str):
-                content = str(content)
-            trimmed = content[:300].strip()
-            if trimmed:
-                summary_parts.append(f"[{role}] {trimmed}")
-        summary_text = "\n".join(summary_parts[:30])
-        if len(summary_parts) > 30:
-            summary_text += f"\n... (+{len(summary_parts) - 30} more messages)"
-
-    # ---- Build post-compact messages (CC: buildPostCompactMessages) ----
-    wrapper = _AUTOCOMPACT_SUMMARY_WRAPPER.format(summary=summary_text)
-    if transcript_path:
-        wrapper += (
-            f"\n\nIf you need specific details from before compaction "
-            f"(like exact code snippets, error messages, or content you "
-            f"generated), read the full transcript at: {transcript_path}"
-        )
-
-    compact_msg = {
-        "role": "user",
-        "content": wrapper,
-        "_autocompacted": True,
-    }
-
-    tokens_after = _estimate_message_tokens(compact_msg)
-    tokens_freed = tokens_before - tokens_after
-
-    result = system_msgs + [compact_msg] + tail
-
-    # Reset failure count on success (CC: consecutiveFailures: 0)
-    tracking.consecutive_failures = 0
-    tracking.compacted = True
-
-    logger.info(
-        "[token optimization] autocompact summarized {} messages via {} (~{} tokens freed)",
-        len(to_summarize),
-        "LLM" if model else "heuristic",
-        tokens_freed,
-    )
-    return result, tokens_freed, tracking
-
-
-def apply_token_optimizations(
-    messages: list[dict],
-    memory: Any | None = None,
-    base_dir: Path | None = None,
-    *,
-    is_main_thread: bool = True,
-    snip_config: SnipConfig | None = None,
-    enable_context_collapse: bool = True,
-    enable_autocompact: bool = True,
-    query_source: str | None = None,
-) -> list[dict]:
-    """Synchronous 4-stage optimization pipeline.
-
-    For the full 5-stage pipeline including LLM-based autocompact,
-    use :func:`apply_token_optimizations_async`.
-    """
-    # 1. Externalize large tool outputs (session-level budget)
-    optimized = apply_tool_result_budget(
-        messages,
-        memory=memory,
-        base_dir=base_dir,
-        query_source=query_source,
-    )
-    # 2. Snip over-budget history (HISTORY_SNIP) — before microcompact
-    optimized, _ = snip_messages_to_budget(optimized, config=snip_config)
-    # 3. Clear old compactable tool results (time-based microcompact)
-    optimized = microcompact_messages(
-        optimized,
-        is_main_thread=is_main_thread,
-    )
-    # 4. Context Collapse: fold consecutive read/search groups (CC-style)
-    if enable_context_collapse:
-        optimized, _ = collapse_read_search_groups(optimized)
-    # Note: autocompact (stage 5) is async — use apply_token_optimizations_async
-    return optimized
-
-
-async def apply_token_optimizations_async(
-    messages: list[dict],
-    memory: Any | None = None,
-    base_dir: Path | None = None,
-    *,
-    is_main_thread: bool = True,
-    snip_config: SnipConfig | None = None,
-    enable_context_collapse: bool = True,
-    enable_autocompact: bool = True,
-    query_source: str | None = None,
-    autocompact_model: str | None = None,
-    autocompact_tracking: AutocompactTrackingState | None = None,
-    transcript_path: str | None = None,
-) -> tuple[list[dict], AutocompactTrackingState | None]:
-    """Full 5-stage CC-identical optimization pipeline (async).
-
-    Stages:
-    1. Tool result budget (externalize large outputs)
-    2. HISTORY_SNIP (token-budget truncation)
-    3. Microcompact (time-based clearing)
-    4. contextCollapse (read/search folding)
-    5. Autocompact (LLM-based summarization — CC-identical)
-
-    Returns (optimized_messages, tracking_state).
-    """
-    # Stages 1-4 (sync)
-    optimized = apply_token_optimizations(
-        messages,
-        memory=memory,
-        base_dir=base_dir,
-        is_main_thread=is_main_thread,
-        snip_config=snip_config,
-        enable_context_collapse=enable_context_collapse,
-        enable_autocompact=False,  # handled below
-        query_source=query_source,
-    )
-    # Stage 5: Autocompact (async, LLM-based)
-    tracking = autocompact_tracking
-    if enable_autocompact:
-        optimized, _, tracking = await autocompact_messages(
-            optimized,
-            model=autocompact_model,
-            query_source=query_source,
-            tracking=tracking,
-            transcript_path=transcript_path,
-        )
-    return optimized, tracking
-
-
-def project_memory_messages_for_llm(messages: list[dict]) -> list[dict]:
-    """Project stored history into the LLM-facing view."""
-    from copy import deepcopy
-
-    filtered = [message for message in messages if message.get("role") != "system"]
-
-    last_compression_idx = -1
-    for index, message in enumerate(filtered):
-        if message.get("role") == "compression":
-            last_compression_idx = index
-
-    if last_compression_idx >= 0:
-        filtered = filtered[last_compression_idx:]
-
-    result: list[dict] = []
-    for message in filtered:
-        msg = deepcopy(message)
-        if msg.get("role") == "compression":
-            msg["role"] = "user"
-            if not isinstance(msg.get("content"), str):
-                msg["content"] = str(msg.get("content", ""))
-        result.append(msg)
-    return result
-
-
-def _prepare_llm_view_messages(
-    messages: list[dict],
-) -> tuple[dict | None, list[dict]]:
-    """Shared projection logic for build_llm_view / build_llm_view_async.
-
-    Returns (system_message_or_None, projected_non_system_messages).
-    """
-    system_message = next(
-        (message for message in messages if message.get("role") == "system"),
-        None,
-    )
-    non_system_messages = [
-        message for message in messages if message.get("role") != "system"
-    ]
-    projected = project_memory_messages_for_llm(non_system_messages)
-    projected = [
-        m for m in projected
-        if m.get("role") in ("user", "assistant", "tool")
-    ]
-    return system_message, projected
-
-
-def _wrap_with_system(
-    system_message: dict | None,
-    optimized: list[dict],
-) -> list[dict]:
-    if system_message is not None:
-        return [system_message, *optimized]
-    return optimized
-
-
-def build_llm_view(
-    messages: list[dict],
-    memory: Any | None = None,
-    base_dir: Path | None = None,
-    *,
-    is_main_thread: bool = True,
-    snip_config: "SnipConfig | None" = None,
-) -> list[dict]:
-    """Build the projected prompt view from raw history (sync, no autocompact)."""
-    if not messages:
-        return []
-    system_message, projected = _prepare_llm_view_messages(messages)
-    optimized = apply_token_optimizations(
-        projected,
-        memory=memory,
-        base_dir=base_dir,
-        is_main_thread=is_main_thread,
-        snip_config=snip_config,
-    )
-    return _wrap_with_system(system_message, optimized)
-
-
-async def build_llm_view_async(
-    messages: list[dict],
-    memory: Any | None = None,
-    base_dir: Path | None = None,
-    *,
-    is_main_thread: bool = True,
-    snip_config: "SnipConfig | None" = None,
-    autocompact_model: str | None = None,
-) -> list[dict]:
-    """Async variant of build_llm_view that enables LLM-based autocompact."""
-    if not messages:
-        return []
-    system_message, projected = _prepare_llm_view_messages(messages)
-    optimized, _ = await apply_token_optimizations_async(
-        projected,
-        memory=memory,
-        base_dir=base_dir,
-        is_main_thread=is_main_thread,
-        snip_config=snip_config,
-        autocompact_model=autocompact_model,
-    )
-    return _wrap_with_system(system_message, optimized)
-
-
-def stabilize_tool_definitions(tools: list[dict]) -> list[dict]:
-    """Return deterministic tool definitions for cache-stable prompts."""
-
-    def normalize(value: Any) -> Any:
-        if isinstance(value, dict):
-            normalized = {key: normalize(value[key]) for key in sorted(value)}
-            required = normalized.get("required")
-            if isinstance(required, list) and all(
-                isinstance(item, str) for item in required
-            ):
-                normalized["required"] = sorted(required)
-            return normalized
-        if isinstance(value, list):
-            return [normalize(item) for item in value]
-        return value
-
-    stabilized = [normalize(tool) for tool in tools]
-    stabilized.sort(
-        key=lambda tool: (
-            str(tool.get("function", {}).get("name", "")),
-            json.dumps(tool, ensure_ascii=False, sort_keys=True),
-        )
-    )
-    return stabilized
-
-
-def extract_persisted_file_paths(messages: list[dict]) -> list[str]:
-    pattern = re.compile(r"Full output saved to:\s*(.+)")
-    paths: list[str] = []
-    seen: set[str] = set()
-    for message in messages:
-        content = message.get("content")
-        if not isinstance(content, str):
-            continue
-        for match in pattern.findall(content):
-            path = match.strip()
-            if path and path not in seen:
-                seen.add(path)
-                paths.append(path)
-    return paths
-
-
-def build_recent_context_block(
-    messages: list[dict],
-    max_messages: int = 6,
-    max_chars_per_message: int = 1200,
-) -> str:
-    relevant = [
-        message
-        for message in messages
-        if message.get("role") in {"user", "assistant", "tool"}
-    ]
-    if not relevant:
-        return ""
-
-    def trim(text: str) -> str:
-        if len(text) <= max_chars_per_message:
-            return text
-        return text[:max_chars_per_message] + "\n...[truncated recent context]..."
-
-    blocks: list[str] = []
-    for message in relevant[-max_messages:]:
-        content = message.get("content")
-        if not isinstance(content, str) or not content.strip():
-            continue
-        role = str(message.get("role", "unknown")).upper()
-        blocks.append(f"[{role}]\n{trim(content.strip())}")
-    return "\n\n".join(blocks)
-
-
-ON_DEMAND_HINT = (
-    "Note: Only a summary and recent context are provided above. "
-    "If you need the full content of any referenced file or tool output, "
-    "use read_file or the appropriate tool to retrieve it on demand."
-)
-
-
-def build_delegation_context_message(
-    history: list[dict],
-    instruction: str,
-    summary_text: str | None = None,
-) -> str:
-    """Build a compact delegation prompt from summary + recent context + file refs.
-
-    The *history* passed here should already be trimmed to a recent tail by the
-    caller (``create_delegation_task_message``).  Older context is represented
-    by *summary_text*.
-    """
-    projected = build_llm_view(history, is_main_thread=False)
-    parts: list[str] = []
-    if summary_text:
-        parts.append(f"Context Summary:\n{summary_text}")
-
-    recent_context = build_recent_context_block(projected)
-    if recent_context:
-        parts.append(f"Recent Context:\n{recent_context}")
-
-    file_paths = extract_persisted_file_paths(projected)
-    if file_paths:
-        parts.append(
-            "Referenced Files (retrieve on demand if needed):\n"
-            + "\n".join(f"- {path}" for path in file_paths[:10])
-        )
-
-    parts.append(f"Task: {instruction}")
-
-    # Append on-demand retrieval hint when summary was used
-    if summary_text:
-        parts.append(ON_DEMAND_HINT)
-
-    return "\n\n".join(parts)
-
-
-def estimate_total_tokens_from_chars(messages: list[dict]) -> int:
-    total_chars = 0
-    for message in messages:
-        content = message.get("content")
-        if isinstance(content, str):
-            total_chars += len(content)
-    return int(total_chars / BYTES_PER_TOKEN)
-
-
-# ---------------------------------------------------------------------------
-# Opt3: Prompt cache control markers (Anthropic API)
-# ---------------------------------------------------------------------------
-
-_ANTHROPIC_MODEL_PREFIXES = ("claude", "anthropic/", "custom_anthropic/")
-
-
-def is_anthropic_model(model: str) -> bool:
-    """Return True if *model* routes to the Anthropic API via litellm."""
-    lower = model.lower()
-    return any(lower.startswith(p) for p in _ANTHROPIC_MODEL_PREFIXES)
-
-
-def _make_text_block(text: str) -> dict[str, Any]:
-    return {"type": "text", "text": text}
-
-
-def _last_text_block_index(blocks: list[dict]) -> int | None:
-    """Return the index of the last block whose type is 'text', or None."""
-    for i in range(len(blocks) - 1, -1, -1):
-        if blocks[i].get("type") == "text":
-            return i
-    return None
-
-
-def _ensure_block_content(message: dict) -> list[dict]:
-    """Return message content as a list of blocks, converting str if needed."""
-    content = message.get("content")
-    if isinstance(content, str):
-        return [_make_text_block(content)]
-    if isinstance(content, list):
-        return list(content)
-    return []
-
-
-def inject_cache_control_markers(
-    messages: list[dict],
-    *,
-    skip_cache_write: bool = False,
-) -> list[dict]:
-    """Inject Anthropic prompt-cache markers into a message list.
-
-    Mirrors Claude Code's ``addCacheBreakpoints()`` strategy:
-    - System message: mark the last text block with cache_control.
-    - Conversation: mark the last text block of the last (or
-      second-to-last when *skip_cache_write*) user/assistant message
-      that has non-empty text content.
-
-    *skip_cache_write* is used for fire-and-forget / fork queries:
-    the last message is a short delegation directive whose prefix will
-    never be reused, so placing the cache breakpoint one message earlier
-    preserves cache for the parent conversation prefix.
-
-    Returns a *new* list; input messages are not mutated.
-
-    Requires litellm >= 1.34.0 for cache_control pass-through to Anthropic.
-    """
-    # Safety check: litellm must support cache_control field pass-through
-    try:
-        from importlib.metadata import version as pkg_version
-        litellm_version = pkg_version("litellm")
-        major, minor = (int(x) for x in litellm_version.split(".")[:2])
-        if (major, minor) < (1, 34):
-            logger.debug(
-                "litellm {} < 1.34 — skipping cache_control injection",
-                litellm_version,
-            )
-            return messages
-    except Exception:
-        pass  # can't determine version, proceed optimistically
-
-    from copy import deepcopy
-
-    result = deepcopy(messages)
-    cache_marker: dict[str, Any] = {"type": "ephemeral"}
-
-    # 1. Mark last text block of system message
-    for msg in result:
-        if msg.get("role") == "system":
-            blocks = _ensure_block_content(msg)
-            idx = _last_text_block_index(blocks)
-            if idx is not None:
-                blocks[idx] = {**blocks[idx], "cache_control": cache_marker}
-                msg["content"] = blocks
-            break
-
-    # 2. Mark last text block of the Nth-from-last user/assistant message
-    #    Normal: last message.  skip_cache_write: second-to-last.
-    hits_needed = 2 if skip_cache_write else 1
-    hits = 0
-    for msg in reversed(result):
-        role = msg.get("role")
-        if role not in ("user", "assistant"):
-            continue
-        hits += 1
-        if hits < hits_needed:
-            continue
-        blocks = _ensure_block_content(msg)
-        idx = _last_text_block_index(blocks)
-        if idx is not None and blocks[idx].get("text", "").strip():
-            blocks[idx] = {**blocks[idx], "cache_control": cache_marker}
-            msg["content"] = blocks
-            break
-
-    return result
diff --git a/pantheon/utils/truncate.py b/pantheon/utils/truncate.py
index f6cf7165..de1795a4 100644
--- a/pantheon/utils/truncate.py
+++ b/pantheon/utils/truncate.py
@@ -30,8 +30,8 @@
    - Handles edge cases (circular refs, non-serializable objects)
 
 Configuration:
-  - max_tool_content_length: 50K (global fallback; per-tool thresholds take priority)
-  - max_file_read_chars: 50K-100K (read_file internal limit)
+  - max_tool_content_length: 10K (LLM context limit)
+  - max_file_read_chars: 50K (read_file internal limit)
   - Recursion depth: 2 layers (covers 99% cases)
 """
 
@@ -40,24 +40,6 @@
 from typing import Any
 
 
-def _format_file_size(num_bytes: int) -> str:
-    """Format byte count as human-readable size string."""
-    value = float(num_bytes)
-    for unit in ["B", "KB", "MB", "GB"]:
-        if value < 1024 or unit == "GB":
-            if unit == "B":
-                return f"{int(value)}{unit}"
-            return f"{value:.1f}{unit}"
-        value /= 1024
-    return f"{num_bytes}B"
-
-
-# Unified externalization markers — shared with token_optimization.py
-PERSISTED_OUTPUT_TAG = "<persisted-output>"
-PERSISTED_OUTPUT_CLOSING_TAG = "</persisted-output>"
-PREVIEW_SIZE_BYTES = 2000
-
-
 def truncate_string(content: str, max_length: int) -> str:
     """Truncate string preserving head and tail with info.
     
@@ -96,21 +78,24 @@ def _format_truncated_message(
     filepath: Path,
     preview_size: int | None = None,
 ) -> str:
-    """Format truncated content message using unified <persisted-output> format.
-
-    This format is recognized by token_optimization.py's _is_already_externalized()
-    so that the LLM-view pipeline can correctly detect already-externalized content.
+    """Format truncated content message.
+    
+    Args:
+        preview: Preview content to show
+        total_size: Total size of original content
+        filepath: Path where full content is saved
+        preview_size: Optional preview size (if None, uses len(preview))
+        
+    Returns:
+        Formatted message with preview
     """
     if preview_size is None:
         preview_size = len(preview)
-
+    
     return (
-        f"{PERSISTED_OUTPUT_TAG}\n"
-        f"Output too large ({_format_file_size(total_size)}). "
-        f"Full output saved to: {filepath}\n\n"
-        f"Preview (first {_format_file_size(preview_size)}):\n"
-        f"{preview}\n"
-        f"{PERSISTED_OUTPUT_CLOSING_TAG}"
+        f"[truncated {preview_size:,}/{total_size:,} chars]\n"
+        f"Full content saved to: {filepath}\n\n"
+        f"{preview}"
     )
 
 
@@ -203,7 +188,7 @@ def _truncate_non_dict(
             with open(filepath, 'w', encoding='utf-8') as f:
                 f.write(content)
             
-            preview_size = min(PREVIEW_SIZE_BYTES, max_length // 2)
+            preview_size = min(2000, max_length // 2)
             preview = content[:preview_size]
             return _format_truncated_message(
                 preview=preview,
@@ -224,16 +209,14 @@ def _truncate_json_path(
     temp_dir: str,
 ) -> str:
     """Handle JSON tools truncation.
-
+    
     Special handling for tools with 'truncated' field:
     - Skip base64 filtering (already processed by tool)
-    - Length limits are ALWAYS applied (per-tool thresholds are the
-      primary control; Layer 1's truncated flag only means the tool
-      did its own pre-processing, not that no further limits apply).
+    - Skip length limits (trust tool's truncation)
     """
     # Check if tool already handled truncation
     has_truncated_field = 'truncated' in result
-
+    
     # Step 1: Base64 filter (skip for tools with truncated field)
     if filter_base64 and not has_truncated_field:
         try:
@@ -245,7 +228,7 @@ def _truncate_json_path(
         except Exception as e:
             from pantheon.utils.log import logger
             logger.warning(f"Skipping base64 filter due to error: {e}")
-
+    
     # Step 2: Format to JSON
     try:
         formatted = json.dumps(result, ensure_ascii=False)
@@ -253,11 +236,11 @@ def _truncate_json_path(
         from pantheon.utils.log import logger
         logger.warning(f"JSON serialization failed: {e}, using repr")
         formatted = repr(result)
-
-    # Step 3: Length check — always applied regardless of truncated field
-    if len(formatted) <= max_length:
+    
+    # Step 3: Length check (skip for tools with truncated field)
+    if has_truncated_field or len(formatted) <= max_length:
         return formatted
-
+    
     # Step 4: Save and generate preview
     return _save_and_preview_json(result, formatted, max_length, temp_dir)
 
diff --git a/scripts/benchmark_prompt_cache.py b/scripts/benchmark_prompt_cache.py
deleted file mode 100644
index 3b7ebc58..00000000
--- a/scripts/benchmark_prompt_cache.py
+++ /dev/null
@@ -1,385 +0,0 @@
-from __future__ import annotations
-
-import argparse
-import asyncio
-import copy
-import json
-import os
-from dataclasses import asdict, dataclass
-from typing import Any
-
-from pantheon.agent import Agent, AgentRunContext
-from pantheon.internal.memory import Memory
-from pantheon.team.pantheon import (
-    _get_cache_safe_child_fork_context_messages,
-    _get_cache_safe_child_run_overrides,
-)
-from pantheon.utils.llm import count_tokens_in_messages, process_messages_for_model
-from pantheon.utils.token_optimization import (
-    build_cache_safe_runtime_params,
-    build_delegation_context_message,
-    build_llm_view,
-)
-
-
-@dataclass
-class LiveUsageMetrics:
-    prompt_tokens: int
-    cached_tokens: int
-    uncached_prompt_tokens: int
-    cache_hit_rate: float
-
-
-def _build_agent(name: str, instructions: str, model: str) -> Agent:
-    agent = Agent(
-        name=name,
-        instructions=instructions,
-        model=model,
-        model_params={"temperature": 0.7},
-    )
-
-    def alpha_tool(path: str) -> str:
-        return path
-
-    def beta_tool(query: str) -> str:
-        return query
-
-    agent.tool(beta_tool)
-    agent.tool(alpha_tool)
-    return agent
-
-
-async def build_benchmark_state(model: str, prefix_repeat: int) -> dict[str, Any]:
-    instructions = "You are a software engineering agent."
-    caller = _build_agent("caller", instructions, model)
-    target = _build_agent("target", instructions, model)
-
-    history = [
-        {"role": "system", "content": instructions},
-        {
-            "role": "user",
-            "content": "Production outage investigation context. " * prefix_repeat,
-        },
-        {"role": "assistant", "content": "I will inspect logs and code."},
-        {
-            "role": "tool",
-            "tool_call_id": "tool-1",
-            "tool_name": "shell",
-            "content": "ERROR " * 5000,
-        },
-        {
-            "role": "assistant",
-            "content": "The failure involves the cache layer and delegation path.",
-        },
-    ]
-
-    parent_messages = build_llm_view(
-        history,
-        memory=Memory("prompt-cache-benchmark-parent"),
-        is_main_thread=True,
-    )
-    parent_tools = await caller.get_tools_for_llm()
-    parent_runtime = build_cache_safe_runtime_params(
-        model=model,
-        model_params={"temperature": 0, "top_p": 1},
-        response_format=None,
-    )
-    parent_processed = process_messages_for_model(
-        copy.deepcopy(parent_messages),
-        model,
-    )
-    run_context = AgentRunContext(
-        agent=caller,
-        memory=None,
-        execution_context_id=None,
-        process_step_message=None,
-        process_chunk=None,
-        cache_safe_runtime_params=parent_runtime,
-        cache_safe_prompt_messages=parent_processed,
-        cache_safe_tool_definitions=parent_tools,
-    )
-
-    task_message = build_delegation_context_message(
-        history=history,
-        instruction="Analyze the cache issue and propose a fix.",
-        summary_text=(
-            "Parent found a likely cache-layer bug and wants a focused "
-            "root-cause analysis."
-        ),
-    )
-
-    before_child_messages = [
-        {"role": "system", "content": instructions},
-        {"role": "user", "content": task_message},
-    ]
-    before_runtime = build_cache_safe_runtime_params(
-        model=target.models[0],
-        model_params=target.model_params,
-        response_format=target.response_format,
-    )
-    before_processed = process_messages_for_model(
-        copy.deepcopy(before_child_messages),
-        model,
-    )
-
-    child_run_overrides, child_context_variables = _get_cache_safe_child_run_overrides(
-        run_context,
-        target,
-        {},
-    )
-    fork_context_messages = await _get_cache_safe_child_fork_context_messages(
-        run_context,
-        target,
-    )
-
-    after_child_messages = [
-        {"role": "system", "content": instructions},
-        *(fork_context_messages or []),
-        {"role": "user", "content": task_message},
-    ]
-    after_runtime = build_cache_safe_runtime_params(
-        model=child_run_overrides.get("model", target.models[0]),
-        model_params=child_context_variables.get(
-            "model_params", target.model_params
-        ),
-        response_format=child_run_overrides.get(
-            "response_format", target.response_format
-        ),
-    )
-    after_processed = process_messages_for_model(
-        copy.deepcopy(after_child_messages),
-        model,
-    )
-
-    return {
-        "instructions": instructions,
-        "parent_messages": parent_processed,
-        "before_child_messages": before_processed,
-        "after_child_messages": after_processed,
-        "parent_tools": parent_tools,
-        "target_tools": await target.get_tools_for_llm(),
-        "parent_runtime": parent_runtime,
-        "before_runtime": before_runtime,
-        "after_runtime": after_runtime,
-        "child_context_variables": child_context_variables,
-        "fork_context_messages": fork_context_messages or [],
-        "compatibility": {
-            "same_instructions": target.instructions == caller.instructions,
-            "same_model_chain": list(target.models) == list(caller.models),
-            "same_tools": parent_tools == await target.get_tools_for_llm(),
-            "same_response_format": target.response_format == caller.response_format,
-        },
-    }
-
-
-def build_structural_metrics(state: dict[str, Any], model: str) -> dict[str, Any]:
-    parent_messages = state["parent_messages"]
-    before_child_messages = state["before_child_messages"]
-    after_child_messages = state["after_child_messages"]
-    target_tools = state["target_tools"]
-
-    before_prefix_hit = (
-        before_child_messages[: len(parent_messages)] == parent_messages
-    )
-    after_prefix_hit = after_child_messages[: len(parent_messages)] == parent_messages
-
-    before_tokens = count_tokens_in_messages(
-        before_child_messages,
-        model,
-        target_tools,
-    )["total"]
-    after_tokens = count_tokens_in_messages(
-        after_child_messages,
-        model,
-        target_tools,
-    )["total"]
-
-    return {
-        "cache_prefix_hit_before": before_prefix_hit,
-        "cache_prefix_hit_after": after_prefix_hit,
-        "child_prompt_tokens_before": before_tokens,
-        "child_prompt_tokens_after": after_tokens,
-        "child_prompt_token_delta": after_tokens - before_tokens,
-        "fork_context_message_count": len(state["fork_context_messages"]),
-    }
-
-
-def run_live_request(
-    client: Any,
-    *,
-    model: str,
-    messages: list[dict],
-    max_completion_tokens: int,
-) -> LiveUsageMetrics:
-    resp = client.chat.completions.create(
-        model=model,
-        messages=messages,
-        max_completion_tokens=max_completion_tokens,
-    )
-    usage = resp.usage
-    prompt_tokens = int(usage.prompt_tokens or 0)
-    prompt_details = getattr(usage, "prompt_tokens_details", None)
-    cached_tokens = int(getattr(prompt_details, "cached_tokens", 0) or 0)
-    uncached_prompt_tokens = prompt_tokens - cached_tokens
-    cache_hit_rate = round(
-        cached_tokens / prompt_tokens * 100,
-        2,
-    ) if prompt_tokens else 0.0
-    return LiveUsageMetrics(
-        prompt_tokens=prompt_tokens,
-        cached_tokens=cached_tokens,
-        uncached_prompt_tokens=uncached_prompt_tokens,
-        cache_hit_rate=cache_hit_rate,
-    )
-
-
-def sanitize_messages_for_live_chat(messages: list[dict]) -> list[dict]:
-    sanitized: list[dict] = []
-    for message in messages:
-        role = message.get("role")
-        if role not in {"system", "user", "assistant"}:
-            continue
-        content = message.get("content")
-        if not isinstance(content, str):
-            continue
-        sanitized.append({"role": role, "content": content})
-    return sanitized
-
-
-def build_live_metrics(
-    *,
-    model: str,
-    parent_messages: list[dict],
-    before_child_messages: list[dict],
-    after_child_messages: list[dict],
-    max_completion_tokens: int,
-) -> dict[str, Any]:
-    from openai import OpenAI
-
-    client = OpenAI()
-    parent_messages = sanitize_messages_for_live_chat(parent_messages)
-    before_child_messages = sanitize_messages_for_live_chat(before_child_messages)
-    after_child_messages = sanitize_messages_for_live_chat(after_child_messages)
-
-    parent_before = run_live_request(
-        client,
-        model=model,
-        messages=parent_messages,
-        max_completion_tokens=max_completion_tokens,
-    )
-    child_before = run_live_request(
-        client,
-        model=model,
-        messages=before_child_messages,
-        max_completion_tokens=max_completion_tokens,
-    )
-    parent_after = run_live_request(
-        client,
-        model=model,
-        messages=parent_messages,
-        max_completion_tokens=max_completion_tokens,
-    )
-    child_after = run_live_request(
-        client,
-        model=model,
-        messages=after_child_messages,
-        max_completion_tokens=max_completion_tokens,
-    )
-
-    return {
-        "parent_before": asdict(parent_before),
-        "child_before": asdict(child_before),
-        "parent_after": asdict(parent_after),
-        "child_after": asdict(child_after),
-        "cache_hit_rate_before_pct": child_before.cache_hit_rate,
-        "cache_hit_rate_after_pct": child_after.cache_hit_rate,
-        "uncached_prompt_tokens_before": child_before.uncached_prompt_tokens,
-        "uncached_prompt_tokens_after": child_after.uncached_prompt_tokens,
-        "uncached_prompt_token_delta": (
-            child_after.uncached_prompt_tokens
-            - child_before.uncached_prompt_tokens
-        ),
-        "warm_parent_vs_cached_child_uncached_delta": (
-            parent_before.uncached_prompt_tokens
-            - child_after.uncached_prompt_tokens
-        ),
-    }
-
-
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(
-        description=(
-            "Benchmark delegation prompt-cache behavior before/after "
-            "cache-safe prefix sharing."
-        )
-    )
-    parser.add_argument(
-        "--model",
-        default="gpt-4.1-mini",
-        help="OpenAI model for live benchmark. Default: gpt-4.1-mini",
-    )
-    parser.add_argument(
-        "--prefix-repeat",
-        type=int,
-        default=260,
-        help="How many times to repeat the parent context sentence.",
-    )
-    parser.add_argument(
-        "--max-completion-tokens",
-        type=int,
-        default=16,
-        help="max_completion_tokens for each live request.",
-    )
-    parser.add_argument(
-        "--skip-live",
-        action="store_true",
-        help="Only compute structural metrics; skip live API requests.",
-    )
-    return parser.parse_args()
-
-
-def main() -> None:
-    args = parse_args()
-    state = asyncio.run(
-        build_benchmark_state(
-            model=args.model,
-            prefix_repeat=args.prefix_repeat,
-        )
-    )
-
-    result: dict[str, Any] = {
-        "model": args.model,
-        "compatibility": state["compatibility"],
-        "structural": build_structural_metrics(state, args.model),
-        "notes": {
-            "compatible_agent_rule": (
-                "Prefix sharing only activates when caller/target are local Agent "
-                "instances and the cache-critical surface matches: instructions, "
-                "model chain, tools, and response format. Runtime model/model_params "
-                "are then inherited from the parent request."
-            ),
-        },
-    }
-
-    api_key_present = bool(os.environ.get("OPENAI_API_KEY"))
-    if args.skip_live:
-        result["live"] = {"skipped": True, "reason": "--skip-live specified"}
-    elif not api_key_present:
-        result["live"] = {
-            "skipped": True,
-            "reason": "OPENAI_API_KEY not set",
-        }
-    else:
-        result["live"] = build_live_metrics(
-            model=args.model,
-            parent_messages=state["parent_messages"],
-            before_child_messages=state["before_child_messages"],
-            after_child_messages=state["after_child_messages"],
-            max_completion_tokens=args.max_completion_tokens,
-        )
-
-    print(json.dumps(result, ensure_ascii=False, indent=2))
-
-
-if __name__ == "__main__":
-    main()
diff --git a/scripts/benchmark_token_optimizations.py b/scripts/benchmark_token_optimizations.py
deleted file mode 100644
index e809febc..00000000
--- a/scripts/benchmark_token_optimizations.py
+++ /dev/null
@@ -1,535 +0,0 @@
-#!/usr/bin/env python3
-"""Benchmark all 5 token optimizations with real OpenAI API token counting.
-
-Measures actual prompt_tokens reported by the API for each optimization
-on/off, across multiple conversation sizes. Uses gpt-4.1-mini for minimal cost.
-
-Usage:
-    OPENAI_API_KEY=sk-... python scripts/benchmark_token_optimizations.py
-    OPENAI_API_KEY=sk-... python scripts/benchmark_token_optimizations.py --skip-live
-"""
-
-from __future__ import annotations
-
-import argparse
-import asyncio
-import copy
-import json
-import os
-import sys
-import time
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
-
-from pantheon.internal.memory import Memory
-from pantheon.team.pantheon import DELEGATION_RECENT_TAIL_SIZE
-from pantheon.utils.token_optimization import (
-    ON_DEMAND_HINT,
-    SnipConfig,
-    TimeBasedMicrocompactConfig,
-    apply_token_optimizations,
-    apply_tool_result_budget,
-    autocompact_messages,
-    build_delegation_context_message,
-    build_llm_view,
-    collapse_read_search_groups,
-    microcompact_messages,
-    should_autocompact,
-    snip_messages_to_budget,
-    stabilize_tool_definitions,
-)
-
-CHARS_PER_TOKEN = 4
-
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-def est_tokens(messages: list[dict]) -> int:
-    total = 0
-    for m in messages:
-        c = m.get("content")
-        if isinstance(c, str):
-            total += len(c)
-    return total // CHARS_PER_TOKEN
-
-
-def _make_log_output(size_kb: int) -> str:
-    line = "2026-03-31T06:41:47 INFO  server.handler  request_id=abc123 status=200 duration_ms=42 path=/api/v2/data user=admin@example.com\n"
-    return line * max(1, (size_kb * 1024) // len(line))
-
-
-def _tc(call_id: str, name: str) -> dict:
-    return {"id": call_id, "function": {"name": name, "arguments": "{}"}}
-
-
-def _tool_msg(call_id: str, content: str, name: str = "shell") -> dict:
-    return {"role": "tool", "tool_call_id": call_id, "tool_name": name, "content": content}
-
-
-def _asst(content: str, tool_calls=None, ts=None) -> dict:
-    m: dict = {"role": "assistant", "content": content}
-    if tool_calls:
-        m["tool_calls"] = tool_calls
-    if ts is not None:
-        m["timestamp"] = ts
-    return m
-
-
-# ---------------------------------------------------------------------------
-# Build test conversations
-# ---------------------------------------------------------------------------
-
-def build_conversation(
-    num_rounds: int = 10,
-    output_kb: int = 50,
-    all_timestamps_old: bool = False,
-) -> list[dict]:
-    msgs: list[dict] = [
-        {"role": "system", "content": "You are a software engineering assistant. Investigate issues methodically."},
-        {"role": "user", "content": "Investigate the production outage in the payment service. Check logs, code, and recent deployments."},
-    ]
-    now = time.time()
-    tools = ["shell", "read_file", "grep", "web_fetch", "bash"]
-
-    for i in range(num_rounds):
-        cid = f"call_{i:04d}"
-        name = tools[i % len(tools)]
-
-        if all_timestamps_old:
-            ts = now - 7200 + i * 10  # all 2 hours ago
-        else:
-            ts = now - 7200 + i * 10 if i < num_rounds - 2 else now - 5
-
-        msgs.append(_asst(f"Checking {name} for issue #{i}.", tool_calls=[_tc(cid, name)], ts=ts))
-        msgs.append(_tool_msg(cid, _make_log_output(output_kb), name=name))
-
-    msgs.append(_asst("Root cause: race condition in payment handler."))
-    msgs.append({"role": "user", "content": "Please fix it."})
-    return msgs
-
-
-def build_collapsible_conversation(
-    num_rounds: int = 10,
-    output_kb: int = 50,
-) -> list[dict]:
-    """Build a conversation where tool calls have silent assistant messages
-    (no text content), making them collapsible by contextCollapse."""
-    msgs: list[dict] = [
-        {"role": "system", "content": "You are a software engineering assistant."},
-        {"role": "user", "content": "Investigate the bug."},
-    ]
-    now = time.time()
-    tools = ["grep", "read_file", "glob", "shell", "web_fetch"]
-
-    for i in range(num_rounds):
-        cid = f"call_{i:04d}"
-        name = tools[i % len(tools)]
-        ts = now - 7200 + i * 10
-        # Silent assistant: only tool_calls, NO text — collapsible
-        msgs.append({
-            "role": "assistant",
-            "tool_calls": [_tc(cid, name)],
-            "timestamp": ts,
-        })
-        msgs.append(_tool_msg(cid, _make_log_output(output_kb), name=name))
-
-    msgs.append(_asst("I found the root cause."))
-    msgs.append({"role": "user", "content": "Fix it."})
-    return msgs
-
-
-# ---------------------------------------------------------------------------
-# Convert to OpenAI-compatible format for live API calls
-# ---------------------------------------------------------------------------
-
-def flatten_for_api(messages: list[dict]) -> list[dict]:
-    """Convert tool messages to user messages for API token counting.
-
-    OpenAI requires tool_calls + tool responses in matched pairs. To avoid
-    that complexity, we flatten: embed tool results as user messages so the
-    API counts all the tokens accurately.
-    """
-    result = []
-    for m in messages:
-        role = m.get("role", "")
-        content = m.get("content", "")
-        if not isinstance(content, str) or not content:
-            continue
-        if role == "tool":
-            # Embed as user message to preserve token count
-            result.append({"role": "user", "content": f"[Tool output ({m.get('tool_name','tool')})]\n{content}"})
-        elif role in ("system", "user", "assistant"):
-            clean = {"role": role, "content": content}
-            result.append(clean)
-        elif role == "compression":
-            result.append({"role": "user", "content": content})
-    return result
-
-
-# ---------------------------------------------------------------------------
-# Optimization benchmarks
-# ---------------------------------------------------------------------------
-
-@dataclass
-class Result:
-    name: str
-    before_tokens: int
-    after_tokens: int
-
-    @property
-    def saved(self) -> int:
-        return self.before_tokens - self.after_tokens
-
-    @property
-    def pct(self) -> float:
-        return round(self.saved / max(self.before_tokens, 1) * 100, 1)
-
-
-def bench_opt1(msgs: list[dict], tmp: Path, budget: int = 30_000) -> Result:
-    """Opt1: Tool result budget — externalize large outputs to disk."""
-    before = est_tokens(msgs)
-    mem = Memory("b-opt1")
-    out = apply_tool_result_budget(copy.deepcopy(msgs), memory=mem, base_dir=tmp / "o1",
-                                   per_message_limit=budget)
-    return Result("1. Tool Result Budget", before, est_tokens(out))
-
-
-def bench_opt2(msgs: list[dict]) -> Result:
-    """Opt2: Micro-compact — clear old compactable tool results."""
-    before = est_tokens(msgs)
-    out = microcompact_messages(
-        copy.deepcopy(msgs), is_main_thread=True,
-        config=TimeBasedMicrocompactConfig(enabled=True, gap_threshold_minutes=1, keep_recent=2),
-    )
-    return Result("2. Micro-Compact", before, est_tokens(out))
-
-
-def bench_opt3() -> Result:
-    """Opt3: Cache stability — local idempotency check only.
-    Real cache-hit measurement is done in bench_opt3_live().
-    """
-    tools = [
-        {"function": {"name": "zeta", "parameters": {"type": "object", "required": ["b", "a"],
-                                                      "properties": {"b": {"type": "str"}, "a": {"type": "str"}}}}},
-        {"function": {"name": "alpha", "parameters": {"type": "object", "required": ["x"],
-                                                       "properties": {"x": {"type": "str"}}}}},
-        {"function": {"name": "mid", "parameters": {"type": "object", "required": ["q", "p"],
-                                                     "properties": {"q": {"type": "str"}, "p": {"type": "str"}}}}},
-    ]
-    stable1 = stabilize_tool_definitions(copy.deepcopy(tools))
-    stable2 = stabilize_tool_definitions(copy.deepcopy(tools))
-    is_stable = json.dumps(stable1) == json.dumps(stable2)
-    t = len(json.dumps(tools)) // CHARS_PER_TOKEN
-    return Result(f"3. Cache Stability (idempotent={is_stable}, see live)", t, t)
-
-
-def bench_opt4_snip(msgs: list[dict]) -> Result:
-    """Opt4a: HISTORY_SNIP — token-budget truncation of oldest messages."""
-    before = est_tokens(msgs)
-    out, freed = snip_messages_to_budget(
-        copy.deepcopy(msgs),
-        config=SnipConfig(enabled=True, token_budget=20_000, keep_recent=4),
-    )
-    return Result("4a. HISTORY_SNIP", before, est_tokens(out))
-
-
-def bench_opt4_collapse(msgs: list[dict], collapse_msgs: list[dict] | None = None) -> Result:
-    """Opt4b: contextCollapse — fold consecutive read/search groups."""
-    target = collapse_msgs or msgs
-    before = est_tokens(target)
-    out, saved = collapse_read_search_groups(copy.deepcopy(target), min_group_size=3)
-    return Result("4b. contextCollapse", before, est_tokens(out))
-
-
-def bench_opt4_autocompact(msgs: list[dict]) -> Result:
-    """Opt4c: autocompact — last-resort LLM-based summarization (heuristic fallback)."""
-    before = est_tokens(msgs)
-    out, freed, _ = asyncio.run(autocompact_messages(
-        copy.deepcopy(msgs), token_budget=20_000, keep_recent=4,
-        model=None,  # heuristic fallback for benchmark (no API cost)
-    ))
-    return Result("4c. Autocompact (heuristic)", before, est_tokens(out))
-
-
-def bench_opt4(msgs: list[dict], tmp: Path) -> Result:
-    """Opt4: build_llm_view — full 5-stage projection pipeline."""
-    before = est_tokens(msgs)
-    mem = Memory("b-opt4")
-    out = build_llm_view(copy.deepcopy(msgs), memory=mem, base_dir=tmp / "o4", is_main_thread=True)
-    return Result("4. LLM View Layer (all stages)", before, est_tokens(out))
-
-
-def bench_opt5(msgs: list[dict]) -> Result:
-    """Opt5: Delegation summary-first vs full history.
-
-    Compares what a sub-agent actually receives:
-      BEFORE (old default use_summary=False): raw instruction + child gets the
-              parent's full history via memory (simulated by est_tokens on full msgs)
-      AFTER  (new default use_summary=True):  summary + recent tail context message
-    """
-    # BEFORE: old behavior (use_summary=False) — sub-agent sees full parent history
-    # as its task_message was just the instruction, but `run_context.memory.get_messages`
-    # would feed the parent conversation. We estimate what the child *actually* processes.
-    system_tokens = 20  # system prompt overhead
-    before = est_tokens(msgs) + system_tokens  # child processes full parent history
-
-    # AFTER: new behavior — child only sees compact delegation context
-    tail = msgs[-DELEGATION_RECENT_TAIL_SIZE:]
-    ctx_after = build_delegation_context_message(
-        history=tail, instruction="Fix the race condition.",
-        summary_text="Parent investigated payment outage. Root cause: race condition in concurrent transaction handler. Logs and code examined.",
-    )
-    after = len(ctx_after) // CHARS_PER_TOKEN + system_tokens
-
-    return Result("5. Delegation Summary-First", before, after)
-
-
-def bench_combined(msgs: list[dict], tmp: Path) -> Result:
-    """All optimizations stacked (opt1 + opt2 + opt4)."""
-    before = est_tokens(msgs)
-    mem = Memory("b-all")
-    out = build_llm_view(copy.deepcopy(msgs), memory=mem, base_dir=tmp / "all", is_main_thread=True)
-    out = microcompact_messages(
-        out, is_main_thread=True,
-        config=TimeBasedMicrocompactConfig(enabled=True, gap_threshold_minutes=1, keep_recent=2),
-    )
-    return Result("** ALL COMBINED **", before, est_tokens(out))
-
-
-# ---------------------------------------------------------------------------
-# Live API
-# ---------------------------------------------------------------------------
-
-def api_prompt_tokens(client, model: str, messages: list[dict]) -> int:
-    flat = flatten_for_api(messages)
-    if not flat:
-        return 0
-    resp = client.chat.completions.create(model=model, messages=flat, max_completion_tokens=1)
-    return int(resp.usage.prompt_tokens or 0)
-
-
-def _make_cache_tools(order: list[str]) -> list[dict]:
-    """Build tool definitions in the given name order (to simulate unstable ordering)."""
-    all_tools = {
-        "read_file":   {"type": "function", "function": {"name": "read_file",   "description": "Read a file from disk",       "parameters": {"type": "object", "required": ["path"],  "properties": {"path":  {"type": "string"}}}}},
-        "shell":       {"type": "function", "function": {"name": "shell",       "description": "Run a shell command",          "parameters": {"type": "object", "required": ["cmd"],   "properties": {"cmd":   {"type": "string"}}}}},
-        "grep":        {"type": "function", "function": {"name": "grep",        "description": "Search file contents",         "parameters": {"type": "object", "required": ["query"], "properties": {"query": {"type": "string"}}}}},
-        "web_fetch":   {"type": "function", "function": {"name": "web_fetch",   "description": "Fetch a URL",                  "parameters": {"type": "object", "required": ["url"],   "properties": {"url":   {"type": "string"}}}}},
-        "write_file":  {"type": "function", "function": {"name": "write_file",  "description": "Write content to a file",      "parameters": {"type": "object", "required": ["path", "content"], "properties": {"path": {"type": "string"}, "content": {"type": "string"}}}}},
-    }
-    return [all_tools[n] for n in order]
-
-
-def bench_opt3_live(client, model: str) -> dict:
-    """Live cache-hit test for Opt3 (Cache Stability).
-
-    OpenAI caches the prefix (system + tools + messages) when it is ≥1024 tokens
-    and the request is identical. We send the same long conversation twice:
-      - UNSTABLE: tools in different random orders each call → prefix mismatch → no cache
-      - STABLE:   tools always in sorted order via stabilize_tool_definitions → cache hit
-
-    We measure cached_tokens in the second request of each pair.
-    """
-    # Build a long enough prefix (need >1024 tokens cacheable)
-    long_system = (
-        "You are a software engineering assistant. "
-        "Your job is to investigate complex production incidents. "
-        "Always be methodical: check logs first, then code, then deployments. "
-    ) * 30  # ~750 chars × 4 ≈ well over 1024 tokens with tools added
-
-    messages = [
-        {"role": "system", "content": long_system},
-        {"role": "user", "content": "Investigate the payment service outage. " * 20},
-        {"role": "assistant", "content": "I will start by checking the error logs. " * 15},
-        {"role": "user", "content": "What did you find in the logs?"},
-    ]
-
-    order_a = ["read_file", "shell", "grep", "web_fetch", "write_file"]
-    order_b = ["shell", "write_file", "read_file", "grep", "web_fetch"]  # different order
-
-    stable_tools = stabilize_tool_definitions(_make_cache_tools(order_a))
-
-    def call(tools_list, label):
-        resp = client.chat.completions.create(
-            model=model,
-            messages=messages,
-            tools=tools_list,
-            tool_choice="none",   # don't invoke any tool, just count prefix tokens
-            max_completion_tokens=8,
-        )
-        u = resp.usage
-        prompt = int(u.prompt_tokens or 0)
-        details = getattr(u, "prompt_tokens_details", None)
-        cached = int(getattr(details, "cached_tokens", 0) or 0)
-        hit_rate = round(cached / max(prompt, 1) * 100, 1)
-        print(f"    {label}: prompt={prompt:,}  cached={cached:,}  hit_rate={hit_rate}%", flush=True)
-        return {"prompt_tokens": prompt, "cached_tokens": cached, "hit_rate_pct": hit_rate}
-
-    print("  [live opt3] UNSTABLE order — call 1 (warm-up)...", flush=True)
-    unstable_call1 = call(_make_cache_tools(order_a), "unstable call1")
-    print("  [live opt3] UNSTABLE order — call 2 (different order, expect 0 cache)...", flush=True)
-    unstable_call2 = call(_make_cache_tools(order_b), "unstable call2")
-
-    # Small delay to let cache warm
-    time.sleep(1)
-
-    print("  [live opt3] STABLE order — call 1 (warm-up)...", flush=True)
-    stable_call1 = call(stable_tools, "stable   call1")
-    print("  [live opt3] STABLE order — call 2 (same order, expect cache hit)...", flush=True)
-    stable_call2 = call(stable_tools, "stable   call2")
-
-    return {
-        "unstable": {"call1": unstable_call1, "call2": unstable_call2},
-        "stable":   {"call1": stable_call1,   "call2": stable_call2},
-        "cache_gain_pct": round(
-            stable_call2["hit_rate_pct"] - unstable_call2["hit_rate_pct"], 1
-        ),
-        "uncached_tokens_saved": (
-            unstable_call2["prompt_tokens"] - unstable_call2["cached_tokens"]
-        ) - (
-            stable_call2["prompt_tokens"] - stable_call2["cached_tokens"]
-        ),
-    }
-
-
-def run_live(model: str, scenarios: list[dict], tmp: Path) -> list[dict]:
-    from openai import OpenAI
-    client = OpenAI()
-    results = []
-    for sc in scenarios:
-        label = sc["label"]
-        msgs = sc["messages"]
-        print(f"  [live] {label} — measuring raw...", flush=True)
-        before = api_prompt_tokens(client, model, msgs)
-
-        print(f"  [live] {label} — measuring optimized...", flush=True)
-        mem = Memory(f"live-{label}")
-        opt = build_llm_view(copy.deepcopy(msgs), memory=mem, base_dir=tmp / f"l-{label}", is_main_thread=True)
-        opt = microcompact_messages(
-            opt, is_main_thread=True,
-            config=TimeBasedMicrocompactConfig(enabled=True, gap_threshold_minutes=1, keep_recent=2),
-        )
-        after = api_prompt_tokens(client, model, opt)
-        saved = before - after
-        pct = round(saved / max(before, 1) * 100, 1)
-        results.append({"scenario": label, "before": before, "after": after, "saved": saved, "pct": pct})
-    return results
-
-
-# ---------------------------------------------------------------------------
-# Display
-# ---------------------------------------------------------------------------
-
-def print_table(title: str, rows: list[Result]):
-    print(f"\n{'─' * 70}")
-    print(f"  {title}")
-    print(f"{'─' * 70}")
-    print(f"  {'Optimization':<45} {'Before':>8} {'After':>8} {'Saved':>8} {'%':>7}")
-    print(f"  {'─' * 45} {'─' * 8} {'─' * 8} {'─' * 8} {'─' * 7}")
-    for r in rows:
-        print(f"  {r.name:<45} {r.before_tokens:>8,} {r.after_tokens:>8,} {r.saved:>8,} {r.pct:>6.1f}%")
-
-
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model", default="gpt-4.1-mini")
-    parser.add_argument("--skip-live", action="store_true")
-    args = parser.parse_args()
-    tmp = Path("/tmp/pantheon-token-bench")
-    tmp.mkdir(parents=True, exist_ok=True)
-
-    print("=" * 70)
-    print("  PantheonOS Token Optimization Benchmark")
-    print("=" * 70)
-
-    scales = [
-        {"label": "Small  (5×10KB)",   "rounds": 5,  "kb": 10},
-        {"label": "Medium (10×50KB)",   "rounds": 10, "kb": 50},
-        {"label": "Large  (20×100KB)",  "rounds": 20, "kb": 100},
-        {"label": "XL     (30×200KB)",  "rounds": 30, "kb": 200},
-    ]
-
-    all_json: list[dict] = []
-
-    for scale in scales:
-        # Build with ALL timestamps old so microcompact triggers
-        msgs = build_conversation(num_rounds=scale["rounds"], output_kb=scale["kb"],
-                                  all_timestamps_old=True)
-        raw = est_tokens(msgs)
-
-        # Build collapsible version (silent assistant messages) for contextCollapse
-        collapse_msgs = build_collapsible_conversation(
-            num_rounds=scale["rounds"], output_kb=scale["kb"],
-        )
-        rows = [
-            bench_opt1(msgs, tmp),
-            bench_opt2(msgs),
-            bench_opt3(),
-            bench_opt4_snip(msgs),
-            bench_opt4_collapse(msgs, collapse_msgs),
-            bench_opt4_autocompact(msgs),
-            bench_opt4(msgs, tmp),
-            bench_opt5(msgs),
-            bench_combined(msgs, tmp),
-        ]
-
-        print_table(f"{scale['label']}   (raw ≈ {raw:,} tokens)", rows)
-        all_json.append({
-            "scale": scale["label"], "raw_tokens": raw,
-            "opts": [{"name": r.name, "before": r.before_tokens, "after": r.after_tokens,
-                       "saved": r.saved, "pct": r.pct} for r in rows],
-        })
-
-    # Live API
-    if not args.skip_live and os.environ.get("OPENAI_API_KEY"):
-        from openai import OpenAI
-        client = OpenAI()
-
-        # --- Opt 1/2/4/5 live token reduction ---
-        print(f"\n{'=' * 70}")
-        print("  Live API: Opt 1+2+4 combined token reduction (gpt-4.1-mini)")
-        print(f"{'=' * 70}")
-        live_scenarios = [
-            {"label": "5×30KB",  "messages": build_conversation(5,  30, all_timestamps_old=True)},
-            {"label": "10×50KB", "messages": build_conversation(10, 50, all_timestamps_old=True)},
-            {"label": "15×80KB", "messages": build_conversation(15, 80, all_timestamps_old=True)},
-        ]
-        live = run_live(args.model, live_scenarios, tmp)
-        print(f"\n  {'Scenario':<20} {'Before':>10} {'After':>10} {'Saved':>10} {'%':>7}")
-        print(f"  {'─' * 20} {'─' * 10} {'─' * 10} {'─' * 10} {'─' * 7}")
-        for lr in live:
-            print(f"  {lr['scenario']:<20} {lr['before']:>10,} {lr['after']:>10,} {lr['saved']:>10,} {lr['pct']:>6.1f}%")
-
-        # --- Opt 3 live cache-hit test ---
-        print(f"\n{'=' * 70}")
-        print("  Live API: Opt 3 — Cache Stability (cached_tokens comparison)")
-        print(f"{'=' * 70}")
-        cache_result = bench_opt3_live(client, args.model)
-        print(f"\n  Unstable order → call2 cache hit rate: {cache_result['unstable']['call2']['hit_rate_pct']}%")
-        print(f"  Stable   order → call2 cache hit rate: {cache_result['stable']['call2']['hit_rate_pct']}%")
-        print(f"  Cache gain from stable ordering:       +{cache_result['cache_gain_pct']}%")
-        print(f"  Uncached tokens saved (per request):   {cache_result['uncached_tokens_saved']:,}")
-
-        all_json.append({"live_token_reduction": live, "live_cache_stability": cache_result})
-    elif args.skip_live:
-        print("\n[Skipped live API benchmark (--skip-live)]")
-    else:
-        print("\n[Skipped live API benchmark (OPENAI_API_KEY not set)]")
-
-    # Write JSON
-    out_path = tmp / "results.json"
-    out_path.write_text(json.dumps(all_json, indent=2, ensure_ascii=False))
-    print(f"\nJSON results saved to: {out_path}")
-
-
-if __name__ == "__main__":
-    main()
diff --git a/tests/test_token_optimization.py b/tests/test_token_optimization.py
deleted file mode 100644
index 210ec776..00000000
--- a/tests/test_token_optimization.py
+++ /dev/null
@@ -1,1498 +0,0 @@
-from __future__ import annotations
-
-from datetime import datetime, timedelta, timezone
-
-from pantheon.agent import Agent, AgentRunContext
-from pantheon.internal.memory import Memory
-from pantheon.team.pantheon import (
-    PantheonTeam,
-    _get_cache_safe_child_fork_context_messages,
-    _get_cache_safe_child_run_overrides,
-    create_delegation_task_message,
-)
-from pantheon.utils.token_optimization import (
-    PERSISTED_OUTPUT_TAG,
-    TIME_BASED_MC_CLEARED_MESSAGE,
-    TimeBasedMicrocompactConfig,
-    apply_token_optimizations,
-    apply_tool_result_budget,
-    build_cache_safe_runtime_params,
-    build_delegation_context_message,
-    build_llm_view,
-    evaluate_time_based_trigger,
-    estimate_total_tokens_from_chars,
-    inject_cache_control_markers,
-    is_anthropic_model,
-    microcompact_messages,
-)
-
-
-def _build_tool_message(tool_call_id: str, content: str) -> dict:
-    return {
-        "role": "tool",
-        "tool_call_id": tool_call_id,
-        "tool_name": "shell",
-        "content": content,
-    }
-
-
-def test_apply_tool_result_budget_persists_large_parallel_tool_messages(tmp_path):
-    """Aggregate path: 3×90K = 270K exceeds the 200K per-message limit,
-    so the budget externalizes the largest fresh results until under limit.
-    Per-tool thresholds are now enforced at tool execution time (process_tool_result),
-    so apply_tool_result_budget only handles aggregate overflow."""
-    memory = Memory("test-memory")
-    messages = [
-        {
-            "role": "assistant",
-            "id": "assistant-1",
-            "tool_calls": [
-                {"id": "tool-1", "function": {"name": "shell"}},
-                {"id": "tool-2", "function": {"name": "shell"}},
-                {"id": "tool-3", "function": {"name": "shell"}},
-            ],
-        },
-        _build_tool_message("tool-1", "A" * 90_000),
-        _build_tool_message("tool-2", "B" * 90_000),
-        _build_tool_message("tool-3", "C" * 90_000),
-    ]
-
-    optimized = apply_tool_result_budget(messages, memory=memory, base_dir=tmp_path)
-
-    optimized_tool_messages = [msg for msg in optimized if msg["role"] == "tool"]
-    persisted = [
-        msg for msg in optimized_tool_messages if msg["content"].startswith(PERSISTED_OUTPUT_TAG)
-    ]
-
-    # 270K > 200K aggregate limit — at least 1 result must be externalized
-    assert len(persisted) >= 1
-    assert all("Full output saved to:" in m["content"] for m in persisted)
-    assert "token_optimization" in memory.extra_data
-
-    rerun = apply_tool_result_budget(messages, memory=memory, base_dir=tmp_path)
-    assert rerun[1]["content"] == optimized[1]["content"]
-    assert rerun[2]["content"] == optimized[2]["content"]
-    assert rerun[3]["content"] == optimized[3]["content"]
-
-
-def test_apply_tool_result_budget_aggregate_path_for_unknown_tool(tmp_path):
-    """Unknown tools use global aggregate logic, not per-tool limit."""
-    memory = Memory("test-aggregate")
-    # 3 × 90K = 270K > 200K global limit → only largest(s) get externalized
-    messages = [
-        {
-            "role": "assistant",
-            "id": "assistant-1",
-            "tool_calls": [
-                {"id": "t-1", "function": {"name": "my_custom_tool"}},
-                {"id": "t-2", "function": {"name": "my_custom_tool"}},
-                {"id": "t-3", "function": {"name": "my_custom_tool"}},
-            ],
-        },
-        _build_tool_message("t-1", "A" * 90_000),
-        _build_tool_message("t-2", "B" * 90_000),
-        _build_tool_message("t-3", "C" * 90_000),
-    ]
-    optimized = apply_tool_result_budget(messages, memory=memory, base_dir=tmp_path)
-    tool_msgs = [msg for msg in optimized if msg["role"] == "tool"]
-    persisted = [m for m in tool_msgs if m["content"].startswith(PERSISTED_OUTPUT_TAG)]
-    # Aggregate logic: 270K > 200K limit → some (≥1) get externalized
-    assert len(persisted) >= 1
-    assert "Full output saved to:" in persisted[0]["content"]
-
-
-def test_time_based_microcompact_clears_old_tool_messages():
-    old_timestamp = (datetime.now(timezone.utc) - timedelta(minutes=90)).isoformat()
-    messages = [
-        {"role": "assistant", "id": "assistant-1", "timestamp": old_timestamp},
-        _build_tool_message("tool-1", "A" * 20_000),
-        _build_tool_message("tool-2", "B" * 20_000),
-        _build_tool_message("tool-3", "C" * 20_000),
-        _build_tool_message("tool-4", "D" * 20_000),
-        _build_tool_message("tool-5", "E" * 20_000),
-        _build_tool_message("tool-6", "F" * 20_000),
-    ]
-
-    compacted = microcompact_messages(
-        messages,
-        is_main_thread=True,
-        config=TimeBasedMicrocompactConfig(
-            enabled=True,
-            gap_threshold_minutes=60,
-            keep_recent=2,
-        ),
-    )
-    compacted_contents = [msg["content"] for msg in compacted if msg["role"] == "tool"]
-
-    assert compacted_contents[:4] == [TIME_BASED_MC_CLEARED_MESSAGE] * 4
-    assert compacted_contents[-2:] == ["E" * 20_000, "F" * 20_000]
-
-
-def test_time_based_microcompact_only_clears_compactable_tools():
-    old_timestamp = (datetime.now(timezone.utc) - timedelta(minutes=90)).isoformat()
-    messages = [
-        {
-            "role": "assistant",
-            "id": "assistant-1",
-            "timestamp": old_timestamp,
-            "tool_calls": [
-                {"id": "tool-1", "function": {"name": "shell"}},
-                {"id": "tool-2", "function": {"name": "knowledge__search_knowledge"}},
-                {"id": "tool-3", "function": {"name": "web_urllib__web_search"}},
-            ],
-        },
-        _build_tool_message("tool-1", "A" * 20_000),
-        {
-            "role": "tool",
-            "tool_call_id": "tool-2",
-            "tool_name": "knowledge__search_knowledge",
-            "content": "B" * 20_000,
-        },
-        {
-            "role": "tool",
-            "tool_call_id": "tool-3",
-            "tool_name": "web_urllib__web_search",
-            "content": "C" * 20_000,
-        },
-    ]
-
-    compacted = microcompact_messages(
-        messages,
-        is_main_thread=True,
-        config=TimeBasedMicrocompactConfig(
-            enabled=True,
-            gap_threshold_minutes=60,
-            keep_recent=1,
-        ),
-    )
-    compacted_contents = [msg["content"] for msg in compacted if msg["role"] == "tool"]
-
-    assert compacted_contents[0] == TIME_BASED_MC_CLEARED_MESSAGE
-    assert compacted_contents[1] == "B" * 20_000
-    assert compacted_contents[2] == "C" * 20_000
-
-
-def test_evaluate_time_based_trigger_requires_old_assistant_message():
-    recent_timestamp = (datetime.now(timezone.utc) - timedelta(minutes=10)).isoformat()
-    old_timestamp = (datetime.now(timezone.utc) - timedelta(minutes=90)).isoformat()
-
-    recent = [{"role": "assistant", "timestamp": recent_timestamp}]
-    old = [{"role": "assistant", "timestamp": old_timestamp}]
-
-    config = TimeBasedMicrocompactConfig(
-        enabled=True,
-        gap_threshold_minutes=60,
-        keep_recent=5,
-    )
-
-    assert evaluate_time_based_trigger(
-        recent,
-        is_main_thread=True,
-        config=config,
-    ) is None
-    assert evaluate_time_based_trigger(
-        old,
-        is_main_thread=False,
-        config=config,
-    ) is None
-    assert evaluate_time_based_trigger(
-        old,
-        is_main_thread=True,
-        config=config,
-    ) is not None
-
-
-def test_build_llm_view_skips_time_based_microcompact_for_subagents():
-    memory = Memory("subagent-projection-memory")
-    old_timestamp = (datetime.now(timezone.utc) - timedelta(minutes=90)).isoformat()
-    messages = [
-        {"role": "system", "content": "system"},
-        {
-            "role": "assistant",
-            "id": "assistant-1",
-            "timestamp": old_timestamp,
-            "tool_calls": [
-                {"id": "tool-1", "function": {"name": "shell"}},
-                {"id": "tool-2", "function": {"name": "file_manager__read_file"}},
-                {"id": "tool-3", "function": {"name": "file_manager__grep"}},
-                {"id": "tool-4", "function": {"name": "web_urllib__web_search"}},
-                {"id": "tool-5", "function": {"name": "web_urllib__web_fetch"}},
-                {"id": "tool-6", "function": {"name": "file_manager__glob"}},
-            ],
-        },
-        _build_tool_message("tool-1", "A" * 20_000),
-        {
-            "role": "tool",
-            "tool_call_id": "tool-2",
-            "tool_name": "file_manager__read_file",
-            "content": "B" * 20_000,
-        },
-        {
-            "role": "tool",
-            "tool_call_id": "tool-3",
-            "tool_name": "file_manager__grep",
-            "content": "C" * 20_000,
-        },
-        {
-            "role": "tool",
-            "tool_call_id": "tool-4",
-            "tool_name": "web_urllib__web_search",
-            "content": "D" * 20_000,
-        },
-        {
-            "role": "tool",
-            "tool_call_id": "tool-5",
-            "tool_name": "web_urllib__web_fetch",
-            "content": "E" * 20_000,
-        },
-        {
-            "role": "tool",
-            "tool_call_id": "tool-6",
-            "tool_name": "file_manager__glob",
-            "content": "F" * 20_000,
-        },
-    ]
-
-    view = build_llm_view(messages, memory=memory, is_main_thread=False)
-
-    tool_contents = [msg["content"] for msg in view if msg["role"] == "tool"]
-    assert TIME_BASED_MC_CLEARED_MESSAGE not in tool_contents
-
-
-def test_build_cache_safe_runtime_params_normalizes_dict_order():
-    class ResponseA:
-        @staticmethod
-        def model_json_schema():
-            return {
-                "type": "object",
-                "properties": {
-                    "b": {"type": "string"},
-                    "a": {"type": "string"},
-                },
-                "required": ["b", "a"],
-            }
-
-    params_a = build_cache_safe_runtime_params(
-        model="openai/gpt-5.1-mini",
-        model_params={"top_p": 1, "temperature": 0},
-        response_format=ResponseA,
-    )
-    params_b = build_cache_safe_runtime_params(
-        model="openai/gpt-5.1-mini",
-        model_params={"temperature": 0, "top_p": 1},
-        response_format=ResponseA,
-    )
-
-    assert params_a.model_params_normalized == params_b.model_params_normalized
-    assert params_a.response_format_normalized == params_b.response_format_normalized
-
-
-def test_get_cache_safe_child_run_overrides_inherits_compatible_runtime_params():
-    caller = Agent(
-        name="caller",
-        instructions="caller",
-        model="openai/gpt-5.1-mini",
-        model_params={"temperature": 0},
-    )
-    target = Agent(
-        name="target",
-        instructions="target",
-        model="openai/gpt-5.1-mini",
-        model_params={"temperature": 0},
-    )
-    run_context = AgentRunContext(
-        agent=caller,
-        memory=None,
-        execution_context_id=None,
-        process_step_message=None,
-        process_chunk=None,
-    )
-    run_context.cache_safe_runtime_params = build_cache_safe_runtime_params(
-        model="openai/gpt-5.1-mini",
-        model_params={"temperature": 0, "top_p": 1},
-        response_format=None,
-    )
-
-    overrides, child_context_variables = _get_cache_safe_child_run_overrides(
-        run_context,
-        target,
-        {},
-    )
-
-    assert overrides == {
-        "model": "openai/gpt-5.1-mini",
-        "response_format": None,
-    }
-    assert child_context_variables["model_params"] == {"temperature": 0, "top_p": 1}
-
-
-def test_prepare_execution_context_prepends_cache_safe_fork_messages():
-    agent = Agent(name="child", instructions="child", model="openai/gpt-5.1-mini")
-    fork_context_messages = [
-        {"role": "user", "content": "Parent prefix question"},
-        {"role": "assistant", "content": "Parent prefix answer"},
-    ]
-
-    import asyncio
-
-    exec_context = asyncio.run(
-        agent._prepare_execution_context(
-            msg="Delegated child task",
-            use_memory=False,
-            context_variables={
-                "_cache_safe_fork_context_messages": fork_context_messages,
-            },
-        )
-    )
-
-    assert exec_context.conversation_history[0]["content"] == "Parent prefix question"
-    assert exec_context.conversation_history[1]["content"] == "Parent prefix answer"
-    assert exec_context.conversation_history[-1]["content"] == "Delegated child task"
-    assert "_cache_safe_fork_context_messages" not in exec_context.context_variables
-
-
-def test_get_cache_safe_child_fork_context_messages_requires_compatible_agent():
-    caller = Agent(
-        name="caller",
-        instructions="shared instructions",
-        model="openai/gpt-5.1-mini",
-    )
-    target = Agent(
-        name="target",
-        instructions="shared instructions",
-        model="openai/gpt-5.1-mini",
-    )
-
-    def alpha_tool(path: str) -> str:
-        return path
-
-    caller.tool(alpha_tool)
-    target.tool(alpha_tool)
-
-    run_context = AgentRunContext(
-        agent=caller,
-        memory=None,
-        execution_context_id=None,
-        process_step_message=None,
-        process_chunk=None,
-        cache_safe_prompt_messages=[
-            {"role": "system", "content": "shared instructions"},
-            {"role": "user", "content": "Parent prefix question"},
-        ],
-    )
-
-    import asyncio
-
-    run_context.cache_safe_tool_definitions = asyncio.run(caller.get_tools_for_llm())
-    fork_context_messages = asyncio.run(
-        _get_cache_safe_child_fork_context_messages(run_context, target)
-    )
-
-    assert fork_context_messages == [
-        {"role": "user", "content": "Parent prefix question"},
-    ]
-
-
-def test_apply_token_optimizations_reduces_prompt_size(tmp_path):
-    memory = Memory("benchmark-memory")
-    old_timestamp = (datetime.now(timezone.utc) - timedelta(minutes=90)).isoformat()
-    messages = [
-        {"role": "system", "content": "You are helpful."},
-        {
-            "role": "assistant",
-            "id": "assistant-1",
-            "timestamp": old_timestamp,
-            "tool_calls": [
-                {"id": "tool-1", "function": {"name": "shell"}},
-                {"id": "tool-2", "function": {"name": "shell"}},
-                {"id": "tool-3", "function": {"name": "shell"}},
-                {"id": "tool-4", "function": {"name": "shell"}},
-                {"id": "tool-5", "function": {"name": "shell"}},
-                {"id": "tool-6", "function": {"name": "shell"}},
-            ],
-        },
-        _build_tool_message("tool-1", "A" * 90_000),
-        _build_tool_message("tool-2", "B" * 90_000),
-        _build_tool_message("tool-3", "C" * 90_000),
-        _build_tool_message("tool-4", "D" * 90_000),
-        _build_tool_message("tool-5", "E" * 90_000),
-        _build_tool_message("tool-6", "F" * 90_000),
-        {"role": "user", "content": "Please summarize the tool outputs."},
-    ]
-
-    before_tokens = estimate_total_tokens_from_chars(messages)
-    optimized = apply_token_optimizations(
-        messages,
-        memory=memory,
-        base_dir=tmp_path,
-    )
-    after_tokens = estimate_total_tokens_from_chars(optimized)
-
-    assert after_tokens < before_tokens
-
-
-def test_build_llm_view_projects_compression_and_preserves_system():
-    memory = Memory("projection-memory")
-    messages = [
-        {"role": "system", "content": "system"},
-        {"role": "user", "content": "first"},
-        {"role": "compression", "content": "compressed"},
-        {"role": "assistant", "content": "after compression"},
-    ]
-
-    view = build_llm_view(messages, memory=memory)
-
-    assert view[0]["role"] == "system"
-    assert len(view) == 3
-    assert view[1]["role"] == "user"
-    assert view[1]["content"] == "compressed"
-
-
-def test_get_tools_for_llm_is_stably_sorted():
-    agent = Agent(name="sorter", instructions="Sort tools")
-
-    def zebra_tool() -> str:
-        return "z"
-
-    def alpha_tool() -> str:
-        return "a"
-
-    agent.tool(zebra_tool)
-    agent.tool(alpha_tool)
-
-    import asyncio
-
-    tools = asyncio.run(agent.get_tools_for_llm())
-    tool_names = [tool["function"]["name"] for tool in tools]
-
-    assert tool_names == sorted(tool_names)
-
-
-def test_create_delegation_task_message_uses_recent_context_and_file_refs(monkeypatch):
-    class FakeSummaryGenerator:
-        async def generate_summary(self, history, max_tokens=1000):
-            return "short summary"
-
-    monkeypatch.setattr(
-        "pantheon.chatroom.special_agents.get_summary_generator",
-        lambda: FakeSummaryGenerator(),
-    )
-
-    history = [
-        {"role": "user", "content": "Investigate the failures."},
-        {
-            "role": "tool",
-            "tool_call_id": "tool-1",
-            "tool_name": "shell",
-            "content": "<persisted-output>\nOutput too large (10KB). Full output saved to: /tmp/tool-1.txt\n\nPreview (first 2KB):\nfoo\n</persisted-output>",
-        },
-        {"role": "assistant", "content": "I found two likely causes."},
-    ]
-
-    import asyncio
-
-    task_message = asyncio.run(
-        create_delegation_task_message(
-            history,
-            "Find the root cause",
-            use_summary=True,
-        )
-    )
-
-    assert "Context Summary:\nshort summary" in task_message
-    assert "Recent Context:" in task_message
-    assert "Referenced Files (retrieve on demand if needed):\n- /tmp/tool-1.txt" in task_message
-    assert "Task: Find the root cause" in task_message
-    # On-demand hint is appended when summary is present
-    assert "retrieve it on demand" in task_message
-
-
-def test_create_delegation_task_message_use_summary_false_returns_raw_instruction(monkeypatch):
-    """When use_summary=False, only the raw instruction is returned."""
-    import asyncio
-
-    result = asyncio.run(
-        create_delegation_task_message(
-            history=[{"role": "user", "content": "hello"}],
-            instruction="Do something",
-            use_summary=False,
-        )
-    )
-    assert result == "Do something"
-
-
-def test_create_delegation_task_message_trims_history_to_recent_tail(monkeypatch):
-    """Only the most recent messages are passed to build_delegation_context_message."""
-    from pantheon.team.pantheon import DELEGATION_RECENT_TAIL_SIZE
-
-    captured = {}
-
-    original_build = build_delegation_context_message
-
-    def spy_build(history, instruction, summary_text=None):
-        captured["history_len"] = len(history)
-        return original_build(
-            history=history,
-            instruction=instruction,
-            summary_text=summary_text,
-        )
-
-    monkeypatch.setattr(
-        "pantheon.utils.token_optimization.build_delegation_context_message",
-        spy_build,
-    )
-
-    class FakeSummaryGenerator:
-        async def generate_summary(self, history, max_tokens=1000):
-            captured["summary_input_len"] = len(history)
-            return "summary"
-
-    monkeypatch.setattr(
-        "pantheon.chatroom.special_agents.get_summary_generator",
-        lambda: FakeSummaryGenerator(),
-    )
-
-    # Create a history larger than DELEGATION_RECENT_TAIL_SIZE
-    big_history = [
-        {"role": "user", "content": f"message {i}"}
-        for i in range(DELEGATION_RECENT_TAIL_SIZE + 30)
-    ]
-
-    import asyncio
-
-    asyncio.run(
-        create_delegation_task_message(
-            history=big_history,
-            instruction="Analyze",
-            use_summary=True,
-        )
-    )
-
-    # Summary generator sees full history
-    assert captured["summary_input_len"] == len(big_history)
-    # build_delegation_context_message only sees the recent tail
-    assert captured["history_len"] == DELEGATION_RECENT_TAIL_SIZE
-
-
-def test_create_delegation_no_on_demand_hint_without_summary(monkeypatch):
-    """When summary generation fails, on-demand hint is not appended."""
-    class FailingSummaryGenerator:
-        async def generate_summary(self, history, max_tokens=1000):
-            raise RuntimeError("LLM unavailable")
-
-    monkeypatch.setattr(
-        "pantheon.chatroom.special_agents.get_summary_generator",
-        lambda: FailingSummaryGenerator(),
-    )
-
-    import asyncio
-
-    result = asyncio.run(
-        create_delegation_task_message(
-            history=[{"role": "user", "content": "hello"}],
-            instruction="Do something",
-            use_summary=True,
-        )
-    )
-    # No summary means no on-demand hint
-    assert "retrieve it on demand" not in result
-    assert "Task: Do something" in result
-
-
-def test_pantheon_team_use_summary_defaults_to_true():
-    """PantheonTeam defaults to use_summary=True for summary-first delegation."""
-    from unittest.mock import MagicMock
-
-    agent = MagicMock()
-    agent.name = "test-agent"
-    agent.models = ["gpt-4"]
-
-    team = PantheonTeam(agents=[agent])
-    assert team.use_summary is True
-
-
-# ---------------------------------------------------------------------------
-# Opt3: cache_control injection tests
-# ---------------------------------------------------------------------------
-
-def test_is_anthropic_model_detection():
-    assert is_anthropic_model("claude-3-5-sonnet-20241022") is True
-    assert is_anthropic_model("anthropic/claude-3-haiku") is True
-    assert is_anthropic_model("custom_anthropic/claude-3-opus") is True
-    assert is_anthropic_model("gpt-4o") is False
-    assert is_anthropic_model("gpt-4.1-mini") is False
-    assert is_anthropic_model("openai/gpt-4") is False
-
-
-def test_inject_cache_control_marks_system_and_last_user():
-    messages = [
-        {"role": "system", "content": "You are a helpful assistant."},
-        {"role": "user", "content": "Hello"},
-        {"role": "assistant", "content": "Hi there!"},
-        {"role": "user", "content": "What is 2+2?"},
-    ]
-    result = inject_cache_control_markers(messages)
-
-    # System message last block should have cache_control
-    sys_content = result[0]["content"]
-    assert isinstance(sys_content, list)
-    assert sys_content[-1].get("cache_control") == {"type": "ephemeral"}
-
-    # Last user message should have cache_control
-    last_user = result[-1]["content"]
-    assert isinstance(last_user, list)
-    assert last_user[-1].get("cache_control") == {"type": "ephemeral"}
-
-    # Middle assistant message should NOT have cache_control
-    mid_asst = result[2]["content"]
-    if isinstance(mid_asst, list):
-        assert all("cache_control" not in b for b in mid_asst)
-    else:
-        assert "cache_control" not in str(mid_asst)
-
-
-def test_inject_cache_control_converts_string_content_to_blocks():
-    messages = [{"role": "user", "content": "plain string content"}]
-    result = inject_cache_control_markers(messages)
-    content = result[0]["content"]
-    assert isinstance(content, list)
-    assert content[0]["type"] == "text"
-    assert content[0]["text"] == "plain string content"
-    assert content[0]["cache_control"] == {"type": "ephemeral"}
-
-
-def test_inject_cache_control_does_not_mutate_input():
-    messages = [
-        {"role": "system", "content": "sys"},
-        {"role": "user", "content": "user msg"},
-    ]
-    original_sys = messages[0]["content"]
-    inject_cache_control_markers(messages)
-    # Input should be unchanged
-    assert messages[0]["content"] == original_sys
-    assert isinstance(messages[0]["content"], str)
-
-
-def test_inject_cache_control_skips_empty_assistant():
-    """Last non-empty user/assistant gets the marker, not an empty trailing message."""
-    messages = [
-        {"role": "system", "content": "sys"},
-        {"role": "user", "content": "question"},
-        {"role": "assistant", "content": "   "},  # whitespace-only, skip
-    ]
-    result = inject_cache_control_markers(messages)
-    # user message should get the marker since assistant is empty
-    user_content = result[1]["content"]
-    assert isinstance(user_content, list)
-    assert user_content[-1].get("cache_control") == {"type": "ephemeral"}
-
-
-# ---------------------------------------------------------------------------
-# Opt2 extension: HISTORY_SNIP tests
-# ---------------------------------------------------------------------------
-
-def test_snip_messages_drops_oldest_when_over_budget():
-    from pantheon.utils.token_optimization import SnipConfig, snip_messages_to_budget
-
-    # 5 user messages of ~1000 tokens each = ~5000 tokens, budget = 3000
-    def big_msg(i):
-        return {"role": "user" if i % 2 == 0 else "assistant", "content": "x" * 4000, "id": str(i)}
-
-    messages = [{"role": "system", "content": "sys"}] + [big_msg(i) for i in range(5)]
-    config = SnipConfig(enabled=True, token_budget=3000, keep_recent=2)
-
-    result, freed = snip_messages_to_budget(messages, config=config)
-
-    # System message always kept
-    assert result[0]["role"] == "system"
-    # Last 2 messages always kept (protected tail)
-    assert result[-1]["id"] == "4"
-    assert result[-2]["id"] == "3"
-    # Some old messages dropped
-    assert freed > 0
-    total_after = sum(len(m.get("content", "")) // 4 for m in result)
-    assert total_after <= 3000
-
-
-def test_snip_messages_noop_when_under_budget():
-    from pantheon.utils.token_optimization import SnipConfig, snip_messages_to_budget
-
-    messages = [
-        {"role": "system", "content": "sys"},
-        {"role": "user", "content": "short"},
-        {"role": "assistant", "content": "reply"},
-    ]
-    config = SnipConfig(enabled=True, token_budget=100_000, keep_recent=2)
-    result, freed = snip_messages_to_budget(messages, config=config)
-    assert result is messages  # unchanged
-    assert freed == 0
-
-
-def test_snip_messages_respects_keep_recent():
-    from pantheon.utils.token_optimization import SnipConfig, snip_messages_to_budget
-
-    messages = [{"role": "user", "content": "x" * 4000, "id": str(i)} for i in range(10)]
-    config = SnipConfig(enabled=True, token_budget=1, keep_recent=4)
-    result, freed = snip_messages_to_budget(messages, config=config)
-    kept_ids = [m["id"] for m in result]
-    # Last 4 must be in result
-    assert "6" in kept_ids
-    assert "7" in kept_ids
-    assert "8" in kept_ids
-    assert "9" in kept_ids
-
-
-def test_snip_disabled_is_noop():
-    from pantheon.utils.token_optimization import SnipConfig, snip_messages_to_budget
-
-    messages = [{"role": "user", "content": "x" * 100_000}]
-    config = SnipConfig(enabled=False, token_budget=1, keep_recent=1)
-    result, freed = snip_messages_to_budget(messages, config=config)
-    assert result is messages
-    assert freed == 0
-
-
-def test_apply_token_optimizations_runs_snip_before_microcompact(tmp_path):
-    import time
-    from pantheon.utils.token_optimization import SnipConfig
-
-    # Build messages that are over snip budget, with old timestamps → both snip and microcompact fire
-    old_ts = time.time() - 7200
-    messages = []
-    for i in range(8):
-        messages.append({
-            "role": "assistant", "content": f"turn {i}",
-            "tool_calls": [{"id": f"c{i}", "function": {"name": "shell", "arguments": "{}"}}],
-            "timestamp": old_ts + i,
-        })
-        messages.append({
-            "role": "tool", "tool_call_id": f"c{i}", "tool_name": "shell",
-            "content": "x" * 4000,
-        })
-
-    snip_cfg = SnipConfig(enabled=True, token_budget=2000, keep_recent=2)
-    mc_cfg = TimeBasedMicrocompactConfig(enabled=True, gap_threshold_minutes=1, keep_recent=1)
-
-    result = apply_token_optimizations(
-        messages, snip_config=snip_cfg, is_main_thread=True
-    )
-    # Result must be smaller than input
-    before = sum(len(m.get("content", "")) for m in messages)
-    after = sum(len(m.get("content", "")) for m in result)
-    assert after < before
-
-
-# ---------------------------------------------------------------------------
-# Opt1 extension: per-tool threshold tests
-# ---------------------------------------------------------------------------
-
-def test_per_tool_threshold_externalizes_oversized_single_result(tmp_path):
-    """Per-tool threshold is now enforced at process_tool_result() time.
-    Here we verify that process_tool_result uses get_per_tool_limit to
-    apply the correct per-tool threshold (read_file = 40K)."""
-    from pantheon.utils.token_optimization import PER_TOOL_RESULT_SIZE_CHARS
-    from pantheon.utils.llm import process_tool_result
-
-    read_file_limit = PER_TOOL_RESULT_SIZE_CHARS["read_file"]
-    big_content = "x" * (read_file_limit + 1000)
-
-    # process_tool_result with tool_name="read_file" should apply 40K limit
-    result = process_tool_result(big_content, max_length=50_000, tool_name="read_file")
-    # Result should be truncated since content exceeds read_file's 40K limit
-    assert len(result) < len(big_content)
-
-
-def test_per_tool_threshold_keeps_small_result_intact(tmp_path):
-    """A shell result under its 50K per-tool limit should NOT be externalized."""
-    small_content = "x" * 100  # well under any limit
-
-    messages = [
-        {
-            "role": "assistant",
-            "content": "running shell",
-            "tool_calls": [{"id": "sh-1", "function": {"name": "shell", "arguments": "{}"}}],
-        },
-        {"role": "tool", "tool_call_id": "sh-1", "tool_name": "shell", "content": small_content},
-    ]
-
-    mem = Memory("per-tool-small-test")
-    result = apply_tool_result_budget(messages, memory=mem, base_dir=tmp_path)
-
-    tool_msg = next(m for m in result if m.get("role") == "tool")
-    assert tool_msg["content"] == small_content
-
-
-def test_per_tool_threshold_unknown_tool_uses_global_limit(tmp_path):
-    """An unknown tool falls back to the global per_message_limit."""
-    # Content below global limit → should not be externalized
-    content = "y" * 1000
-    messages = [
-        {
-            "role": "assistant",
-            "content": "custom tool call",
-            "tool_calls": [{"id": "ct-1", "function": {"name": "my_custom_tool", "arguments": "{}"}}],
-        },
-        {"role": "tool", "tool_call_id": "ct-1", "tool_name": "my_custom_tool", "content": content},
-    ]
-
-    mem = Memory("per-tool-unknown-test")
-    result = apply_tool_result_budget(
-        messages, memory=mem, base_dir=tmp_path, per_message_limit=200_000
-    )
-    tool_msg = next(m for m in result if m.get("role") == "tool")
-    assert tool_msg["content"] == content
-
-
-# ---------------------------------------------------------------------------
-# Opt5: forkContextMessages structured delegation tests
-# ---------------------------------------------------------------------------
-
-def test_build_structured_fork_context_uses_cache_safe_messages():
-    from unittest.mock import MagicMock
-    from pantheon.team.pantheon import _build_structured_fork_context
-
-    run_context = MagicMock()
-    run_context.cache_safe_prompt_messages = [
-        {"role": "system", "content": "sys"},
-        {"role": "user", "content": "hello"},
-        {"role": "assistant", "content": "hi"},
-    ]
-
-    result = _build_structured_fork_context(run_context)
-
-    # System message must be stripped
-    assert all(m["role"] != "system" for m in result)
-    assert len(result) == 2
-    assert result[0]["content"] == "hello"
-    assert result[1]["content"] == "hi"
-
-
-def test_build_structured_fork_context_returns_none_when_no_memory():
-    from unittest.mock import MagicMock
-    from pantheon.team.pantheon import _build_structured_fork_context
-
-    run_context = MagicMock()
-    run_context.memory = None
-    run_context.cache_safe_prompt_messages = None
-
-    assert _build_structured_fork_context(run_context) is None
-
-
-def test_build_structured_fork_context_returns_none_for_empty_messages():
-    from unittest.mock import MagicMock
-    from pantheon.team.pantheon import _build_structured_fork_context
-
-    run_context = MagicMock()
-    run_context.cache_safe_prompt_messages = [
-        {"role": "system", "content": "sys only"},
-    ]
-
-    # Only system message → nothing to forward
-    assert _build_structured_fork_context(run_context) is None
-
-
-# ---------------------------------------------------------------------------
-# New: empty result guard tests
-# ---------------------------------------------------------------------------
-
-def test_guard_empty_tool_results_injects_placeholder():
-    from pantheon.utils.token_optimization import (
-        EMPTY_TOOL_RESULT_PLACEHOLDER,
-        guard_empty_tool_results,
-    )
-    messages = [
-        {"role": "tool", "tool_call_id": "t1", "content": ""},
-        {"role": "tool", "tool_call_id": "t2", "content": "   "},
-        {"role": "tool", "tool_call_id": "t3", "content": "real output"},
-    ]
-    result = guard_empty_tool_results(messages)
-    assert result[0]["content"] == EMPTY_TOOL_RESULT_PLACEHOLDER
-    assert result[1]["content"] == EMPTY_TOOL_RESULT_PLACEHOLDER
-    assert result[2]["content"] == "real output"
-
-
-def test_empty_result_guard_runs_inside_apply_tool_result_budget(tmp_path):
-    """Empty tool results get the placeholder even inside the full budget pipeline."""
-    from pantheon.utils.token_optimization import EMPTY_TOOL_RESULT_PLACEHOLDER
-    from pantheon.internal.memory import Memory
-    memory = Memory("empty-guard-test")
-    messages = [
-        {
-            "role": "assistant",
-            "tool_calls": [{"id": "e1", "function": {"name": "shell"}}],
-        },
-        {"role": "tool", "tool_call_id": "e1", "tool_name": "shell", "content": ""},
-    ]
-    result = apply_tool_result_budget(messages, memory=memory, base_dir=tmp_path)
-    tool_msg = next(m for m in result if m.get("role") == "tool")
-    assert tool_msg["content"] == EMPTY_TOOL_RESULT_PLACEHOLDER
-
-
-# ---------------------------------------------------------------------------
-# New: DEFAULT_MAX_RESULT_SIZE_CHARS = 50K fallback tests
-# ---------------------------------------------------------------------------
-
-def test_default_per_tool_limit_is_50k_for_unknown_tools(tmp_path):
-    """Unknown tools use DEFAULT_MAX_RESULT_SIZE_CHARS (50K) as their
-    per-tool limit, enforced at process_tool_result() time."""
-    from pantheon.utils.token_optimization import get_per_tool_limit, DEFAULT_MAX_RESULT_SIZE_CHARS
-    from pantheon.utils.llm import process_tool_result
-
-    # Verify the limit value
-    limit = get_per_tool_limit("unknown_tool", 200_000)
-    assert limit == DEFAULT_MAX_RESULT_SIZE_CHARS
-
-    # Verify process_tool_result applies it
-    content = "x" * 60_000
-    result = process_tool_result(content, max_length=200_000, tool_name="unknown_tool")
-    # 60K > 50K default limit, so should be truncated
-    assert len(result) < len(content)
-
-
-def test_persistence_opt_out_prevents_externalization(tmp_path):
-    """Tools in PERSISTENCE_OPT_OUT_TOOLS are never externalized."""
-    from pantheon.utils import token_optimization as tok
-    from pantheon.internal.memory import Memory
-
-    original = tok.PERSISTENCE_OPT_OUT_TOOLS
-    try:
-        tok.PERSISTENCE_OPT_OUT_TOOLS = frozenset({"my_special_tool"})
-        memory = Memory("opt-out-test")
-        messages = [
-            {
-                "role": "assistant",
-                "tool_calls": [{"id": "o1", "function": {"name": "my_special_tool"}}],
-            },
-            {"role": "tool", "tool_call_id": "o1", "tool_name": "my_special_tool", "content": "x" * 500_000},
-        ]
-        result = apply_tool_result_budget(messages, memory=memory, base_dir=tmp_path)
-        tool_msg = next(m for m in result if m.get("role") == "tool")
-        # Should NOT be externalized despite being 500K
-        assert PERSISTED_OUTPUT_TAG not in tool_msg["content"]
-        assert len(tool_msg["content"]) == 500_000
-    finally:
-        tok.PERSISTENCE_OPT_OUT_TOOLS = original
-
-
-# ---------------------------------------------------------------------------
-# New: JSON detection tests
-# ---------------------------------------------------------------------------
-
-def test_persist_json_content_uses_json_extension(tmp_path):
-    from pantheon.utils.token_optimization import persist_tool_result
-    json_content = '[{"key": "value"}, {"key2": 123}]'
-    result = persist_tool_result(json_content, "json-tool-1", base_dir=tmp_path)
-    assert result["filepath"].endswith(".json")
-
-
-def test_persist_non_json_content_uses_txt_extension(tmp_path):
-    from pantheon.utils.token_optimization import persist_tool_result
-    result = persist_tool_result("plain text content", "txt-tool-1", base_dir=tmp_path)
-    assert result["filepath"].endswith(".txt")
-
-
-# ---------------------------------------------------------------------------
-# New: contextCollapse tests
-# ---------------------------------------------------------------------------
-
-def test_collapse_read_search_groups_folds_consecutive_reads():
-    from pantheon.utils.token_optimization import collapse_read_search_groups
-    messages = [
-        {"role": "user", "content": "Find the bug"},
-        # Collapsible group: assistant(tool_calls) + 3 tool results with substantial content
-        {
-            "role": "assistant",
-            "tool_calls": [
-                {"id": "g1", "function": {"name": "grep"}},
-                {"id": "g2", "function": {"name": "read_file"}},
-                {"id": "g3", "function": {"name": "glob"}},
-            ],
-        },
-        {"role": "tool", "tool_call_id": "g1", "tool_name": "grep", "content": "match result " * 500},
-        {"role": "tool", "tool_call_id": "g2", "tool_name": "read_file", "content": "/src/main.py\n" + "code " * 1000},
-        {"role": "tool", "tool_call_id": "g3", "tool_name": "glob", "content": "file_entry\n" * 200},
-        # Non-collapsible: assistant with text output
-        {"role": "assistant", "content": "I found the issue in main.py"},
-    ]
-    result, tokens_saved = collapse_read_search_groups(messages, min_group_size=3)
-    # Group of 4 (assistant + 3 tools) → collapsed to 1
-    assert len(result) < len(messages)
-    collapsed = [m for m in result if m.get("_collapsed")]
-    assert len(collapsed) == 1
-    assert "searched" in collapsed[0]["content"]
-    assert "read" in collapsed[0]["content"]
-    assert tokens_saved > 0
-
-
-def test_collapse_preserves_non_collapsible_messages():
-    from pantheon.utils.token_optimization import collapse_read_search_groups
-    messages = [
-        {"role": "user", "content": "Hello"},
-        {"role": "assistant", "content": "I'll help you."},
-        {"role": "user", "content": "Fix the bug"},
-    ]
-    result, tokens_saved = collapse_read_search_groups(messages)
-    assert result == messages
-    assert tokens_saved == 0
-
-
-def test_collapse_skips_small_groups():
-    from pantheon.utils.token_optimization import collapse_read_search_groups
-    messages = [
-        {
-            "role": "assistant",
-            "tool_calls": [{"id": "s1", "function": {"name": "grep"}}],
-        },
-        {"role": "tool", "tool_call_id": "s1", "tool_name": "grep", "content": "x"},
-    ]
-    # Only 2 messages — below min_group_size=3
-    result, tokens_saved = collapse_read_search_groups(messages, min_group_size=3)
-    assert result == messages
-    assert tokens_saved == 0
-
-
-def test_collapse_breaks_on_assistant_text_output():
-    from pantheon.utils.token_optimization import collapse_read_search_groups
-    messages = [
-        {
-            "role": "assistant",
-            "tool_calls": [{"id": "a1", "function": {"name": "grep"}}],
-        },
-        {"role": "tool", "tool_call_id": "a1", "tool_name": "grep", "content": "x" * 1000},
-        # Assistant with text — breaks the group
-        {"role": "assistant", "content": "I see the results."},
-        {
-            "role": "assistant",
-            "tool_calls": [{"id": "a2", "function": {"name": "read_file"}}],
-        },
-        {"role": "tool", "tool_call_id": "a2", "tool_name": "read_file", "content": "y" * 1000},
-    ]
-    result, tokens_saved = collapse_read_search_groups(messages, min_group_size=3)
-    # No group reaches min_group_size because the text output breaks them
-    assert tokens_saved == 0
-
-
-# ---------------------------------------------------------------------------
-# New: autocompact tests
-# ---------------------------------------------------------------------------
-
-def test_autocompact_summarizes_when_over_budget():
-    import asyncio
-    from pantheon.utils.token_optimization import autocompact_messages
-    # 20 big messages → way over 1000 token budget, no model → heuristic fallback
-    messages = [{"role": "user", "content": f"msg {i}: " + "x" * 4000} for i in range(20)]
-    result, freed, tracking = asyncio.run(
-        autocompact_messages(messages, token_budget=1000, keep_recent=4)
-    )
-    assert freed > 0
-    assert tracking.compacted is True
-    assert tracking.consecutive_failures == 0
-    # Last 4 preserved
-    assert result[-1]["content"] == messages[-1]["content"]
-    assert result[-4]["content"] == messages[-4]["content"]
-    # First message is the autocompact summary wrapper
-    assert "continued from a previous conversation" in result[0]["content"]
-
-
-def test_autocompact_noop_when_under_budget():
-    import asyncio
-    from pantheon.utils.token_optimization import autocompact_messages
-    messages = [
-        {"role": "user", "content": "short"},
-        {"role": "assistant", "content": "reply"},
-    ]
-    result, freed, tracking = asyncio.run(
-        autocompact_messages(messages, token_budget=100_000, keep_recent=4)
-    )
-    assert result is messages
-    assert freed == 0
-
-
-def test_autocompact_circuit_breaker():
-    """After MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES, autocompact stops trying."""
-    import asyncio
-    from pantheon.utils.token_optimization import (
-        AutocompactTrackingState,
-        MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES,
-        autocompact_messages,
-    )
-    messages = [{"role": "user", "content": "x" * 40_000} for _ in range(10)]
-    tracking = AutocompactTrackingState(
-        consecutive_failures=MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES
-    )
-    result, freed, new_tracking = asyncio.run(
-        autocompact_messages(messages, token_budget=1000, tracking=tracking)
-    )
-    # Circuit breaker tripped — no compaction despite being over budget
-    assert result is messages
-    assert freed == 0
-
-
-def test_autocompact_recursion_guard():
-    """Autocompact must not fire for compact/session_memory query sources."""
-    import asyncio
-    from pantheon.utils.token_optimization import autocompact_messages
-    messages = [{"role": "user", "content": "x" * 40_000} for _ in range(10)]
-    for src in ("compact", "session_memory", "agent_summary"):
-        result, freed, _ = asyncio.run(
-            autocompact_messages(messages, token_budget=1000, query_source=src)
-        )
-        assert result is messages, f"Should skip for query_source={src}"
-
-
-# ---------------------------------------------------------------------------
-# New: skip_cache_write tests
-# ---------------------------------------------------------------------------
-
-# ---------------------------------------------------------------------------
-# New: querySource filtering tests
-# ---------------------------------------------------------------------------
-
-def test_query_source_agent_summary_does_not_persist(tmp_path):
-    """agent_summary query source must NOT write new disk entries."""
-    from pantheon.internal.memory import Memory
-    memory = Memory("qs-test")
-    messages = [
-        {
-            "role": "assistant",
-            "tool_calls": [{"id": "qs1", "function": {"name": "shell"}}],
-        },
-        {"role": "tool", "tool_call_id": "qs1", "tool_name": "shell", "content": "x" * 100_000},
-    ]
-    result = apply_tool_result_budget(
-        messages, memory=memory, base_dir=tmp_path, query_source="agent_summary"
-    )
-    tool_msg = next(m for m in result if m.get("role") == "tool")
-    # Should NOT be externalized — agent_summary is a skip source
-    assert PERSISTED_OUTPUT_TAG not in tool_msg["content"]
-
-
-def test_query_source_main_thread_persists(tmp_path):
-    """Main thread query source SHOULD persist to disk when aggregate limit
-    is exceeded. Per-tool threshold enforcement is now at process_tool_result()."""
-    from pantheon.internal.memory import Memory
-    memory = Memory("qs-main-test")
-    # Use 3 × 80K = 240K to exceed aggregate 200K limit
-    messages = [
-        {
-            "role": "assistant",
-            "tool_calls": [
-                {"id": "qs2", "function": {"name": "shell"}},
-                {"id": "qs3", "function": {"name": "shell"}},
-                {"id": "qs4", "function": {"name": "shell"}},
-            ],
-        },
-        {"role": "tool", "tool_call_id": "qs2", "tool_name": "shell", "content": "x" * 80_000},
-        {"role": "tool", "tool_call_id": "qs3", "tool_name": "shell", "content": "y" * 80_000},
-        {"role": "tool", "tool_call_id": "qs4", "tool_name": "shell", "content": "z" * 80_000},
-    ]
-    result = apply_tool_result_budget(
-        messages, memory=memory, base_dir=tmp_path, query_source="repl_main_thread"
-    )
-    tool_msgs = [m for m in result if m.get("role") == "tool"]
-    persisted = [m for m in tool_msgs if PERSISTED_OUTPUT_TAG in m["content"]]
-    assert len(persisted) >= 1
-
-
-# ---------------------------------------------------------------------------
-# New: session resume state reconstruction tests
-# ---------------------------------------------------------------------------
-
-def test_reconstruct_state_from_existing_messages():
-    from pantheon.utils.token_optimization import reconstruct_content_replacement_state
-    messages = [
-        {"role": "user", "content": "hello"},
-        {"role": "tool", "tool_call_id": "r1", "content": "<persisted-output>\nOutput too large (10KB). Full output saved to: /tmp/r1.txt\n\nPreview:\nxxx\n</persisted-output>"},
-        {"role": "tool", "tool_call_id": "r2", "content": "[Old tool result content cleared]"},
-        {"role": "tool", "tool_call_id": "r3", "content": "normal content"},
-    ]
-    state = reconstruct_content_replacement_state(messages)
-    assert "r1" in state.seen_ids
-    assert "r1" in state.replacements  # persisted → replacement recorded
-    assert "r2" in state.seen_ids  # cleared → seen
-    assert "r2" not in state.replacements  # cleared → no replacement
-    assert "r3" not in state.seen_ids  # normal → not tracked
-
-
-# ---------------------------------------------------------------------------
-# New: CC-identical token estimation tests
-# ---------------------------------------------------------------------------
-
-def test_estimate_tokens_image_block():
-    from pantheon.utils.token_optimization import _estimate_message_tokens, IMAGE_MAX_TOKEN_SIZE
-    msg = {"role": "user", "content": [
-        {"type": "text", "text": "Look at this:"},
-        {"type": "image", "source": {"data": "base64data"}},
-    ]}
-    tokens = _estimate_message_tokens(msg)
-    # Text (~4 tokens) + IMAGE_MAX_TOKEN_SIZE (2000) + padding
-    assert tokens >= IMAGE_MAX_TOKEN_SIZE
-
-
-def test_estimate_tokens_tool_result_block():
-    from pantheon.utils.token_optimization import _estimate_message_tokens
-    msg = {"role": "user", "content": [
-        {"type": "tool_result", "tool_use_id": "t1", "content": "x" * 4000},
-    ]}
-    tokens = _estimate_message_tokens(msg)
-    # 4000 chars / 4 bytes = 1000 tokens * 4/3 padding = ~1333
-    assert tokens >= 1000
-
-
-def test_estimate_tokens_tool_use_block():
-    from pantheon.utils.token_optimization import _estimate_message_tokens
-    msg = {"role": "assistant", "content": [
-        {"type": "tool_use", "name": "grep", "input": {"query": "test"}},
-    ]}
-    tokens = _estimate_message_tokens(msg)
-    assert tokens > 0
-
-
-def test_inject_cache_control_skip_cache_write_marks_second_to_last():
-    messages = [
-        {"role": "system", "content": "sys"},
-        {"role": "user", "content": "first question"},
-        {"role": "assistant", "content": "first answer"},
-        {"role": "user", "content": "fork directive"},
-    ]
-    result = inject_cache_control_markers(messages, skip_cache_write=True)
-    # With skip_cache_write, the marker goes on the SECOND-to-last user/assistant
-    # which is the assistant "first answer"
-    asst_content = result[2]["content"]
-    assert isinstance(asst_content, list)
-    assert asst_content[-1].get("cache_control") == {"type": "ephemeral"}
-    # The last user message should NOT have cache_control
-    last_user = result[3]["content"]
-    if isinstance(last_user, list):
-        assert all("cache_control" not in b for b in last_user)
-
-
-def test_inject_cache_control_normal_marks_last():
-    messages = [
-        {"role": "system", "content": "sys"},
-        {"role": "user", "content": "question"},
-        {"role": "assistant", "content": "answer"},
-        {"role": "user", "content": "followup"},
-    ]
-    result = inject_cache_control_markers(messages, skip_cache_write=False)
-    # Normal mode: last user/assistant gets the marker
-    last_user = result[3]["content"]
-    assert isinstance(last_user, list)
-    assert last_user[-1].get("cache_control") == {"type": "ephemeral"}
-
-
-# ---------------------------------------------------------------------------
-# New: full pipeline integration test with all 5 stages
-# ---------------------------------------------------------------------------
-
-def test_apply_token_optimizations_runs_all_stages(tmp_path):
-    """Integration: budget → snip → microcompact → collapse (sync 4-stage)."""
-    import time
-    from pantheon.utils.token_optimization import SnipConfig
-    from pantheon.internal.memory import Memory
-
-    memory = Memory("pipeline-test")
-    old_ts = time.time() - 7200  # 2 hours ago
-
-    # Build large conversation with collapsible groups
-    messages = []
-    for i in range(15):
-        messages.append({
-            "role": "assistant",
-            "tool_calls": [
-                {"id": f"c{i}a", "function": {"name": "grep"}},
-                {"id": f"c{i}b", "function": {"name": "read_file"}},
-            ],
-            "timestamp": old_ts + i,
-        })
-        messages.append({
-            "role": "tool", "tool_call_id": f"c{i}a",
-            "tool_name": "grep", "content": "match " * 5000,
-        })
-        messages.append({
-            "role": "tool", "tool_call_id": f"c{i}b",
-            "tool_name": "read_file", "content": "file content " * 5000,
-        })
-
-    messages.append({"role": "user", "content": "What did you find?"})
-
-    before = estimate_total_tokens_from_chars(messages)
-    result = apply_token_optimizations(
-        messages,
-        memory=memory,
-        base_dir=tmp_path,
-        is_main_thread=True,
-        snip_config=SnipConfig(enabled=True, token_budget=10_000, keep_recent=4),
-    )
-    after = estimate_total_tokens_from_chars(result)
-
-    # Massive reduction from stages 1-4 combined
-    assert after < before
-    savings_pct = (1 - after / before) * 100
-    assert savings_pct > 50, f"Expected >50% savings, got {savings_pct:.1f}%"
-
-
-def test_apply_token_optimizations_async_runs_full_pipeline(tmp_path):
-    """Integration: full 5-stage async pipeline with heuristic autocompact."""
-    import asyncio
-    import time
-    from pantheon.utils.token_optimization import (
-        SnipConfig,
-        apply_token_optimizations_async,
-    )
-    from pantheon.internal.memory import Memory
-
-    memory = Memory("async-pipeline-test")
-    old_ts = time.time() - 7200
-
-    messages = []
-    for i in range(20):
-        messages.append({
-            "role": "assistant",
-            "tool_calls": [{"id": f"d{i}", "function": {"name": "shell"}}],
-            "timestamp": old_ts + i,
-        })
-        messages.append({
-            "role": "tool", "tool_call_id": f"d{i}",
-            "tool_name": "shell", "content": "output " * 5000,
-        })
-    messages.append({"role": "user", "content": "done?"})
-
-    before = estimate_total_tokens_from_chars(messages)
-    result, tracking = asyncio.run(
-        apply_token_optimizations_async(
-            messages,
-            memory=memory,
-            base_dir=tmp_path,
-            is_main_thread=True,
-            snip_config=SnipConfig(enabled=True, token_budget=5_000, keep_recent=4),
-            autocompact_model=None,  # heuristic fallback
-        )
-    )
-    after = estimate_total_tokens_from_chars(result)
-    assert after < before
-
-
-# ---------------------------------------------------------------------------
-# Adaptation tests: Layer 2 ↔ Layer 3 integration
-# ---------------------------------------------------------------------------
-
-
-def test_process_tool_result_uses_per_tool_limit_for_grep():
-    """process_tool_result should apply grep's 20K limit, not the 50K global."""
-    from pantheon.utils.llm import process_tool_result
-
-    content = "x" * 30_000  # > 20K (grep limit) but < 50K (global)
-    result = process_tool_result(content, max_length=50_000, tool_name="grep")
-    assert len(result) < 30_000, "grep output above 20K should be truncated"
-
-
-def test_process_tool_result_uses_global_for_unknown_tool():
-    """Unknown tools fall back to min(DEFAULT_MAX_RESULT_SIZE_CHARS, global_limit)."""
-    from pantheon.utils.llm import process_tool_result
-
-    content = "x" * 40_000  # < 50K
-    result = process_tool_result(content, max_length=50_000, tool_name="my_custom_tool")
-    assert len(result) == 40_000, "40K < 50K fallback, should pass through"
-
-
-def test_process_tool_result_truncated_field_skips_base64_but_not_length():
-    """Tools with 'truncated' field skip base64 filtering but per-tool
-    length limits are ALWAYS applied (P0 fix: no more bypass)."""
-    from pantheon.utils.llm import process_tool_result
-
-    # read_file limit is 40K; content is 45K with 'truncated' flag.
-    # Old behavior: trusted → pass through. New behavior: still externalized.
-    result = {"content": "x" * 45_000, "truncated": True}
-    output = process_tool_result(result, max_length=50_000, tool_name="read_file")
-    # 45K > read_file's 40K limit → should be externalized
-    assert len(output) < 45_000, "truncated field should NOT bypass per-tool limits"
-
-
-def test_unified_format_recognized_by_is_already_externalized():
-    """Content produced by smart_truncate_result (Layer 2) should be
-    recognized by _is_already_externalized (Layer 3) to avoid double processing."""
-    from pantheon.utils.token_optimization import _is_already_externalized
-    from pantheon.utils.truncate import PERSISTED_OUTPUT_TAG, PERSISTED_OUTPUT_CLOSING_TAG
-
-    # Simulate Layer 2 output
-    layer2_output = (
-        f"{PERSISTED_OUTPUT_TAG}\n"
-        f"Output too large (100.0KB). Full output saved to: /tmp/test.json\n\n"
-        f"Preview (first 2.0KB):\nsome preview content\n"
-        f"{PERSISTED_OUTPUT_CLOSING_TAG}"
-    )
-    assert _is_already_externalized(layer2_output), (
-        "Layer 2's <persisted-output> format must be recognized by Layer 3"
-    )
-
-
-def test_stage1_skips_already_externalized_by_layer2(tmp_path):
-    """If Layer 2 already externalized content, Stage 1 should not re-process it."""
-    from pantheon.utils.truncate import PERSISTED_OUTPUT_TAG, PERSISTED_OUTPUT_CLOSING_TAG
-
-    externalized_content = (
-        f"{PERSISTED_OUTPUT_TAG}\n"
-        f"Output too large (100.0KB). Full output saved to: /tmp/test.json\n\n"
-        f"Preview (first 2.0KB):\npreview\n"
-        f"{PERSISTED_OUTPUT_CLOSING_TAG}"
-    )
-    messages = [
-        {
-            "role": "assistant",
-            "tool_calls": [{"id": "t1", "function": {"name": "shell"}}],
-        },
-        {"role": "tool", "tool_call_id": "t1", "tool_name": "shell",
-         "content": externalized_content},
-    ]
-    mem = Memory("skip-test")
-    result = apply_tool_result_budget(messages, memory=mem, base_dir=tmp_path)
-    tool_msg = next(m for m in result if m.get("role") == "tool")
-    # Content should be unchanged — already externalized
-    assert tool_msg["content"] == externalized_content
-
-
-def test_per_tool_limit_values():
-    """Verify per-tool limits match the expected CC-aligned values."""
-    from pantheon.utils.token_optimization import get_per_tool_limit
-
-    assert get_per_tool_limit("grep", 200_000) == 20_000
-    assert get_per_tool_limit("read_file", 200_000) == 40_000
-    assert get_per_tool_limit("shell", 200_000) == 50_000
-    assert get_per_tool_limit("bash", 200_000) == 50_000
-    assert get_per_tool_limit("glob", 200_000) == 10_000
-    assert get_per_tool_limit("web_fetch", 200_000) == 30_000
-    # Unknown tool: min(DEFAULT_MAX_RESULT_SIZE_CHARS=50K, global_limit)
-    assert get_per_tool_limit("unknown", 200_000) == 50_000
-    assert get_per_tool_limit("unknown", 30_000) == 30_000
-    # MCP-prefixed tool name normalization
-    assert get_per_tool_limit("mcp__server__grep", 200_000) == 20_000
-
-
-def test_full_pipeline_layer2_then_layer3(tmp_path):
-    """End-to-end: Layer 2 externalizes large content, Layer 3 preserves it."""
-    from pantheon.utils.llm import process_tool_result
-
-    # Simulate Layer 2: large grep output gets externalized
-    big_output = "x" * 25_000  # > grep's 20K limit
-    layer2_result = process_tool_result(big_output, max_length=50_000, tool_name="grep")
-
-    # Build message as agent.py would
-    messages = [
-        {"role": "system", "content": "You are an assistant."},
-        {"role": "user", "content": "search for pattern"},
-        {
-            "role": "assistant",
-            "tool_calls": [{"id": "g1", "function": {"name": "grep"}}],
-        },
-        {"role": "tool", "tool_call_id": "g1", "tool_name": "grep",
-         "content": layer2_result},
-    ]
-
-    # Layer 3: build_llm_view should preserve the externalized content
-    mem = Memory("e2e-test")
-    view = build_llm_view(messages, memory=mem, base_dir=tmp_path)
-
-    tool_msg = next(m for m in view if m.get("role") == "tool")
-    # Should still be externalized (not re-expanded)
-    assert len(tool_msg["content"]) < 25_000
diff --git a/tests/test_truncate.py b/tests/test_truncate.py
index b1f6ead8..88cfb22f 100644
--- a/tests/test_truncate.py
+++ b/tests/test_truncate.py
@@ -132,24 +132,23 @@ def test_format_consistency():
 
 
 def test_smart_truncate_with_truncated_field():
-    """Test tools with 'truncated' field: skip base64 filter, but
-    length limits are still applied (per-tool thresholds are always enforced)."""
-    # Simulate read_file/shell output (small enough to be under limit)
+    """Test tools with 'truncated' field skip base64 filter and length limits."""
+    # Simulate read_file/shell output
     result = {
         "content": "file content here",
         "truncated": False,
         "path": "/some/path",
     }
-
+    
     output = smart_truncate_result(result, max_length=100)
-
+    
     # Should be JSON formatted (unified format)
     parsed = json.loads(output)
     assert parsed["content"] == "file content here"
     assert parsed["truncated"] == False
     assert parsed["path"] == "/some/path"
-
-    # Small content under limit → passes through as-is
+    
+    # Should NOT apply length limits (trust tool's truncation)
     assert isinstance(output, str)
     print("✓ test_smart_truncate_with_truncated_field passed")
 
@@ -185,9 +184,9 @@ def test_smart_truncate_path2_oversized():
             temp_dir=tmpdir
         )
         
-        # Should indicate truncation (unified <persisted-output> format)
-        assert "Full output saved to:" in output
-        assert "<persisted-output>" in output
+        # Should indicate truncation
+        assert "[truncated" in output
+        assert "Full content saved to:" in output
         
         # Should contain preview
         assert "small_field" in output
@@ -210,10 +209,10 @@ def test_smart_truncate_non_dict():
     result1 = smart_truncate_result("simple string", max_length=100)
     assert result1 == "simple string"
     
-    # Large string - when saved to file, uses <persisted-output> format
+    # Large string - allow more tolerance
     result2 = smart_truncate_result("x" * 1000, max_length=100)
-    # May be truncated inline or saved to file with persisted-output wrapper
-    assert "[truncated" in result2 or "<persisted-output>" in result2
+    assert len(result2) <= 200  # More tolerance for suffix
+    assert "[truncated" in result2
     
     print("✓ test_smart_truncate_non_dict passed")