You asked your LLM to check a web page. It saved a DOM dump that blows up its context window. 😱
❯ ls -alh kapture-dom-page-save.html
-rw-r--r-- 1 yourname staff 1.8M Mar 6 13:06 kapture-dom-page-save.html 😤 What now? Use the kapture-dom-erode skill. Read the visible text from multi-megabyte Kapture snapshots without drowning in <div> soup.
Your agent says "I can see the page" but what it actually has is millions of characters of nested HTML.
You see text on screen that is not far apart. The LLM sees that text a mile apart inside 47 wrapper tags.
- 🤯 Kapture DOM dumps are massive -- 1-5MB of framework-generated tag soup
- 🔍 Text is buried -- "Bake at 180°C" lives at
html[0].body[0].div[6].div[2].div[1].div[4].div[0].p[0] - 💸 Context windows cry -- feeding raw HTML to your LLM is not ideal
⚠️ Screenshots aren't parseable -- you can see it, but the agent can't read it
Four commands. Find what you need. Erode the tags. Done.
# Find WHERE text lives in the DOM
./tools.sh gron-grep -f page.html -q "Bake at 180"
# Auto-extract the page's main body text in one shot
./tools.sh main-text -f page.html
# Rank all block elements by size, auto-detect main content region
./tools.sh top-content -f page.html
# Strip tags below a known path, return visible text
./tools.sh extract-text -f page.html -p "html[0].body[0].div[3]"Scans every block element, counts visible characters, prints the top N by size with previews. Also auto-detects the most likely "real content" region by spotting the heading + body alternating pattern.
./tools.sh top-content -f page.htmlPath Chars Words Preview
---------------------------------------------------------------
html[0].body[0].main[0] 18432 2901 Roasted Vegetable Tart Serves 4 Ready in 55 minutes...
html[0].body[0].main[0].div[0] 18201 2870 Roasted Vegetable Tart Serves 4 Ready in 55 minutes...
html[0].body[0].main[0].div[0].div[1] 9823 1544 Ingredients 2 sheets shortcrust pastry 3 courgettes...
--- Content region detection ---
Best candidate: html[0].body[0].main[0].div[0].div[1]
Score: 14 (headings=4, blocks=6)
Options: --top N (default 10), --no-detect (skip auto-detection).
Search any text string and get back its exact coordinates in the HTML tree.
./tools.sh gron-grep -f page.html -q "On Sale"
# html[0].body[0].div[3].div[1].div[2].span[0] = "On Sale"
# html[0].body[0].div[3].div[5].div[2].span[0] = "On Sale"Options: -q <text>, -i (case-insensitive).
Strip away all markup and return visible text from any subtree. Reading order preserved.
./tools.sh extract-text -f page.html -p "html[0].body[0].div[3]"
# On Sale
# Wireless Headphones -- Was $99, now $49
# USB-C Hub -- Was $45, now $29The fastest path to clean page text. Finds the two largest content blocks, computes their common ancestor, and erodes everything from that ancestor. Because the two biggest text blocks are almost always in the main body — their common ancestor joins all the siblings together, capturing everything above, between, and below them while excluding nav chrome.
./tools.sh main-text -f page.htmlOutput (diagnostic lines + extracted text):
Top-1: html[0].body[0].main[0].div[0].section[1]
Top-2: html[0].body[0].main[0].div[0].section[2]
Common ancestor: html[0].body[0].main[0].div[0]
Roasted Vegetable Tart
Serves 4 · Ready in 55 minutes
Ingredients
2 sheets shortcrust pastry
3 courgettes, sliced
...
Use --quiet / -q to suppress the diagnostic lines and get only the text.
You see two products with "Limited Time Deal" badges. You want all of them.
# Find both
./tools.sh gron-grep -f shop.html -q "Limited Time Deal"
# html[0].body[0].div[2].div[1].div[0].div[3].span[0]
# html[0].body[0].div[2].div[3].div[0].div[3].span[0]
# Both share ancestor div[2] — extract everything from there
./tools.sh extract-text -f shop.html -p "html[0].body[0].div[2]"Or skip the manual steps entirely:
./tools.sh main-text -f shop.html-f accepts either:
- The full JSON object returned by
kapture_dom(recommended) — the tool reads the"html"key by default - A raw
.htmlfile
Override the JSON key with -k / --content-key if your dump uses a different field name.
SYNOPSIS
./tools.sh main-text -f <file> [-q]
./tools.sh top-content -f <file> [--top N] [--no-detect]
./tools.sh gron-grep -f <file> -q <text> [-i]
./tools.sh extract-text -f <file> -p <path>
OPTIONS (all commands)
-f, --file Input file (kapture_dom JSON or raw HTML)
-k, --content-key JSON key for DOM content (default: html)
main-text
-q, --quiet Output extracted text only, no diagnostics
top-content
--top N Show top N results (default: 10)
--no-detect Skip content region auto-detection
gron-grep
-q, --query Text to search for
-i Case-insensitive search
extract-text
-p, --path Gron-style path to extract from
# OpenCode
cp -r kapture-dom-erode ~/.config/opencode/skills/
# Claude Code
cp -r kapture-dom-erode ~/.claude/skills/Dependencies: Python + BeautifulSoup, handled via uv.
MIT — Copyright (c) 2026 Simon Massey