feat: Cleanup HTML page to reduce token usage#1073
Open
mguella wants to merge 2 commits intoItzCrazyKns:masterfrom
Open
feat: Cleanup HTML page to reduce token usage#1073mguella wants to merge 2 commits intoItzCrazyKns:masterfrom
mguella wants to merge 2 commits intoItzCrazyKns:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">
<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:53">
P2: Comment-stripping regex only matches whitespace/dot comments, so most HTML comments remain and the cleanup fails to remove common comments.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Add one-off context when rerunning by tagging
@cubic-dev-aiwith guidance or docs links (includingllms.txt) - Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
As reported by #1031 some pages are consuming too many tokens.
Cause
This is because the page is parsed directly to markdown after being fetched, so it contains data that we don't really care about (e.g. comments, styles, scripts).
Solution
There is another PR #1035 that tries to reduce the number of tokens sent to the LLM by truncating the HTML content.
However that approach risks to delete data we care about, especially because it truncates the data at a fixed point, and HTML pages tend to include
styleandscriptbefore the page content, so with that approach we might just end up including in the resulting textscriptandstylewithout any of thebodyand the main page content.This PR takes a different approach: cleanup the HTML by removing things the LLM doesn't need, like comments, script tags and style tags, so we can limit the token usage.
Next steps
If the approach from this PR is not enough, we could parse the page with Mozilla's Readability.js to keep only the main page content.
If that is also not enough we can combine both approaches above (HTML cleanup + Readability.js) with the truncate approach from #1035.
Summary by cubic
Clean up HTML pages before converting to Markdown to lower token usage while keeping the main content. We strip comments, scripts, styles, templates, and extra whitespace when the response is HTML.
New Features
scrapeURL.tsusingjsdomwhenContent-Typeistext/html.script,style, andtemplatetags before Markdown conversion.Bug Fixes
Written for commit 5be0f5b. Summary will update on new commits.