-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Why
Need clean text for search and reading. Raw hypertext markup language is noisy.
Description of Done
- Extractor returns a structured result: title, site name, byline if present, main text, language code guess, and cleaned hypertext markup language
- Boilerplate, navigation, and scripts are removed
- Absolute links are resolved relative to the page
- Text is whitespace-normalized
- Unit tests cover common layouts and malformed documents
- Fuzz tests run the extractor against random inputs without panics
Tasks
- Implement a readable-content heuristic using a document parser
- Normalize nodes: remove scripts, styles, and tracking elements
- Resolve relative links and images to absolute forms
- Add language detection and fallback to unknown
- Add checks for minimal content length to reject empty pages
- Write unit and property tests; seed with sample fixtures
- Add a fuzz target for extractor inputs
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request