Skip to content

Content Extractor and Normalizer #22

@charlieroth

Description

@charlieroth

Why

Need clean text for search and reading. Raw hypertext markup language is noisy.

Description of Done

  • Extractor returns a structured result: title, site name, byline if present, main text, language code guess, and cleaned hypertext markup language
  • Boilerplate, navigation, and scripts are removed
  • Absolute links are resolved relative to the page
  • Text is whitespace-normalized
  • Unit tests cover common layouts and malformed documents
  • Fuzz tests run the extractor against random inputs without panics

Tasks

  • Implement a readable-content heuristic using a document parser
  • Normalize nodes: remove scripts, styles, and tracking elements
  • Resolve relative links and images to absolute forms
  • Add language detection and fallback to unknown
  • Add checks for minimal content length to reject empty pages
  • Write unit and property tests; seed with sample fixtures
  • Add a fuzz target for extractor inputs

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions