Content Extractor and Normalizer

## Why

Need clean text for search and reading. Raw hypertext markup language is noisy.

## Description of Done

- Extractor returns a structured result: title, site name, byline if present, main text, language code guess, and cleaned hypertext markup language
- Boilerplate, navigation, and scripts are removed
- Absolute links are resolved relative to the page
- Text is whitespace-normalized
- Unit tests cover common layouts and malformed documents
- Fuzz tests run the extractor against random inputs without panics

## Tasks

- [ ] Implement a readable-content heuristic using a document parser
- [ ] Normalize nodes: remove scripts, styles, and tracking elements
- [ ] Resolve relative links and images to absolute forms
- [ ] Add language detection and fallback to unknown
- [ ] Add checks for minimal content length to reject empty pages
- [ ] Write unit and property tests; seed with sample fixtures
- [ ] Add a fuzz target for extractor inputs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Content Extractor and Normalizer #22

Why

Description of Done

Tasks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Content Extractor and Normalizer #22

Description

Why

Description of Done

Tasks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions