Content Extractor and Normalizer #62

charlieroth · 2025-08-29T05:57:35Z

Summary

Implements a comprehensive content extraction and normalization system for processing web pages. This system extracts readable content from HTML, sanitizes it for security, detects language, and filters out low-quality content.

Fixes #22

Changes Made

Detailed Changes

New extractor module: Complete content extraction system with reader, cleaner, language detection, and content filtering
HTML sanitization: Added ammonia-based HTML cleaning with link resolution for security
Language detection: Integrated whatlang library for automatic language identification
Content filtering: Implemented rejection filters for low-quality content
Fuzzing infrastructure: Added cargo-fuzz setup for robustness testing
Dependency updates: Added readability, ammonia, whatlang, kuchiki, linkify, proptest
Auth middleware fix: Updated tests to use proper configuration loading

Testing

All existing tests pass (make test)
New tests added for new functionality
Manual testing completed
Edge cases considered and tested

Test Commands Run

make test
cargo test extractor
cargo fuzz run extractor

Code Quality

Code follows project style guidelines (make fmt)
No linting errors (make lint)
Full check passes (make check)
Code is well-documented where necessary
No security vulnerabilities introduced

Database Changes

No database changes
Migration scripts included
make prepare run after schema changes
Backward compatibility maintained

Breaking Changes

No breaking changes
Breaking changes documented below

Breaking Changes Details

Deployment Notes

No special deployment considerations
Environment variables need to be updated
Dependencies need to be updated
Special deployment steps required (documented below)

Special Deployment Steps

New dependencies need to be installed: readability, ammonia, whatlang, kuchiki, linkify, proptest

Documentation

No documentation changes needed
README updated
API documentation updated
Contributing guidelines updated
Other documentation updated (specify below)

Reviewer Checklist

Additional Notes

The extractor module is designed to be modular and extensible
Fuzzing infrastructure ensures robustness against malformed HTML input
HTML sanitization prevents XSS attacks and ensures clean content
Language detection supports multilingual content processing
Content filtering helps maintain quality standards
Integration with existing fetcher system via PageResponse type

- Add new extractor module with readable content extraction using readability - Implement HTML sanitization with ammonia and link resolution - Add language detection using whatlang - Include content rejection filters for low-quality content - Add comprehensive test coverage with fixtures - Add fuzzing infrastructure to test robustness - Update dependencies: readability, kuchiki, ammonia, whatlang, linkify - Fix auth middleware test configuration to use proper config loading Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-ac62a9cb-5ac8-46b1-8c5f-85827e067240

charlieroth linked an issue Aug 29, 2025 that may be closed by this pull request

Content Extractor and Normalizer #22

Closed

7 tasks

charlieroth self-assigned this Aug 29, 2025

charlieroth changed the title ~~feat: implement comprehensive content extractor and normalizer~~ Content Extractor and Normalizer Aug 29, 2025

chore: formatting

b7bc70a

charlieroth merged commit 36bc316 into main Aug 29, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Content Extractor and Normalizer #62

Content Extractor and Normalizer #62

Uh oh!

charlieroth commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Content Extractor and Normalizer #62

Content Extractor and Normalizer #62

Uh oh!

Conversation

charlieroth commented Aug 29, 2025

Summary

Changes Made

Detailed Changes

Testing

Test Commands Run

Code Quality

Database Changes

Breaking Changes

Breaking Changes Details

Deployment Notes

Special Deployment Steps

Documentation

Reviewer Checklist

Additional Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants