Skip to content

Conversation

@charlieroth
Copy link
Owner

Summary

Implements a comprehensive content extraction and normalization system for processing web pages. This system extracts readable content from HTML, sanitizes it for security, detects language, and filters out low-quality content.

Fixes #22

Changes Made

  • Feature/functionality changes
  • Bug fixes
  • Refactoring
  • Documentation updates
  • Test additions/improvements
  • Infrastructure/CI changes

Detailed Changes

  • New extractor module: Complete content extraction system with reader, cleaner, language detection, and content filtering
  • HTML sanitization: Added ammonia-based HTML cleaning with link resolution for security
  • Language detection: Integrated whatlang library for automatic language identification
  • Content filtering: Implemented rejection filters for low-quality content
  • Fuzzing infrastructure: Added cargo-fuzz setup for robustness testing
  • Dependency updates: Added readability, ammonia, whatlang, kuchiki, linkify, proptest
  • Auth middleware fix: Updated tests to use proper configuration loading

Testing

  • All existing tests pass (make test)
  • New tests added for new functionality
  • Manual testing completed
  • Edge cases considered and tested

Test Commands Run

make test
cargo test extractor
cargo fuzz run extractor

Code Quality

  • Code follows project style guidelines (make fmt)
  • No linting errors (make lint)
  • Full check passes (make check)
  • Code is well-documented where necessary
  • No security vulnerabilities introduced

Database Changes

  • No database changes
  • Migration scripts included
  • make prepare run after schema changes
  • Backward compatibility maintained

Breaking Changes

  • No breaking changes
  • Breaking changes documented below

Breaking Changes Details

Deployment Notes

  • No special deployment considerations
  • Environment variables need to be updated
  • Dependencies need to be updated
  • Special deployment steps required (documented below)

Special Deployment Steps

New dependencies need to be installed: readability, ammonia, whatlang, kuchiki, linkify, proptest

Documentation

  • No documentation changes needed
  • README updated
  • API documentation updated
  • Contributing guidelines updated
  • Other documentation updated (specify below)

Reviewer Checklist

  • Code review completed
  • Architecture/design approved
  • Security considerations reviewed
  • Performance impact assessed
  • Documentation reviewed

Additional Notes

  • The extractor module is designed to be modular and extensible
  • Fuzzing infrastructure ensures robustness against malformed HTML input
  • HTML sanitization prevents XSS attacks and ensures clean content
  • Language detection supports multilingual content processing
  • Content filtering helps maintain quality standards
  • Integration with existing fetcher system via PageResponse type

- Add new extractor module with readable content extraction using readability
- Implement HTML sanitization with ammonia and link resolution
- Add language detection using whatlang
- Include content rejection filters for low-quality content
- Add comprehensive test coverage with fixtures
- Add fuzzing infrastructure to test robustness
- Update dependencies: readability, kuchiki, ammonia, whatlang, linkify
- Fix auth middleware test configuration to use proper config loading

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-ac62a9cb-5ac8-46b1-8c5f-85827e067240
@charlieroth charlieroth linked an issue Aug 29, 2025 that may be closed by this pull request
7 tasks
@charlieroth charlieroth self-assigned this Aug 29, 2025
@charlieroth charlieroth changed the title feat: implement comprehensive content extractor and normalizer Content Extractor and Normalizer Aug 29, 2025
@charlieroth charlieroth merged commit 36bc316 into main Aug 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Content Extractor and Normalizer

2 participants