Skip to content

Conversation

@charlieroth
Copy link
Owner

Summary

Implement content persistence with checksum-based deduplication to efficiently store and manage fetched page content in the database.

Fixes #23

Changes Made

  • Feature/functionality changes
  • Bug fixes
  • Refactoring
  • Documentation updates
  • Test additions/improvements
  • Infrastructure/CI changes

Detailed Changes

  • Database Schema: Extended contents table with clean_html and clean_text columns, renamed existing columns to raw_html/raw_text for clarity
  • Content Repository: Created new ContentRepository with efficient upsert functionality using MD5 checksum-based duplicate detection
  • Database Indexing: Added composite unique index on (item_id, checksum) for deduplication and GIN index on clean_text for future full-text search
  • Job Handler Updates: Updated fetch_page job handler to use new column names and repository patterns
  • Entity Updates: Modified entity definitions to reflect new database schema

Testing

  • All existing tests pass (make test)
  • New tests added for new functionality
  • Manual testing completed
  • Edge cases considered and tested

Test Commands Run

make test

All 86 tests pass including comprehensive unit tests for ContentRepository covering insert, update, and no-op scenarios.

Code Quality

  • Code follows project style guidelines (make fmt)
  • No linting errors (make lint)
  • Full check passes (make check)
  • Code is well-documented where necessary
  • No security vulnerabilities introduced

Database Changes

  • No database changes
  • Migration scripts included
  • make prepare run after schema changes
  • Backward compatibility maintained

Migration 20250829081421_extend_contents_table extends the existing schema without breaking existing functionality.

Breaking Changes

  • No breaking changes
  • Breaking changes documented below

Deployment Notes

  • No special deployment considerations
  • Environment variables need to be updated
  • Dependencies need to be updated
  • Special deployment steps required (documented below)

Documentation

  • No documentation changes needed
  • README updated
  • API documentation updated
  • Contributing guidelines updated
  • Other documentation updated (specify below)

Additional Notes

This implementation provides efficient content deduplication using MD5 checksums to avoid unnecessary database writes when content hasn't changed. The new schema separates raw content from cleaned content, setting up the foundation for future full-text search capabilities with the GIN index on clean_text.

- Add database migration to extend contents table with clean_html/clean_text columns
- Rename existing html/text columns to raw_html/raw_text for clarity
- Create ContentRepository with efficient upsert functionality
- Add composite unique index on (item_id, checksum) for deduplication
- Add GIN index on clean_text for future full-text search capabilities
- Implement MD5 checksum-based duplicate detection to avoid unnecessary writes
- Add comprehensive unit tests for insert, update, and no-op scenarios
- Update existing code to use new column names
- Handle large content payloads efficiently without excessive memory usage

Closes #23

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-68ad5589-7b94-4be5-a8a4-306a28f44efb
@charlieroth charlieroth linked an issue Aug 29, 2025 that may be closed by this pull request
6 tasks
@charlieroth charlieroth self-assigned this Aug 29, 2025
@charlieroth charlieroth changed the title feat: implement content persistence with checksum-based deduplication Persist Content in Database Aug 29, 2025
@charlieroth charlieroth merged commit ebdb358 into main Aug 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist Content in Database

2 participants