Persist Content in Database #63

charlieroth · 2025-08-29T06:26:43Z

Summary

Implement content persistence with checksum-based deduplication to efficiently store and manage fetched page content in the database.

Fixes #23

Changes Made

Detailed Changes

Database Schema: Extended contents table with clean_html and clean_text columns, renamed existing columns to raw_html/raw_text for clarity
Content Repository: Created new ContentRepository with efficient upsert functionality using MD5 checksum-based duplicate detection
Database Indexing: Added composite unique index on (item_id, checksum) for deduplication and GIN index on clean_text for future full-text search
Job Handler Updates: Updated fetch_page job handler to use new column names and repository patterns
Entity Updates: Modified entity definitions to reflect new database schema

Testing

All existing tests pass (make test)
New tests added for new functionality
Manual testing completed
Edge cases considered and tested

Test Commands Run

make test

All 86 tests pass including comprehensive unit tests for ContentRepository covering insert, update, and no-op scenarios.

Code Quality

Code follows project style guidelines (make fmt)
No linting errors (make lint)
Full check passes (make check)
Code is well-documented where necessary
No security vulnerabilities introduced

Database Changes

No database changes
Migration scripts included
make prepare run after schema changes
Backward compatibility maintained

Migration 20250829081421_extend_contents_table extends the existing schema without breaking existing functionality.

Breaking Changes

No breaking changes
Breaking changes documented below

Deployment Notes

No special deployment considerations
Environment variables need to be updated
Dependencies need to be updated
Special deployment steps required (documented below)

Documentation

No documentation changes needed
README updated
API documentation updated
Contributing guidelines updated
Other documentation updated (specify below)

Additional Notes

This implementation provides efficient content deduplication using MD5 checksums to avoid unnecessary database writes when content hasn't changed. The new schema separates raw content from cleaned content, setting up the foundation for future full-text search capabilities with the GIN index on clean_text.

- Add database migration to extend contents table with clean_html/clean_text columns - Rename existing html/text columns to raw_html/raw_text for clarity - Create ContentRepository with efficient upsert functionality - Add composite unique index on (item_id, checksum) for deduplication - Add GIN index on clean_text for future full-text search capabilities - Implement MD5 checksum-based duplicate detection to avoid unnecessary writes - Add comprehensive unit tests for insert, update, and no-op scenarios - Update existing code to use new column names - Handle large content payloads efficiently without excessive memory usage Closes #23 Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-68ad5589-7b94-4be5-a8a4-306a28f44efb

charlieroth linked an issue Aug 29, 2025 that may be closed by this pull request

Persist Content in Database #23

Closed

6 tasks

charlieroth self-assigned this Aug 29, 2025

charlieroth changed the title ~~feat: implement content persistence with checksum-based deduplication~~ Persist Content in Database Aug 29, 2025

charlieroth merged commit ebdb358 into main Aug 29, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Persist Content in Database #63

Persist Content in Database #63

Uh oh!

charlieroth commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Persist Content in Database #63

Persist Content in Database #63

Uh oh!

Conversation

charlieroth commented Aug 29, 2025

Summary

Changes Made

Detailed Changes

Testing

Test Commands Run

Code Quality

Database Changes

Breaking Changes

Deployment Notes

Documentation

Additional Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants