Skip to content

feat: [document-compression-updater]- Improvements: Auto-remove dummy compression field, drop tracker collection and more#182

Open
NestorRV wants to merge 1 commit intoawslabs:masterfrom
NestorRV:compression-updater-improvements
Open

feat: [document-compression-updater]- Improvements: Auto-remove dummy compression field, drop tracker collection and more#182
NestorRV wants to merge 1 commit intoawslabs:masterfrom
NestorRV:compression-updater-improvements

Conversation

@NestorRV
Copy link
Copy Markdown

@NestorRV NestorRV commented Apr 1, 2026

Summary

  • Add connection retry logic and startup validation
  • Fix UnboundLocalError on empty collection in setup()
  • Replace bare print() calls with printLog() throughout
  • Add per-batch ASCII progress bar, elapsed time, rate, and ETA logging
  • Auto-remove dummy compression field after each batch via $unset
  • Add --skip-cleanup and --append-log flags
  • Drop tracker collection on successful completion
  • Remove dead multiprocessing queue code and unused imports

Changes

Error Handling

  • Added get_mongo_client() helper with retry logic (up to 3 attempts, 5s delay) used everywhere a connection is needed
  • Added validate_connection() called at startup to verify the URI is reachable and the target database/collection exist before any work begins
  • Fixed an UnboundLocalError crash in setup() when the collection is empty
  • Wrapped all MongoDB operations in setup() and task_worker() with try/except, with reconnect logic in the worker's batch loop

Code Quality

  • Removed unused imports: threading, string
  • Removed redundant variable assignments at the top of task_worker()
  • Added --append-log flag — log file is no longer silently deleted on every startup unless the flag is omitted

Observability

  • Replaced all bare print() calls with printLog() for consistent output to both stdout and the log file
  • Added per-batch ASCII progress bar, elapsed time, docs/sec rate, and ETA after each batch

Correctness

  • The dummy field used to trigger compression is now automatically removed from each document immediately after each batch via a second bulk_write with $unset
  • Added --skip-cleanup flag for cases where removing the field is not required
  • Each tracker entry now includes a cleanupComplete boolean
  • On successful completion the tracker collection is automatically dropped

README

  • Updated to reflect all new flags: --append-log, --skip-cleanup
  • Added notes on automatic dummy field cleanup and tracker collection drop behaviour

🤖 Generated with Claude Code

… compression field, drop tracker collection and more

- Add connection retry logic and startup validation
- Fix UnboundLocalError on empty collection in setup()
- Replace bare print() calls with printLog() throughout
- Add per-batch progress bar, elapsed time, rate, and ETA logging
- Auto-remove dummy compression field after each batch via $unset
- Add --skip-cleanup and --append-log flags
- Drop tracker collection on successful completion
- Remove dead multiprocessing queue code and unused imports
@NestorRV NestorRV changed the title Improve document-compression-updater: error handling, observability, and correctness feat: [document-compression-updater]- Improvements: Auto-remove dummy compression field, drop tracker collection and more Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant