Skip to content

feat: Intelligent error management with Dead Letter Queue#8

Open
matthieup240 wants to merge 12 commits intoMr-Pepe:mainfrom
matthieup240:feat/intelligent-error-management-and-dlq
Open

feat: Intelligent error management with Dead Letter Queue#8
matthieup240 wants to merge 12 commits intoMr-Pepe:mainfrom
matthieup240:feat/intelligent-error-management-and-dlq

Conversation

@matthieup240
Copy link

⚠️ Dependencies

This PR is built on top of PR #7 (Adaptive sync improvements). Please review and merge #7 first, or review this PR with the understanding that it includes those changes.

Base branch for review: feat/adaptive-sync-and-realtime-improvements from PR #7
Changes in this PR: Only the commits related to error management (1 commit after PR #7)


Summary

This PR introduces a comprehensive error management system that prevents queue blocking and ensures data integrity in offline-first scenarios.

Key Features

🎯 Error Classification

  • New SyncErrorClassifier distinguishes network errors from application errors
  • Network errors (SocketException, TimeoutException, PostgrestException without code): Unlimited retries, never lose data
  • Application errors (validation, constraint violations): Move to Dead Letter Queue after 3 attempts

🔄 Circuit Breaker Pattern

  • Prevents network spam with automatic circuit breaking after 5 consecutive network errors
  • Auto-resets after 2 minutes to allow retry when network recovers
  • Per-table circuit breaker state for granular control
  • Does not lose data - items stay in queue during circuit breaker open state

📦 Dead Letter Queue (DLQ)

  • New SyncDeadLetterQueue helper class for persistent error tracking
  • Saves complete item JSON for data recovery
  • Captures full stack traces for debugging
  • Stores error context (type, message, timestamps, retry count)
  • Provides full traceability for administrators
  • Enables manual resolution and recovery of application errors

🚦 Queue Management Improvements

  • Non-blocking processing: Replaced all break statements with continue for independent item processing
  • New _errorQueues Map tracks items with application errors (removed from outQueue after 3 retries)
  • New _permanentErrorItemIds Set prevents re-injection of failed items after cleanup
  • Periodic cleanup every 100 sync loops to prevent memory leaks
  • "Second chance" logic: If user modifies a failed item locally, it automatically gets retried with fresh start

🔒 Data Loss Prevention Guarantees

  • Network errors: Items stay in outQueue indefinitely with unlimited retries
  • Application errors: Complete data preserved in DLQ (JSON + stack trace + context)
  • Retry counters capped at 10,000 to prevent integer overflow
  • Clean state management prevents data leakage between users
  • clearSyncState() properly cleans all error management structures

Implementation Details

New Files

  • lib/src/sync_error_classifier.dart: Error classification logic with SyncErrorType enum
  • lib/src/sync_dead_letter_queue.dart: DLQ persistence helper with saveFailedItem() method

Modified Files

  • lib/src/sync_manager.dart:
    • Added error management data structures (_errorQueues, _permanentErrorItemIds, _retryCounters, _circuitBreakers)
    • Modified _processOutgoing() to handle errors individually with classification
    • Updated _pushLocalChangesToOutQueue() to check permanent error tracking
    • Enhanced _cleanupErrorQueues() with periodic cleanup
    • Fixed clearSyncState() to clean all error structures
    • Fixed dispose() to properly clean up
  • lib/syncable.dart: Export new classes

Database Schema Requirements

Consuming applications must implement a sync_dead_letter_queue table. Example schema:

@DataClassName('SyncDeadLetterEntry')
class SyncDeadLetterQueueTable extends Table {
  TextColumn get id => text()();
  TextColumn get syncTableName => text().named('table_name')();
  TextColumn get itemJson => text().named('item_json')();
  TextColumn get errorType => text().named('error_type')();
  TextColumn get errorMessage => text().named('error_message')();
  IntColumn get retryCount => integer().named('retry_count')();
  IntColumn get firstErrorAt => integer().named('first_error_at')();
  IntColumn get lastErrorAt => integer().named('last_error_at')();
  TextColumn get lastStackTrace => text().named('last_stack_trace').nullable()();
  TextColumn get status => text().withDefault(const Constant('pending'))();
  
  @override
  Set<Column> get primaryKey => {id};
}

Use Case Example

Scenario 1: User Offline for Days

// User creates item while offline
final item = await database.createCompetition(...);

// SyncManager detects network error (no internet)
// ✅ Item stays in outQueue
// ✅ Retries indefinitely with circuit breaker protection
// ✅ When network returns after 3 days, item syncs successfully
// ✅ NO DATA LOSS

Scenario 2: Application Error (e.g., Constraint Violation)

// Item fails to sync due to application error
// Attempt 1: Error logged, retry counter = 1
// Attempt 2: Error logged, retry counter = 2
// Attempt 3: Error logged, retry counter = 3
// ✅ Item moved to errorQueue and removed from outQueue
// ✅ Full details saved to DLQ:
//    - Complete item JSON (admin can recover data)
//    - Full stack trace (developer can debug)
//    - Error type, message, timestamps
// ✅ Other items continue processing (non-blocking)
// ✅ If user modifies item locally → gets "second chance" automatically

Breaking Changes

None - This is fully backward compatible:

  • DLQ is optional (nullable _deadLetterQueue?)
  • Only initialized if enableSync() receives a database with SyncDeadLetterQueueTable
  • Existing code continues to work without changes
  • New features are opt-in via DLQ table implementation

Bug Fixes During Implementation

During implementation and testing, 5 bugs were identified and fixed:

  1. DLQ initialization: Made nullable with proper null-safety checks
  2. Memory leak: Added periodic cleanup of errorQueues
  3. Integer overflow: Capped retry counters at 10,000
  4. Incomplete cleanup: Fixed clearSyncState() to clean all structures
  5. Item re-injection: Added permanent tracking to prevent failed items from re-entering queue

Testing

  • ✅ Verified with 0 compilation errors
  • ✅ All 5 identified bugs have been fixed
  • ✅ Tested scenarios:
    • Network errors with unlimited retries
    • Application errors moving to DLQ after 3 attempts
    • Circuit breaker opening and auto-reset
    • Cleanup preventing memory leaks
    • User switching with clean state
    • "Second chance" logic when user modifies failed items

Migration Guide

To use the DLQ feature in your consuming app:

  1. Add the table to your database:
@DriftDatabase(tables: [
  // ... your existing tables ...
  SyncDeadLetterQueueTable,
])
  1. Increment schema version and add migration:
@override
int get schemaVersion => 2;

@override
MigrationStrategy get migration => MigrationStrategy(
  onUpgrade: (Migrator m, int from, int to) async {
    if (from == 1 && to == 2) {
      await m.createTable(syncDeadLetterQueueTable);
    }
  },
);
  1. That's it! The SyncManager will automatically detect and use the DLQ table.

Documentation

This PR includes:

  • Inline code documentation with emoji markers (🔴 NEW:)
  • Detailed commit message with implementation notes
  • Usage examples in this PR description
  • Database schema requirements

matthieup240 and others added 12 commits October 3, 2025 10:53
- Add SyncEvent classes to track sync events (itemReceived, syncStarted, syncCompleted)
- Add optional callbacks to SyncManager constructor (onItemReceived, onSyncStarted, onSyncCompleted)
- Track sync event sources (realtime vs fullSync)
- Emit events when items are received, syncs start/complete
- Add comprehensive documentation and examples in README
- Add tests for sync event notifications
- Backward compatible implementation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add intelligent sync interval adjustment based on user activity patterns
to optimize battery consumption while maintaining responsiveness.

Features:
- Adaptive sync modes (active/recent/idle) with variable intervals (5s/15s/30s)
- Immediate sync triggering when changes detected in idle/recent modes
- Periodic safety checks: re-sync from Drift every 20 iterations
- Immediate processing of realtime events for instant UI updates
- Improved retry logic for failed sync operations

Benefits:
- Reduced battery consumption during idle periods (30s interval)
- Maximum responsiveness during active editing (5s interval)
- Data consistency safeguards with periodic Drift re-syncs
- Instant realtime updates without waiting for sync loop
This commit introduces a comprehensive error management system that prevents
queue blocking and ensures data integrity in offline scenarios.

## Key Features

### Error Classification
- New `SyncErrorClassifier` distinguishes network errors from application errors
- Network errors (SocketException, TimeoutException, etc.) trigger unlimited retries
- Application errors (validation, constraint violations, etc.) move to DLQ after 3 attempts

### Circuit Breaker Pattern
- Prevents network spam with automatic circuit breaking after 5 consecutive errors
- Auto-resets after 2 minutes to allow retry when network recovers
- Per-table circuit breaker state for granular control

### Dead Letter Queue (DLQ)
- New `SyncDeadLetterQueue` helper class for persistent error tracking
- Saves complete item JSON, stack traces, and error context
- Provides full traceability for administrators
- Enables manual resolution of application errors

### Queue Management
- Replace all blocking `break` statements with `continue` for independent item processing
- New `_errorQueues` Map for application errors (removed from outQueue after 3 retries)
- New `_permanentErrorItemIds` Set prevents re-injection of failed items
- Periodic cleanup every 100 sync loops to prevent memory leaks
- "Second chance" logic: if user modifies a failed item locally, it gets retried

### Data Loss Prevention
- Network errors: Items stay in outQueue indefinitely with unlimited retries
- Application errors: Complete data preserved in DLQ (JSON + stack trace)
- Retry counters capped at 10,000 to prevent integer overflow
- Clean state management prevents data leakage between users

## Implementation Details

### New Files
- `lib/src/sync_error_classifier.dart`: Error classification logic
- `lib/src/sync_dead_letter_queue.dart`: DLQ persistence helper

### Modified Files
- `lib/src/sync_manager.dart`: Core sync logic with error management
- `lib/syncable.dart`: Export new classes

### Database Requirements
Consuming applications must implement a `sync_dead_letter_queue` table with schema:
- id, table_name, item_json, error_type, error_message
- retry_count, first_error_at, last_error_at, last_stack_trace, status

See `SyncDeadLetterQueue.saveFailedItem()` for expected schema.

## Breaking Changes
None - This is backward compatible. DLQ is optional (nullable) and only used if
`enableSync()` is called with a database that has `SyncDeadLetterQueueTable`.

## Testing Notes
- Verified with 0 compilation errors
- All 5 identified bugs during implementation have been fixed
- Tested scenarios: network errors, application errors, cleanup, user switching
Add optional callback system to enable external monitoring integration
(e.g., Sentry) without creating dependencies in the syncable package.

Features:
- OnDLQErrorCallback: Notifies when items are moved to Dead Letter Queue
  with full context (table, itemId, JSON, errorType, stackTrace, retryCount)
- OnSyncBreadcrumbCallback: Traces sync flow events (loop start, circuit
  breaker, error recovery, DLQ moves) for debugging
- All callbacks are optional and protected with try-catch to prevent crashes
- Circuit breaker callback integration for network error tracking

Benefits:
- Maintains package independence (no Sentry dependency in syncable)
- Enables rich monitoring in consuming applications
- Zero impact when callbacks are not provided (backward compatible)
- Fire-and-forget pattern preserves sync performance
This commit includes three critical fixes to ensure test stability
and improve sync system reliability:

1. **Fix immediate sync trigger for all modes**
   - Previously only idle/recent modes triggered immediate sync
   - Now ALL modes trigger immediate sync on local changes
   - This ensures fast response time regardless of current mode
   - Fixes 6 failing integration tests

2. **Restore syncInterval parameter backward compatibility**
   - The syncInterval parameter was stored but never used
   - Adaptive intervals now respect custom syncInterval values
   - Tests using custom intervals (1ms) now work correctly
   - Fixes 1 failing unit test

3. **Increase timeout for heavy paging test**
   - "Reading from backend uses paging" test syncs 1001 items
   - Increased timeout from 30s to 2 minutes for slower machines
   - Test is legitimate and important for pagination feature
   - Fixes 1 flaky test

**Test Results:**
- Before: 17 passing, 8 failing
- After: 25 passing, 0 failing ✅

**Changes Made:**
- sync_manager.dart:
  * Store _syncInterval field
  * Use custom interval if provided (not default 1s)
  * Always trigger immediate sync on local changes
- integration_test.dart:
  * Add @timeout(2 minutes) to paging test

All monitoring callback features from previous commits remain
intact and functional.
The SQL queries were using 'sync_dead_letter_queue' instead of
'sync_dead_letter_queue_table', causing database errors.

Fixed in:
- saveFailedItem(): INSERT OR REPLACE query
- getPendingItems(): SELECT query
- getPendingCount(): COUNT query
Add three new methods to SyncDeadLetterQueue for manual intervention:

- retryItem(itemId): Retrieves failed item JSON for retry without deleting
  from DLQ (caller must delete after successful retry)

- ignoreItem(itemId): Marks item as 'ignored' status (stays in DLQ but
  hidden from pending list)

- deleteItem(itemId): Permanently removes item from DLQ when error is
  understood and item should be discarded

These methods enable admin UI workflows for managing sync failures.
Add comprehensive monitoring and observability features:

**Configuration Constants:**
- Add sync configuration constants (_DRIFT_RESYNC_INTERVAL,
  _ERROR_QUEUE_CLEANUP_INTERVAL, _CIRCUIT_BREAKER_THRESHOLD, etc.)
- Centralize magic numbers for better maintainability

**Public API Getters:**
- deadLetterQueue: Access DLQ for viewing/managing sync errors
- backendTableNames: Map of types to backend table names
- localTables: Access to local table metadata
- uploadQueueSizes: Count of pending uploads per type
- errorQueueSizes: Count of errors per type
- circuitBreakers: Circuit breaker state per type
- hasActiveRealtimeSubscription: Realtime subscription status per type

These additions enable external monitoring systems (Sentry, custom dashboards)
to observe sync state without tight coupling to the syncable package.
Translate all French comments and user-facing messages to English for
better international collaboration:

**sync_error_classifier.dart:**
- Translate enum/class documentation
- Translate error type descriptions
- Translate classification logic comments
- Translate user-friendly error messages

**sync_manager.dart:**
- Translate callback documentation (OnDLQErrorCallback, OnSyncBreadcrumbCallback)
- Translate inline comments throughout sync loop
- Replace debug prints with logger calls
- Translate breadcrumb messages

This improves code readability for international contributors and aligns
with the project's goal of being an open-source package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant