-
Notifications
You must be signed in to change notification settings - Fork 12
feat: add failure metrics to kinesis outbox event processing #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r failed kinesis events with error code metadata
| emit_metric('journaled.worker.batch_process', value: total_events, worker_id:) | ||
| emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:) | ||
| emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:) | ||
| emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| emit_metric('journaled.worker.batch_process', value: total_events, worker_id:) | |
| emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:) | |
| emit_metric('journaled.worker.batch.processed', value: total_events, worker_id:) | |
| emit_metric('journaled.worker.batch.sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.worker.batch.failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.worker.batch.failed_transiently', value: stats[:failed_transiently], worker_id:) |
🤷 nothing says we can't have more than 3 . in a stat name -- but mainly i renamed process to processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I'm wondering if we need batch_ at all. We're emitting stats in batches, but the stats themselves are counts of individual events:
| emit_metric('journaled.worker.batch_process', value: total_events, worker_id:) | |
| emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:) | |
| emit_metric('journaled.event.processed', value: total_events, worker_id:) | |
| emit_metric('journaled.event.sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.event.failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.event.failed_transiently', value: stats[:failed_transiently], worker_id:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also thoughts on event.failed_permanently and event.failed_transiently becoming event.failed and event.errored? (Aligning with DJ terminology)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah all sounds good! will update
| metrics = calculate_queue_metrics | ||
| emit_metric('journaled.worker.queue_total_count', value: metrics[:total_count], worker_id:) | ||
| emit_metric('journaled.worker.queue_workable_count', value: metrics[:workable_count], worker_id:) | ||
| emit_metric('journaled.worker.queue_erroring_count', value: metrics[:erroring_count], worker_id:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| emit_metric('journaled.worker.queue_erroring_count', value: metrics[:erroring_count], worker_id:) | |
| emit_metric('journaled.worker.queue_failed_count', value: metrics[:erroring_count], worker_id:) |
Should it be "failed count"? If it's a batch query it can really only count the number of hard-failed rows, not the number of transiently-erroring rows, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated!
smudge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TAFN - two points on naming conventions.
smudge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
domain LGTM && platform LGTM
Summary
This adds more visibility into both batch and individual kinesis errors by emitting metrics with event information and error codes. This will allow us to track rate limiting errors more closely, and have insight into other errors that occur.
/no-platform