Skip to content

Conversation

@effron
Copy link
Contributor

@effron effron commented Nov 25, 2025

Summary

This adds more visibility into both batch and individual kinesis errors by emitting metrics with event information and error codes. This will allow us to track rate limiting errors more closely, and have insight into other errors that occur.

/no-platform

@effron effron requested a review from a team as a code owner November 25, 2025 16:38
@effron effron requested a review from smudge November 25, 2025 16:38
Comment on lines 17 to 20
emit_metric('journaled.worker.batch_process', value: total_events, worker_id:)
emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:)
emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:)
emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
emit_metric('journaled.worker.batch_process', value: total_events, worker_id:)
emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:)
emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:)
emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:)
emit_metric('journaled.worker.batch.processed', value: total_events, worker_id:)
emit_metric('journaled.worker.batch.sent', value: stats[:succeeded], worker_id:)
emit_metric('journaled.worker.batch.failed_permanently', value: stats[:failed_permanently], worker_id:)
emit_metric('journaled.worker.batch.failed_transiently', value: stats[:failed_transiently], worker_id:)

🤷 nothing says we can't have more than 3 . in a stat name -- but mainly i renamed process to processed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'm wondering if we need batch_ at all. We're emitting stats in batches, but the stats themselves are counts of individual events:

Suggested change
emit_metric('journaled.worker.batch_process', value: total_events, worker_id:)
emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:)
emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:)
emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:)
emit_metric('journaled.event.processed', value: total_events, worker_id:)
emit_metric('journaled.event.sent', value: stats[:succeeded], worker_id:)
emit_metric('journaled.event.failed_permanently', value: stats[:failed_permanently], worker_id:)
emit_metric('journaled.event.failed_transiently', value: stats[:failed_transiently], worker_id:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also thoughts on event.failed_permanently and event.failed_transiently becoming event.failed and event.errored? (Aligning with DJ terminology)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah all sounds good! will update

metrics = calculate_queue_metrics
emit_metric('journaled.worker.queue_total_count', value: metrics[:total_count], worker_id:)
emit_metric('journaled.worker.queue_workable_count', value: metrics[:workable_count], worker_id:)
emit_metric('journaled.worker.queue_erroring_count', value: metrics[:erroring_count], worker_id:)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
emit_metric('journaled.worker.queue_erroring_count', value: metrics[:erroring_count], worker_id:)
emit_metric('journaled.worker.queue_failed_count', value: metrics[:erroring_count], worker_id:)

Should it be "failed count"? If it's a batch query it can really only count the number of hard-failed rows, not the number of transiently-erroring rows, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated!

Copy link
Member

@smudge smudge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAFN - two points on naming conventions.

@effron effron requested a review from smudge November 25, 2025 18:08
Copy link
Member

@smudge smudge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domain LGTM && platform LGTM

@effron effron merged commit 8a68afb into master Nov 25, 2025
29 checks passed
@effron effron deleted the effron/main/add-failure-metrics branch November 25, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants