-
Notifications
You must be signed in to change notification settings - Fork 243
Introduce RetryHeap and RetryHeapProcessor #1972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: poison-pill
Are you sure you want to change the base?
Conversation
| ) | ||
|
|
||
| // retryHeapImpl implements heap.Interface for logEventBatch sorted by nextRetryTime | ||
| type retryHeapImpl []*logEventBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add var _ heap.Interface = (*retryHeapImpl)(nil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack will update as follow up
| } | ||
|
|
||
| // RetryHeap manages failed batches during their retry wait periods | ||
| type RetryHeap interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add var _ RetryHeap = (*retryHeap)(nil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack will update as follow up
| for _, batch := range readyBatches { | ||
| // Check if batch has expired | ||
| if batch.isExpired(p.maxRetryDuration) { | ||
| p.logger.Debugf("Dropping expired batch for %s/%s", batch.Group, batch.Stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Make it match https://github.com/aws/amazon-cloudwatch-agent/pull/1970/changes#diff-d38a5f3264bf20ef00aa1acd29a08512a7d92c63384dcacfd3737b70cdf7311cR119 for consistency with our logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack will update as follow up
| ) | ||
|
|
||
| // retryHeapImpl implements heap.Interface for logEventBatch sorted by nextRetryTime | ||
| type retryHeapImpl []*logEventBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why do we have a retryHeapImpl and a retryHeap? Is there ever a case where we'd use one without the other? Is it better to operate on the []*logEventBatch directly instead of accessing it as a field of retryHeap?
| func (h *retryHeapImpl) Pop() interface{} { | ||
| old := *h | ||
| n := len(old) | ||
| item := old[n-1] | ||
| *h = old[0 : n-1] | ||
| return item | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is missing a step from the PriorityQueue example
https://pkg.go.dev/container/heap#Interface
func (pq *PriorityQueue) Pop() any {
...
old[n-1] = nil // don't stop the GC from reclaiming the item eventually
...
}
Is there a reason we decided not to include it?
| batch3 := newLogEventBatch(target, nil) | ||
| batch3.nextRetryTime = time.Now().Add(time.Hour) // Future time, won't be popped | ||
| heap.Push(batch3) // This should block | ||
| pushCompleted = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a data race since you're accessing the same variable on two separate threads/goroutines. Consider using an atomic or a channel.
| } | ||
|
|
||
| // RetryHeap manages failed batches during their retry wait periods | ||
| type RetryHeap interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the idea that each sender will have a reference to the RetryHeap to push failed batches onto and then there's a single RetryHeapProcessor that also has a reference to the RetryHeap?
| // NewRetryHeap creates a new retry heap with the specified maximum size | ||
| func NewRetryHeap(maxSize int) RetryHeap { | ||
| rh := &retryHeap{ | ||
| heap: make(retryHeapImpl, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would it make sense to initialize it with a capacity of maxSize? That way it doesn't need to reallocate on each push?
| } | ||
|
|
||
| // Stop stops the retry heap | ||
| func (rh *retryHeap) Stop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is going to stop the retry heap? Does it stop when the RetryHeapProcessor stops?
| } | ||
|
|
||
| // Wait for batches to become ready, then pop to release semaphore | ||
| time.Sleep(4 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we use sleeps in other unit tests, but is there a way to make this less flaky? Is there a reason to wait since this isn't using the processor, so doesn't rely on the ticker? The initial batches could be expired when added like in the other tests.
|
|
||
| // Wait for batches to become ready, then pop to release semaphore | ||
| time.Sleep(4 * time.Second) | ||
| heap.PopReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we validate that the correct batches were taken off the heap?
Description of the issue
Introduce RetryHeap and RetryHeapProcessor implementations for cloudwatchlogs log publishing
Description of changes
RetryHeap is implemented as a min-heap sorted on retryTimestamp for a given batch PutLogEvent request that has already been attempted once and failed.
RetryHeapProcessor will periodically scan for batches that are ready to be retried based on retryTimestamp and move them to the SenderPool tasks queue for processing by the sender pool.
Assumptions: The RetryHeap and RetryHeapProcessor will only be instantiated and used when concurrency is enabled.
No functionality changes are introduced in this PR as this sets up RetryHeap and related components, but does not instantiate them in the code yet.
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
New unit tests are passing
Requirements
Before commiting your code, please do the following steps.
make fmtandmake fmt-shmake lintIntegration Tests
To run integration tests against this PR, add the
ready for testinglabel.