Skip to content

Fix race in requeue#349

Merged
ChrisBr merged 1 commit intomainfrom
cbruckmayer/fix-race-requeue
Jul 2, 2025
Merged

Fix race in requeue#349
ChrisBr merged 1 commit intomainfrom
cbruckmayer/fix-race-requeue

Conversation

@ChrisBr
Copy link
Contributor

@ChrisBr ChrisBr commented Jul 1, 2025

When a test gets requeued, we put it first back into the list and then later remove the error report in the reported. However, a race can occour when the test is picked up by another worker in the meantime and fails. The worker which requeud the test would then remove the error report again.

@ChrisBr ChrisBr force-pushed the cbruckmayer/fix-race-requeue branch from 45819b9 to f8d3efc Compare July 1, 2025 13:12
@ChrisBr ChrisBr marked this pull request as ready for review July 1, 2025 13:15
redis.call('hincrby', requeues_count_key, '___total___', 1)
redis.call('hincrby', requeues_count_key, test, 1)

redis.call('hdel', error_reports_key, test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing hdel here, the transaction.hdel(key('error-reports'), id) becomes redundant.

But also since they were in a transaction, I don't see what this changes.

Copy link
Contributor Author

@ChrisBr ChrisBr Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is requeue which is not in a transaction. It's called from the queue.

  1. We call requeue from queue
  2. We go through the reporters
  3. BuildStatusRecorder will call record_success which will remove the error report

This opens a window for a race because as soon as it's requeued a new worker can run it and report a new error report which could be deleted again by the worker which did the requeue.

It's also not redundant because we now call record_requeue (instead of record_success) which will only report the status. We still need to do the delete in record_success which is in a transaction with acknowledge already.

if failed && CI::Queue.requeueable?(result) && queue.requeue(example)
result.requeue!
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not redundant

In record_success it very much seem redundant:

            @queue.acknowledge(id, pipeline: transaction)
            transaction.hdel(key('error-reports'), id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we call requeue we don't call record_success anymore but record_requeue which is only setting the stats.

        def record_requeue(id, stats: nil)
          redis.pipelined do |pipeline|
           record_stats(stats, pipeline: pipeline)
          end
        end

Copy link
Contributor Author

@ChrisBr ChrisBr Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we were calling record_success on requeues and were

  1. deleting the error
  2. setting the stats

As we now delete the error report from inside the requeue lua script I've created a dedicated method.

The alternative would be to alter record_success to something like this

        def record_success(id, stats: nil, skip_flaky_record: false, acknowledge: true)
          _, error_reports_deleted_count, requeued_count, _ = redis.multi do |transaction|
            if acknowledge
              @queue.acknowledge(id, pipeline: transaction)
              transaction.hdel(key('error-reports'), id)
            end
            transaction.hget(key('requeues-count'), id)
            record_stats(stats, pipeline: transaction)
          end
          record_flaky(id) if !skip_flaky_record && (error_reports_deleted_count.to_i > 0 || requeued_count.to_i > 0)
          nil
        end

But it's difficult because the return values would be incorrect in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. error_reports_deleted_count is now already always 0 because it's deleted inside acknowledge.

Copy link
Contributor Author

@ChrisBr ChrisBr Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's deleted inside acknowledge.

Not acknowledge - we delete it in requeue. I think you might be mixing up acknowledge and requeue scripts?

We don't call acknowledge on a requeue, we only call requeue. Hence, we either do

  • success: call acknowledge and hdel the error-reports
  • requeue: call requeue (and don't call hdel in that case because it's inside requeue already)

Well. error_reports_deleted_count is now already always 0 because it's deleted inside acknowledge.

Right, but the assignments would be messed up, no?

Anyway, error_reports_deleted_count and requeued_count are really only relevant on success to identify a flaky test. It doesn't matter on a requeue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think you might be mixing up acknowledge and requeue scripts?

Ah indeed...

@ChrisBr
Copy link
Contributor Author

ChrisBr commented Jul 2, 2025

/shipit

@ChrisBr ChrisBr merged commit 6322337 into main Jul 2, 2025
13 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants