scheduler: Stop scheduling from the imperative queue when asked to quit by abderrahim · Pull Request #2005 · apache/buildstream

abderrahim · 2025-04-05T07:47:47Z

For instance, we shouldn't schedule new build jobs when asked to quit. However, not considering the build queue at all would miss pushing newly built elements.

What this commit does is:

Pull elements forward through all queues (including the ones we no longer need)
Only schedule jobs after the imperative queue

This means that elements in the build queue (or any imperative queue) will still get passed to the following queues, but no new build jobs get scheduled.

Fixes #1787

gtristan · 2025-05-15T15:50:32Z

src/buildstream/_scheduler/scheduler.py

            # Pull elements forward through queues
            elements = []
-            for queue in queues:
+            for queue in self.queues:


This line looks very dubious.

The first part of the function's job is to select the queues on which to process in the latter half of the function, that seems to make sense. And except for this line you've changed, the rest of the function does only operate on the selected queues.

I wonder if this change is really needed, or if only the previous change is needed.

Given this is such a sensitive code block, I think this change needs further clarity.

The first part of the function's job is to select the queues on which to process in the latter half of the function, that seems to make sense. And except for this line you've changed, the rest of the function does only operate on the selected queues.

Correct, and that's the root cause of the issue I'm trying to fix.

There are two operations in this function ("pull elements forward through queues", and "kickoff whatever processes can be processed at this time") and they shouldn't operate on the same list of queues.

Let's consider a simple example where we have three queues: fetch, build and push. Once we receive the quit signal, we only want to:

pull elements forward from build to push

kickoff new jobs in push

What the current code does is:

pull elements forward from build to push

kickoff new jobs in build (<- not desired, and is the issue I'm trying to fix)

kickoff new jobs in push

What the new code does is:

pull elements forward from fetch to build (<- undesired but harmless, this is the issue you're pointing)

pull elements through from build to push

kickoff new jobs in push

I hope this explains it. So while this line is indeed "dubious", it needs to be different from the other one. And since it's harmless to pull jobs into next queues even when asked to quit, I think it's fine to do that unconditionally.

Ok... I've done some mental gymnastics and I agree with your statement :)

What I would like to see here is:

Rename the local queues variable to something more explicit, perhaps queues_to_process

At least slightly elaborate on the comment "Pull elements forward through queues"

Perhaps something like "Pull elements forward through all queues, regardless of whether we are processing those queues"

Maybe even a little explanation, like "We need to propagate elements through all queues even if we are not processing post-imperative queues, so that we do end up processing the jobs we need to"

gtristan · 2025-05-18T06:30:13Z

Can we add a test here ?

I think that we can potentially test correct behavior, as described in the comments of #1787, by constructing a test which:

Has an artifactshare
Has a build which fails
Test that the failed build exists in the artifact share after the failed bst build call
- And/or use the cli.get_pushed_elements() checker to also verify the build log

I think we already have a similar test in tests/integration/cachedfail.py::test_push_cached_fail(), perhaps this requires constructing a dependency chain which fails the existing test but passes with this change applied ?

abderrahim · 2025-05-18T07:15:23Z

I think we already have a similar test in tests/integration/cachedfail.py::test_push_cached_fail(), perhaps this requires constructing a dependency chain which fails the existing test but passes with this change applied ?

I think there is a misunderstanding here. What this test does (and what makes me confident that my changes don't break anything) is test for correctness "buildstream pushes the artifacts it has built before quitting". While my change is about interactivity "buildstream quits ASAP when asked to".

I don't think I can reliably test for the interactivity part as buildstream isn't deterministic (or at least it isn't guaranteed to be deterministic) in the order in which it schedules ready jobs. How I would test this manually is to have two build jobs that would run in parallel, set --builders to 1, then interrupt the build and request quit during the build of the first element that gets scheduled.

Talking this through, I think we might be able to do the same thing with failed builds: have two elements that fail, launch the build with --on-error quit, and assert that only one of two elements gets built and pushed. I'll give it a go.

gtristan · 2025-05-18T07:40:58Z

Talking this through, I think we might be able to do the same thing with failed builds: have two elements that fail, launch the build with --on-error quit, and assert that only one of two elements gets built and pushed. I'll give it a go.

Thats what I was thinking (at least the --on-error quit part), also, to control the behavior, we probably want an explicit --builders (possibly --builders 1 ?) to constrain buildstream to build a constrained amount of elements in parallel.

Sorry, I read the first and last part of what you wrote, and then reading the middle part I can see you thought of all of that already ;-)

gtristan · 2025-05-23T06:00:15Z

[...]

Talking this through, I think we might be able to do the same thing with failed builds: have two elements that fail, launch the build with --on-error quit, and assert that only one of two elements gets built and pushed. I'll give it a go.

I've rebased your branch, and added:

A refactor of tests/integration/cachedfail.py, so that it doesn't share the project/ directory with other integration tests
Added regression tests for BuildStream doesn't quit on ctrl+c then quit #1787

I've verified that these new tests indeed fail with your changes reverted.

tests/integration/cachedfail.py

gtristan · 2025-05-31T09:36:02Z

So the update here is that this fails in CI, and we're confident that the tests are correct.

The original patch to the scheduler appears to be working with python 3.13 but not in the other python versions which are covered in CI. The patch needs to be better understood, clarified and made to work.

Note: One thing we know which has changed in python 3.13 is the default list sorting algorithm, this has effected stage ordering in buildstream which we have worked around, it is unclear how this could have affected this scheduler patch, if indeed it is related at all.

abderrahim · 2025-06-01T08:59:58Z

I think it's not fair to say that this is dependent on the python version. It's not fair to say it's because of this patch either: it could be a race condition somewhere else in the scheduler that we only see now with this test (and God knows how many race conditions there are in the scheduler, I remember two manifestations of such race conditions in everyday usage of buildstream).

As a data point, I just did 10 back-to-back runs of just these two tests (test_stop_building_after_failed and test_push_but_stop_building_after_failed) and all of them succeeded on all supported python versions.

I know this failure exists, and could reproduce it on my laptop at least once, but this is definitely some non-deterministic behaviour rather than a real issue with the patch.

I would recommend landing this patch now: it's definitely better than what we currently have, and we can keep investigating the test failure and land the test later (along with any fix to make it always pass)

abderrahim · 2025-06-01T09:17:22Z

I just reproduced the failure locally (by running the whole file rather than just the two tests), and looking at the log I can see that the issue is that the cache isn't correctly cleaned up between the tests.

At the pipeline summary when buildstream starts, one of the two elements is already "failed" (and base is already "cached").

Pipeline
      cached 32de3a73cff66516cefb7922fe38627bd66eb08de11bc3de55f717e9abd06e01 base.bst 
   buildable 326d9b9f2c93d81bd2fd53a19bd4fb310f4223f8ab8fb07c1e14dc386b06cad7 base-also-fail.bst 
      failed af38817f3fa7ade48339e6891a33972dedaadcabc73510925a53b5670ace64f2 base-fail.bst 
     waiting ab71767c47a2b0e12eafbf28594a73e5a9ad66c05554c88ac1b5cb5aeacb2240 depends-on-two-failures.bst

BuildStream then builds the other one and has two failures at the end.

In the second test, both files were already cached failures before the build started.

Pipeline
      cached 32de3a73cff66516cefb7922fe38627bd66eb08de11bc3de55f717e9abd06e01 base.bst 
      failed 326d9b9f2c93d81bd2fd53a19bd4fb310f4223f8ab8fb07c1e14dc386b06cad7 base-also-fail.bst 
      failed af38817f3fa7ade48339e6891a33972dedaadcabc73510925a53b5670ace64f2 base-fail.bst 
     waiting ab71767c47a2b0e12eafbf28594a73e5a9ad66c05554c88ac1b5cb5aeacb2240 depends-on-two-failures.bst

abderrahim · 2025-06-01T09:30:47Z

You can also see this in CI (with python 3.13, no less): https://github.com/apache/buildstream/actions/runs/15327164427/job/43124550203?pr=2005#step:3:8187

For instance, we shouldn't schedule new build jobs when asked to quit. However, not considering the build queue at all would miss pushing newly built elements. What this commit does is: * Pull elements forward through all queues (including the ones we no longer need) * Only schedule jobs after the imperative queue This means that elements in the build queue (or any imperative queue) will still get passed to the following queues, but no new build jobs get scheduled. Fixes #1787

We use remove_artifact_from_cache() in integration tests where the cache is shared and we just want to clean up some artifacts regardless of whether they have already been built or not.

… project This approach has been applied mostly elsewhere but not yet in the integration tests. Since we are modifying integration tests, it is our responsibility to remodel the tests so that multiple tests do not share the same project directory data.

… builds Test that builds are no longer scheduled after a build fails with `--on-error quit` Also test that failed builds are *still* pushed after a build fails, even if no other builds are scheduled. This is a regression test for #1787

abderrahim requested review from BenjaminSchubert, cs-shadow, gtristan and juergbi as code owners April 5, 2025 07:47

abderrahim mentioned this pull request Apr 5, 2025

BuildStream doesn't quit on ctrl+c then quit #1787

Closed

gtristan reviewed May 15, 2025

View reviewed changes

gtristan force-pushed the abderrahim/quit-build branch from 5ec3901 to d15a8be Compare May 23, 2025 05:58

abderrahim commented May 27, 2025

View reviewed changes

tests/integration/cachedfail.py Outdated Show resolved Hide resolved

gtristan force-pushed the abderrahim/quit-build branch from d15a8be to 70c163e Compare May 29, 2025 14:57

abderrahim force-pushed the abderrahim/quit-build branch from 70c163e to b6c34dc Compare May 29, 2025 15:15

gtristan force-pushed the abderrahim/quit-build branch from b6c34dc to 7f91257 Compare June 3, 2025 09:22

gtristan added 3 commits June 3, 2025 18:56

_testing/runcli.py: Don't fail when removing non-existant artifacts

bde7d67

We use remove_artifact_from_cache() in integration tests where the cache is shared and we just want to clean up some artifacts regardless of whether they have already been built or not.

gtristan force-pushed the abderrahim/quit-build branch from 7f91257 to 8078dd7 Compare June 3, 2025 09:56

gtristan merged commit 9d698d2 into master Jun 3, 2025
17 checks passed

gtristan deleted the abderrahim/quit-build branch June 3, 2025 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: Stop scheduling from the imperative queue when asked to quit#2005

scheduler: Stop scheduling from the imperative queue when asked to quit#2005
gtristan merged 4 commits intomasterfrom
abderrahim/quit-build

abderrahim commented Apr 5, 2025

Uh oh!

gtristan May 15, 2025

Uh oh!

abderrahim May 17, 2025

Uh oh!

gtristan May 18, 2025

Uh oh!

gtristan commented May 18, 2025

Uh oh!

abderrahim commented May 18, 2025

Uh oh!

gtristan commented May 18, 2025 •

edited

Loading

Uh oh!

gtristan commented May 23, 2025

Uh oh!

Uh oh!

gtristan commented May 31, 2025 •

edited

Loading

Uh oh!

abderrahim commented Jun 1, 2025

Uh oh!

abderrahim commented Jun 1, 2025 •

edited

Loading

Uh oh!

abderrahim commented Jun 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abderrahim commented Apr 5, 2025

Uh oh!

gtristan May 15, 2025

Choose a reason for hiding this comment

Uh oh!

abderrahim May 17, 2025

Choose a reason for hiding this comment

Uh oh!

gtristan May 18, 2025

Choose a reason for hiding this comment

Uh oh!

gtristan commented May 18, 2025

Uh oh!

abderrahim commented May 18, 2025

Uh oh!

gtristan commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gtristan commented May 23, 2025

Uh oh!

Uh oh!

gtristan commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abderrahim commented Jun 1, 2025

Uh oh!

abderrahim commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abderrahim commented Jun 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gtristan commented May 18, 2025 •

edited

Loading

gtristan commented May 31, 2025 •

edited

Loading

abderrahim commented Jun 1, 2025 •

edited

Loading