Skip to content

Conversation

@roberth
Copy link
Member

@roberth roberth commented Jan 11, 2026

Motivation

When !keepGoing (the default), and a derivation fails, other derivations are cancelled.
Due to two mistakes, cancelled derivations would be reported as failed, with an empty error message.

  • amDone was called without setting the BuildResult in trampoline goal
  • cancelled goals weren't filtered out by the error reporting logic.

See commit messages for more detailed explanations and substantiation.

Context

  • This issue is probably underreported because it makes users believe both their own build and Nix itself are buggy, and it's not an easy report to figure out.
  • This only affects buildPathsWithResults use cases, which is not the whole CLI, and notably not nix build. goal->ex worked fine.
  • More relevant since Track attributes in nix flake check #14321 makes nix flake check use buildPathsWithResults

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

Add helpers to the base Goal class that set buildResult and call amDone,
ensuring buildResult is always populated when a goal terminates.

Derived class helpers now call the base class versions. This reorders
operations: previously buildResult was set before bookkeeping (counter
resets, worker stats), now it's set after. This is safe because the
bookkeeping code (mcExpectedBuilds.reset(), worker.doneBuilds++,
worker.updateProgress(), etc.) only accesses worker counters, not
buildResult.
@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Jan 11, 2026
@roberth roberth changed the title Fix concurrent failure bug Fix concurrent builder failure empty message bug Jan 11, 2026
Copy link
Member

@Eveeifyeve Eveeifyeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw no issue in this, code wise however I am wondering how the tests are looking, to see if it can reproduce this issue so it's fixed.

@roberth roberth force-pushed the fix-concurrent-failure-bug branch 2 times, most recently from 0c9b296 to febab73 Compare January 11, 2026 19:37
@roberth
Copy link
Member Author

roberth commented Jan 11, 2026

Ok, I've added more intentional regression tests.
The bugfix DerivationTrampolineGoal was tricky to trigger, because I got the conditions under which it would trigger wrong initially. It's somewhat of a separate issue with the same symptom, as it can be triggered with nix build, not just nix flake check.
^ Just a breadcrumb for future explorers.

The regression tests now cover both fixes.

@Ericson2314 Ericson2314 force-pushed the fix-concurrent-failure-bug branch from febab73 to 7b53a13 Compare January 11, 2026 22:01
@roberth roberth changed the title Fix concurrent builder failure empty message bug Fix concurrent builder failure empty message bugs Jan 12, 2026
@xokdvium
Copy link
Contributor

@Ericson2314, what's left to do here?

@roberth roberth added the backport 2.33-maintenance Automatically creates a PR against the branch label Jan 14, 2026
@Ericson2314
Copy link
Member

@roberth was going to squash to 2 commits, and then I was going to take a stab at removing an the optional exception argument since the BuildResult::Failure already has a message. (And indeed BuildError and BuildResult::Failure are the same thing, I should fix that too.)

DerivationTrampolineGoal is the top-level goal whose buildResult is
returned by buildPathsWithResults. When it failed without setting
buildResult.inner, buildPathsWithResults would return failures with
empty errorMsg, producing error messages like:

  error: failed to build attribute 'checks.x86_64-linux.foo',
  build of '/nix/store/...drv^*' failed:

(note the empty message after "failed:")

Use the new doneFailure helper to ensure buildResult is populated
with meaningful error information.
When keepGoing=false and a build fails, other goals are cancelled.
Previously, these cancelled goals were reported in the "build of ...
failed" error message alongside actual failures. This was misleading
since cancelled goals didn't actually fail - they were never tried.

Update the test to expect only the actual failure (hash mismatch) to
be reported, not the cancelled goals.
When !keepGoing and a goal fails, other goals are cancelled and
remain with exitCode == ecBusy. These cancelled goals have a default
BuildResult::Failure{} with empty errorMsg.

Previously, buildPathsWithResults would return these cancelled goals,
and throwBuildErrors would report them as failures. When only one such
cancelled goal was present, it would throw an error with an empty
message like:

    error: build of '/nix/store/...drv^*' failed:

Now we skip goals with ecBusy since their state is indeterminate.
Cancelled goals could be reported, but this keeps the output relevant.
Other indeterminate goal states were already not being reported, for
instance: derivations that weren't started for being blocked on a
concurrency limit, or blocked on a currently building dependency.
Change "cannot build missing derivation" to "failed to obtain derivation of"
since the path (e.g. '...drv^out') is a derivation output, not a derivation.

The message could be improved further to resolve ambiguity when multiple
outputOf links are involved, but for now we err on the side of brevity
since this message is already merged into larger error messages with
other context from the Worker and CLI.
@roberth roberth force-pushed the fix-concurrent-failure-bug branch from 7b53a13 to 3c3ceb1 Compare January 14, 2026 19:43
@roberth
Copy link
Member Author

roberth commented Jan 14, 2026

2 commits

The fixes are each in their own commit now, but I have still kept separate:

  • The pure refactor, 1st commit
  • Test suite change (adjustment of requirements; not just a fix)
  • Error message rewording is also a separate change

then I was going to take a stab at removing an the optional exception argument since the BuildResult::Failure already has a message. (And indeed BuildError and BuildResult::Failure are the same thing, I should fix that too.)

Should we block the fix on that?

@xokdvium
Copy link
Contributor

Should we block the fix on that?

Don't think so. It's a pretty annoying bug.

@Ericson2314
Copy link
Member

OK I'll just do the thing I want go do after.

@Ericson2314 Ericson2314 added this pull request to the merge queue Jan 23, 2026
Merged via the queue into NixOS:master with commit 83360cd Jan 23, 2026
14 checks passed
@internal-nix-ci
Copy link

Successfully created backport PR for 2.33-maintenance:

@roberth roberth added the backport 2.32-maintenance Automatically creates a PR against the branch label Jan 23, 2026
@internal-nix-ci
Copy link

Backport failed for 2.32-maintenance, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.32-maintenance
git worktree add -d .worktree/backport-14972-to-2.32-maintenance origin/2.32-maintenance
cd .worktree/backport-14972-to-2.32-maintenance
git switch --create backport-14972-to-2.32-maintenance
git cherry-pick -x cb2ade20d4b45ae1f1838d453c5484287516308f 25eb07a91b377f62322600d45d520453111a79eb 3fd85c7d64327df3b145f239211ebf7519a69c2a 68f549def46586fc5aee8b5ce62e7e8bb5a16703 3c3ceb18e9a3c421ab5990e381dfbe40e3dc3cec

@internal-nix-ci
Copy link

Backport failed for 2.33-maintenance, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.33-maintenance
git worktree add -d .worktree/backport-14972-to-2.33-maintenance origin/2.33-maintenance
cd .worktree/backport-14972-to-2.33-maintenance
git switch --create backport-14972-to-2.33-maintenance
git cherry-pick -x cb2ade20d4b45ae1f1838d453c5484287516308f 25eb07a91b377f62322600d45d520453111a79eb 3fd85c7d64327df3b145f239211ebf7519a69c2a 68f549def46586fc5aee8b5ce62e7e8bb5a16703 3c3ceb18e9a3c421ab5990e381dfbe40e3dc3cec

@roberth
Copy link
Member Author

roberth commented Jan 23, 2026

Slated for release in 2.33.2 and 2.34.0.
No further backports needed.

@roberth roberth added backports created Does not require attention and can be filtered away and removed backport 2.32-maintenance Automatically creates a PR against the branch labels Jan 23, 2026
@internal-nix-ci
Copy link

Backport failed for 2.33-maintenance, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.33-maintenance
git worktree add -d .worktree/backport-14972-to-2.33-maintenance origin/2.33-maintenance
cd .worktree/backport-14972-to-2.33-maintenance
git switch --create backport-14972-to-2.33-maintenance
git cherry-pick -x cb2ade20d4b45ae1f1838d453c5484287516308f 25eb07a91b377f62322600d45d520453111a79eb 3fd85c7d64327df3b145f239211ebf7519a69c2a 68f549def46586fc5aee8b5ce62e7e8bb5a16703 3c3ceb18e9a3c421ab5990e381dfbe40e3dc3cec

@xokdvium
Copy link
Contributor

This fails for functional_root:

vm-test-run-functional-tests-on-nixos_root> machine # running 4 flake checks...
vm-test-run-functional-tests-on-nixos_root> machine # error:
vm-test-run-functional-tests-on-nixos_root> machine #        … while setting up the build environment
vm-test-run-functional-tests-on-nixos_root> machine #
vm-test-run-functional-tests-on-nixos_root> machine #        error: opening file '\''/nix/store/ii4fsd6ws13ysgp75yzcqmxl4zrgsip1-fast-fail.drv.chroot/root/nix/store/90cp3kgrbrmll2wkb9fsmd2rwm3asqqv-builder-fast-fail.sh'\'': Permission denied'
vm-test-run-functional-tests-on-nixos_root> machine # + status=1
vm-test-run-functional-tests-on-nixos_root> machine # + rm -rf /tmp/nix-test/main/build/cancelled-builds-fifo
vm-test-run-functional-tests-on-nixos_root> machine # + test 1 = 1
vm-test-run-functional-tests-on-nixos_root> machine # + grepQuiet -E 'Cannot build.*fast-fail'
vm-test-run-functional-tests-on-nixos_root> machine # + checkGrepArgs -E 'Cannot build.*fast-fail'
vm-test-run-functional-tests-on-nixos_root> machine # + local arg
vm-test-run-functional-tests-on-nixos_root> machine # + for arg in "$@"
vm-test-run-functional-tests-on-nixos_root> machine # + [[ -E != \-\E ]]
vm-test-run-functional-tests-on-nixos_root> machine # + for arg in "$@"
vm-test-run-functional-tests-on-nixos_root> machine # + [[ Cannot build.*fast-fail != \C\a\n\n\o\t\ \b\u\i\l\d\.\*\f\a\s\t\-\f\a\i\l ]]
vm-test-run-functional-tests-on-nixos_root> machine # + command grep -E 'Cannot build.*fast-fail'

@edolstra
Copy link
Member

edolstra commented Feb 2, 2026

FWIW, I had to revert 68f549d in Determinate Nix because we rely on buildPathsWithResults() returning cancelled builds so that nix flake check and nix build can show them (see DeterminateSystems#281).

@xokdvium
Copy link
Contributor

xokdvium commented Feb 2, 2026

The proper fix would have to be in #14559 so that we don't misreport those as failures. The only question would be the back-compat with older clients not understanding the cancelled status code. That might be a use-case for fine grained features?

@Ericson2314
Copy link
Member

@amaanq and I have some code for daemon protocol compat that will help with that, FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.33-maintenance Automatically creates a PR against the branch backports created Does not require attention and can be filtered away bug scheduling with-tests Issues related to testing. PRs with tests have some priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants