Skip to content

nix flake check: free nixosConfigurations values after checking#15142

Draft
illustris wants to merge 1 commit intoNixOS:masterfrom
illustris:flake-mem
Draft

nix flake check: free nixosConfigurations values after checking#15142
illustris wants to merge 1 commit intoNixOS:masterfrom
illustris:flake-mem

Conversation

@illustris
Copy link
Contributor

@illustris illustris commented Feb 4, 2026

Save each nixosConfiguration's thunk state before checking, then restore
it immediately after. This makes the evaluated configuration tree
unreachable, allowing GC_gcollect() to reclaim memory before processing
the next config. This keeps only one configuration's evaluation tree in
memory at a time, rather than holding all evaluated configurations
simultaneously.

Motivation

github:illustris/flake-check-mem-poc has 20 minimal nixos configurations for PVE VMs. Without this patch, nix flake check takes up about 5GB. Bumping that to 100 nodes makes the memory usage go to 18GB.

        User time (seconds): 275.68
        System time (seconds): 174.27
        Percent of CPU this job got: 140%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:19.39
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18579372
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 4482275
        Minor (reclaiming a frame) page faults: 5513432
        Voluntary context switches: 1069755
        Involuntary context switches: 65697
        Swaps: 0
        File system inputs: 35700044
        File system outputs: 201
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

with the patch:

        User time (seconds): 200.35
        System time (seconds): 3.96
        Percent of CPU this job got: 77%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:23.93
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1479644
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1807
        Minor (reclaiming a frame) page faults: 378344
        Voluntary context switches: 1070729
        Involuntary context switches: 31566
        Swaps: 0
        File system inputs: 435937
        File system outputs: 104
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added the new-cli Relating to the "nix" command label Feb 4, 2026
src/nix/flake.cc Outdated
Comment on lines 729 to 731
auto * mutableAttr = const_cast<Attr *>(&attr);
mutableAttr->value = state->allocValue();
mutableAttr->value->mkNull();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem sound to me at all. What if there's another thunk that ends up referring to it in another flake output (like a select expression)? There's a reason that attrs() returns a readonly view - it's not generally safe to modify an existing Bindings (or any non-thunk Value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I tested nixosConfigurations cross-referencing each other, but did not test other flake outputs referencing nixosConfigurations:

$ /run/current-system/sw/bin/time -v /nix/store/916a1jk389d54qmcsn1rjbakklhdq6k7-nix-2.34.0pre20260204_b61d150/bin/nix flake check /tmp/test-flake
warning: Git tree '/tmp/test-flake' is dirty
error:
       … while checking flake output 'packages'
         at /tmp/test-flake/flake.nix:30:3:
           29|          );
           30|                packages.x86_64-linux.default = self.nixosConfigurations."1".config.system.build.toplevel;
             |   ^
           31|      };

       … while checking the derivation 'packages.x86_64-linux.default'
         at /tmp/test-flake/flake.nix:30:3:
           29|               );
           30|                packages.x86_64-linux.default = self.nixosConfigurations."1".config.system.build.toplevel;
             |   ^
           31|      };

       (stack trace truncated; use '--show-trace' to show the full, detailed trace)

       error: expected a set but found null: null

@xokdvium
Copy link
Contributor

xokdvium commented Feb 4, 2026

Generally this issue seems like a tradeoff between sharing and doing redundant work. Until we've evaluated everything we don't know what will need to be forced so we can't do something similar to this ahead of time. Maybe the best we could do it somehow demote forced values back to thunks based on some heuristic (if we are reasonably sure that we don't incur an extra high cost by having to redo the work of forcing it), but we don't have such a mechanism yet.

@xokdvium
Copy link
Contributor

xokdvium commented Feb 4, 2026

Also, are you sure that you are benchmarking without the eval cache and the fetcher cache prewarmed. This difference seems very suspicious to me:

File system inputs: 35700044

File system inputs: 435937

@illustris
Copy link
Contributor Author

Generally this issue seems like a tradeoff between sharing and doing redundant work. Until we've evaluated everything we don't know what will need to be forced so we can't do something similar to this ahead of time. Maybe the best we could do it somehow demote forced values back to thunks based on some heuristic (if we are reasonably sure that we don't incur an extra high cost by having to redo the work of forcing it), but we don't have such a mechanism yet.

For nixosConfigurations at least, this would make sense. A fully evaluated nixos system takes up way too much memory. The relatively small additional compute for re-evaluating config values is a good tradeoff for the memory savings. For example:

$ /run/current-system/sw/bin/time -v /nix/store/916a1jk389d54qmcsn1rjbakklhdq6k7-nix-2.34.0pre20260204_b61d150/bin/nix eval /tmp/test-flake#nixosConfigurations.1.config.networking.hostId
warning: Git tree '/tmp/test-flake' is dirty
"11111111"
        Command being timed: "/nix/store/916a1jk389d54qmcsn1rjbakklhdq6k7-nix-2.34.0pre20260204_b61d150/bin/nix eval /tmp/test-flake#nixosConfigurations.1.config.networking.hostId"
        User time (seconds): 0.68
        System time (seconds): 0.14
        Percent of CPU this job got: 70%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.16
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 178936
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 16
        Minor (reclaiming a frame) page faults: 42786
        Voluntary context switches: 2971
        Involuntary context switches: 40
        Swaps: 0
        File system inputs: 31842
        File system outputs: 192
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Also, are you sure that you are benchmarking without the eval cache and the fetcher cache prewarmed. This difference seems very suspicious to me:

File system inputs: 35700044

File system inputs: 435937

The cache was not pre-warmed, and the 100 node flake was causing a lot of swapping. But it doesn't make much of a difference for the memory util. I reran the tests with 20 nodes and 10 iterations. After warmup, the filesystem numbers were fairly consistent.

baseline:

        Command being timed: "/nix/store/614dzfxcahl6q6dhz9ysjfsrb948sqkh-nix-2.34.0pre20260203_27435e0/bin/nix flake check /tmp/test-flake"
        User time (seconds): 56.57
        System time (seconds): 2.39
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.01
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4959664
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1246096
        Voluntary context switches: 215181
        Involuntary context switches: 3295
        Swaps: 0
        File system inputs: 64634
        File system outputs: 98
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

patched:

        Command being timed: "/nix/store/916a1jk389d54qmcsn1rjbakklhdq6k7-nix-2.34.0pre20260204_b61d150/bin/nix flake check /tmp/test-flake"
        User time (seconds): 45.52
        System time (seconds): 1.22
        Percent of CPU this job got: 80%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:57.99
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1286680
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 327877
        Voluntary context switches: 217838
        Involuntary context switches: 4249
        Swaps: 0
        File system inputs: 64345
        File system outputs: 42
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I'll update the patch to force evaluated nixosConfigurations back to thunks and test it. I think it will improve nix flake check for most usecases, with the exception of a small set of scenarios like
packages.x86_64-linux.default = self.nixosConfigurations."1".config.system.build.toplevel;, where it will have to re-evaluate the whole system configuration.

@illustris illustris force-pushed the flake-mem branch 2 times, most recently from 0d24aa3 to f282c79 Compare February 5, 2026 06:26
@illustris
Copy link
Contributor Author

Test Case Variant Wall Time User Time Max RSS (MB)
NixOS configs only baseline 0:59.70 58.03s ~4,847
NixOS configs only patch 0:58.96 56.37s ~701
Configs + pkg accessing config attr baseline 1:00.31 57.85s ~4,847
Configs + pkg accessing config attr patch 1:03.55 64.54s ~1,605
Configs + pkg referencing toplevel baseline 1:38.52 87.99s ~8,484
Configs + pkg referencing toplevel patch 2:37.73 147.28s ~8,809
  • Case 1: Patch reduces peak RSS by ~85% (4.8G → 701M) with no time increase.
  • Case 2: Patch reduces peak RSS by ~67% (4.8G → 1.6G) with minor (~5%) time increase.
  • Case 3: peak RSS is the same and time goes up by ~60% (1:38 → 2:37) when toplevel is forced.

test case flake code and full output of time -v:
https://gist.github.com/illustris/391bd5562499aea1df12133c1d04ff23

In my opinion, trading off the memory for extra compute in some edge cases makes sense here. Flakes with many NixOS configurations are common, but flakes that also expose other valid outputs forcing evaluation of every config.system.build.toplevel are rare.

Save each nixosConfiguration's thunk state before checking, then restore
it immediately after. This makes the evaluated configuration tree
unreachable, allowing GC_gcollect() to reclaim memory before processing
the next config. This keeps only one configuration's evaluation tree in
memory at a time, rather than holding all evaluated configurations
simultaneously.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-cli Relating to the "nix" command

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants