Skip to content

Use initially stored environment to determine GPU info for test reports#5088

Open
Thyre wants to merge 2 commits intoeasybuilders:developfrom
Thyre:restore-env-after-final-build
Open

Use initially stored environment to determine GPU info for test reports#5088
Thyre wants to merge 2 commits intoeasybuilders:developfrom
Thyre:restore-env-after-final-build

Conversation

@Thyre
Copy link
Collaborator

@Thyre Thyre commented Jan 9, 2026

Tools like amd-smi and rocm-smi rely on having a clean environment to correctly pick up information. However, the last build in build_and_install_software leaves a tampered environment, potentially breaking both commands.
To still correctly determine the GPU information, introduce a variable allowing to use the initial environment instead. Also use this for nvidia-smi, as its, by now, typically installed on the system level and not as a module.

Ref. #5086

@Thyre Thyre added the bug fix label Jan 9, 2026
@Thyre Thyre added this to the next release (5.2.1?) milestone Jan 9, 2026
@Thyre
Copy link
Collaborator Author

Thyre commented Jan 9, 2026

Successful test report with this PR e.g. here: easybuilders/easybuild-easyconfigs#24982 (comment)

@Thyre
Copy link
Collaborator Author

Thyre commented Jan 9, 2026

 ======================================================================
ERROR: test_compiler_cache (test.framework.toolchain.ToolchainTest)
Test ccache
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/runner/93c7052ffb0c1767f98f65b7279b2ab1bf2439b4/lib/python3.8/site-packages/test/framework/toolchain.py", line 2467, in test_compiler_cache
    self.assertTrue(os.path.samefile(os.environ['CCACHE_DIR'], ccache_dir))
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'CCACHE_DIR'

This is the same failure encountered with EasyStacks back in #4213 (comment).

The main reasoning was

we assume that the build environment does not get reset after the installation of easyconfigs

which, from my perspective, is a horrible assumption. Anything that runs after an installation, until EasyBuild has finished its execution, runs in a tampered environment. Loaded modules from a prior build might break the run_shell_cmd depending on the loaded dependencies, causing issues like seen in #5086. Environment variables might not match expected values anymore. The worst thing is that the effects are not consistent. They depend on the last built EasyConfig and its EasyBlock.

The other reason

The failing test is actually telling us that we're making a somewhat intrusive change.
It's not impossible that some people are actually relying on the build environment not being reset, for example when using EasyBuild as a library.

doesn't justify the side-effect of not having a clean environment between exiting build_and_install_software and entering it the next time for further builds. It's even worse with using EasyBuild as a library, as earlier hooks (such as a parse hook) could be affected as well, if those hooks are executed before build_and_install_software. Though it's understandable that introducing a potentially breaking change wasn't desirable at the time.

Documentation for this behavior is also lacking I'd say. run_shell_cmd mentions that the command runs in the current processes environment by default, but build_and_install_software neither documents that it resets the environment between builds, nor that it doesn't reset the environment after the final one (see doc).

For eb, this seems, as far as I can tell, only affect get_gpu_info() and the end hook. The remaining system info is pulled from init_session_state.

There are a few solutions for handling this:

  1. Move the GPU info detection to be done earlier, which would also allow to display them in the Gist, as this solely relies on init_session_state at the moment
  2. Add a parameter to build_and_install_software to optionally reset the environment at the end, with eb / process_eb_args doing that by default
  3. Use the "clean" environment from init_session_state to run all run_shell_cmds in get_gpu_info(). This would require adding a parameter to get_gpu_info, which is a breaking change. So we'd need to set a default (probably keep inheriting the build environment) and only pass the init_session_state when called from post_pr_test_report.

@Thyre Thyre force-pushed the restore-env-after-final-build branch from 9c0e19c to c94ce7f Compare January 9, 2026 22:30
@Thyre
Copy link
Collaborator Author

Thyre commented Jan 9, 2026

Switched to option three, as this is generally the most flexible option for now. If necessary, one could pass a different environment. The behavior only changed for post_pr_test_report, where we pass the initial environment.

Also added documentation for the behavior of build_and_install_software.

@Thyre Thyre changed the title Restore environment after final build Use initially stored environment to determine GPU info for test reports Jan 9, 2026
Crivella
Crivella previously approved these changes Jan 12, 2026
Copy link
Contributor

@Crivella Crivella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boegel similar comment for the testing of get_gpu_info specifically as in #5087 which i think would warrant its own PR

@boegel
Copy link
Member

boegel commented Jan 14, 2026

@Thyre merge conflict to resolve after merge of #5087

Thyre added 2 commits January 14, 2026 15:42
Tools like `amd-smi` and `rocm-smi` rely on having a clean environment to
correctly pick up information. However, the last build in
`build_and_install_software` leaves a tampered environment from the last
build, breaking both commands.
To still correctly determine the GPU information, introduce a variable
allowing to use the initial environment instead. Also use this for
`nvidia-smi`, as its, by now, typically installed on the system level and
not as a module.

Signed-off-by: Jan André Reuter <j.reuter@fz-juelich.de>
Signed-off-by: Jan André Reuter <j.reuter@fz-juelich.de>
@Thyre
Copy link
Collaborator Author

Thyre commented Jan 14, 2026

@Thyre merge conflict to resolve after merge of #5087

Rebased, thanks for the ping 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants