CI: Various Improvements to CI#1209
Conversation
markbrown314
commented
Jun 24, 2025
- XPMEM build failure due to use of deprecated paging macros XPMEM incompatible with latest Ubuntu 24.04 Kernel #1206
- pud_large() and pmd_large() are not supported in newer Linux kernels
- Stubbed out pud_large() and pmd_large() referenced in xpmem_vaddr_to_pte_offset() via CI build script
- xpmem_vaddr_to_pte_offset() not used by SOS, so tests are not impacted
- Added timeout for SOS tests
- Added core dump artifact uploading and logging support
- Updated OFI testing to version 2.1.x
- Disabled failing Portals 4 tests Portals 4 Tests Fail on Ubuntu 24.04 #1208
| exit 2 | ||
| fi | ||
|
|
||
| sudo bash -c 'echo '"${CORE_DIR}"'/%E.%p.core > /proc/sys/kernel/core_pattern' |
There was a problem hiding this comment.
Just checking - Any issues with sudo being used in the CI script?
There was a problem hiding this comment.
Yes, sudo is required to modify sysfs. Sudo is okay in github runners
.github/scripts/scan_core.sh
Outdated
| echo "notice: no core dump files found" | ||
| exit 0 | ||
| fi | ||
|
|
There was a problem hiding this comment.
Should this check be done before the loop?
There was a problem hiding this comment.
There is no difference the loop will just break out if core_list is an empty string. Good idea! Moving it up looks cleaner.
There was a problem hiding this comment.
Your call. I would move it to make it clear.
| - config_name: Without Non-fetch AMO | ||
| sos_config: --disable-nonfetch-amo --enable-pmi-simple | ||
| libfabric_version: v1.13.x | ||
| libfabric_version: v2.1.x |
There was a problem hiding this comment.
can a github variable or secret be used for version info instead?
There was a problem hiding this comment.
Perhaps. I do plan to refactor this to be array based and if there is a way we configure this externally as an override. That would be good.
.github/workflows/ci.yml
Outdated
| SHMEM_DEBUG=1 SHMEM_INFO=1 SHMEM_OFI_PROVIDER=sockets make VERBOSE=1 TEST_RUNNER="${SOS_PM} -np 2" check | ||
| ulimit -c unlimited | ||
| ulimit -a | ||
| SHMEM_DEBUG=1 SHMEM_INFO=1 SHMEM_OFI_PROVIDER=sockets make VERBOSE=1 TEST_RUNNER="${SOS_PM} -np 2 timeout --signal=ABRT 15m" check |
There was a problem hiding this comment.
You might want to add FI_PROVIDER=sockets also.
| - name: Install dependencies | ||
| run: | | ||
| sudo apt-get install -y gfortran rpm mpich libmpich-dev libhwloc-dev | ||
| sudo apt-get install -y gfortran rpm mpich libmpich-dev libhwloc-dev gdb |
There was a problem hiding this comment.
Comment: Given the rampant use of sudo in this file, I guess it is assumed that CI has privs. Can ignore my first comment in that case.
6654e7d to
6f23dee
Compare
|
There was a failure in the XPMEM_Only test with mt_membar. It is not caused by this change it is being monitored here #1191. |
.github/workflows/ci.yml
Outdated
| sudo apt-get install linux-headers-`uname -r` | ||
| ./autogen.sh | ||
| sed -i 's/^#define pmd_is_huge(p) pmd_large(p)/#define pmd_is_huge(p) (0)/' kernel/xpmem_pfn.c | ||
| sed -i 's/^#define pud_is_huge(p) pud_large(p)/#define pud_is_huge(p) (0)/' kernel/xpmem_pfn.c |
There was a problem hiding this comment.
I'm not sure I love this... have you considered using github.com:openucx/xpmem? It looks like they've forked and seem to do better w.r.t. pmd_is_huge...
There was a problem hiding this comment.
I will check it out. The function that references these macros is not used by SOS, but it would be nice to use a better maintained xpmem.
6f23dee to
feaa6a5
Compare
* XPMEM build failure due to use of deprecated paging macros Sandia-OpenSHMEM#1206 * pud_large() and pmd_large() are not supported in newer Linux kernels * Moved to OpenUCX version of XPMEM which is better supported * Added timeout for SOS tests * Added core dump artifact uploading and logging support * Updated OFI testing to version 2.1.x * Disabled failing Portals 4 tests Sandia-OpenSHMEM#1208 Signed-off-by: Mark F. Brown <mark.f.brown@intel.com>
feaa6a5 to
97c6627
Compare
markbrown314
left a comment
There was a problem hiding this comment.
Fixed commit message and switch to using OpenUCX XPMEM fork