plugin/amdgpu: Support for open dmabuf handles#1
Open
cfreeamd wants to merge 7 commits intofdavid-amd:dmabuf-postfrom
Open
plugin/amdgpu: Support for open dmabuf handles#1cfreeamd wants to merge 7 commits intofdavid-amd:dmabuf-postfrom
cfreeamd wants to merge 7 commits intofdavid-amd:dmabuf-postfrom
Conversation
amdgpu represents allocated device memory as a memory mapping of the device file. This is a non-standard VMA that must be handled by the plugin, not the normal VMA code. Ignore all VMAs on device files. Signed-off-by: David Francis <David.Francis@amd.com>
During restore, the amdgpu plugin must hold onto fds for dmabufs as they are transferred from one process to another. These fds must be chosen not to conflict with other fds used by restore. Extend the service_fd system, which already finds unused fds, to allow request of an unused fd. Signed-off-by: David Francis <David.Francis@amd.com>
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin to retry. Change files_ext.c to read a response of 1 from a plugin restore function to mean retry. Signed-off-by: David Francis <David.Francis@amd.com>
For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs the amdgpu drm header file, copied from the kernel's includes. Signed-off-by: David Francis <David.Francis@amd.com>
Buffer objects held by the amdgpu drm driver are restored with the new DRM_IOCTL_AMDGPU_CRIU_OP ioctl. Handling for this ioctl is in amdgpu_plugin_drm.h Handling of imported buffer objects may require dmabuf fds to be transferred between processes. These occur over sockets created by the amgpu plugin. There are two new plugin callbacks: COLLECT_FILE to identify the processes that have amdgpu files and so need a socket, and RESUME_DEVICES_EARLY to create the sockets before any files are restored. Before each amdgpu file restore, check the socket and record the recevied dmabuf_fds. During checkpoint, track shared buffer objects, so that buffer objects that are shared across processes can be identified. During restore, track which buffer objects have been restored. Retry restore of a drm file if a buffer object is imported and the original has not been exported yet. Skip buffer objects that have already been completed or cannot be completed in the current restore. So drm code can use sdma_copy_bo, that function no longer requires kfd bo structs Update the protobuf messages with new amdgpu drm information. Signed-off-by: David Francis <David.Francis@amd.com>
Previously, amdgpu plugin was determining when to call its UNPAUSE ioctl by counting the files that have been restored. This was not reliable; there may be more or fewer device files than expected and there may be other processes still checkpointing when unpause was called. Add a new plugin callback DUMP_DEVICE_LATE which is called after files are finished checkpointing for all processes. Signed-off-by: David Francis <David.Francis@amd.com>
Modifications to handle dump/restore of open dmabuf file handles.
b057779 to
456978a
Compare
45f796c to
878a313
Compare
3bace7e to
84fb396
Compare
84fb396 to
1dc9e9a
Compare
1dc9e9a to
e90ffcd
Compare
e90ffcd to
a155b84
Compare
1a5f191 to
87059a8
Compare
87059a8 to
3d40e7a
Compare
3d40e7a to
6cfa960
Compare
01792f8 to
2097325
Compare
fdavid-amd
pushed a commit
that referenced
this pull request
Feb 3, 2026
Running the zdtm/static/unlink_regular00 test on Ubuntu 24.04 on aarch64
results in following error:
# ./zdtm.py run -t zdtm/static/unlink_regular00 -k always
userns is supported
=== Run 1/1 ================ zdtm/static/unlink_regular00
==================== Run zdtm/static/unlink_regular00 in ns ====================
Skipping rtc at root
Start test
Test is SUID
./unlink_regular00 --pidfile=unlink_regular00.pid --outfile=unlink_regular00.out --dirname=unlink_regular00.test
Run criu dump
*** buffer overflow detected ***: terminated
############# Test zdtm/static/unlink_regular00 FAIL at CRIU dump ##############
Test output: ================================
<<< ================================
Send the 9 signal to 47
Wait for zdtm/static/unlink_regular00(47) to die for 0.100000
##################################### FAIL #####################################
According to the backtrace:
#0 __pthread_kill_implementation (threadid=281473158467616, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x0000ffff93477690 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
checkpoint-restore#2 0x0000ffff9342cb3c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
checkpoint-restore#3 0x0000ffff93417e00 in __GI_abort () at ./stdlib/abort.c:79
checkpoint-restore#4 0x0000ffff9346abf0 in __libc_message_impl (fmt=fmt@entry=0xffff93552a78 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:132
checkpoint-restore#5 0x0000ffff934e81a8 in __GI___fortify_fail (msg=msg@entry=0xffff93552a28 "buffer overflow detected") at ./debug/fortify_fail.c:24
checkpoint-restore#6 0x0000ffff934e79e4 in __GI___chk_fail () at ./debug/chk_fail.c:28
checkpoint-restore#7 0x0000ffff934e9070 in ___snprintf_chk (s=s@entry=0xffffc6ed04a3 "testfile", maxlen=maxlen@entry=4056, flag=flag@entry=2, slen=slen@entry=4053,
format=format@entry=0xaaaacffe3888 "link_remap.%d") at ./debug/snprintf_chk.c:29
checkpoint-restore#8 0x0000aaaacff4b8b8 in snprintf (__fmt=0xaaaacffe3888 "link_remap.%d", __n=4056, __s=0xffffc6ed04a3 "testfile")
at /usr/include/aarch64-linux-gnu/bits/stdio2.h:54
checkpoint-restore#9 create_link_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60, lfd=lfd@entry=20,
idp=idp@entry=0xffffc6ed14ec, nsid=nsid@entry=0xaaaada2bac00, parms=parms@entry=0xffffc6ed2808, fallback=0xaaaacff4c6c0 <dump_linked_remap+96>,
fallback@entry=0xffffc6ed2797) at criu/files-reg.c:1164
checkpoint-restore#10 0x0000aaaacff4c6c0 in dump_linked_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60,
parms=parms@entry=0xffffc6ed2808, lfd=lfd@entry=20, id=id@entry=12, nsid=nsid@entry=0xaaaada2bac00, fallback=fallback@entry=0xffffc6ed2797)
at criu/files-reg.c:1198
checkpoint-restore#11 0x0000aaaacff4d8b0 in check_path_remap (nsid=0xaaaada2bac00, id=12, lfd=20, parms=0xffffc6ed2808, link=<optimized out>) at criu/files-reg.c:1426
checkpoint-restore#12 dump_one_reg_file (lfd=20, id=12, p=0xffffc6ed2808) at criu/files-reg.c:1827
checkpoint-restore#13 0x0000aaaacff51078 in dump_one_file (pid=<optimized out>, fd=4, lfd=20, opts=opts@entry=0xaaaada2ba2c0, ctl=ctl@entry=0xaaaada2c4d50,
e=e@entry=0xffffc6ed39c8, dfds=dfds@entry=0xaaaada2c3d40) at criu/files.c:581
checkpoint-restore#14 0x0000aaaacff5176c in dump_task_files_seized (ctl=ctl@entry=0xaaaada2c4d50, item=item@entry=0xaaaada2b8f80, dfds=dfds@entry=0xaaaada2c3d40)
at criu/files.c:657
checkpoint-restore#15 0x0000aaaacff3d3c0 in dump_one_task (parent_ie=0x0, item=0xaaaada2b8f80) at criu/cr-dump.c:1679
checkpoint-restore#16 cr_dump_tasks (pid=<optimized out>) at criu/cr-dump.c:2224
checkpoint-restore#17 0x0000aaaacff163a0 in main (argc=<optimized out>, argv=0xffffc6ed40e8, envp=<optimized out>) at criu/crtools.c:293
This line is the problem:
snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name - 1), "link_remap.%d", rfe.id);
The problem was that the `-1` was on the inside of the braces and not on
the outside. This way the destination size was increase by 1 instead of
being decreased by 1 which triggered the buffer overflow detection.
Signed-off-by: Adrian Reber <areber@redhat.com>
|
A friendly reminder that this PR had no activity for 30 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Modifications to handle dump/restore of open dmabuf file handles.