Skip to content

plugin/amdgpu: Support for open dmabuf handles#1

Open
cfreeamd wants to merge 7 commits intofdavid-amd:dmabuf-postfrom
cfreeamd:dmabuf-post
Open

plugin/amdgpu: Support for open dmabuf handles#1
cfreeamd wants to merge 7 commits intofdavid-amd:dmabuf-postfrom
cfreeamd:dmabuf-post

Conversation

@cfreeamd
Copy link

@cfreeamd cfreeamd commented May 1, 2025

Modifications to handle dump/restore of open dmabuf file handles.

fdavid-amd and others added 7 commits March 10, 2025 15:44
amdgpu represents allocated device memory as a memory mapping
of the device file. This is a non-standard VMA that must
be handled by the plugin, not the normal VMA code.

Ignore all VMAs on device files.

Signed-off-by: David Francis <David.Francis@amd.com>
During restore, the amdgpu plugin must hold onto fds for
dmabufs as they are transferred from one process to another.

These fds must be chosen not to conflict with other fds used by
restore.

Extend the service_fd system, which already finds unused fds,
to allow request of an unused fd.

Signed-off-by: David Francis <David.Francis@amd.com>
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin
to retry.

Change files_ext.c to read a response of 1 from a plugin restore
function to mean retry.

Signed-off-by: David Francis <David.Francis@amd.com>
For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs
the amdgpu drm header file, copied from the kernel's includes.

Signed-off-by: David Francis <David.Francis@amd.com>
Buffer objects held by the amdgpu drm driver are restored with the new
DRM_IOCTL_AMDGPU_CRIU_OP ioctl. Handling for this ioctl is in
amdgpu_plugin_drm.h

Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over sockets created
by the amgpu plugin. There are two new plugin callbacks:
COLLECT_FILE to identify the processes that have amdgpu files and so
need a socket, and RESUME_DEVICES_EARLY to create the sockets before
any files are restored.

Before each amdgpu file restore, check the socket and record the
recevied dmabuf_fds.

During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.

During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.

So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs

Update the protobuf messages with new amdgpu drm information.

Signed-off-by: David Francis <David.Francis@amd.com>
Previously, amdgpu plugin was determining when to call its
UNPAUSE ioctl by counting the files that have been restored.

This was not reliable; there may be more or fewer device files
than expected and there may be other processes still checkpointing
when unpause was called.

Add a new plugin callback DUMP_DEVICE_LATE which is called after
files are finished checkpointing for all processes.

Signed-off-by: David Francis <David.Francis@amd.com>
Modifications to handle dump/restore of open dmabuf file
handles.
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 5 times, most recently from b057779 to 456978a Compare June 18, 2025 18:12
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 3 times, most recently from 45f796c to 878a313 Compare July 8, 2025 14:05
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 2 times, most recently from 3bace7e to 84fb396 Compare August 7, 2025 13:59
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 2 times, most recently from 1a5f191 to 87059a8 Compare September 22, 2025 19:17
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 2 times, most recently from 01792f8 to 2097325 Compare October 31, 2025 14:56
fdavid-amd pushed a commit that referenced this pull request Feb 3, 2026
Running the zdtm/static/unlink_regular00 test on Ubuntu 24.04 on aarch64
results in following error:

    # ./zdtm.py run -t zdtm/static/unlink_regular00 -k always
    userns is supported
    === Run 1/1 ================ zdtm/static/unlink_regular00
    ==================== Run zdtm/static/unlink_regular00 in ns ====================
    Skipping rtc at root
    Start test
    Test is SUID
    ./unlink_regular00 --pidfile=unlink_regular00.pid --outfile=unlink_regular00.out --dirname=unlink_regular00.test
    Run criu dump
    *** buffer overflow detected ***: terminated
    ############# Test zdtm/static/unlink_regular00 FAIL at CRIU dump ##############
    Test output: ================================

     <<< ================================
    Send the 9 signal to  47
    Wait for zdtm/static/unlink_regular00(47) to die for 0.100000
    ##################################### FAIL #####################################

According to the backtrace:

    #0  __pthread_kill_implementation (threadid=281473158467616, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
    #1  0x0000ffff93477690 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
    checkpoint-restore#2  0x0000ffff9342cb3c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
    checkpoint-restore#3  0x0000ffff93417e00 in __GI_abort () at ./stdlib/abort.c:79
    checkpoint-restore#4  0x0000ffff9346abf0 in __libc_message_impl (fmt=fmt@entry=0xffff93552a78 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:132
    checkpoint-restore#5  0x0000ffff934e81a8 in __GI___fortify_fail (msg=msg@entry=0xffff93552a28 "buffer overflow detected") at ./debug/fortify_fail.c:24
    checkpoint-restore#6  0x0000ffff934e79e4 in __GI___chk_fail () at ./debug/chk_fail.c:28
    checkpoint-restore#7  0x0000ffff934e9070 in ___snprintf_chk (s=s@entry=0xffffc6ed04a3 "testfile", maxlen=maxlen@entry=4056, flag=flag@entry=2, slen=slen@entry=4053,
        format=format@entry=0xaaaacffe3888 "link_remap.%d") at ./debug/snprintf_chk.c:29
    checkpoint-restore#8  0x0000aaaacff4b8b8 in snprintf (__fmt=0xaaaacffe3888 "link_remap.%d", __n=4056, __s=0xffffc6ed04a3 "testfile")
        at /usr/include/aarch64-linux-gnu/bits/stdio2.h:54
    checkpoint-restore#9  create_link_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60, lfd=lfd@entry=20,
        idp=idp@entry=0xffffc6ed14ec, nsid=nsid@entry=0xaaaada2bac00, parms=parms@entry=0xffffc6ed2808, fallback=0xaaaacff4c6c0 <dump_linked_remap+96>,
        fallback@entry=0xffffc6ed2797) at criu/files-reg.c:1164
    checkpoint-restore#10 0x0000aaaacff4c6c0 in dump_linked_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60,
        parms=parms@entry=0xffffc6ed2808, lfd=lfd@entry=20, id=id@entry=12, nsid=nsid@entry=0xaaaada2bac00, fallback=fallback@entry=0xffffc6ed2797)
        at criu/files-reg.c:1198
    checkpoint-restore#11 0x0000aaaacff4d8b0 in check_path_remap (nsid=0xaaaada2bac00, id=12, lfd=20, parms=0xffffc6ed2808, link=<optimized out>) at criu/files-reg.c:1426
    checkpoint-restore#12 dump_one_reg_file (lfd=20, id=12, p=0xffffc6ed2808) at criu/files-reg.c:1827
    checkpoint-restore#13 0x0000aaaacff51078 in dump_one_file (pid=<optimized out>, fd=4, lfd=20, opts=opts@entry=0xaaaada2ba2c0, ctl=ctl@entry=0xaaaada2c4d50,
        e=e@entry=0xffffc6ed39c8, dfds=dfds@entry=0xaaaada2c3d40) at criu/files.c:581
    checkpoint-restore#14 0x0000aaaacff5176c in dump_task_files_seized (ctl=ctl@entry=0xaaaada2c4d50, item=item@entry=0xaaaada2b8f80, dfds=dfds@entry=0xaaaada2c3d40)
        at criu/files.c:657
    checkpoint-restore#15 0x0000aaaacff3d3c0 in dump_one_task (parent_ie=0x0, item=0xaaaada2b8f80) at criu/cr-dump.c:1679
    checkpoint-restore#16 cr_dump_tasks (pid=<optimized out>) at criu/cr-dump.c:2224
    checkpoint-restore#17 0x0000aaaacff163a0 in main (argc=<optimized out>, argv=0xffffc6ed40e8, envp=<optimized out>) at criu/crtools.c:293

This line is the problem:

    snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name - 1), "link_remap.%d", rfe.id);

The problem was that the `-1` was on the inside of the braces and not on
the outside. This way the destination size was increase by 1 instead of
being decreased by 1 which triggered the buffer overflow detection.

Signed-off-by: Adrian Reber <areber@redhat.com>
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments