drm/amdgpu: fix non-x86 GPU VCRAT parsing #205

fitzsim · 2026-01-18T20:51:15Z

On ppc64le, an IO link entry in the GPU VCRAT causes a parsing failure which results in the device not being added to the kfd topology:

amdgpu [...]: amdgpu: Error parsing VCRAT
kfd kfd: amdgpu: Error adding device to topology
kfd kfd: amdgpu: Error initializing KFD node
kfd kfd: amdgpu: device [...]:[...] NOT added due to errors

In kfd_create_vcrat_image_gpu, skip IO link entry creation on non-x86 platforms, matching kfd_create_vcrat_image_cpu's behaviour.

With this change, the device is successfully added to the kfd topology on ppc64le.

(Perhaps a more proper solution would be to add IO link support to ppc64le. I don’t know if there is an equivalent POWER9 capability, hardware-wise. For my purposes, I have not yet needed an IO link.)

Motivation

This patch was required to get an AMD Radeon AI Pro R9700 GPU working on a POWER9 machine.

Technical Details

rocminfo was not finding the GPU's entry in the kfd sysfs entry. dmesg showed:

[...] amdgpu 0033:03:00.0: amdgpu: Error parsing VCRAT
[...] kfd kfd: amdgpu: Error adding device to topology
[...] kfd kfd: amdgpu: Error initializing KFD node
[...] kfd kfd: amdgpu: device 1002:7551 NOT added due to errors

Test Plan

Reboot with this patch applied and try rocminfo.

Test Result

dmesg reports:

[...] amdgpu: Not creating IO link entry on non x86 platform
[...] amdgpu: Virtual CRAT table created for GPU
[...] amdgpu: Topology: Add dGPU node [0x7551:0x1002]
[...] kfd kfd: amdgpu: added device 1002:7551

rocminfo displays the GPU's information, and ROCm works.

(Some userspace porting of TheRock subprojects was required to get ROCm working on ppc64le.)

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Submitting to the master branch (not develop per the guidelines) because the pull request tool defaulted to master. I hope that's correct.

Disable VCN reset capability for the program 4 as it's causing regressions. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>

For SDMA IP versions >= v4.4.2, firmware will take care of quiescing SDMA before mode-2 reset. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>

Some distribution kernels include drm/drm_suballoc.h header but have CONFIG_DRM_SUBALLOC_HELPER disabled. This causes build failures as the suballoc symbols are unavailable from the main DRM subsystem. Move the #endif directive for HAVE_DRM_DRM_SUBALLOC_H to properly expose KCL suballoc fallback implementations when the main DRM suballoc helper is not available. This ensures amdgpu can use kcl_drm_suballoc functions as fallback and resolves DKMS build failures on such configurations. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Bob Zhou <Bob.Zhou@amd.com> (cherry picked from commit f72c017) Change-Id: I67fd4b4ba50674791d1e07dd7fc1a900886af253

BIT_ULL(n) sets nth bit, remove explicit shift and set the position Fixes: e30383fce4cb ("drm/amdgpu: fix shift-out-of-bounds in amdgpu_debugfs_jpeg_sched_mask_set") Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com>

Add kiq hdp flush callbacks for gfx ips to support gpu hdp flush when no ring presents Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>

Currently SRIOV runtime will use kiq to write HDP_MEM_FLUSH_CNTL for hdp flush. This register need to be write from CPU for nbif to aware, otherwise it will not work. Implement amdgpu_kiq_hdp_flush and use kiq to do gpu hdp flush during sriov runtime. v2: - fallback to amdgpu_asic_flush_hdp when amdgpu_kiq_hdp_flush failed - add function amdgpu_mes_hdp_flush v3: - changed returned error Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>

Otherwise accessing them can cause a crash. Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> (cherry picked from commit 02fdc6b) Change-Id: I1f8d0e98d509f7154ad5e60cef4269777e08dd62

Certain kernels may have integrated peer_mem into their base Linux module, while others may have it as a separate module. Link against the ofa_kernel as a last resort in case that's the only option available. Some older NICs can't support dma_buf, so this is required in order to get PeerDirect to work on NICs that don't support dma_buf Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Perry Yuan <perry.yuan@amd.com>

This patch adds the condition to not wait for the queue response for unmap, if the gpu is in reset. Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

…h _V1 suffix - This change prepares the later patches to intro _v2 suffix to SRIOV critical regions Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 2a3fd0a20754e3cb8d68daa7449d378b4482ca25) Change-Id: Ic0c69049913b47e72a019285b442931eb981a44b

1. Added enum amd_sriov_crit_region_version to support multi versions 2. Added logic in SRIOV mailbox to regonize crit_region version during req_gpu_init_data Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 176de1520d0d050c755a7d89771d0d52403de49c) Change-Id: I736feee621a5c4a0fbc5244181791236dfc7b9d6

1. Introduced amdgpu_virt_init_critical_region during VF init. - VFs use init_data_header_offset and init_data_header_size_kb transmitted via PF2VF mailbox to fetch the offset of critical regions' offsets/sizes in VRAM and save to adev->virt.crit_region_offsets and adev->virt.crit_region_sizes_kb. Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

- During guest driver init, asa VFs receive PF msg to init dynamic critical region(v2), VFs reuse fw_vram_usage_* from ttm to store critical region tables in a 5MB chunk. Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

…t_region offsets 1. Added VF logic in amdgpu_virt to init IP discovery using the offsets from dynamic(v2) critical regions; 2. Added VF logic in amdgpu_virt to init bios image using the offsets from dynamic(v2) critical regions; Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Change-Id: I35cc87edffdecf3715646b268cceccbe2d816607

…c crit_region offsets 1. Added VF logic to init data exchange region using the offsets from dynamic(v2) critical regions; Signed-off-by: Ellen Pan <yunru.pan@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

When passed around internally the upper 8 bits of power limit include the limit type. This is non-obvious without digging into the nuances of each function. Instead pass the limit type as an argument to all applicable layers. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

The min/max limits only make sense for default PPT. Restructure smu_set_power_limit() to only use them in that case. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

Vangogh has separate limits for default PPT and fast PPT. Add infrastructure to save both of these limits and restore both of them. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

User requested power limits and clock settings are already restored as part of smu_restore_dpm_user_profile(). It's unnecessary to call the same restore as part of smu_resume(). Revert the following commits to drop that extra restore: commit ed4efe426a49 ("drm/amd: Restore cached power limit during resume") commit 796ff8a7e01b ("drm/amd: Restore cached manual clock settings during resume") commit f9b80514a722 ("drm/amd: Only restore cached manual clock settings in restore if OD enabled") Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

…faces Certain multi-GPU configurations (especially GFX12) may hit data corruption when a DCC-compressed VRAM surface is shared across GPUs using peer-to-peer (P2P) DMA transfers. Such surfaces rely on device-local metadata and cannot be safely accessed through a remote GPU’s page tables. Attempting to import a DCC-enabled surface through P2P leads to incorrect rendering or GPU faults. This change disables P2P for DCC-enabled VRAM buffers that are contiguous and allocated on GFX12+ hardware. In these cases, the importer falls back to the standard system-memory path, avoiding invalid access to compressed surfaces. Future work could consider optional migration (VRAM→System→VRAM) if a performance regression is observed when `attach->peer2peer = false`. Tested on: - Dual RX 9700 XT (Navi4x) setup - GNOME and Wayland compositor scenarios - Confirmed no corruption after disabling P2P under these conditions Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Change-Id: Idd16caa2a2d8021c4642b065a20a16327ae7312f

…VRAM surfaces" This reverts commit f376e7b. Reason for revert: We need to discuss this with the ROCm team to decide whether we want this. Change-Id: If7d309ae6388cf87c4d76e80f404671e9e90a137

If process is killed. the vm entity is stopped, submit pt update job will trigger the error message "*ERROR* Trying to push to a killed entity", job will not execute. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>

This reverts codmmit 234b4a9. Reason for revert: test requied Change-Id: I3224e51ab956b880902f5aef86d32f0a9e731d7c

Add new message definitions for pmfw eeprom interface Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

implement ras_smu_drv interface for smu v13.0.12 Signed-off-by: Gangliang Xie <ganglxie@amd.com> Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

add functions to get smu ras driver Signed-off-by: Gangliang Xie <ganglxie@amd.com> Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

add function to check if pmfw is supported, skip eeprom check and recover when pmfw eeprom is supported Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

add wrapper functions for pmfw eeprom interface, for these interfaces to be easily and safely called Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

adapt reset function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

add initialization function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

PMFW will manage RAS eeprom data by itself, add new interface to read eeprom data via PMFW, we can read part of records by setting index. v2: use IPID parse interface. pa is not used and set it to a fixed value. v3: optimize the null pointer check for IPID parse interface. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

PMFW manages eeprom bad page records, update bad page loading accrodingly. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Only update bad page number in legacy eeprom write path. v2: add null pointer check for con. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

In legacy way, bad page is queried from MCA registers, switch to getting it from PMFW when PMFW manages eeprom data. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

smu busy is a normal case when calling SMU_MSG_GetBadPageCount, so no need to print error status at each time.Instead, only print error status when timeout given by user is reached. Signed-off-by: Gangliang Xie <ganglxie@amd.com>

Instead of from physical address. v2: add comment to make the code more readable Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

RAS info update in PMFW is time cost, wait for it. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Check if bad page threshold is reached and take actions accordingly. v2: remove rma message sent to smu when pmfw manages eeprom. v3: add null pointer check for con. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

…_num The busy status returned by ras_eeprom_update_record_num may not be an error, increase timeout to exclude false busy status. Also add more comments to make the code readable. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Query the sub-revision field in the IP Discovery table for the VFs to obtain their revision ID. Meanwhile, read the revision ID from the strap register for the PF. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Introduce new psp interfaces and structures for performance monitoring hardware control. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

Introduce a Peak Tops Limiter (PTL) driver that dynamically caps engine frequency to ensure delivered TOPS never exceeds a defined TOPS_limit. This initial implementation provides core data structures and kernel-space interfaces (set/get, enable/disable) to manage PTL state. PTL performs a firmware handshake to initialize its state and update predefined format types. It supports updating these format types at runtime while user-space tools automatically switch PTL state, and also allows explicitly switching PTL state via newly added commands. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

Introduce hardware detection, runtime state tracking and a kgd->ptl_ctrl() callback to enable/disable/query PTL via the PSP performance-monitor interface (commands 0xA0000000/1). The driver now exposes PTL capability to KFD and keeps the software state in sync with the hardware. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

Add kgd->ptl_ctrl() callback so KFD can query/enable/disable PTL state through the PSP performance monitor interface. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

Combine PTL hardware control with the existing PMC device locking mechanism to ensure proper synchronization and hardware state management during profiling operations. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

Revert submission 1291049 Reason for revert: RM needs merge this Reverted Changes: I63bec83a0:drm/amdgpu: integrate PTL control with PMC device ... I31b7df2a3:drm/amdgpu: add PTL enable/query gfx control suppo... I6682872fe:drm/amdkfd: add kgd control interface for ptl Ide92c0668:drm/amdgpu: add new performance monitor PSP interf... I2d019792a:drm/amdgpu: add psp interfaces for peak tops limit... Change-Id: Ic3d2cd5a72afcd04b3cc0d2c3d9d898cf708552e

Revert submission 1291049 Reason for revert: RM needs merge this Reverted Changes: I63bec83a0:drm/amdgpu: integrate PTL control with PMC device ... I31b7df2a3:drm/amdgpu: add PTL enable/query gfx control suppo... I6682872fe:drm/amdkfd: add kgd control interface for ptl Ide92c0668:drm/amdgpu: add new performance monitor PSP interf... I2d019792a:drm/amdgpu: add psp interfaces for peak tops limit... Change-Id: I7a96c6b934864773cbe08995280100a76451f104

Revert submission 1291049 Reason for revert: RM needs merge this Reverted Changes: I63bec83a0:drm/amdgpu: integrate PTL control with PMC device ... I31b7df2a3:drm/amdgpu: add PTL enable/query gfx control suppo... I6682872fe:drm/amdkfd: add kgd control interface for ptl Ide92c0668:drm/amdgpu: add new performance monitor PSP interf... I2d019792a:drm/amdgpu: add psp interfaces for peak tops limit... Change-Id: I8189f2da53b916b3fc874ac686742c8737949a05

Revert submission 1291049 Reason for revert: RM needs merge this Reverted Changes: I63bec83a0:drm/amdgpu: integrate PTL control with PMC device ... I31b7df2a3:drm/amdgpu: add PTL enable/query gfx control suppo... I6682872fe:drm/amdkfd: add kgd control interface for ptl Ide92c0668:drm/amdgpu: add new performance monitor PSP interf... I2d019792a:drm/amdgpu: add psp interfaces for peak tops limit... Change-Id: Id08776d4d52ff9427791c54bc7cf45d7960661cb

Revert submission 1291049 Reason for revert: RM needs merge this Reverted Changes: I63bec83a0:drm/amdgpu: integrate PTL control with PMC device ... I31b7df2a3:drm/amdgpu: add PTL enable/query gfx control suppo... I6682872fe:drm/amdkfd: add kgd control interface for ptl Ide92c0668:drm/amdgpu: add new performance monitor PSP interf... I2d019792a:drm/amdgpu: add psp interfaces for peak tops limit... Change-Id: I4df792d6cf22c16ae0272673894dfc4ca28a859f

Revert submission 1291049 Reason for revert: <not ready for merge, let`s revert this> Reverted Changes: I63bec83a0:drm/amdgpu: integrate PTL control with PMC device ... I6682872fe:drm/amdkfd: add kgd control interface for ptl I31b7df2a3:drm/amdgpu: add PTL enable/query gfx control suppo... I2d019792a:drm/amdgpu: add psp interfaces for peak tops limit... Ide92c0668:drm/amdgpu: add new performance monitor PSP interf... Change-Id: I94ea76ebec3c0b78e78a7605e5032ac449dd96dd

…faces Certain multi-GPU configurations (especially GFX12) may hit data corruption when a DCC-compressed VRAM surface is shared across GPUs using peer-to-peer (P2P) DMA transfers. Such surfaces rely on device-local metadata and cannot be safely accessed through a remote GPU’s page tables. Attempting to import a DCC-enabled surface through P2P leads to incorrect rendering or GPU faults. This change disables P2P for DCC-enabled VRAM buffers that are contiguous and allocated on GFX12+ hardware. In these cases, the importer falls back to the standard system-memory path, avoiding invalid access to compressed surfaces. Future work could consider optional migration (VRAM→System→VRAM) if a performance regression is observed when `attach->peer2peer = false`. Tested on: - Dual RX 9700 XT (Navi4x) setup - GNOME and Wayland compositor scenarios - Confirmed no corruption after disabling P2P under these conditions Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Change-Id: Ia0dd9eb39ab7a8c2226717307f6cc81083ce8773

If process is killed. the vm entity is stopped, submit pt update job will trigger the error message "*ERROR* Trying to push to a killed entity", job will not execute. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>

Device manager releases device-specific resources when a driver disconnects from a device, devm_memunmap_pages is redundant. It causes below warning trace when module is removed Call Trace: <TASK> dump_stack_lvl+0x76/0xa0 dump_stack+0x10/0x20 bad_page+0x76/0x120 free_page_is_bad_report+0x86/0xa0 free_unref_page_prepare+0x279/0x3d0 free_unref_page+0x34/0x180 __free_pages+0x112/0x130 free_pages+0x3d/0x60 free_pagetable+0xc4/0xe0 remove_pud_table+0x1c3/0x270 remove_p4d_table+0xf8/0x1b0 remove_pagetable+0xd7/0x160 arch_remove_memory+0x3d/0x50 memunmap_pages+0xbe/0x300 devm_memremap_pages_release+0xe/0x20 devm_action_release+0x15/0x30 release_nodes+0x45/0xd0 devres_release_all+0x97/0xe0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x230/0x270 driver_detach+0x4a/0xa0 bus_remove_driver+0x83/0x110 driver_unregister+0x2f/0x60 pci_unregister_driver+0x40/0x90 amdgpu_exit+0x15/0x3b [amdgpu] __do_sys_delete_module.constprop.0+0x1a3/0x300 __x64_sys_delete_module+0x12/0x20 x64_sys_call+0x14e9/0x24b0 do_syscall_64+0x81/0x170 entry_SYSCALL_64_after_hwframe+0x78/0x80 Fixes: b70e506 ("drm/amdkfd: Add AMD Infinity Storage (AIS) support") Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

Commit 502360c ("drm/amdkfd: Fix AIS deinit warnings") removed devm_memunmap_pages from kfd_ais_deinit(). kfd_ais_init() gets called again when compute or memory partitions are changed. Don't remap P2P range again if P2P range has already been initialized. Fixes: 502360c ("drm/amdkfd: Fix AIS deinit warnings") Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

AIS is not supported on virtualization yet. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>

Add SRIOV check when setting VCN ring's supported reset mask. Signed-off-by: Shikang Fan <shikang.fan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

On ppc64le, an IO link entry in the GPU VCRAT causes a parsing failure which results in the device not being added to the kfd topology: amdgpu [...]: amdgpu: Error parsing VCRAT kfd kfd: amdgpu: Error adding device to topology kfd kfd: amdgpu: Error initializing KFD node kfd kfd: amdgpu: device [...]:[...] NOT added due to errors In kfd_create_vcrat_image_gpu, skip IO link entry creation on non-x86 platforms, matching kfd_create_vcrat_image_cpu's behaviour. With this change, the device is successfully added to the kfd topology on ppc64le.

fitzsim · 2026-01-19T12:37:35Z

It might be safer to make the ifdef I added ppc64-specific since I don't have other non-x86 platforms to test.

I should also show the pr_debug messages I captured around the failure:

amdgpu: Virtual CRAT table created for GPU
amdgpu: Parsing CRAT table with 1 nodes
amdgpu: Found CU entry in CRAT table with proximity_domain=2 caps=0
amdgpu: CU GPU: id_base=-2147479552
amdgpu: Found memory entry in CRAT table with proximity_domain=2
amdgpu: Found IO link entry in CRAT table with id_from=2, id_to 8
amdgpu 0033:03:00.0: amdgpu: Error parsing VCRAT
kfd kfd: amdgpu: Error adding device to topology
amdgpu: Free mem_obj = 00000000cbed9333, range_start = 0, range_end = 0
kfd kfd: amdgpu: Error initializing KFD node

I suspect id_to being 8 results in this code returning:

		to_dev = kfd_topology_device_by_proximity_domain_no_lock(id_to);
		if (!to_dev)
			return -ENODEV;

(I could confirm that with extra instrumentation if necessary.)
I have CONFIG_NUMA enabled and I am not using xgmi, so I suspect the id_to is coming from numa_node here:

#ifdef CONFIG_NUMA
	if (kdev->adev->pdev->dev.numa_node == NUMA_NO_NODE)
		sub_type_hdr->proximity_domain_to = 0;
	else
		sub_type_hdr->proximity_domain_to = kdev->adev->pdev->dev.numa_node;
#else
	sub_type_hdr->proximity_domain_to = 0;
#endif

All that said, I am new to these concepts; the patch is a workaround that seems to work for me; I'd appreciate guidance to make it into a correct fix.

Jie1zhang and others added 30 commits October 7, 2025 14:00

drm/amd/pm: Disable VCN queue reset on SMU v13.0.6 due to regression

858f5b2

Disable VCN reset capability for the program 4 as it's causing regressions. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>

drm/amdgpu: Skip SDMA suspend during mode-2 reset

6ff5abe

For SDMA IP versions >= v4.4.2, firmware will take care of quiescing SDMA before mode-2 reset. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>

drm/amdgpu: Add kiq hdp flush callbacks

e85ca18

Add kiq hdp flush callbacks for gfx ips to support gpu hdp flush when no ring presents Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>

amdkfd: Do nto wait for queue op response during reset

7c33556

This patch adds the condition to not wait for the queue response for unmap, if the gpu is in reset. Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amd: Remove second call to set_power_limit()

3d19f4a

The min/max limits only make sense for default PPT. Restructure smu_set_power_limit() to only use them in that case. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

drm/amd: Save and restore all limit types

3dcd53b

Vangogh has separate limits for default PPT and fast PPT. Add infrastructure to save both of these limits and restore both of them. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

Revert "drm/amdgpu: disable peer-to-peer access for DCC-enabled GC12 …

895f8d4

…VRAM surfaces" This reverts commit f376e7b. Reason for revert: We need to discuss this with the ROCm team to decide whether we want this. Change-Id: If7d309ae6388cf87c4d76e80f404671e9e90a137

Revert "drm/amdkfd: Don't clear PT after process killed"

b518f60

This reverts codmmit 234b4a9. Reason for revert: test requied Change-Id: I3224e51ab956b880902f5aef86d32f0a9e731d7c

drm/amd/pm: add new message definitions for pmfw eeprom interface

7b4bec6

Add new message definitions for pmfw eeprom interface Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

drm/amd/pm: implement ras_smu_drv interface for smu v13.0.12

6547b58

implement ras_smu_drv interface for smu v13.0.12 Signed-off-by: Gangliang Xie <ganglxie@amd.com> Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

drm/amd/pm: add smu ras driver framework

cd47484

add functions to get smu ras driver Signed-off-by: Gangliang Xie <ganglxie@amd.com> Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: add function to check if pmfw eeprom is supported

2d9cc37

add function to check if pmfw is supported, skip eeprom check and recover when pmfw eeprom is supported Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: add wrapper functions for pmfw eeprom interface

d2a666a

add wrapper functions for pmfw eeprom interface, for these interfaces to be easily and safely called Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: adapt reset function for pmfw eeprom

20dc791

adapt reset function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu: add initialization function for pmfw eeprom

17d5be1

add initialization function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

Tao Zhou and others added 28 commits November 8, 2025 00:10

drm/amdgpu: support to load RAS bad pages from PMFW

d8d3fac

PMFW manages eeprom bad page records, update bad page loading accrodingly. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: skip writing eeprom when PMFW manages RAS data

202da73

Only update bad page number in legacy eeprom write path. v2: add null pointer check for con. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: load RAS bad page from PMFW in page retirement

8b72f51

In legacy way, bad page is queried from MCA registers, switch to getting it from PMFW when PMFW manages eeprom data. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amd/pm: remove unnecessary prints for smu busy

be00768

smu busy is a normal case when calling SMU_MSG_GetBadPageCount, so no need to print error status at each time.Instead, only print error status when timeout given by user is reached. Signed-off-by: Gangliang Xie <ganglxie@amd.com>

drm/amdgpu: get RAS bad page address from MCA address

5d38c29

Instead of from physical address. v2: add comment to make the code more readable Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: try for more times if RAS bad page number is not updated

c7775b9

RAS info update in PMFW is time cost, wait for it. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: add new performance monitor PSP interfaces

b64b87c

Introduce new psp interfaces and structures for performance monitoring hardware control. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdkfd: Disable AIS on virtualized environment

6fbeb59

AIS is not supported on virtualization yet. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>

drm/amdgpu: Add sriov vf check for VCN per queue reset support.

01cee31

Add SRIOV check when setting VCN ring's supported reset mask. Signed-off-by: Shikang Fan <shikang.fan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

kentrussell force-pushed the master branch from 01cee31 to 33970e1 Compare January 21, 2026 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drm/amdgpu: fix non-x86 GPU VCRAT parsing #205

drm/amdgpu: fix non-x86 GPU VCRAT parsing #205

Uh oh!

fitzsim commented Jan 18, 2026

Uh oh!

fitzsim commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

drm/amdgpu: fix non-x86 GPU VCRAT parsing #205

Are you sure you want to change the base?

drm/amdgpu: fix non-x86 GPU VCRAT parsing #205

Uh oh!

Conversation

fitzsim commented Jan 18, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

fitzsim commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants