Skip to content

Nvidia GH200 / ARM64: SIGSEGV in XPU and Vector, but not Scalar modes #180

@cwpenhale

Description

@cwpenhale

Hi Team,

I'm successfully using OpenMoonRay 1.7 in Gentoo on an AMD EPYC 9654 workstation (Ebuild here, patches to OMR here). I'm working on building out a render farm, and I hope to use the well-priced NVIDIA GH200 platform on VULTR as an on-demand Arras compute node.

I've built an OMR docker image for ARM64 Neocortex-V2 with Optix and CUDA (-march=armv9-a -mcpu=neoverse-v2 -mtune=neoverse-v2), the chips used in the NVIDIA GH200. The patches I've made against OMR's source are here and the ebuild, slightly modified from the previous example, is here.

I'm launching my docker container like so:
docker run -it -v /root:/root --runtime=nvidia --gpus=all -e NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility openmoonray-arm64

My bash environment looks like this:

NVIDIA_VISIBLE_DEVICES=all
REZ_MOONRAY_ROOT=/opt/openmoonray
PWD=/root/example_scenes/pbrt_scenes/country_kitchen
NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility
HOME=/root
LS_COLORS=<trimmed>
RDL2_DSO_PATH=/opt/openmoonray/rdl2dso
MOONRAY_ROOT=/opt/openmoonray
MOONRAY_CLASS_PATH=/opt/openmoonray/shader_json
TERM=xterm
SHLVL=1
ARRAS_SESSION_PATH=/opt/openmoonray/sessions
PATH=/opt/openmoonray/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
OLDPWD=/root/example_scenes
_=/usr/sbin/env

The processor looks like this in /proc/cpuinfo

processor       : 71
BogoMIPS        : 2000.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd4f
CPU revision    : 0

I'm running the test render on the country kitchen scene with
moonray -debug -exec_mode xpu -in scene.rdla -in scene.rdlb -out arm64.exr

And the output is in the attached file
kitchen.log

I'd love to contribue a coherent patch once I get this working. The majority of the changes I've made to try and get this working are all about changing __APPLE__ to __ARM_NEON__ in the appropriate places, and separating out the concerns between ARM on Darwin and ARM on Linux. It's been a whirlwind trying to get this far, and compiling on qemu had made this process slower than usual :)

Where is a good place to start with debugging this? Since scalar works, I imagine I made some mistakes in my patching as it relates to vector and XPU. I also assume after reading the code that Apple hasn't been tested with Optix at all and we're in uncharted waters.

Looking forward to working with everyone! Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions