Fix batched Cholesky OOM on ROCm by bypassing hipSolver's external hipMalloc by FlemingH · Pull Request #717 · ROCm/jax

FlemingH · 2026-02-26T10:19:56Z

jax.vmap(jnp.linalg.cholesky) with batch >= 2 crashes with OOM on ROCm.
Root cause: JAX calls hipsolverDnXpotrfBatched (dense API, no workspace parameter), which internally allocates workspace via hipMalloc. This bypasses XLA's BFC allocator, and since XLA preallocates ~75% of GPU VRAM by default, the external hipMalloc fails.
Fix: Switch to hipsolverXpotrfBatched (standard API), which accepts an external workspace buffer. The workspace is allocated through XLA's scratch allocator, keeping all GPU memory within XLA's control. CUDA path is unchanged.

ROCm#560)

…tignore (ROCm#563) When jaxlib was built in debug more, an assertion in LLVM code that lazy-loads VHLO dialect could fire, since the code path could execute in a multi-threaded environment, and LLVM dialect repositories aren't thread safe to modify. This patch applies the same changes that upstream makes to fix this: jax-ml@48c8762 (this includes disabling a call to `jax_mlir_ext.enter_multi_threaded_execution(context)` in `mlir.py`. Presumably, the whole functionality related to `enter_multi_threaded_execution()` multithreaded checks isn't ready yet, and it was prematurely rolled into the production code. Manual testing

(forgot this skip in the previous PR)

…t tests (ROCm#582)

…OCm#597)

Co-authored-by: Daniel Suo <danielsuo@gmail.com> Co-authored-by: Jake VanderPlas <jakevdp@google.com>

…ROCm#668)

hawkinsp · 2026-02-27T09:05:22Z

jaxlib/gpu/solver_kernels_ffi.cc

  return ffi::Error::Success();
 }

+#ifdef JAX_GPU_HIP


Is it possible to just reuse the CUDA case with the appropriate defines? That's usually how it works since the APIs are almost identical.

Hi @hawkinsp Thanks for your suggestion.
Updated — platform difference is now absorbed in vendor.h and solver_interface, solver_kernels_ffi.cc has no #ifdef.

…gh XLA allocator

charleshofer and others added 30 commits February 24, 2026 09:38

Remove nvidia_wheel_versions

097ff78

Make jaxlib targets visible

19a87c5

hipblas typedef fix

35b2368

No GPU fail

9bf2dbf

Wrap HIP inline functions in anonymous namespaces in vendor.h

7d1708e

SWDEV-512768 - Replace hipGetLastError with hipExtGetLastError

30d7f94

Add shared utility function get_rocm_version to test_util.py

a5377e5

Fix hipSparse CSR algorithm mappings for ROCm 7

db30afa

Fix v_pages quantization and adjust test params for ROCm compatibilit… (

a44f942

ROCm#560)

Add skip of test_is_finite() on Cuda (ROCm#565)

f555563

(forgot this skip in the previous PR)

Add rocm test requirements file (ROCm#570)

8cf787a

Let the unit tests use build.py for setting up Bazel commands for uni…

17e6022

…t tests (ROCm#582)

adding abort logic to rocm/jax (ROCm#590)

b600136

Skip is_finite tests on ROCm (not in Triton lowering for jax 0.8.0) (R…

02399d0

…OCm#597)

Fix shared memory limit check for ROCm in test_dot (ROCm#596)

0959b0f

Fix Numpy signatures test (ROCm#598)

b43ca18

Co-authored-by: Daniel Suo <danielsuo@gmail.com> Co-authored-by: Jake VanderPlas <jakevdp@google.com>

fix merge arts

cdb5bcb

Enable RngShardingTests (ROCm#644)

de1ef41

Enable test_variadic_reduce_window on ROCm (ROCm#647)

d8179cd

Skip sparse tests on ROCm due to hipSPARSE issue (ROCm#652)

c5016ef

Update sparse test skip messages in v0.8.2 (ROCm#653)

4e6626e

Skip sparse tests on ROCm due to hipSPARSE issue (ROCm#652)

694e861

Update sparse test skip messages in v0.8.2 (ROCm#653)

12e07fb

Skip sparse tests on ROCm due to hipSPARSE issue (ROCm#652)

76e576f

Update sparse test skip messages in v0.8.2 (ROCm#653)

237e5ad

Enable testMultivariateNormalSingularCovariance on ROCm (ROCm#666)

da3a3cc

Skip test_tridiagonal_solve on ROCm due to hipSPARSE numerical errors (…

06d459e

…ROCm#668)

Update Skip Reason Outputs (ROCm#663)

c30a449

Skip sparse tests on ROCm due to hipSPARSE issue (ROCm#652)

58ce4e1

magaonka-amd and others added 12 commits February 24, 2026 09:38

Update sparse test skip messages in v0.8.2 (ROCm#653)

e8307d2

Skip testCudaArrayInterfaceOnNonCudaFails on ROCm platform (ROCm#677)

64ee74e

Skip sparse tests on ROCm due to hipSPARSE issue (ROCm#652)

f144132

Update sparse test skip messages in v0.8.2 (ROCm#653)

9dd1698

Skip sparse tests on ROCm due to hipSPARSE issue (ROCm#652)

fd1195e

Update sparse test skip messages in v0.8.2 (ROCm#653)

4af5327

Add ROCm encoding for test_struct_encoding_determinism (ROCm#683)

bfe0208

Remove 'mean' from unsupported params for jnp.var (ROCm#689)

44b8a6c

Implement approx_tanh for ROCm using OCML tanh function (ROCm#691)

7bd4a13

Skipping testEighTinyNorm due to hipSolver issues (ROCm#697)

95ae9fa

Abort detection CI workflow (ROCm#688)

e355fcd

Abort-Detection: Fix halt-for-connection input (ROCm#712)

d36ebc2

This was referenced Feb 26, 2026

Batched Cholesky (potrf) OOM crash due to hipSolver allocating outside XLA memory pool #718

Open

[ROCm] Batched Cholesky (potrf) OOM crash due to hipSolver allocating outside XLA memory pool jax-ml/jax#35455

Open

hawkinsp reviewed Feb 27, 2026

View reviewed changes

FlemingH closed this Mar 2, 2026

FlemingH force-pushed the amd-main branch from 2ccdebd to d36ebc2 Compare March 2, 2026 02:39

Fix batched Cholesky OOM on ROCm by routing hipSOLVER workspace throu…

5b647ae

…gh XLA allocator

FlemingH reopened this Mar 2, 2026

mminutoli force-pushed the amd-main branch from d36ebc2 to 7d684aa Compare March 2, 2026 18:20

lucbruni-amd linked an issue Mar 2, 2026 that may be closed by this pull request

Batched Cholesky (potrf) OOM crash due to hipSolver allocating outside XLA memory pool #718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batched Cholesky OOM on ROCm by bypassing hipSolver's external hipMalloc#717

Fix batched Cholesky OOM on ROCm by bypassing hipSolver's external hipMalloc#717
FlemingH wants to merge 43 commits intoROCm:amd-mainfrom
FlemingH:amd-main

FlemingH commented Feb 26, 2026 •

edited

Loading

Uh oh!

hawkinsp Feb 27, 2026

Uh oh!

FlemingH Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

FlemingH commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkinsp Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

FlemingH Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

FlemingH commented Feb 26, 2026 •

edited

Loading