Skip to content

Release Prep JAX 0.9.1#735

Open
gulsumgudukbay wants to merge 4 commits intorocm-jaxlib-v0.9.1from
release-prep-091
Open

Release Prep JAX 0.9.1#735
gulsumgudukbay wants to merge 4 commits intorocm-jaxlib-v0.9.1from
release-prep-091

Conversation

@gulsumgudukbay
Copy link

PR to prepare for 0.9.1 release.

gulsumgudukbay and others added 4 commits March 6, 2026 22:52
AMD CDNA3 (MI300X/gfx942) does not have a hardware tanh instruction like
NVIDIA's PTX tanh.approx. This implements approx_tanh for ROCm using:

- For f32 (and f16/bf16 via casting): Triton's __triton_hip_fast_tanhf
  which uses a fast exp-based formula: tanh(x) = (exp(2x) - 1) / (exp(2x) + 1)
- For f64: OCML's __ocml_tanh_f64 (AMD's Open Compute Math Library)

Changes:
- Add f64 support to approx_tanh function
- Add ROCm platform detection in _elementwise_inline_asm_lowering
- Add _approx_tanh_rocm_lowering function for ROCm-specific lowering
- Add test_approx_tanh test with f16/bf16/f32/f64 support

See: triton-lang/triton#7780
(cherry picked from commit 39ceb95)
- Remove verbose comment in _elementwise_inline_asm_lowering
- Inline dtype_to_ir_type helper, use mlir.dtype_to_ir_type directly
- Move ir and arith_dialect imports to top-level
- Add TypeError for float64 on non-ROCm platforms
- Simplify _approx_tanh_rocm_lowering with needs_cast pattern
- Move test_approx_tanh from ops_test.py to triton_pallas_test.py
- Fix triton_pallas_test setUp to allow ROCm devices

(cherry picked from commit 600fbd3)
The test is already skipped on CUDA (b/442353988) due to HLO debug
metadata (source column numbers) being embedded in compiled output,
causing semantically identical compilations to produce different
as_text() results. The same issue occurs on ROCm.

(cherry picked from commit 70b2b99)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
(cherry picked from commit 5ec8419)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants