Skip to content

Forward-merge release/26.04 into main#2310

Merged
bdice merged 7 commits intorapidsai:mainfrom
bdice:main-merge-release/26.04
Mar 16, 2026
Merged

Forward-merge release/26.04 into main#2310
bdice merged 7 commits intorapidsai:mainfrom
bdice:main-merge-release/26.04

Conversation

@bdice
Copy link
Collaborator

@bdice bdice commented Mar 16, 2026

Manual forward merge from release/26.04 to main. This PR should not be squashed.

jameslamb and others added 6 commits March 12, 2026 22:04
Fixes these `pre-commit` errors blocking CI:

```text
verify-hardcoded-version.................................................Failed
- hook id: verify-hardcoded-version
- exit code: 1

In file RAPIDS_BRANCH:1:9:
 release/26.04
warning: do not hard-code version, read from VERSION file instead

In file RAPIDS_BRANCH:1:9:
 release/26.04

In file cpp/examples/versions.cmake:8:21:
 set(RMM_TAG release/26.04)
warning: do not hard-code version, read from VERSION file instead

In file cpp/examples/versions.cmake:8:21:
 set(RMM_TAG release/26.04)
```

By updating `verify-hardcoded-version` configuration and by updating the C++ examples to read `RMM_TAG` from the `RAPIDS_BRANCH` file.

See rapidsai/pre-commit-hooks#121 for details

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2293
Contributes to rapidsai/build-planning#256

Broken out from rapidsai#2270 

Proposes a stricter pattern for installing `torch` wheels, to prevent bugs of the form "accidentally used a CPU-only `torch` from pypi.org". This should help us to catch compatibility issues, improving release confidence.

Other small changes:

* splits torch wheel testing into "oldest" (PyTorch 2.9) and "latest" (PyTorch 2.10)
* introduces a `require_gpu_pytorch` matrix filter so conda jobs can explicitly request `pytorch-gpu` (to similarly ensure solvers don't fall back to the GPU-only variant)
* appends `rapids-generate-pip-constraint` output to file `PIP_CONSTRAINT` points
  - *(to reduce duplication and the risk of failing to apply constraints)*

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2279
…adaptor (rapidsai#2304)

So that the tracking resource adaptor is thread safe, the modification of the tracked allocations should be sandwiched by an "acquire-release" pair upstream.allocate-upstream.deallocate. Previously this was not the case, the upstream allocation occurred before updating the tracked allocations, but the dellocation did not occur after. This could lead to a scenario in multi-threaded use where we get a logged error that a deallocated pointer was not tracked.

To solve this, actually use the correct pattern. Moreover, ensure that we don't observe ABA issues by using try_emplace when tracking an allocation.

- Closes rapidsai#2303

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2304
…E 754 -0.0 (rapidsai#2302)

## Description

`device_uvector::set_element_async` had a zero-value optimization that
used `cudaMemsetAsync` when `value == value_type{0}`. For IEEE 754
floating-point types, `-0.0 == 0.0` is `true` per the standard, so
`-0.0` was incorrectly routed through `cudaMemsetAsync(..., 0, ...)`
which clears all bits — including the sign bit — normalizing `-0.0` to
`+0.0`.

This corrupts the in-memory representation of `-0.0` for any downstream
library that creates scalars through RMM
(`cudf::fixed_width_scalar::set_value` →
`rmm::device_scalar::set_value_async` →
`device_uvector::set_element_async`), causing observable behavioral
divergence in spark-rapids (e.g., `cast(-0.0 as string)` returns `"0.0"`
on GPU instead of `"-0.0"`).

### Fix

Per the discussion in rapidsai#2298, remove all `constexpr` special casing in
`set_element_async` — both the `bool` `cudaMemsetAsync` path and the
`is_fundamental_v` zero-detection path — and always use
`cudaMemcpyAsync`. This preserves exact bit-level representations for
all types, which is the correct contract for a memory management library
that sits below cuDF, cuML, and cuGraph.

`set_element_to_zero_async` is unchanged — its explicit "set to zero"
semantics make `cudaMemsetAsync` the correct implementation.

### Testing

Added `NegativeZeroTest.PreservesFloatNegativeZero` and
`NegativeZeroTest.PreservesDoubleNegativeZero` regression tests that
verify the sign bit of `-0.0f` / `-0.0` survives a round-trip through
`set_element_async` → `element`. All 122 tests pass locally (CUDA 13.0,
RTX 5880).

Closes rapidsai#2298

## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.

Made with [Cursor](https://cursor.com)

---------

Signed-off-by: Allen Xu <allxu@nvidia.com>
## Description
I found that the `ulimit` settings for CUDA 13.1 devcontainers were
missing. This fixes it.

## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This PR sets an upper bound on the `numba-cuda` dependency to `<0.29.0`

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2306
@bdice bdice requested review from a team as code owners March 16, 2026 22:10
@bdice bdice added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Mar 16, 2026
@bdice bdice requested review from gforsyth, lamarrr and vyasr March 16, 2026 22:10
@bdice bdice force-pushed the main-merge-release/26.04 branch from 7661d92 to 7ddf10f Compare March 16, 2026 22:12
@bdice bdice moved this to In Progress in RMM Project Board Mar 16, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced PyTorch wheel support with improved CUDA compatibility handling and dynamic installation workflow.
  • Bug Fixes

    • Improved thread-safety in memory resource handling with additional validation checks.
  • Tests

    • Added comprehensive multithreaded concurrency tests for memory resource adaptors and statistics tracking.
  • Chores

    • Updated dependency constraints for CUDA compatibility.
    • Enhanced development environment configuration with improved resource limits.

Walkthrough

This PR makes multi-domain improvements to RMM: adds container resource limits, updates CI infrastructure for PyTorch wheel handling, tightens numba-cuda version constraints, enables dynamic CMake branch selection, removes memset optimizations from async operations, hardens resource adaptors with duplicate-tracking assertions, and introduces test utilities for concurrency validation.

Changes

Cohort / File(s) Summary
Devcontainer Configuration
.devcontainer/cuda13.1-conda/devcontainer.json, .devcontainer/cuda13.1-pip/devcontainer.json
Added file descriptor limit (--ulimit nofile=500000) to container runtime arguments.
Pre-commit and CI Foundation
.pre-commit-config.yaml, ci/release/update-version.sh
Bumped pre-commit-hooks from v1.3.3 to v1.4.2 with new exclude rules; updated copyright year and removed RMM_TAG sed commands from release script.
PyTorch Wheel Management
ci/download-torch-wheels.sh, ci/test_wheel.sh, ci/test_wheel_integrations.sh, ci/test_python_integrations.sh
Added new download script for CUDA-aware PyTorch wheels; refactored test scripts to use PIP_CONSTRAINT environment variable, dynamic wheel downloads, and CUDA version gating (12.9–13.0).
Dependency Constraints
conda/environments/all_cuda-129_arch-aarch64.yaml, conda/environments/all_cuda-129_arch-x86_64.yaml, conda/environments/all_cuda-131_arch-aarch64.yaml, conda/environments/all_cuda-131_arch-x86_64.yaml, python/rmm/pyproject.toml, dependencies.yaml
Tightened numba-cuda constraint from >=0.22.1 to >=0.22.1,<0.29.0 across multiple CUDA variant environments and test matrices.
CMake and Version Management
cpp/examples/versions.cmake
Made RMM_TAG dynamic by reading from ${_rapids_branch} variable and adding include directive to rapids_config.cmake.
Device Operations
cpp/include/rmm/device_scalar.hpp, cpp/include/rmm/device_uvector.hpp
Removed memset optimization documentation and zero-value fast-path for set_element_async, consolidating to single cudaMemcpyAsync path.
Resource Adaptors
cpp/include/rmm/aligned_resource_adaptor.hpp, cpp/include/rmm/statistics_resource_adaptor.hpp, cpp/include/rmm/tracking_resource_adaptor.hpp
Added duplicate-pointer detection via try_emplace assertions; reordered deallocation calls to occur after counter updates to ensure proper error detection and logging.
Testing Infrastructure
cpp/tests/mr/delayed_memory_resource.hpp, cpp/tests/device_uvector_tests.cpp, cpp/tests/mr/statistics_mr_tests.cpp, cpp/tests/mr/tracking_mr_tests.cpp
Introduced delayed_memory_resource test utility to inject post-deallocation delays; added negative-zero preservation tests for floating-point types and multithreaded concurrency tests for resource adaptors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

3 - Ready for review

Suggested reviewers

  • gforsyth
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'Forward-merge release/26.04 into main' clearly and concisely describes the primary purpose of this changeset—merging changes from a release branch into the main branch.
Description check ✅ Passed The PR description accurately explains this is a manual forward merge from release/26.04 to main with a note about not squashing, which directly relates to the changeset's purpose.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/tests/mr/tracking_mr_tests.cpp (1)

36-80: Good multi-threaded test for ABA race detection.

The test correctly sets up the interleaving scenario documented in the comments (lines 43-60). The use of delayed_memory_resource with 300ms delay combined with Thread 1's 100ms initial sleep creates the overlapping deallocation window needed to expose ABA issues.

One observation: unlike the StatisticsTest::MultiThreaded in statistics_mr_tests.cpp, this test doesn't assert final counter/tracking state after threads join. Consider adding assertions to verify mr.get_outstanding_allocations().size() == 0 and mr.get_allocated_bytes() == 0 to confirm correct bookkeeping under concurrency.

💡 Optional: Add final state assertions
   for (auto& t : threads) {
     t.join();
   }
+  EXPECT_EQ(mr.get_outstanding_allocations().size(), 0);
+  EXPECT_EQ(mr.get_allocated_bytes(), 0);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/mr/tracking_mr_tests.cpp` around lines 36 - 80, Add final-state
assertions after the thread joins in the TrackingTest::MultiThreaded test to
ensure the tracking_resource_adaptor cleaned up correctly: call
mr.get_outstanding_allocations().size() and mr.get_allocated_bytes() and assert
they are zero (e.g., EXPECT_EQ or ASSERT_EQ) to verify no leaked allocation
entries and zero tracked bytes after concurrent allocate/deallocate; place these
checks immediately after the for-loop that joins threads and reference mr (the
tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/tests/mr/tracking_mr_tests.cpp`:
- Around line 36-80: Add final-state assertions after the thread joins in the
TrackingTest::MultiThreaded test to ensure the tracking_resource_adaptor cleaned
up correctly: call mr.get_outstanding_allocations().size() and
mr.get_allocated_bytes() and assert they are zero (e.g., EXPECT_EQ or ASSERT_EQ)
to verify no leaked allocation entries and zero tracked bytes after concurrent
allocate/deallocate; place these checks immediately after the for-loop that
joins threads and reference mr (the
tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 73e0669c-448b-467f-b00c-a97943ba8d02

📥 Commits

Reviewing files that changed from the base of the PR and between 22f4680 and 7ddf10f.

📒 Files selected for processing (24)
  • .devcontainer/cuda13.1-conda/devcontainer.json
  • .devcontainer/cuda13.1-pip/devcontainer.json
  • .pre-commit-config.yaml
  • ci/download-torch-wheels.sh
  • ci/release/update-version.sh
  • ci/test_python_integrations.sh
  • ci/test_wheel.sh
  • ci/test_wheel_integrations.sh
  • conda/environments/all_cuda-129_arch-aarch64.yaml
  • conda/environments/all_cuda-129_arch-x86_64.yaml
  • conda/environments/all_cuda-131_arch-aarch64.yaml
  • conda/environments/all_cuda-131_arch-x86_64.yaml
  • cpp/examples/versions.cmake
  • cpp/include/rmm/device_scalar.hpp
  • cpp/include/rmm/device_uvector.hpp
  • cpp/include/rmm/mr/aligned_resource_adaptor.hpp
  • cpp/include/rmm/mr/statistics_resource_adaptor.hpp
  • cpp/include/rmm/mr/tracking_resource_adaptor.hpp
  • cpp/tests/device_uvector_tests.cpp
  • cpp/tests/mr/delayed_memory_resource.hpp
  • cpp/tests/mr/statistics_mr_tests.cpp
  • cpp/tests/mr/tracking_mr_tests.cpp
  • dependencies.yaml
  • python/rmm/pyproject.toml
💤 Files with no reviewable changes (1)
  • cpp/include/rmm/device_uvector.hpp

@bdice bdice merged commit 485d79a into rapidsai:main Mar 16, 2026
82 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in RMM Project Board Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants