Forward-merge release/26.04 into main#2310
Conversation
Fixes these `pre-commit` errors blocking CI: ```text verify-hardcoded-version.................................................Failed - hook id: verify-hardcoded-version - exit code: 1 In file RAPIDS_BRANCH:1:9: release/26.04 warning: do not hard-code version, read from VERSION file instead In file RAPIDS_BRANCH:1:9: release/26.04 In file cpp/examples/versions.cmake:8:21: set(RMM_TAG release/26.04) warning: do not hard-code version, read from VERSION file instead In file cpp/examples/versions.cmake:8:21: set(RMM_TAG release/26.04) ``` By updating `verify-hardcoded-version` configuration and by updating the C++ examples to read `RMM_TAG` from the `RAPIDS_BRANCH` file. See rapidsai/pre-commit-hooks#121 for details Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2293
Contributes to rapidsai/build-planning#256 Broken out from rapidsai#2270 Proposes a stricter pattern for installing `torch` wheels, to prevent bugs of the form "accidentally used a CPU-only `torch` from pypi.org". This should help us to catch compatibility issues, improving release confidence. Other small changes: * splits torch wheel testing into "oldest" (PyTorch 2.9) and "latest" (PyTorch 2.10) * introduces a `require_gpu_pytorch` matrix filter so conda jobs can explicitly request `pytorch-gpu` (to similarly ensure solvers don't fall back to the GPU-only variant) * appends `rapids-generate-pip-constraint` output to file `PIP_CONSTRAINT` points - *(to reduce duplication and the risk of failing to apply constraints)* Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2279
…adaptor (rapidsai#2304) So that the tracking resource adaptor is thread safe, the modification of the tracked allocations should be sandwiched by an "acquire-release" pair upstream.allocate-upstream.deallocate. Previously this was not the case, the upstream allocation occurred before updating the tracked allocations, but the dellocation did not occur after. This could lead to a scenario in multi-threaded use where we get a logged error that a deallocated pointer was not tracked. To solve this, actually use the correct pattern. Moreover, ensure that we don't observe ABA issues by using try_emplace when tracking an allocation. - Closes rapidsai#2303 Authors: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2304
…E 754 -0.0 (rapidsai#2302) ## Description `device_uvector::set_element_async` had a zero-value optimization that used `cudaMemsetAsync` when `value == value_type{0}`. For IEEE 754 floating-point types, `-0.0 == 0.0` is `true` per the standard, so `-0.0` was incorrectly routed through `cudaMemsetAsync(..., 0, ...)` which clears all bits — including the sign bit — normalizing `-0.0` to `+0.0`. This corrupts the in-memory representation of `-0.0` for any downstream library that creates scalars through RMM (`cudf::fixed_width_scalar::set_value` → `rmm::device_scalar::set_value_async` → `device_uvector::set_element_async`), causing observable behavioral divergence in spark-rapids (e.g., `cast(-0.0 as string)` returns `"0.0"` on GPU instead of `"-0.0"`). ### Fix Per the discussion in rapidsai#2298, remove all `constexpr` special casing in `set_element_async` — both the `bool` `cudaMemsetAsync` path and the `is_fundamental_v` zero-detection path — and always use `cudaMemcpyAsync`. This preserves exact bit-level representations for all types, which is the correct contract for a memory management library that sits below cuDF, cuML, and cuGraph. `set_element_to_zero_async` is unchanged — its explicit "set to zero" semantics make `cudaMemsetAsync` the correct implementation. ### Testing Added `NegativeZeroTest.PreservesFloatNegativeZero` and `NegativeZeroTest.PreservesDoubleNegativeZero` regression tests that verify the sign bit of `-0.0f` / `-0.0` survives a round-trip through `set_element_async` → `element`. All 122 tests pass locally (CUDA 13.0, RTX 5880). Closes rapidsai#2298 ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes. Made with [Cursor](https://cursor.com) --------- Signed-off-by: Allen Xu <allxu@nvidia.com>
## Description I found that the `ulimit` settings for CUDA 13.1 devcontainers were missing. This fixes it. ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes.
This PR sets an upper bound on the `numba-cuda` dependency to `<0.29.0` Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2306
7661d92 to
7ddf10f
Compare
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR makes multi-domain improvements to RMM: adds container resource limits, updates CI infrastructure for PyTorch wheel handling, tightens numba-cuda version constraints, enables dynamic CMake branch selection, removes memset optimizations from async operations, hardens resource adaptors with duplicate-tracking assertions, and introduces test utilities for concurrency validation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cpp/tests/mr/tracking_mr_tests.cpp (1)
36-80: Good multi-threaded test for ABA race detection.The test correctly sets up the interleaving scenario documented in the comments (lines 43-60). The use of
delayed_memory_resourcewith 300ms delay combined with Thread 1's 100ms initial sleep creates the overlapping deallocation window needed to expose ABA issues.One observation: unlike the
StatisticsTest::MultiThreadedinstatistics_mr_tests.cpp, this test doesn't assert final counter/tracking state after threads join. Consider adding assertions to verifymr.get_outstanding_allocations().size() == 0andmr.get_allocated_bytes() == 0to confirm correct bookkeeping under concurrency.💡 Optional: Add final state assertions
for (auto& t : threads) { t.join(); } + EXPECT_EQ(mr.get_outstanding_allocations().size(), 0); + EXPECT_EQ(mr.get_allocated_bytes(), 0); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/mr/tracking_mr_tests.cpp` around lines 36 - 80, Add final-state assertions after the thread joins in the TrackingTest::MultiThreaded test to ensure the tracking_resource_adaptor cleaned up correctly: call mr.get_outstanding_allocations().size() and mr.get_allocated_bytes() and assert they are zero (e.g., EXPECT_EQ or ASSERT_EQ) to verify no leaked allocation entries and zero tracked bytes after concurrent allocate/deallocate; place these checks immediately after the for-loop that joins threads and reference mr (the tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cpp/tests/mr/tracking_mr_tests.cpp`:
- Around line 36-80: Add final-state assertions after the thread joins in the
TrackingTest::MultiThreaded test to ensure the tracking_resource_adaptor cleaned
up correctly: call mr.get_outstanding_allocations().size() and
mr.get_allocated_bytes() and assert they are zero (e.g., EXPECT_EQ or ASSERT_EQ)
to verify no leaked allocation entries and zero tracked bytes after concurrent
allocate/deallocate; place these checks immediately after the for-loop that
joins threads and reference mr (the
tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 73e0669c-448b-467f-b00c-a97943ba8d02
📒 Files selected for processing (24)
.devcontainer/cuda13.1-conda/devcontainer.json.devcontainer/cuda13.1-pip/devcontainer.json.pre-commit-config.yamlci/download-torch-wheels.shci/release/update-version.shci/test_python_integrations.shci/test_wheel.shci/test_wheel_integrations.shconda/environments/all_cuda-129_arch-aarch64.yamlconda/environments/all_cuda-129_arch-x86_64.yamlconda/environments/all_cuda-131_arch-aarch64.yamlconda/environments/all_cuda-131_arch-x86_64.yamlcpp/examples/versions.cmakecpp/include/rmm/device_scalar.hppcpp/include/rmm/device_uvector.hppcpp/include/rmm/mr/aligned_resource_adaptor.hppcpp/include/rmm/mr/statistics_resource_adaptor.hppcpp/include/rmm/mr/tracking_resource_adaptor.hppcpp/tests/device_uvector_tests.cppcpp/tests/mr/delayed_memory_resource.hppcpp/tests/mr/statistics_mr_tests.cppcpp/tests/mr/tracking_mr_tests.cppdependencies.yamlpython/rmm/pyproject.toml
💤 Files with no reviewable changes (1)
- cpp/include/rmm/device_uvector.hpp
Manual forward merge from release/26.04 to main. This PR should not be squashed.