🔄 daily merge: master → main 2026-01-16 by antfin-oss · Pull Request #747 · antgroup/ant-ray

antfin-oss · 2026-01-16T03:20:12Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2026-01-16
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

not used by rllib any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

) Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ray-project#59669) `test_resource_manager.py` is 1300+ LOC. If we want to add a new resource allocator implementation, it'll be even larger. To keep test files small, and make it easier to introduce a new allocator, this PR separates the `ReservationOpResourceAllocator` into its own test module. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

…-project#59637) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Add BackpressurePolicy to streaming executor progress bar Add Backpressure policy information to streaming executor progress bar ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

To make it easier to extend Ray Data and add new allocator implementations, this PR adds a `create_resource_allocator` function. You can configure its output with the `RAY_DATA_USE_OP_RESOURCE_ALLOCATOR_VERSION` environment variable (currently, it only supports the "V1" value). --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

…ay-project#59672) ray-project#57788 added `task_resource_usage` and `output_object_store_usage` to `OpResourceAllocator.max_task_output_bytes_to_read`. But, the parameters aren't actually used anywhere, so this PR removes them. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

) ray-project#59412 updated the reservation resource allocator to cap resource allocations based on the resources returned by `min_max_resource_usage`. The problem is that running tasks can produce any amounts of data, so it doesn't make sense to cap by `obj_store_mem_max_pending_output_per_task * concurrency`. This PR fixes that issue by setting the max object store memory usage to 'inf'. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…59746) shortcut to JobFileManager directly the original `upload()` and `download()` methods are not used anywhere any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

) ## Why are these changes needed? With this change the `@udf` decorator can be added to callable classes. Allows for expressions to be used in conjunction with callable classes. ## Related issue number Closes ray-project#56529 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com>

… encapsulation (ray-project#59754) ## Description `get_ineligible_op_usage` depends on `ResourceManager`, and the notion of operator eligibility is defined by the `ResourceManager`, so it might be easier to understand the abstractions if `get_ineligible_op_usage` is also a `ResourceManager` method. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…ct#59713) ## Description Fix @task_consumer decorator's __del__ method calling shutdown() which broadcasts to all Celery workers instead of just the local one. This kills newly started workers during rolling updates. ## Related issues None ## Additional information Removed self._adapter.shutdown() from __del__ - only stop_consumer() should be called since it targets the specific worker hostname. Also removed shutdown() implementation & from interface given it is not used anywhere --------- Signed-off-by: krisselberg <kselberg@princeton.edu>

to be consistent with setup.py; also unpins the version, which allows library users to use newer versions. also relaxes ormsgpack's version so that library users can upgrade it. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…mation (ray-project#59741) Adding Buildkite Environment variables to help identify the build clearly. Release test configurations are being added to narrow down on the team and environment for any specific test. Signed-off-by: Rajesh G <rajesh@anyscale.com>

old flag that is not used anywhere now Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…allelization (ray-project#59736) ## Description This PR created two new files - test_push_based_shuffle.py and test_shuffle_diagnostics.py and moved some tests originally in test_sort.py there. It is meant to enhance the parallelization during CI. ## Related issues Related to ray-project#59729 --------- Signed-off-by: Rob12312368 <rob12312368@gmail.com>

…59676) ## Description Fixed a minor typo in the comment for execution_id resource tracking logic. Remove the redundant "the" before "its expiration timestamp" - From: `# For the same execution_id, we track the latest resource request and the its expiration timestamp.` - To: `# For the same execution_id, we track the latest resource request and its expiration timestamp.` ## Impact - No functional changes, just improves comment clarity Signed-off-by: will <zzchun8@gmail.com>

Fixes ray-project#59300 Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

@bveeramani

…y-project#59234) ## Description This is a follow-up PR after: ray-project#58915 for changing the function name from ```completed()``` to ```has_completed()``` in the class PhycicalOperator. This is in-line with our discussions with @bveeramani here: ray-project#58915 (review). This is done to make it clear that the method doesn't modify state and has no side effects. ## Related issues ray-project#58884 ## Additional information Just updated the completed() and all it's references in other classes and tests. --------- Signed-off-by: Simeet Nayan <simeetnayan.8100@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

not used anywhere anymore; pure dead code. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

not used any more. cluster lifecycle in release tests are managed by Anyscale or kuberayportal today also removes unused exceptions and error codes. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

- upgrade docker plugin - use buildkite commands, rather than plugin commands - add set shell to bash - move `python:3.10` image as default image Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ray-project#59752) stop testing simple error handling with mocking Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…y-project#59766) they are never used, do not apply to kuberay, and will not work with new anyscale SDK/CLI. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ect#59753) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Enable and Tune DownstreamCapacityBackpressurePolicy - To backpressure a given Op, use Queue size build up / Downstream capacity ratio. This ratio represents the upper limit of buffering in Object store between pipeline stages to optimize for throughput. - Wait until OBJECT_STORE_BUDGET_UTIL_THRESHOLD of the Op utilization before this backpressure policy can kick in, so steady state is reached. - Skip this backpressure policy, if current Os or downstream Op is materializing. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: Srinath Krishnamachari <68668616+srinathk10@users.noreply.github.com>

## Description Completing the Expr Arithmetic operations: negate, sign, power, abs ``` import ray from ray.data import from_items from ray.data.expressions import col ds = from_items([{"x": 5}, {"x": 2}, {"x": 0}]) for row in ds.iter_rows(): print(row) x_expr = col("x") ds = ds.with_column("x_negate", x_expr.negate()) ds = ds.with_column("x_sign", x_expr.sign()) ds = ds.with_column("x_power", x_expr.power(2)) ds = ds.with_column("x_abs", x_expr.abs()) for row in ds.iter_rows(): print(row) ``` ## Related issues Related to ray-project#58674 --------- Signed-off-by: will <zzchun8@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>

## Description Similar to ray-project#59422, the `Write` op already has a `_compute` attribute for specifying compute strategy. So we do not need to store `concurrency` separately.. This will also make it easier to implement actor-based data sink (with `ActorPoolStrategy`) in the future. ## Additional information * No change to public APIs * Cleaned up some imports in `dataset.py` Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

) python 3.9 clean up: removing **base test 3.9** and **base gpu 3.9** wanda files and buildkite jobs --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

test_controller was timing out ``` [2025-12-30T19:04:59Z] //python/ray/serve/tests:test_controller TIMEOUT in 3 out of 3 in 60.1s ``` Signed-off-by: abrar <abrar@anyscale.com>

to 0.26.74, the last release before ray summit 2025 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ect#59772) kicking off the anyscale sdk migration process Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…project#60175) Was recently debugging something related to this behavior and found these logs useful. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

## Why are these changes needed? This PR adds gRPC-based inter-deployment communication for Ray Serve, allowing deployments to communicate with each other using gRPC transport instead of Ray actor calls. This can provide performance benefits in certain scenarios. ### Key Changes 1. **gRPC Server on Replicas**: Each replica now starts a gRPC server that can handle requests from other deployments. 2. **gRPC Replica Wrapper**: A new `gRPCReplicaWrapper` class handles sending requests via gRPC and processing responses. 3. **Handle Options**: The `_by_reference` option on handles controls whether to use Ray actor calls (`True`) or gRPC transport (`False`). 4. **New Environment Variables**: - `RAY_SERVE_USE_GRPC_BY_DEFAULT`: Master flag to enable gRPC transport by default for all inter-deployment communication - `RAY_SERVE_PROXY_USE_GRPC`: Controls whether the proxy uses gRPC transport (defaults to the master flag value) - `RAY_SERVE_GRPC_MAX_MESSAGE_SIZE`: Configures the maximum gRPC message size (default: 2GB-1) ## Related issue number N/A ## Checks - [x] I've signed all my commits - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a temporary testing hook, I've added it under the API Reference (Experimental) page. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests. ## Test Plan - `python/ray/serve/tests/test_grpc_e2e.py` - `python/ray/serve/tests/test_grpc_replica_wrapper.py` - `python/ray/serve/tests/unit/test_grpc_replica_result.py` ## Benchmarks Script available [here](https://gist.github.com/eicherseiji/02808c32d0e377803888671da64524d1) Results show throughput/latency improvements w/ gRPC for message size < ~1MB. <img width="2229" height="740" alt="benchmark_plot" src="https://github.com/user-attachments/assets/e7e25f94-00b4-434d-9eff-10cd36047356" /> ``` ⎿ ============================================================================== gRPC vs Plasma Benchmark Results ============================================================================== Payload Metric Plasma gRPC Δ Winner ---------------------------------------------------------------------- 1 KB Latency p50 2.63ms 1.89ms +28% gRPC Chain p50 4.11ms 3.02ms +26% gRPC Throughput 160/s 190/s +16% gRPC ---------------------------------------------------------------------- 10 KB Latency p50 2.68ms 1.68ms +37% gRPC Chain p50 3.91ms 2.94ms +25% gRPC Throughput 167/s 185/s +10% gRPC ---------------------------------------------------------------------- 100 KB Latency p50 2.74ms 2.02ms +26% gRPC Chain p50 4.28ms 3.06ms +28% gRPC Throughput 157/s 182/s +13% gRPC ---------------------------------------------------------------------- 500 KB Latency p50 5.78ms 3.52ms +39% gRPC Chain p50 5.65ms 4.82ms +15% gRPC Throughput 114/s 144/s +21% gRPC ---------------------------------------------------------------------- 1 MB Latency p50 6.31ms 5.18ms +18% gRPC Chain p50 5.96ms 6.20ms -4% Plasma Throughput 130/s 165/s +21% gRPC ---------------------------------------------------------------------- 2 MB Latency p50 8.82ms 9.57ms -9% Plasma Chain p50 7.20ms 10.69ms -48% Plasma Throughput 123/s 106/s -16% Plasma ---------------------------------------------------------------------- 5 MB Latency p50 15.20ms 23.72ms -56% Plasma Chain p50 8.90ms 23.25ms -161% Plasma Throughput 78/s 49/s -58% Plasma ---------------------------------------------------------------------- 10 MB Latency p50 25.02ms 34.34ms -37% Plasma Chain p50 9.72ms 34.71ms -257% Plasma Throughput 38/s 31/s -24% Plasma ---------------------------------------------------------------------- ``` Compared to parity implementation: ``` ============================================================================== gRPC Transport: OSS 3.0.0.dev0 vs Parity 2.53.0 ============================================================================== Payload Metric OSS 3.0.0.dev0 Parity 2.53.0 ---------------------------------------------------------------------- 1 KB Latency p50 1.82ms 2.27ms Chain p50 2.95ms 2.99ms Throughput 268/s 272/s ---------------------------------------------------------------------- 10 KB Latency p50 1.82ms 2.05ms Chain p50 2.85ms 2.80ms Throughput 246/s 293/s ---------------------------------------------------------------------- 100 KB Latency p50 2.04ms 2.35ms Chain p50 3.27ms 3.12ms Throughput 262/s 257/s ---------------------------------------------------------------------- 500 KB Latency p50 3.67ms 3.78ms Chain p50 5.77ms 4.91ms Throughput 186/s 192/s ---------------------------------------------------------------------- 1 MB Latency p50 4.99ms 5.39ms Chain p50 5.95ms 6.56ms Throughput 177/s 156/s ---------------------------------------------------------------------- 2 MB Latency p50 7.91ms 7.37ms Chain p50 8.26ms 12.16ms Throughput 117/s 129/s ---------------------------------------------------------------------- 5 MB Latency p50 17.86ms 19.53ms Chain p50 22.65ms 23.85ms Throughput 87/s 54/s ---------------------------------------------------------------------- 10 MB Latency p50 23.79ms 27.78ms Chain p50 35.67ms 31.06ms Throughput 48/s 27/s ---------------------------------------------------------------------- Cluster: 2 worker nodes (48 CPU, 4 GPU, 192GB RAM, 54.5GB object store each) 3-trial average ``` Note: OSS 3.0.0.dev0 includes a token auth optimization (ray-project#59500) that reduces per-RPC overhead by caching auth tokens and avoiding object construction on each call. This likely explains the improved latency and throughput at larger payload sizes. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…ject#59788) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Disable ConcurrencyCap Backpressure policy by default - With DownstreamCapacityBackpressurePolicy now enabled by default, disable ConcurrencyCapBackpressurePolicy by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…tage (ray-project#59395) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

…roject#60179) `ray-docs` has become a bottleneck for review. No longer requiring their approval for library documentation changes, but leaving it as a catch-all for other docs changes. Flyby: removing code ownership for removed Ray Workflows library directories. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…stage config refactor (ray-project#59214) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

## Description Introduced in ray-project#59544, we added many optimisations for IMPALA / APPO learner. One of these optimisations is to minimise the time thread locked from the queue. Using `queue.get_nowait()` will raise an exception if the queue has no data. Therefore, we wrap the request in a try/except. Currently, when this exception occurs then we log a warning but reviewing the training logs this actually causes a massive amount of spam. This PR, therefore, removes the warning with an associated comment to explain why we use a `pass` instead. --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>

Before: <img width="616" height="210" alt="Screenshot 2026-01-07 at 11 03 26 AM" src="https://github.com/user-attachments/assets/139496f4-136d-4ade-9ec5-ad788cc4f8f9" /> After: <img width="5000" height="2812" alt="ray-job-diagram" src="https://github.com/user-attachments/assets/e9c653be-3191-458d-9fa1-30329c395496" /> --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: akshay-anyscale <122416226+akshay-anyscale@users.noreply.github.com>

…ay-project#60176) No behavior changes, pure refactoring to retain my sanity. - No longer inherit from `SchedulingQueue` in `NormalTaskExecutionQueue` - Remove unnecessary methods from `SchedulingQueue` interface - `NormalSchedulingQueue` -> `NormalTaskExecutionQueue` - `ActorSchedulingQueue` -> `OrderedActorTaskExecutionQueue` - `OutOfOrderActorSchedulingQueue` -> `UnorderedActorTaskExecutionQueue` - `SchedulingQueue` -> `ActorTaskExecutionQueueInterface` - A few method/field renamings. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>

…re flag, GCS, and Raylet code. (ray-project#59979) This is 1/N in a series of PRs to remove Centralized Actor Scheduling by the GCS (introduced in ray-project#15943). The feature is off by default and no longer in use or supported. In this PR, I remove the feature flag to turn the feature on and remove related code and tests in the GCS and the Raylet. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>

…registry example (ray-project#60071) Noticed ray-project#59917 but it didn't fix it - Linking to notebook instead of README.md - Removing notebook from exclude_patterns The notebook should be the single source of truth since it’s what's tested and validated, so we should link to it rather than the README.md. The README.md is generated from the notebook (jupyter nbconvert) and exists only for display in the console when converting the example into an Anyscale template Also fixing the error 404 of the mlflow registry example by lin,ing to the proper doc --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

…y-project#60161) so that it is clear that these functions are not meant to be used by other files Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

not used anywhere any more Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…oject#57555)   ## Why are these changes needed? Since `0.26.0` `uvicorn` [changed](https://uvicorn.dev/release-notes/?utm_source=chatgpt.com#0260-january-16-2024) how it processes `root_path`. To support all `uvicorn` versions, injecting `root_path` to ASGI app instead of passing it to `uvicorn.Config` starting from version `0.26.0`. Before the change: ``` # uvicorn==0.22.0 pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass # uvicorn==0.40.0 - latest pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - failed # FAILED python/ray/serve/tests/test_standalone.py::test_http_root_path - assert 404 == 200 ``` After the change: ``` # uvicorn==0.22.0 pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass # uvicorn==0.40.0 - latest pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass ```  ## Related issue number  Closes ray-project#55776. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>

@Aydin-ab

…nd Model composition for recsys examples (ray-project#59166) ## Description Adding two examples for Ray Serve as part of our workload based series: - Model multiplexing with forecasting models ✅ - Model composition for recsys (recommendation systems) ✅ Will later be published as templates in the anyscale console Lots of added/modified files but the contents to review are under the `content/` folder, everything else is related to the publishing workflow in ray docs + setting up testing in the CI author: @Aydin-ab --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

Minor cleanups on the task execution path as I muddle through it. No behavior changes. - `DependencyWaiter` -> `ActorTaskExecutionArgWaiter`. Previously, I found myself continually confused if (1) this was only for actor tasks or also normal tasks and (2) if it was on the submission or execution path. - Added `ActorTaskExecutionArgWaiterInterface` instead of having an `Impl`. - Added header comments for what the `ActorTaskExecutionArgWaiter` is doing. - `HandleTask` -> `QueueTaskForExecution`. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ect#60149) ## Description The `python redis test` is failing with high probability in testing due to `test_network_partial_failures` catching a "subprocess is still running" resource warning when it expects no warnings to be thrown. Investigation showed that existing kill redis server logic during test cleanup does not wait for the redis server process to die before moving onto the next test, causing the resource warning we observed above. This PR addresses this issue by adding wait to ensure that the redis server is fully cleaned up before moving to the proceeding test during redis test cleanup step. ## Related issues Fixes failing `test_network_partial_failures` in CI automated tests. ## Additional information Example of the fix passing `test_network_partial_failures` in post merge: https://buildkite.com/ray-project/postmerge/builds/15416#_ --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

…project#60173) ## Description Due to Python's operator precedence (+ binds tighter than if-else): ```python result = [1, 2, 3] + [4] if False else [] print(f"result = {result}") #result = [] ``` `PSUTIL_PROCESS_ATTRS` on Windows actually returns an empty list, but our intention is to only exclude `num_fds`. This PR fixes it. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>

…c callback during shutdown (ray-project#60048) ## Description When a Ray worker process shuts down (e.g., during `ray.shutdown()` or node termination), the OpenTelemetry `PeriodicExportingMetricReader`'s background thread may still be invoking the gauge callback (`_DoubleGaugeCallback`), which then accesses already-destroyed member data, resulting in a use-after-free crash. The error message: ``` (bundle_reservation_check_func pid=1543823) pure virtual method called (bundle_reservation_check_func pid=1543823) __cxa_deleted_virtual ``` I looked further into this, and ideally, at the OpenTelemetry code level, shutdown should be handled correctly. [PeriodicExportingMetricReader's shutdown](https://github.com/open-telemetry/opentelemetry-cpp/blob/f33dcc07c56c7e3b18fd18e13986f0eda965d116/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L292-L299) waits for `worker_thread_` to finish. ```c bool PeriodicExportingMetricReader::OnShutDown(std::chrono::microseconds timeout) noexcept { if (worker_thread_.joinable()) { cv_.notify_all(); worker_thread_.join(); } return exporter_->Shutdown(timeout); } ``` And callback(`worker_thread_`) is in a [while (IsShutdown() != true)](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L147) loop. Therefore, there should be no use-after-free race condition at the OpenTelemetry code level, and it should be safe to call `meter_provider_->Shutdown()`. However, the issue is that the last callback appears to access member data that has already been destroyed during ForceFlush, which is called before Shutdown. This member data belongs to the OpenTelemetry SDK itself. The more I look into it, the more it feels like this is actually a bug in the OpenTelemetry SDK. And even further, I found this:[[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free ](open-telemetry/opentelemetry-cpp#3457) Which is exactly the issue I encountered! This PR upgrade the OpenTelemetry C++ SDK version to include this fix. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information It is quit easy to reproduced， For example, if we manually running the `test_placement_group_reschedule_node_dead` in `python/ray/autoscaler/v2/tests/test_e2e.py`. ``` (docs) ubuntu@devbox:~/ray$ pkill -9 -f raylet 2>/dev/null || true; pkill -9 -f gcs_server 2>/dev/null || true; ray stop --force 2>/dev/null || true; sleep 2 Did not find any active Ray processes. (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ............ __cxa_deleted_virtual opentelemetry::v1::sdk::metrics::FilteredOrderedAttributeMap::FilteredOrderedAttributeMap()::{lambda()#1}::operator()() opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()() opentelemetry::v1::sdk::metrics::ObserverResultT<>::Observe() opentelemetry::v1::metrics::ObserverResultT<>::Observe<>() ray::observability::OpenTelemetryMetricRecorder::CollectGaugeMetricValues() (anonymous namespace)::_DoubleGaugeCallback() opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe() opentelemetry::v1::sdk::metrics::Meter::Collect() opentelemetry::v1::sdk::metrics::MetricCollector::Produce() opentelemetry::v1::sdk::metrics::MetricReader::Collect() opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce() std::thread::_State_impl<>::_M_run() ............ ``` after this pr, no such error message: ``` (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ============================= test session starts ============================== platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python cachedir: .pytest_cache rootdir: /home/ubuntu/ray configfile: pytest.ini plugins: asyncio-1.3.0, anyio-4.11.0 asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collecting ... collected 2 items python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v1] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:00,347 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:00,385 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (autoscaler +11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +11s) Resized to 0 CPUs. (autoscaler +12s) Resized to 0 CPUs. (autoscaler +14s) Resized to 0 CPUs. (autoscaler +15s) Resized to 0 CPUs. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +16s) Resized to 0 CPUs. (autoscaler +16s) Adding 1 node(s) of type type-1. (autoscaler +16s) Adding 1 node(s) of type type-2. (autoscaler +16s) Adding 1 node(s) of type type-3. Killing pids 1566233 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 Stopped all 10 Ray processes. (autoscaler +32s) Resized to 0 CPUs. (autoscaler +32s) Adding 1 node(s) of type type-1. (autoscaler +32s) Adding 1 node(s) of type type-2. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Removing 1 nodes of type type-3 (idle). PASSED python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v2] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:40,170 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:40,202 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 Stopped only 9 out of 12 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=1569612, name='raylet', status='terminated'), psutil.Process(pid=1569160, name='raylet', status='terminated'), psutil.Process(pid=1568952, name='raylet', status='terminated')] will be forcefully terminated. You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination. Killing pids 1568744 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 PASSED ========================= 2 passed in 80.90s (0:01:20) ========================= EXIT CODE: 0 (docs) ubuntu@devbox:~/ray$ ``` Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>

gemini-code-assist

Code Review

This pull request is a large automated merge from master to main, encompassing a wide range of changes. The most significant updates are a major refactoring of the CI/CD pipelines, particularly for wheel building and testing, and extensive improvements to the documentation. Key changes include the removal of Python 3.9 support from CI, the introduction of a new rayci rules system, and updates to many documentation pages for clarity, correctness, and new features. Overall, these changes appear to be positive, improving maintainability and user-facing documentation. I have one suggestion regarding a new CI rule that could impact performance.

gemini-code-assist · 2026-01-16T03:23:37Z

.buildkite/test.rules.txt

+*
+@ ml tune train data serve
+@ core_cpp cpp java python doc
+@ linux_wheels macos_wheels dashboard tools release_tests
+;


This wildcard rule * at the end of the file applies a very broad set of tags to any file change. This appears to defeat the purpose of the conditional testing rules defined earlier, as it may trigger a large number of unrelated tests for any modification, potentially increasing CI time and costs. If this catch-all is not intentional, I recommend removing it to restore more targeted test execution.

github-actions · 2026-01-30T13:30:24Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2026-02-13T13:36:39Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

aslonnie and others added 30 commits December 29, 2025 13:58

[deps] remove scikit-image (ray-project#59743)

24c08d5

not used by rllib any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] remove unused FileManager abstract class (ray-project#…

4d9938e

…59746) shortcut to JobFileManager directly the original `upload()` and `download()` methods are not used anywhere any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] remove unused --no-terminate flag (ray-project#59758)

c381d05

old flag that is not used anywhere now Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[Data] Remove obsolete _DatasetWrapper (ray-project#59310)

875132e

Fixes ray-project#59300 Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

[release test] remove unused old JobManager (ray-project#59761)

bc99f28

not used anywhere anymore; pure dead code. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] remove FullClusterManager (ray-project#59759)

34e68ad

not used any more. cluster lifecycle in release tests are managed by Anyscale or kuberayportal today also removes unused exceptions and error codes. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] refactor release test step generation (ray-project#59755)

84d219d

- upgrade docker plugin - use buildkite commands, rather than plugin commands - add set shell to bash - move `python:3.10` image as default image Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] move testing of test result struct into test_result.py (…

f5440d9

…ray-project#59752) stop testing simple error handling with mocking Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] remove --cluster-id and --cluster-env-id flags (ra…

f2c6825

…y-project#59766) they are never used, do not apply to kuberay, and will not work with new anyscale SDK/CLI. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] removing python 3.9 buildkite jobs & wanda files (ray-project#59769

5acd1ea

) python 3.9 clean up: removing **base test 3.9** and **base gpu 3.9** wanda files and buildkite jobs --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[serve] deflake windows test (ray-project#59771)

4aa061c

test_controller was timing out ``` [2025-12-30T19:04:59Z] //python/ray/serve/tests:test_controller TIMEOUT in 3 out of 3 in 60.1s ``` Signed-off-by: abrar <abrar@anyscale.com>

[deps] update anyscale CLI/SDK (ray-project#59770)

b774927

to 0.26.74, the last release before ray summit 2025 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] migrate lookup functions to new anyscale sdk (ray-proj…

c4d50df

…ect#59772) kicking off the anyscale sdk migration process Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

edoakes and others added 20 commits January 15, 2026 09:13

[core] Add debugging logs related to pinned argument size limit (ray-…

72b8951

…project#60175) Was recently debugging something related to this behavior and found these logs useful. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[Data][LLM] Add should_continue_on_error support for ServeDeploymentS…

b3a50c2

…tage (ray-project#59395) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

[docs][data][llm] Batch inference docs reorg + update to reflect per-…

29af75c

…stage config refactor (ray-project#59214) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

[release test] add underscore in private functions in template.py (ra…

40960a9

…y-project#60161) so that it is clear that these functions are not meant to be used by other files Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] remove old release test init (ray-project#60156)

54a2a87

not used anywhere any more Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners January 16, 2026 03:20

antfin-oss added auto-generated daily-merge labels Jan 16, 2026

antfin-oss assigned ffbin Jan 16, 2026

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

github-actions bot added the stale label Jan 30, 2026

github-actions bot closed this Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔄 daily merge: master → main 2026-01-16#747

🔄 daily merge: master → main 2026-01-16#747
antfin-oss wants to merge 357 commits intomainfrom
create-pull-request/patch-c9ff1647c8

antfin-oss commented Jan 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

antfin-oss commented Jan 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants