🔄 daily merge: master → main 2026-02-06 by antfin-oss · Pull Request #763 · antgroup/ant-ray

antfin-oss · 2026-02-04T03:54:04Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2026-02-06
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…ct#60146) so that generic changes to ray python code will not trigger expensive GPU tests also separates out min-build as its own tag. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

Uses existing auth, allows sourcing+testing crane via existing Bazel libraries Signed-off-by: andrew <andrew@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…#60155) updating incorrect path for ray llm requirements file Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Description Add redundant shuffle fusion rules by dropping the 1st shuffle - Repartition -> Aggregate - StreamingRepartition -> Repartition - Repartition -> StreamingRepartition - Sort -> Sort ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…project#60175) Was recently debugging something related to this behavior and found these logs useful. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

## Why are these changes needed? This PR adds gRPC-based inter-deployment communication for Ray Serve, allowing deployments to communicate with each other using gRPC transport instead of Ray actor calls. This can provide performance benefits in certain scenarios. ### Key Changes 1. **gRPC Server on Replicas**: Each replica now starts a gRPC server that can handle requests from other deployments. 2. **gRPC Replica Wrapper**: A new `gRPCReplicaWrapper` class handles sending requests via gRPC and processing responses. 3. **Handle Options**: The `_by_reference` option on handles controls whether to use Ray actor calls (`True`) or gRPC transport (`False`). 4. **New Environment Variables**: - `RAY_SERVE_USE_GRPC_BY_DEFAULT`: Master flag to enable gRPC transport by default for all inter-deployment communication - `RAY_SERVE_PROXY_USE_GRPC`: Controls whether the proxy uses gRPC transport (defaults to the master flag value) - `RAY_SERVE_GRPC_MAX_MESSAGE_SIZE`: Configures the maximum gRPC message size (default: 2GB-1) ## Related issue number N/A ## Checks - [x] I've signed all my commits - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a temporary testing hook, I've added it under the API Reference (Experimental) page. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests. ## Test Plan - `python/ray/serve/tests/test_grpc_e2e.py` - `python/ray/serve/tests/test_grpc_replica_wrapper.py` - `python/ray/serve/tests/unit/test_grpc_replica_result.py` ## Benchmarks Script available [here](https://gist.github.com/eicherseiji/02808c32d0e377803888671da64524d1) Results show throughput/latency improvements w/ gRPC for message size < ~1MB. <img width="2229" height="740" alt="benchmark_plot" src="https://github.com/user-attachments/assets/e7e25f94-00b4-434d-9eff-10cd36047356" /> ``` ⎿ ============================================================================== gRPC vs Plasma Benchmark Results ============================================================================== Payload Metric Plasma gRPC Δ Winner ---------------------------------------------------------------------- 1 KB Latency p50 2.63ms 1.89ms +28% gRPC Chain p50 4.11ms 3.02ms +26% gRPC Throughput 160/s 190/s +16% gRPC ---------------------------------------------------------------------- 10 KB Latency p50 2.68ms 1.68ms +37% gRPC Chain p50 3.91ms 2.94ms +25% gRPC Throughput 167/s 185/s +10% gRPC ---------------------------------------------------------------------- 100 KB Latency p50 2.74ms 2.02ms +26% gRPC Chain p50 4.28ms 3.06ms +28% gRPC Throughput 157/s 182/s +13% gRPC ---------------------------------------------------------------------- 500 KB Latency p50 5.78ms 3.52ms +39% gRPC Chain p50 5.65ms 4.82ms +15% gRPC Throughput 114/s 144/s +21% gRPC ---------------------------------------------------------------------- 1 MB Latency p50 6.31ms 5.18ms +18% gRPC Chain p50 5.96ms 6.20ms -4% Plasma Throughput 130/s 165/s +21% gRPC ---------------------------------------------------------------------- 2 MB Latency p50 8.82ms 9.57ms -9% Plasma Chain p50 7.20ms 10.69ms -48% Plasma Throughput 123/s 106/s -16% Plasma ---------------------------------------------------------------------- 5 MB Latency p50 15.20ms 23.72ms -56% Plasma Chain p50 8.90ms 23.25ms -161% Plasma Throughput 78/s 49/s -58% Plasma ---------------------------------------------------------------------- 10 MB Latency p50 25.02ms 34.34ms -37% Plasma Chain p50 9.72ms 34.71ms -257% Plasma Throughput 38/s 31/s -24% Plasma ---------------------------------------------------------------------- ``` Compared to parity implementation: ``` ============================================================================== gRPC Transport: OSS 3.0.0.dev0 vs Parity 2.53.0 ============================================================================== Payload Metric OSS 3.0.0.dev0 Parity 2.53.0 ---------------------------------------------------------------------- 1 KB Latency p50 1.82ms 2.27ms Chain p50 2.95ms 2.99ms Throughput 268/s 272/s ---------------------------------------------------------------------- 10 KB Latency p50 1.82ms 2.05ms Chain p50 2.85ms 2.80ms Throughput 246/s 293/s ---------------------------------------------------------------------- 100 KB Latency p50 2.04ms 2.35ms Chain p50 3.27ms 3.12ms Throughput 262/s 257/s ---------------------------------------------------------------------- 500 KB Latency p50 3.67ms 3.78ms Chain p50 5.77ms 4.91ms Throughput 186/s 192/s ---------------------------------------------------------------------- 1 MB Latency p50 4.99ms 5.39ms Chain p50 5.95ms 6.56ms Throughput 177/s 156/s ---------------------------------------------------------------------- 2 MB Latency p50 7.91ms 7.37ms Chain p50 8.26ms 12.16ms Throughput 117/s 129/s ---------------------------------------------------------------------- 5 MB Latency p50 17.86ms 19.53ms Chain p50 22.65ms 23.85ms Throughput 87/s 54/s ---------------------------------------------------------------------- 10 MB Latency p50 23.79ms 27.78ms Chain p50 35.67ms 31.06ms Throughput 48/s 27/s ---------------------------------------------------------------------- Cluster: 2 worker nodes (48 CPU, 4 GPU, 192GB RAM, 54.5GB object store each) 3-trial average ``` Note: OSS 3.0.0.dev0 includes a token auth optimization (ray-project#59500) that reduces per-RPC overhead by caching auth tokens and avoiding object construction on each call. This likely explains the improved latency and throughput at larger payload sizes. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…ject#59788) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Disable ConcurrencyCap Backpressure policy by default - With DownstreamCapacityBackpressurePolicy now enabled by default, disable ConcurrencyCapBackpressurePolicy by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…tage (ray-project#59395) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

…roject#60179) `ray-docs` has become a bottleneck for review. No longer requiring their approval for library documentation changes, but leaving it as a catch-all for other docs changes. Flyby: removing code ownership for removed Ray Workflows library directories. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…stage config refactor (ray-project#59214) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

## Description Introduced in ray-project#59544, we added many optimisations for IMPALA / APPO learner. One of these optimisations is to minimise the time thread locked from the queue. Using `queue.get_nowait()` will raise an exception if the queue has no data. Therefore, we wrap the request in a try/except. Currently, when this exception occurs then we log a warning but reviewing the training logs this actually causes a massive amount of spam. This PR, therefore, removes the warning with an associated comment to explain why we use a `pass` instead. --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>

Before: <img width="616" height="210" alt="Screenshot 2026-01-07 at 11 03 26 AM" src="https://github.com/user-attachments/assets/139496f4-136d-4ade-9ec5-ad788cc4f8f9" /> After: <img width="5000" height="2812" alt="ray-job-diagram" src="https://github.com/user-attachments/assets/e9c653be-3191-458d-9fa1-30329c395496" /> --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: akshay-anyscale <122416226+akshay-anyscale@users.noreply.github.com>

…ay-project#60176) No behavior changes, pure refactoring to retain my sanity. - No longer inherit from `SchedulingQueue` in `NormalTaskExecutionQueue` - Remove unnecessary methods from `SchedulingQueue` interface - `NormalSchedulingQueue` -> `NormalTaskExecutionQueue` - `ActorSchedulingQueue` -> `OrderedActorTaskExecutionQueue` - `OutOfOrderActorSchedulingQueue` -> `UnorderedActorTaskExecutionQueue` - `SchedulingQueue` -> `ActorTaskExecutionQueueInterface` - A few method/field renamings. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>

…re flag, GCS, and Raylet code. (ray-project#59979) This is 1/N in a series of PRs to remove Centralized Actor Scheduling by the GCS (introduced in ray-project#15943). The feature is off by default and no longer in use or supported. In this PR, I remove the feature flag to turn the feature on and remove related code and tests in the GCS and the Raylet. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>

…registry example (ray-project#60071) Noticed ray-project#59917 but it didn't fix it - Linking to notebook instead of README.md - Removing notebook from exclude_patterns The notebook should be the single source of truth since it’s what's tested and validated, so we should link to it rather than the README.md. The README.md is generated from the notebook (jupyter nbconvert) and exists only for display in the console when converting the example into an Anyscale template Also fixing the error 404 of the mlflow registry example by lin,ing to the proper doc --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

…y-project#60161) so that it is clear that these functions are not meant to be used by other files Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

not used anywhere any more Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…oject#57555)   ## Why are these changes needed? Since `0.26.0` `uvicorn` [changed](https://uvicorn.dev/release-notes/?utm_source=chatgpt.com#0260-january-16-2024) how it processes `root_path`. To support all `uvicorn` versions, injecting `root_path` to ASGI app instead of passing it to `uvicorn.Config` starting from version `0.26.0`. Before the change: ``` # uvicorn==0.22.0 pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass # uvicorn==0.40.0 - latest pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - failed # FAILED python/ray/serve/tests/test_standalone.py::test_http_root_path - assert 404 == 200 ``` After the change: ``` # uvicorn==0.22.0 pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass # uvicorn==0.40.0 - latest pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass ```  ## Related issue number  Closes ray-project#55776. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>

@Aydin-ab

…nd Model composition for recsys examples (ray-project#59166) ## Description Adding two examples for Ray Serve as part of our workload based series: - Model multiplexing with forecasting models ✅ - Model composition for recsys (recommendation systems) ✅ Will later be published as templates in the anyscale console Lots of added/modified files but the contents to review are under the `content/` folder, everything else is related to the publishing workflow in ray docs + setting up testing in the CI author: @Aydin-ab --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

Minor cleanups on the task execution path as I muddle through it. No behavior changes. - `DependencyWaiter` -> `ActorTaskExecutionArgWaiter`. Previously, I found myself continually confused if (1) this was only for actor tasks or also normal tasks and (2) if it was on the submission or execution path. - Added `ActorTaskExecutionArgWaiterInterface` instead of having an `Impl`. - Added header comments for what the `ActorTaskExecutionArgWaiter` is doing. - `HandleTask` -> `QueueTaskForExecution`. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ect#60149) ## Description The `python redis test` is failing with high probability in testing due to `test_network_partial_failures` catching a "subprocess is still running" resource warning when it expects no warnings to be thrown. Investigation showed that existing kill redis server logic during test cleanup does not wait for the redis server process to die before moving onto the next test, causing the resource warning we observed above. This PR addresses this issue by adding wait to ensure that the redis server is fully cleaned up before moving to the proceeding test during redis test cleanup step. ## Related issues Fixes failing `test_network_partial_failures` in CI automated tests. ## Additional information Example of the fix passing `test_network_partial_failures` in post merge: https://buildkite.com/ray-project/postmerge/builds/15416#_ --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

…project#60173) ## Description Due to Python's operator precedence (+ binds tighter than if-else): ```python result = [1, 2, 3] + [4] if False else [] print(f"result = {result}") #result = [] ``` `PSUTIL_PROCESS_ATTRS` on Windows actually returns an empty list, but our intention is to only exclude `num_fds`. This PR fixes it. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>

…c callback during shutdown (ray-project#60048) ## Description When a Ray worker process shuts down (e.g., during `ray.shutdown()` or node termination), the OpenTelemetry `PeriodicExportingMetricReader`'s background thread may still be invoking the gauge callback (`_DoubleGaugeCallback`), which then accesses already-destroyed member data, resulting in a use-after-free crash. The error message: ``` (bundle_reservation_check_func pid=1543823) pure virtual method called (bundle_reservation_check_func pid=1543823) __cxa_deleted_virtual ``` I looked further into this, and ideally, at the OpenTelemetry code level, shutdown should be handled correctly. [PeriodicExportingMetricReader's shutdown](https://github.com/open-telemetry/opentelemetry-cpp/blob/f33dcc07c56c7e3b18fd18e13986f0eda965d116/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L292-L299) waits for `worker_thread_` to finish. ```c bool PeriodicExportingMetricReader::OnShutDown(std::chrono::microseconds timeout) noexcept { if (worker_thread_.joinable()) { cv_.notify_all(); worker_thread_.join(); } return exporter_->Shutdown(timeout); } ``` And callback(`worker_thread_`) is in a [while (IsShutdown() != true)](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L147) loop. Therefore, there should be no use-after-free race condition at the OpenTelemetry code level, and it should be safe to call `meter_provider_->Shutdown()`. However, the issue is that the last callback appears to access member data that has already been destroyed during ForceFlush, which is called before Shutdown. This member data belongs to the OpenTelemetry SDK itself. The more I look into it, the more it feels like this is actually a bug in the OpenTelemetry SDK. And even further, I found this:[[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free ](open-telemetry/opentelemetry-cpp#3457) Which is exactly the issue I encountered! This PR upgrade the OpenTelemetry C++ SDK version to include this fix. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information It is quit easy to reproduced， For example, if we manually running the `test_placement_group_reschedule_node_dead` in `python/ray/autoscaler/v2/tests/test_e2e.py`. ``` (docs) ubuntu@devbox:~/ray$ pkill -9 -f raylet 2>/dev/null || true; pkill -9 -f gcs_server 2>/dev/null || true; ray stop --force 2>/dev/null || true; sleep 2 Did not find any active Ray processes. (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ............ __cxa_deleted_virtual opentelemetry::v1::sdk::metrics::FilteredOrderedAttributeMap::FilteredOrderedAttributeMap()::{lambda()#1}::operator()() opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()() opentelemetry::v1::sdk::metrics::ObserverResultT<>::Observe() opentelemetry::v1::metrics::ObserverResultT<>::Observe<>() ray::observability::OpenTelemetryMetricRecorder::CollectGaugeMetricValues() (anonymous namespace)::_DoubleGaugeCallback() opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe() opentelemetry::v1::sdk::metrics::Meter::Collect() opentelemetry::v1::sdk::metrics::MetricCollector::Produce() opentelemetry::v1::sdk::metrics::MetricReader::Collect() opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce() std::thread::_State_impl<>::_M_run() ............ ``` after this pr, no such error message: ``` (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ============================= test session starts ============================== platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python cachedir: .pytest_cache rootdir: /home/ubuntu/ray configfile: pytest.ini plugins: asyncio-1.3.0, anyio-4.11.0 asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collecting ... collected 2 items python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v1] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:00,347 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:00,385 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (autoscaler +11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +11s) Resized to 0 CPUs. (autoscaler +12s) Resized to 0 CPUs. (autoscaler +14s) Resized to 0 CPUs. (autoscaler +15s) Resized to 0 CPUs. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +16s) Resized to 0 CPUs. (autoscaler +16s) Adding 1 node(s) of type type-1. (autoscaler +16s) Adding 1 node(s) of type type-2. (autoscaler +16s) Adding 1 node(s) of type type-3. Killing pids 1566233 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 Stopped all 10 Ray processes. (autoscaler +32s) Resized to 0 CPUs. (autoscaler +32s) Adding 1 node(s) of type type-1. (autoscaler +32s) Adding 1 node(s) of type type-2. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Removing 1 nodes of type type-3 (idle). PASSED python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v2] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:40,170 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:40,202 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 Stopped only 9 out of 12 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=1569612, name='raylet', status='terminated'), psutil.Process(pid=1569160, name='raylet', status='terminated'), psutil.Process(pid=1568952, name='raylet', status='terminated')] will be forcefully terminated. You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination. Killing pids 1568744 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 PASSED ========================= 2 passed in 80.90s (0:01:20) ========================= EXIT CODE: 0 (docs) ubuntu@devbox:~/ray$ ``` Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>

…eManager from GCS (ray-project#60121) This PR stacks on ray-project#60019. This is 3/N in a series of PRs to remove Centralized Actor Scheduling by the GCS (introduced in ray-project#15943). The feature is off by default and no longer in use or supported. In this PR, I've removed the GCS's dependency on the LocalLeaseManager. I've also moved LocalLeaseManager to the raylet/scheduling package and made it's visibility private to the package. Also deleted the NoopLocalLeaseManager. The LocalLeaseManager is used by the ClusterLeaseManager to see if a task can scheduled locally by a Raylet. The GCS used only the Noop implementation. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>

@goutamvenkat-anyscale

…ject#60145) ## Why are these changes needed? When `EngineDeadError` occurs, the vLLM engine subprocess is dead but the Ray actor process is still alive. Previously, we re-raised the exception, but this causes **task retries to go to the SAME actor** (actor methods are bound to specific instances), creating an infinite retry loop on the broken actor. ### The Problem ``` vLLM engine subprocess crashes ↓ EngineDeadError raised ↓ Exception re-raised (actor stays ALIVE) ↓ Ray: actor_task_retry_on_errors triggers retry ↓ Retry goes to SAME actor (actor methods are bound) ↓ Same actor, engine still dead → EngineDeadError ↓ Infinite loop (with max_task_retries=-1) ``` ### The Fix Call `os._exit(1)` to exit the actor. This triggers Ray to: 1. Mark the actor as `RESTARTING` 2. Create a replacement actor with a fresh vLLM engine 3. Route task retries to healthy actors (Ray Data excludes `RESTARTING` actors from dispatch) This leverages Ray Data's existing fault tolerance infrastructure: - `max_restarts=-1` (default) enables actor replacement - `max_task_retries=-1` (default) enables task retry ### Why `os._exit(1)` instead of `ray.actor.exit_actor()`? We must use `os._exit(1)` rather than `ray.actor.exit_actor()` because they produce different exit types with different retry behavior: | Exit Method | Exit Type | Exception Raised | Retried? | |-------------|-----------|------------------|----------| | `os._exit(1)` | `SYSTEM_ERROR` | `RaySystemError` | Yes | | `ray.actor.exit_actor()` | `INTENDED_USER_EXIT` | `ActorDiedError` | No | The root cause is that Ray Data only adds `RaySystemError` to its `retry_exceptions` list (in `_add_system_error_to_retry_exceptions()`). Since `ActorDiedError` is NOT a subclass of `RaySystemError` (they're siblings in the exception hierarchy), tasks that fail due to `ray.actor.exit_actor()` are not retried. **The semantic gap**: Ray currently lacks a "fatal application error" concept - an error where the actor should be restarted AND pending tasks retried. The available options are: - Clean exit (`exit_actor`) = "I'm intentionally done" → no retry - Crash (`os._exit`) = "Something broke unexpectedly" → retry We need the "crash" semantics even though this is a deliberate decision, so `os._exit(1)` is the correct workaround until Ray Core adds explicit support for fatal application errors. See: ray-project#60150 cc @goutamvenkat-anyscale ### Validation We created a minimal reproduction script demonstrating: 1. **The problem**: All retries go to the same broken actor (same PID) 2. **The fix**: Actor exits → replacement created → job succeeds with multiple PIDs ```python # Demo output showing fix works: [SUCCESS] Processed 20 rows! PIDs that processed batches: {708593, 708820} -> Multiple PIDs = replacement actor joined and processed work ``` Full reproduction: https://gist.github.com/nrghosh/c18e514a975144a238511012774bab8b ## Related issue number Fixes ray-project#59522 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a temporary file handling method, I've added it in `doc/source/ray-core/api/doc/ray.util.temp_files.rst`. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests. --------- Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

we already know the cloud provider / type via test definition, so there is no need to fetch it via sdk. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

not used anywhere in JobFileManager; all files are transfered via shared blog storage access. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

no test definition is using this field --------- Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

**Problem** All versions of setproctitle after 1.2.3 seem to introduce significant expense when forking. This in turn results in very slow or crashing runs of ray jobs on macOs, particularly when spawning many jobs. **Solution** Looking at the code, it seems as if versions after 1.2.3 make quite a lot of changes on Darwin in order to make the process renames work properly with Activity monitor and other MacOs utilities. This... is not 'especially' important to Ray. So downgrading doesn't cost us too much. Though of course it would have been vastly preferable to rely on a non-vendored latest version, but latest versions seem to have this issue. So we can downgrade for now and come back later potentially. ## Related issues fixes ray-project#59663 **Historic Context** ray-project#53471 vendored the dependency and made a slight logic tweak in the cython binding in setproctitle.pxi. This had the benefit of fixing the cmdline parse issue described in the PR but had the downside of upgrading the library version (which now included a set of Darwin tweaks which leads to the slowdown). After this PR, the state will be that the vendored version is now old enough to not contain the activity monitor tweaks for Darwin, as well as having the changes in setproctitle.pxi. Signed-off-by: ZacAttack <zac@anyscale.com>

…to to 5.x.x (ray-project#59489) ## Description PyArrow 22 uses a newer AWS SDK that sends S3 requests with HTTP chunked transfer encoding and trailer checksums (x-amz-checksum-crc64nvme). Our old moto version (4.2.12) doesn't properly parse this protocol, causing raw HTTP wire format to leak into test responses: ``` Expected: b'spam' Got: b'4\r\nspam\r\n0\r\nx-amz-checksum-crc64nvme:...\r\n\r\n' ``` Related issue from moto: getmoto/moto#7198 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Alexey Kudinkin <ak@anyscale.com>

upgrading pycurl and kiwisolver, especially for py313 dependencies kiwisolver==1.4.5 -> 1.4.7 pycurl==7.45.3 -> 7.45.4 Wheels don't exist for py313 on original versions --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

While migrating the wheel build+publish to Wanda, I missed that ci/build/copy_build_artifacts is also the mechanism by which the Buildkite Artifacts tab is populated (instead of a buildkite-agent upload flow). ci/build/copy_build_artifacts has a guard-clause against publishing in non-postmerge, so let's run this step on PRs so that devs can access wheels via the Artifacts tab as needed I've also added the same tagset to upload as is build, to ensure we always accompany a build with an upload. Guard clause: https://github.com/ray-project/ray/blob/f685a50dc5f1757f4c9b6d431c6c1697701dae87/ci/build/copy_build_artifacts.sh#L49-L57 Signed-off-by: andrew <andrew@anyscale.com>

ray-project#60143 shifted library to use exceptions, not error codes. That landed while push_ray_image was in development, and testing did not sufficently catch this. Updated call, and added new test to catch. Signed-off-by: andrew <andrew@anyscale.com>

This PR fully deprecates `local_mode` and raises an error if the option is specified by the user. It also removes the implementation from core C++ & Python code, as well as lingering library logic & tests. Originally, I attempted to split this into two PRs: one removing the Python changes and one removing C++ changes, but that turned out to be challenging because we have enforcement in CI that all of the proto options are used in Python. I couldn't remove the error proto related to local mode without also removing its C++ implementation. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…y-project#60609) This test has been periodically timing out on Windows. New test cases were added in December, so I think we're just bumping up against the 60s limit. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

) ## Description This document describes how to verify the V1 cluster autoscaler deprecation. 1. Added `@Deprecated` annotation to `DefaultClusterAutoscaler` in `default_cluster_autoscaler.py` 2. Changed default from `V1` to `V2` in `__init__.py` ## Related issues Closes ray-project#60459 ## Additional information ### Testing **Test V1 via env var emits deprecation warning:** ```bash RAY_DATA_CLUSTER_AUTOSCALER=V1 python -c " import warnings warnings.simplefilter('always') from unittest.mock import MagicMock from ray.data._internal.cluster_autoscaler import create_cluster_autoscaler mock_data_context = MagicMock() mock_data_context.execution_options.resource_limits = None autoscaler = create_cluster_autoscaler(MagicMock(), MagicMock(), mock_data_context, execution_id='test') print(f'Created: {type(autoscaler).__name__}') " ``` ``` ...RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases... DefaultClusterAutoscaler (V1) is deprecated. Use DefaultClusterAutoscalerV2 instead by setting RAY_DATA_CLUSTER_AUTOSCALER=V2 or using the default. Created: DefaultClusterAutoscaler ``` **Test V2 (default) has no deprecation warning:** ```bash python -c " import warnings warnings.filterwarnings('always', category=DeprecationWarning) from unittest.mock import MagicMock from ray.data._internal.cluster_autoscaler import create_cluster_autoscaler mock_data_context = MagicMock() mock_data_context.execution_options.resource_limits = None autoscaler = create_cluster_autoscaler(MagicMock(), MagicMock(), mock_data_context, execution_id='test') print(f'Created: {type(autoscaler).__name__}') " ``` ``` Created: DefaultClusterAutoscalerV2 ``` --------- Signed-off-by: Ryan Huang <ryankert01@gmail.com>

…ions (ray-project#60595) This PR addresses intermittent test failures in the execution optimizer integration suite. The test test_from_pandas_refs_e2e previously assumed a deterministic row ordering when reading from multiple pandas references, which is not guaranteed by the Ray Data interface in a distributed execution environment. Changes Replaced manual tuple-list comparisons with the rows_same utility from ray.data._internal.util. Refactored assertions to use ds.to_pandas() for robust, order-agnostic data validation. Applied the fix to all three affected assertions within the test function. Fixes [60553](ray-project#60553) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>

…tdown([ray-project#59573](https://github.com/muyihao/ray/issues/59573)) (ray-project#60258) When using Ray actors, especially detached actors (lifetime="detached"), they can outlive the driver process that created them. In production or long-running workflows, sometimes these actors get stuck, leak resources (e.g., GPU memory), or hang due to exceptions/bugs. ## Related issues Related to ray-project#59573 ## Additional information As discussed in ray-project#28407, Ray still lacks a public way to obtain an actor handle from its ID, like `ray.get(actor_id)`, so this patch only supports killing by name. Once ray-project#28407 (or an equivalent public API) lands, we can extend `ray kill-actor --actor-id <id>` in a follow-up patch without breaking existing usage. --------- Signed-off-by: muyihao <515294355@qq.com>

…ray-project#60252) ## Description Issue ray-project#58964 reported that it's possible that the GCS publisher may receive an invalid channel causing the cluster to be taken down. From investigation, we suspect this mostly likely sprung up due to a misconfiguration in the local environment of the subscriber (e.g. ray version mismatch). This PR Modify publisher reject and reply to messages with bad arguments. The subscriber will fail upon receiving a bad argument reply as pubsub should be internal and bad arguments indicate potential bugs in the system. However, we do not take down the cluster in this case in case of a local environment misconfiguration. ## Related issues ray-project#58964 ## Additional information --------- Signed-off-by: davik <davik@anyscale.com> Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: davik <davik@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fixes ray-project#60218 During shutdown, `TaskEventBufferImpl::Stop()` and `RayEventRecorder` were losing buffered events because the io_service was stopped immediately after calling async gRPC flush methods, without waiting for the gRPC calls to complete. This PR: - Adds a synchronous flush with configurable timeout in `TaskEventBuffer::Stop()` - waits up to 5 seconds (configurable via `task_events_shutdown_flush_timeout_ms`) for in-flight gRPC calls to complete - Adds `StopExportingEvents()` method to `RayEventRecorder` for graceful shutdown - Calls `StopExportingEvents()` from `GcsServer::Stop()` before stopping io_service - Adds new config option `task_events_shutdown_flush_timeout_ms` (default 5000ms) ## Test plan - [x] Added unit test `TestStopFlushesEvents` for `TaskEventBuffer` that verifies events are flushed during `Stop()` - [x] Added unit test `TestStopFlushesEvents` for `RayEventRecorder` that verifies events are exported during `StopExportingEvents()` - [ ] Ray CI tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: ruo <ruoliu.dev@gmail.com> Signed-off-by: rlizzy <liuruo20021124@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: rlizzy <liuruo20021124@gmail.com>

…ct#58857) ## Description Currently, custom autoscaling policies bypass all standard autoscaling configuration parameters - they must manually implement delay logic, scaling factors, and bounds checking themselves. This PR adds an `apply_autoscaling_config` decorator that enables custom autoscaling policies to automatically benefit from Ray Serve's standard autoscaling parameters that are embedded in the default policy (`upscale_delay_s,` `downscale_delay_s`, `downscale_to_zero_delay_s`, `upscaling_factor`, `downscaling_factor`, `min_replicas`, `max_replicas`). ## Related issues Fixes ray-project#58622 ## Implementation Details - Core implementation (python/ray/serve/autoscaling_policy.py): - Added `apply_autoscaling_config decorator` - Refactored delay logic into `_apply_delay_logic()` helper function - Added scaling factor logic for custom policies`_apply_scaling_factors()` helper function - Refactored bounds checking into` _apply_bounds()` helper function - Updated replica_queue_length_autoscaling_policy to use `_apply_delay_logic` function - Tests (python/ray/serve/tests/test_autoscaling_policy.py and python/ray/tests/unit/test_autoscaling_policy.py) - End-to-end tests verifying delay enforcement for decorated custom policies - Tests for scaling factor moderation (upscaling and downscaling) - Unit tests for checking each helper function Added documentation for usage with example --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Co-authored-by: harshit-anyscale <harshit@anyscale.com>

…ce (ray-project#60457) Introduce a credential provider pattern for Databricks authentication, enabling custom credential sources while maintaining backward compatibility. Changes: - Add DatabricksCredentialProvider base class with StaticCredentialProvider and EnvironmentCredentialProvider implementations - Add credential_provider parameter to DatabricksUCDatasource and read_databricks_tables() - Add UnityCatalogConnector class for reading Unity Catalog tables directly (supports Delta/Parquet formats with AWS, Azure, and GCP credential handoff) - Add retry on 401 with credential invalidation via shared request_with_401_retry() helper - Centralize common code (build_headers, request_with_401_retry) in databricks_credentials.py - Move Databricks tests to dedicated test files with shared test utilities The credential provider abstraction allows users to implement custom credential sources that support token refresh and other authentication patterns beyond static tokens. Backward compatibility: Existing code using the DATABRICKS_TOKEN and DATABRICKS_HOST environment variables continues to work unchanged. --------- Signed-off-by: ankur <ankur@anyscale.com>

…ject#60086) ## Description In this PR, we export the output schema of dataset operators so that we can check the output field names and their data types for better observability. If `DataContext.enforce_schemas` is set to False, the schema will only be export once for each operator; and if it is set to True, the schema will be exported whenever the fields get updated. Example export event: ``` { "event_id": "83b3A80eAa283CFFBf", "timestamp": 1769111404, "source_type": "EXPORT_DATASET_OPERATOR_SCHEMA", "event_data": { "operator_uuid": "a76411d1-5d28-4027-ad19-56cdcb073410", "schema_fields": { "int_field": "int64", "bool_field": "bool", "bytes_field": "binary", "string_field": "string", "date_field": "date32[day]", "datetime_field": "timestamp[us]", "numpy_int_field": "int32", "numpy_float_field": "double", "numpy_array_field": "ArrowTensorType(shape=(3,), dtype=int64)", "list_float_field": "list<item: double>", "list_list_field": "list<item: list<item: double>>", "nested_dict_field": "struct<a: struct<b: string>>", "none_field": "null", } } } ``` Signed-off-by: cong.qian <cong.qian@anyscale.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

…tor` (ray-project#60273) ## Description Cleaning up `ResourceManager` - Cleaning up methods duplication - Fixing `_should_unblock_streaming_output_backpressure` semantic - Abstracting common `_is_blocking_materializing_op` util to determine if operation is a blocking materializing op Cleaning up `ReservationOpResourceAllocator` - Adjusting `can_submit_new_task` to check for available Object Store when launching tasks ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

rather than exit early. this makes sure that the job url and job ids are all assigned even for release test jobs that failed. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

Add _should_upload() guard to prevent accidental pushes from feature branches (mirrors RayDockerContainer behavior) Add --pipeline-id option to validate postmerge pipeline Signed-off-by: andrew <andrew@anyscale.com>

Support multiple --platform flags for consolidated push jobs * This mirrors the current workflow, and prevents Cuda builders from spinning up 40 separate push jobs. Signed-off-by: andrew <andrew@anyscale.com>

Changes --- 1. Modified Ray Core's generator handling sequence to inject back object creation & serialization durations 2. Updated `task_completion_time_excl_backpressure_s` to track both UDF block generation time AND block serialization overhead ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Reverts ray-project#60543 breaks almost all rllib tests.

* [Data] Add schema inference capabilities to logical operators - Add infer_schema method to Count, Map, Filter, Project, and Repartition operators - Implement expression type inference for various expression types including BinaryExpr, UnaryExpr, LiteralExpr - Add projection schema computation logic with proper field type derivation Signed-off-by: will <zzchun8@gmail.com>

gemini-code-assist

Code Review

This pull request is a large automated merge that introduces a wide array of changes, including a major refactoring of the CI/CD pipeline, extensive documentation updates, and the introduction of several new features. The move towards a more modular and cached build system using wanda is a significant improvement that should enhance CI performance and maintainability. The documentation has been substantially improved with better explanations, new examples, and coverage of new features like graceful actor cancellation and advanced Ray Serve configurations. Overall, these changes are very positive. I have one minor suggestion for code simplification in the crane_lib.py file.

gemini-code-assist · 2026-02-04T03:58:29Z

ci/ray_ci/automation/crane_lib.py

+    except CraneError:
+        raise


This except CraneError: raise block appears to be redundant. Any CraneError raised within the try block would propagate up naturally without being caught and re-raised here. Removing this block would simplify the code without changing its behavior.

aslonnie and others added 30 commits January 14, 2026 16:57

[train] only run gpu train tests on air/train/tune changes (ray-proje…

d1f9480

…ct#60146) so that generic changes to ray python code will not trigger expensive GPU tests also separates out min-build as its own tag. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

feat(ci): Port extract_wanda_wheels to python (ray-project#60138)

de03a42

Uses existing auth, allows sourcing+testing crane via existing Bazel libraries Signed-off-by: andrew <andrew@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[ci] updating req file path for llm base extra test deps (ray-project…

f182131

…#60155) updating incorrect path for ray llm requirements file Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[core] Add debugging logs related to pinned argument size limit (ray-…

72b8951

…project#60175) Was recently debugging something related to this behavior and found these logs useful. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[Data][LLM] Add should_continue_on_error support for ServeDeploymentS…

b3a50c2

…tage (ray-project#59395) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

[docs][data][llm] Batch inference docs reorg + update to reflect per-…

29af75c

…stage config refactor (ray-project#59214) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

[release test] add underscore in private functions in template.py (ra…

40960a9

…y-project#60161) so that it is clear that these functions are not meant to be used by other files Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] remove old release test init (ray-project#60156)

54a2a87

not used anywhere any more Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] stop fetching cloud provider via sdk (ray-project#60168)

878db11

we already know the cloud provider / type via test definition, so there is no need to fetch it via sdk. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] remove sdk in JobFileManager (ray-project#60162)

97dbecd

not used anywhere in JobFileManager; all files are transfered via shared blog storage access. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] remove cloud_name field (ray-project#60157)

8f3453e

no test definition is using this field --------- Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

goutamvenkat-anyscale and others added 22 commits January 29, 2026 18:03

[release test] assign job url when runner exists (ray-project#60607)

f28b663

rather than exit early. this makes sure that the job url and job ids are all assigned even for release test jobs that failed. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[ci] add upload guards to push_ray_image (ray-project#60612)

54e872a

Add _should_upload() guard to prevent accidental pushes from feature branches (mirrors RayDockerContainer behavior) Add --pipeline-id option to validate postmerge pipeline Signed-off-by: andrew <andrew@anyscale.com>

[ci] add multi-platform support to push_ray_image (ray-project#60623)

3e40294

Support multiple --platform flags for consolidated push jobs * This mirrors the current workflow, and prevents Cuda builders from spinning up 40 separate push jobs. Signed-off-by: andrew <andrew@anyscale.com>

Revert "[core] Remove support for local_mode" (ray-project#60629)

d0b1d15

Reverts ray-project#60543 breaks almost all rllib tests.

antfin-oss requested review from SongGuyang and kfstorm as code owners February 4, 2026 03:54

antfin-oss added auto-generated daily-merge labels Feb 4, 2026

antfin-oss assigned ffbin Feb 4, 2026

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

antfin-oss changed the title ~~🔄 daily merge: master → main 2026-02-04~~ 🔄 daily merge: master → main 2026-02-05 Feb 5, 2026

antfin-oss changed the title ~~🔄 daily merge: master → main 2026-02-05~~ 🔄 daily merge: master → main 2026-02-06 Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔄 daily merge: master → main 2026-02-06#763

🔄 daily merge: master → main 2026-02-06#763
antfin-oss wants to merge 583 commits intomainfrom
create-pull-request/patch-42e69df4d0

antfin-oss commented Feb 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

antfin-oss commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

antfin-oss commented Feb 4, 2026 •

edited

Loading