π daily merge: master β main 2026-01-14#744
Open
antfin-oss wants to merge 304 commits intomainfrom
Open
Conversation
## Description This pull request removes Unity3D-based environments (`mlagents` and `mlagents_envs`) from RLlib, including dependencies, code, documentation, and related test requirements. The main goal is to clean up the requirements dependency. --------- Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
``` REGRESSION 29.30%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1152.0966726586987 to 814.5335887693782 in microbenchmark.json REGRESSION 27.74%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1146.7134222243185 to 828.6299560282166 in microbenchmark.json REGRESSION 27.16%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 13260.224066162647 to 9658.981678535481 in microbenchmark.json REGRESSION 25.88%: multi_client_put_gigabytes (THROUGHPUT) regresses from 47.62336463265461 to 35.29689743165927 in microbenchmark.json REGRESSION 25.52%: client__tasks_and_get_batch (THROUGHPUT) regresses from 1.0755792867557323 to 0.8011125804259877 in microbenchmark.json REGRESSION 21.24%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 576.2018967799997 to 453.8059915017394 in microbenchmark.json REGRESSION 20.91%: client__tasks_and_put_batch (THROUGHPUT) regresses from 11657.102874288967 to 9220.111790372692 in microbenchmark.json REGRESSION 12.79%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 6.5168926025589275 to 5.683136512909751 in microbenchmark.json REGRESSION 12.54%: 1_n_actor_calls_async (THROUGHPUT) regresses from 7818.847120700663 to 6838.2845805526595 in microbenchmark.json REGRESSION 12.17%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4686.60144219099 to 4116.404938052882 in microbenchmark.json REGRESSION 11.79%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5629.888924437268 to 4965.99522007048 in microbenchmark.json REGRESSION 10.73%: client__put_calls (THROUGHPUT) regresses from 821.8214713340072 to 733.6703843739219 in microbenchmark.json REGRESSION 10.69%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4314.570035703319 to 3853.261228971964 in microbenchmark.json REGRESSION 10.46%: client__get_calls (THROUGHPUT) regresses from 1033.7763022350296 to 925.594265020844 in microbenchmark.json REGRESSION 9.84%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 2762.9385297368535 to 2490.991223668351 in microbenchmark.json REGRESSION 9.16%: 1_n_async_actor_calls_async (THROUGHPUT) regresses from 6913.550938819563 to 6280.583274671035 in microbenchmark.json REGRESSION 8.78%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 21866.061040938854 to 19945.253372184772 in microbenchmark.json REGRESSION 7.90%: n_n_actor_calls_async (THROUGHPUT) regresses from 24531.521409406632 to 22593.67022851302 in microbenchmark.json REGRESSION 3.15%: single_client_tasks_sync (THROUGHPUT) regresses from 872.2036137608502 to 844.7209532677355 in microbenchmark.json REGRESSION 2.79%: tasks_per_second (THROUGHPUT) regresses from 390.190063861316 to 379.30168512953065 in benchmarks/many_nodes.json REGRESSION 2.75%: single_client_tasks_async (THROUGHPUT) regresses from 6961.354217387221 to 6769.634231009387 in microbenchmark.json REGRESSION 2.70%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 3353.6340010226468 to 3263.220674257469 in microbenchmark.json REGRESSION 2.21%: multi_client_tasks_async (THROUGHPUT) regresses from 20569.559125979922 to 20114.199533908533 in microbenchmark.json REGRESSION 1.89%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 9541.420239681218 to 9361.068161075398 in microbenchmark.json REGRESSION 1.82%: single_client_wait_1k_refs (THROUGHPUT) regresses from 4.803898199921876 to 4.716418799247922 in microbenchmark.json REGRESSION 1.39%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 12.697911312818526 to 12.521640061929554 in microbenchmark.json REGRESSION 1.02%: placement_group_create/removal (THROUGHPUT) regresses from 685.9076055741489 to 678.9244842339416 in microbenchmark.json REGRESSION 0.03%: client__put_gigabytes (THROUGHPUT) regresses from 0.10220457395600176 to 0.10217222369611438 in microbenchmark.json REGRESSION 157.48%: dashboard_p99_latency_ms (LATENCY) regresses from 382.069 to 983.739 in benchmarks/many_pgs.json REGRESSION 108.23%: dashboard_p95_latency_ms (LATENCY) regresses from 17.033 to 35.467 in benchmarks/many_pgs.json REGRESSION 68.74%: stage_4_spread (LATENCY) regresses from 0.26078188348514014 to 0.4400494907723027 in stress_tests/stress_test_many_tasks.json REGRESSION 49.97%: dashboard_p95_latency_ms (LATENCY) regresses from 17.15 to 25.72 in benchmarks/many_nodes.json REGRESSION 49.08%: stage_3_time (LATENCY) regresses from 1911.0371930599213 to 2849.039920568466 in stress_tests/stress_test_many_tasks.json REGRESSION 47.13%: dashboard_p50_latency_ms (LATENCY) regresses from 20.494 to 30.152 in benchmarks/many_actors.json REGRESSION 39.86%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 12.110535204000001 to 16.937953781000004 in scalability/object_store.json REGRESSION 24.17%: 1000000_queued_time (LATENCY) regresses from 177.25064926800002 to 220.095988608 in scalability/single_node.json REGRESSION 22.86%: dashboard_p50_latency_ms (LATENCY) regresses from 6.037 to 7.417 in benchmarks/many_nodes.json REGRESSION 14.23%: stage_2_avg_iteration_time (LATENCY) regresses from 36.306496143341064 to 41.472771883010864 in stress_tests/stress_test_many_tasks.json REGRESSION 11.81%: dashboard_p99_latency_ms (LATENCY) regresses from 49.411 to 55.245 in benchmarks/many_nodes.json REGRESSION 4.39%: stage_1_avg_iteration_time (LATENCY) regresses from 14.045441269874573 to 14.662270617485046 in stress_tests/stress_test_many_tasks.json REGRESSION 1.20%: avg_pg_remove_time_ms (LATENCY) regresses from 1.4351014099100747 to 1.452357480480116 in stress_tests/stress_test_placement_group.json REGRESSION 1.20%: 10000_args_time (LATENCY) regresses from 17.502108547999995 to 17.71234498399999 in scalability/single_node.json REGRESSION 0.66%: 3000_returns_time (LATENCY) regresses from 5.539246789999993 to 5.576066161 in scalability/single_node.json ``` Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Co-authored-by: Lonnie Liu <lonnie@anyscale.com>
β¦roject#59572) we can remove the python version constraint after windows CI is migrated to python 3.9 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
otherwise the wheel uploading is failing with older versions. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
β¦ification loop (ray-project#59574) ## Description In test_cancel_recursive_tree, the concurrent test case: 1. Creates 10 ChildActor instances 2. Submits 10 Actor.run tasks, each spawning child tasks on a ChildActor 3. Cancels 5 tasks with recursive=True and 5 with recursive=False 4. Expects that for recursive=True, both the parent task and child tasks are cancelled; for recursive=False, only the parent task is cancelled The issue is in the verification loop: when checking if the parent tasks are cancelled, the test uses `run_ref` (a stale loop variable from the previous loop) instead of run_refs[i]. This causes the test to verify the same task (`run_refs[9]`) ten times, rather than verifying all 10 tasks. This PR fixes the issue by using `run_refs[i]` to correctly verify each task. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
This is being done as part of cataloging the ray serve env var as per the doc: https://docs.google.com/spreadsheets/d/1mU_ds6_hI39dK-7zZEFr4SBgHJoASpTvqW5t0yi596A/edit?usp=sharing This PR removes support for several environment variables that were used to override Ray Serve HTTP and gRPC configuration settings. These settings should now be configured exclusively through the Serve config API (http_options). Additionally, this PR adds documentation for the RAY_SERVE_GRPC_MAX_MESSAGE_SIZE environment variable. ### Removed Environment Variables | Environment Variable | Default Value | Alternative | |---------------------|---------------|-------------| | `RAY_SERVE_DEFAULT_HTTP_HOST` | `127.0.0.1` | Use `http_options.host` in config | | `RAY_SERVE_DEFAULT_HTTP_PORT` | `8000` | Use `http_options.port` in config | | `RAY_SERVE_DEFAULT_GRPC_PORT` | `9000` | Use `grpc_options.port` in config | | `RAY_SERVE_HTTP_KEEP_ALIVE_TIMEOUT_S` | `0` (disabled) | Use `http_options.keep_alive_timeout_s` in config | | `RAY_SERVE_REQUEST_PROCESSING_TIMEOUT_S` | `0.0` (disabled) | Use `http_options.request_timeout_s` in config | | `SERVE_REQUEST_PROCESSING_TIMEOUT_S` | `0.0` (disabled) | Use `http_options.request_timeout_s` in config | ### Changes - `python/ray/serve/_private/constants.py` - Replaced environment variable lookups with hardcoded default values - `doc/source/serve/http-guide.md` - Removed documentation for RAY_SERVE_HTTP_KEEP_ALIVE_TIMEOUT_S environment variable - `doc/source/serve/advanced-guides/grpc-guide.md` - Added new section "Configure gRPC message size limits" documenting the RAY_SERVE_GRPC_MAX_MESSAGE_SIZE environment variable - Updated introduction to include the new topic - `python/ray/serve/tests/test_proxy.py` - Removed test_set_keep_alive_timeout_in_env test - Removed test_set_timeout_keep_alive_in_both_config_and_env test - `python/ray/serve/tests/unit/test_http_util.py` - Removed mock_env_constants fixture - Simplified test_basic_configuration (formerly test_basic_configuration_with_mock_env) - Removed test_keep_alive_timeout_override_from_env test - Removed test_request_timeout_preserved_when_already_set test - `python/ray/serve/tests/test_request_timeout.py` - Updated all tests to use serve.start(http_options={"request_timeout_s": ...}) instead of environment variable parametrization --------- Signed-off-by: harshit <harshit@anyscale.com>
- adding template for ray serve async inference feature --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
only keeping `requirements_py310.*` `requirements_buildkite.*` are for python 3.9 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description Minor cleanup of the _concurrency of `Read` op, since we already have _compute (strategy) under `AbstractMap` This does not change the public `read_*` APIs which remains to have `concurrency` args. Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
## Description Upgrading msgpack for python 3.13 upgrade --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
β¦y-project#58366) Signed-off-by: Colin Wang <conlinwang@ntu.edu.tw> Co-authored-by: Seiji Eicher <seiji@anyscale.com>
ray-project#59218 emitting target replicas on every update cycle so that we can compare it with actual replicas on a time series. Signed-off-by: abrar <abrar@anyscale.com>
β¦ject#59490) Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description I've created a custom aggregator by creating a child class overriding `accumulate_block`, `combine` and `_finalize` as instructed in the documentation/examples. That sort of worked. But I always got an extra nested list back... Adding debug messages to the `_finalize` method made me realize that this function is never executed. Reading the source code at https://github.com/ray-project/ray/blob/master/python/ray/data/aggregate.py#L282 makes me realize that `_finalize` should be `finalize`. PR attached. Signed-off-by: Achim GΓ€dke <135793393+AchimGaedkeLynker@users.noreply.github.com> Co-authored-by: Praveen <praveeng@anyscale.com>
β¦ation (ray-project#59529) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Currently, there's no public API to retrieve the Ray session name. Users need to use private APIs like `ray._private.worker.global_worker.node.session_name` or query the dashboard REST API. This makes it difficult to filter Prometheus metrics by cluster when multiple clusters run the same application name, since Ray metrics use the `SessionName` label (which contains the session_name value). --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
β¦oject#59605) Created by release automation bot. Update with commit 0ddb7ee Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Co-authored-by: Lonnie Liu <lonnie@anyscale.com>
## Description Before this PR, BC and MARWIL would not pick up LR schedules because we'd never send timesteps to the learner. After this PR, we do send timesteps there to update LR after gradient updates. --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
β¦um (ray-project#59468) Signed-off-by: dayshah <dhyey2019@gmail.com>
β¦tive solution (ray-project#56082) ## Why are these changes needed? When we retry EnvRunner._sample, we call EnvRunner._sample again. This makes for a recursive solution, leaving already done episodes in done_episodes_to_return. This PR makes it so that .... - We stay true to the promise of the docstring of the EnvRunner.sample() methods that we return AT least n timesteps. Even if envs are restarted. Before, we would not reset how many timessteps have been sampled when resetting the environment (and thus starting a new episode to collect chunks from). - ... `done_episodes_to_return` does not start from zero if we retry in the new recursive call - ... we can make an arbitrary number of calls to env_runner.sample(), sampling episodes and timesteps as we like. Before We'd break if the number of episodes or timesteps was reached, thus leaving other episodes in a dangling state even if they were finished. (run included test on old version of code to confirm) - ... we don't recurse, thereby creating more risk for future memory leaks --------- Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
## Description Fix ingress deployment name could be modified if child deployment has the same name ## Related issues Fixes ray-project#53295 ## Additional information Because there is a child app with the same name as the ingress deployment app, the ingress deployment app name was modified during _build_app_recursive function. Therefore we should use the modified name instead. Another solution is changing the child_app name instead of the ingress deployment app name --------- Signed-off-by: Le Duc Manh <naruto12308@gmail.com>
β¦er initialization (ray-project#59611) ## Description Fixes a race condition in `MetricsAgentClientImpl::WaitForServerReadyWithRetry` where concurrent HealthCheck callbacks could both attempt to initialize the exporter, causing GCS to crash with: ``` Check failed: !exporting_started_ RayEventRecorder::StartExportingEvents() should be called only once. ``` The `exporter_initialized_` flag was a non-atomic bool. When multiple HealthCheck RPCs completed simultaneously, their callbacks could both read false before either set it to true, leading to `init_exporter_fn` being called twice. Changed the flag to `std::atomic<bool>` to ensure only one callback wins the race. Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
β¦STOP_REQUESTED in autoscaler v2 (ray-project#59550) ## Description When the autoscaler attempts to terminate QUEUED instances to enforce the `max_num_nodes_per_type` limit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state. The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED β RAY_STOP_REQUESTED, which is invalid. https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197 This occurs when `max_workers` configuration is dynamically reduced or when instances exceed the limit. ``` 2025-12-04 06:21:55,298 INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached). 2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED ``` This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling queued instances. ## Related issues Closes ray-project#59219 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: win5923 <ken89@kimo.com>
## Description When running bellow code: ``` from ray import ActorID ActorID.nil().job_id ``` or ``` from ray import TaskID TaskID.nil().job_id() ``` Bellow error shows: <img width="1912" height="331" alt="ζͺε 2025-12-18 δΈε6 49 18" src="https://github.com/user-attachments/assets/b4200ef8-10df-4c91-83ff-f96f7874b0ce" /> The program should throw an error instead of crash, and this PR fixed it by adding a helper function to do nil check. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". Closes [ray-project#53872](ray-project#53872) ## Additional information After the fix, now it will throw an `ValueError` <img width="334" height="52" alt="ζͺε 2025-12-20 δΈε8 47 30" src="https://github.com/user-attachments/assets/00228923-2d26-4cb4-bf53-615945d2ce6c" /> <img width="668" height="103" alt="ζͺε 2025-12-20 δΈε8 47 49" src="https://github.com/user-attachments/assets/ee68213a-681a-4499-bef2-2e13533e3ffd" /> --------- Signed-off-by: Alex Wu <c.alexwu@gmail.com>
β¦0 GPUs on CPU-only cluster (ray-project#59514) If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes. This PR fixes that bug. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦object construction (ray-project#59500) ## Description Reduce overhead added by token authentication: - Return shared_ptr from AuthenticationTokenLoader::GetToken() instead of constructing a new AuthenticationToken object copy every time (which would also add object destruction overhead) - Cache token in client interceptor at construction (previously called GetToken() for every RPC) - Use CompareWithMetadata() to validate tokens directly from string_view without constructing new AuthenticationToken objects - Pass shared_ptr through ServerCallFactory to avoid per-call copies release tests: without this change, the microbenchmark `multi_client_put_gigabytes` was in 25-30 range, eg run: https://buildkite.com/ray-project/release/builds/70658 now with this change it is in 40-45 range https://buildkite.com/ray-project/release/builds/72070 --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦project#60050) Add support for authenticating HTTPS downloads in runtime environments using bearer tokens via the RAY_RUNTIME_ENV_BEARER_TOKEN environment variable. Fixes [ray-project#46833](ray-project#46833) Signed-off-by: Denis Khachyan <khachyanda@gmail.com>
β¦ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() β _release_server() β DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <ulyer555@hotmail.com> Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
Context --- This change aims at revisiting of the `HashShuffleAggregator` protocol by - Removing global lock (per aggregator) - Making shard accepting flow lock-free - Relocating all state from `ShuffleAggregation` into Aggregator itself - Adding dynamic compaction (exponentially increasing compaction period) to amortize compaction costs - Adding debugging state dumps ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Adding CONSTRAINTS_FILE docker arg for ray base-deps image release test run: https://buildkite.com/ray-project/release/builds/74879 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Description 1. Jax dependency is introduced in ray-project#58322 2. The current test environment is for CUDA 12.1, which limit jax version below 0.4.14. 3. jax <= 0.4.14 does not support py 3.12. 4. skip jax test if it runs against py3.12+. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>
β¦se (ray-project#60080) Signed-off-by: Future-Outlier <eric901201@gmail.com>
## Description Add model inference release test that closely reflects user workloads. Release test run: https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_glehkcquv9k26ta69f8lkc94nl?job-logs-section-tabs=application_logs&job-tab=overview&metrics-tab=data ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
Reverts ray-project#59983 symlink does not work with newer version of wanda, where the newer version of wanda is doing the right thing.
β¦ct#59987) - Bump .rayciversion from 0.21.0 to 0.25.0 - Move rules files to .buildkite/ with *.rules.txt naming convention - Add always.rules.txt for always-run lint rules - Add test.rules.test.txt with test cases - Add test-rules CI step in cicd.rayci.yml (auto-discovery) - Update macOS config to use new rules file paths Topic: update-rayci-latest Signed-off-by: andrew <andrew@anyscale.com>
β¦ay-project#60057) ## Summary When running prefill-decode disaggregation with NixlConnector and data parallelism, both prefill and decode deployments were using the same port base for their ZMQ side channel. This caused "Address already in use" errors when both deployments had workers on the same node: ``` zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009') Exception in thread nixl_handshake_listener ``` ## Changes Fix by setting different `NIXL_SIDE_CHANNEL_PORT_BASE` values for prefill (40000) and decode (41000) configs to ensure port isolation. ## Test plan - Run `test_llm_serve_prefill_decode_with_data_parallelism` - should complete without timeout - The test previously hung forever waiting for "READY message from DP Coordinator" Signed-off-by: Seiji Eicher <seiji@anyscale.com>
β¦se (ray-project#60092) Signed-off-by: Future-Outlier <eric901201@gmail.com>
- Fix ProgressBar to honor `use_ray_tqdm` in `DataContext`. - Note that `tqdm_ray` is designed to work in non-interactive contexts (workers/actors) by sending JSON progress updates to the driver. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
β¦ray-project#59933) ## Description The `DefaultAutoscaler2` implementation needs an `AutoscalingCoordinator` and a way to get all of the `_NodeResourceSpec`. Currently, we can't explicitly inject fake implementations of either dependency. This is problematic because the tests need to assume what the implementation of each dependency looks like and use brittle mocks. To solve this: - Add the `FakeAutoscalingCoordinator` implementation to a new `fake_autoscaling_coordinator.py` module (you can use the code below) - `DefaultClusterAutoscalerV2` has two new parameters `autoscaling_coordinator: Optional[AutoscalingCoordinator] = None` and `get_node_counts: Callable[[], Dict[_NodeResourceSpec, int]] = get_node_resource_spec_and_count`. If `autoscaling_coordinator` is None, you can use the default implementation. - Update `test_try_scale_up_cluster` to use the explicit seams rather than mocks. Where possible, assert against the public interface rather than implementation details ## Related issues Closes ray-project#59683 --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com>
## Description RLlib's rayci.yml [file](https://github.com/ray-project/ray/blob/master/.buildkite/rllib.rayci.yml) and the BUILD.bazel [file](https://github.com/ray-project/ray/blob/master/rllib/BUILD.bazel) are disconnected such that there are old tags in the BUILD not the rayci and vice-versa. This PR attempts to clean up both files without modifying what tests are or aren't run currently --------- Signed-off-by: Mark Towers <mark@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
β¦g tracer file handles (ray-project#60078) This fix resolves serve's window test failure: ``` [2026-01-12T22:52:13Z] =================================== ERRORS ==================================== -- [2026-01-12T22:52:13Z] _______ ERROR at teardown of test_deployment_remote_calls_with_tracing ________ [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] @pytest.fixture [2026-01-12T22:52:13Z] def cleanup_spans(): [2026-01-12T22:52:13Z] """Cleanup temporary spans_dir folder at beginning and end of test.""" [2026-01-12T22:52:13Z] if os.path.exists(spans_dir): [2026-01-12T22:52:13Z] shutil.rmtree(spans_dir) [2026-01-12T22:52:13Z] os.makedirs(spans_dir, exist_ok=True) [2026-01-12T22:52:13Z] yield [2026-01-12T22:52:13Z] # Enable tracing only sets up tracing once per driver process. [2026-01-12T22:52:13Z] # We set ray.__traced__ to False here so that each [2026-01-12T22:52:13Z] # test will re-set up tracing. [2026-01-12T22:52:13Z] ray.__traced__ = False [2026-01-12T22:52:13Z] if os.path.exists(spans_dir): [2026-01-12T22:52:13Z] > shutil.rmtree(spans_dir) [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] python\ray\serve\tests\test_serve_with_tracing.py:30: [2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:750: in rmtree [2026-01-12T22:52:13Z] return _rmtree_unsafe(path, onerror) [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:620: in _rmtree_unsafe [2026-01-12T22:52:13Z] onerror(os.unlink, fullname, sys.exc_info()) [2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] path = '/tmp/spans/' [2026-01-12T22:52:13Z] onerror = <function rmtree.<locals>.onerror at 0x000002C0FFBBDA20> [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] def _rmtree_unsafe(path, onerror): [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] with os.scandir(path) as scandir_it: [2026-01-12T22:52:13Z] entries = list(scandir_it) [2026-01-12T22:52:13Z] except OSError: [2026-01-12T22:52:13Z] onerror(os.scandir, path, sys.exc_info()) [2026-01-12T22:52:13Z] entries = [] [2026-01-12T22:52:13Z] for entry in entries: [2026-01-12T22:52:13Z] fullname = entry.path [2026-01-12T22:52:13Z] if _rmtree_isdir(entry): [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] if entry.is_symlink(): [2026-01-12T22:52:13Z] # This can only happen if someone replaces [2026-01-12T22:52:13Z] # a directory with a symlink after the call to [2026-01-12T22:52:13Z] # os.scandir or entry.is_dir above. [2026-01-12T22:52:13Z] raise OSError("Cannot call rmtree on a symbolic link") [2026-01-12T22:52:13Z] except OSError: [2026-01-12T22:52:13Z] onerror(os.path.islink, fullname, sys.exc_info()) [2026-01-12T22:52:13Z] continue [2026-01-12T22:52:13Z] _rmtree_unsafe(fullname, onerror) [2026-01-12T22:52:13Z] else: [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] > os.unlink(fullname) [2026-01-12T22:52:13Z] E PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '/tmp/spans/15464.txt' [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:618: PermissionError ``` **Cause:** The `setup_local_tmp_tracing.py` module opens a file handle for the `ConsoleSpanExporter` that is never explicitly closed. On Windows, files cannot be deleted while they're open, causing `shutil.rmtree` to fail with `PermissionError: [WinError 32]` during the `cleanup_spans` fixture teardown. **Fix:** Added `trace.get_tracer_provider().shutdown()` in the `ray_serve_with_tracing` fixture teardown to properly flush and close the span exporter's file handles before the cleanup fixture attempts to delete the spans directory. --------- Signed-off-by: doyoung <doyoung@anyscale.com>
### Why are these changes needed?
When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).
This violates the documented behavior in the `fit()` docstring:
> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.
**Example of the bug:**
```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()
# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}
# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)
# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
# "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```
---------
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦project#60072) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
β¦ay-project#60028) Capture the install script content in BuildContext digest by inlining it as a constant and adding install_python_deps_script_digest field. This ensures build reproducibility when the script changes. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
the "test-rules" test job was missing the forge dependency Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This migrates ray wheel builds from CLI-based approach to wanda-based container builds for x86_64. Changes: - Add ray-wheel.wanda.yaml and Dockerfile for wheel builds - Update build.rayci.yml wheel steps to use wanda - Add wheel upload steps that extract from wanda cache Topic: ray-wheel Signed-off-by: andrew <andrew@anyscale.com>
|
Note The number of changes in this pull request is too large for Gemini Code Assist to generate a review. |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-14
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.