π daily merge: master β main 2026-01-15#746
Conversation
β¦er initialization (ray-project#59611) ## Description Fixes a race condition in `MetricsAgentClientImpl::WaitForServerReadyWithRetry` where concurrent HealthCheck callbacks could both attempt to initialize the exporter, causing GCS to crash with: ``` Check failed: !exporting_started_ RayEventRecorder::StartExportingEvents() should be called only once. ``` The `exporter_initialized_` flag was a non-atomic bool. When multiple HealthCheck RPCs completed simultaneously, their callbacks could both read false before either set it to true, leading to `init_exporter_fn` being called twice. Changed the flag to `std::atomic<bool>` to ensure only one callback wins the race. Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
β¦STOP_REQUESTED in autoscaler v2 (ray-project#59550) ## Description When the autoscaler attempts to terminate QUEUED instances to enforce the `max_num_nodes_per_type` limit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state. The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED β RAY_STOP_REQUESTED, which is invalid. https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197 This occurs when `max_workers` configuration is dynamically reduced or when instances exceed the limit. ``` 2025-12-04 06:21:55,298 INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached). 2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED ``` This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling queued instances. ## Related issues Closes ray-project#59219 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: win5923 <ken89@kimo.com>
## Description When running bellow code: ``` from ray import ActorID ActorID.nil().job_id ``` or ``` from ray import TaskID TaskID.nil().job_id() ``` Bellow error shows: <img width="1912" height="331" alt="ζͺε 2025-12-18 δΈε6 49 18" src="https://github.com/user-attachments/assets/b4200ef8-10df-4c91-83ff-f96f7874b0ce" /> The program should throw an error instead of crash, and this PR fixed it by adding a helper function to do nil check. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". Closes [ray-project#53872](ray-project#53872) ## Additional information After the fix, now it will throw an `ValueError` <img width="334" height="52" alt="ζͺε 2025-12-20 δΈε8 47 30" src="https://github.com/user-attachments/assets/00228923-2d26-4cb4-bf53-615945d2ce6c" /> <img width="668" height="103" alt="ζͺε 2025-12-20 δΈε8 47 49" src="https://github.com/user-attachments/assets/ee68213a-681a-4499-bef2-2e13533e3ffd" /> --------- Signed-off-by: Alex Wu <c.alexwu@gmail.com>
β¦0 GPUs on CPU-only cluster (ray-project#59514) If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes. This PR fixes that bug. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦object construction (ray-project#59500) ## Description Reduce overhead added by token authentication: - Return shared_ptr from AuthenticationTokenLoader::GetToken() instead of constructing a new AuthenticationToken object copy every time (which would also add object destruction overhead) - Cache token in client interceptor at construction (previously called GetToken() for every RPC) - Use CompareWithMetadata() to validate tokens directly from string_view without constructing new AuthenticationToken objects - Pass shared_ptr through ServerCallFactory to avoid per-call copies release tests: without this change, the microbenchmark `multi_client_put_gigabytes` was in 25-30 range, eg run: https://buildkite.com/ray-project/release/builds/70658 now with this change it is in 40-45 range https://buildkite.com/ray-project/release/builds/72070 --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
## Description This is a series of PRs to refactor/consolidate progress reporting and to decouple it from the executor, opstates, etc. ## Related issues Split from ray-project#58173 ## Additional information N/A --------- Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
## Description Move extension types to ray.data ray.air.util.objection_extensions -> ray.data._internal.object_extensions ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions ## Related issues Closes ray-project#59418 Signed-off-by: will <zzchun8@gmail.com>
β¦ovements. (ray-project#59544) ## Description APPO targets high-throughput RL and relies on asynchrony - via actor parallelism and multi-threading. In our tests, the local-learner setup (num_learners=0) underperformed, prompting a deeper investigation into root causes and improvements as follows. ### Causes - __Producer-driven + weak backpressure:__ The data-flow was producer-driven. There was no bounded, blocking handoff to pace producers. The in-queue for the `LearnerThread` was a `deque` which is unblocking. As a result, the `Learner` wasn't setting the pace (sawtooth throughput and stalls at 20-update barrier, i.e. learner starved; furthermore: CPU-waste and hot GIL when polling from empty queues). - __Thread contention at multiple places:__ `Learner` and `_LearnerThread` shared the `_num_updates_lock` and `metrics._threading_lock` (`RLock`). Every update both threads contended on the same lock. Every 20 updates occasionally both threads contended on the other shared lock. - __Global timesteps, race condition:__ `update()` wrote a global `_CURRENT_GLOBAL_TIMESTEPS` and the `_LearnerThread` had to read it later. Rapid calls to `update()` could overwrite this before the thread consumes it, i.e. timesteps can mismatch the batch actually trained. - __Aggregation race + spurious reduces:__ The "every 20-updates" path checked a copy of `_num_updates` without synchronization, then reset it at `>=20` - but without any threading event/condition to align the producer with the consumer. This returned often `{}` or reduce at odd times. - __No pinned memory or stream handoff:__ GPU-copies used `pin_memory=False` and there was no explicit stream handoff; any implicit sync may land inside of the learner update timing. - __Reference resolving on producer's hot path:__ `TrainingData.solve_refs()` was called synchronously inside `LearnerGroup.update()` before queuing which cost sometimes around ~25% of time in some calls. Extends the window where producer/learner can drift. - __Mixed queue semantics + missing task semantics:__ The code mixed `queue.Queue` and `deque` (and optionally `CircularBuffer`). `task_done`/`join()` semantics don't exist for `deque`, correctness relies on polling and manual drops. There was no bounded, blocking handoff to pace producers. This was brittle under load. - __No clean stop path:__ The thread used a `stop` flag but no sentinel was enqueued, if it was blocked/polling the shutdown could hang or increment counters after a stop. - __Complete multi-learner stalling:__ In multi-learner setups with __multi-agent__ policies asynchronous batches (i.e. batches with different combinations of policies) led to stalls in Torch's `DistributedDataParallel` gradient asynchronous synchronization. One rank computed gradients for a policy not existent on the other rank(s) and waited indefinitely for synched gradients. ### Improvements by this PR - __Consumer-driven:__ `Learner` dictates pace through blocking queues (`Learner` does `get()`, producer does `put`). That avoids busy polling (no CPU burn). Faster reloading through quick returns from `Learner.update()` with no results ready. Avoids learner starving - bigger queues allow for frequent producer burstiness. - __Edge-triggered aggregation:__ Only `_LearnerThread` increments a private counter and on __exactly__ the `broadcast_interval`th update fires an __event__ (`_agg_event`). The producer simply `wait`s for the event and `clear`s it (no lock fight). Furthermore, the `_LearnerThread` now reduces metrics and returns them through an out-queue from which the `Learner` picks them up and returns them to the main process. All of these measures reduce thread contention to an absolute minimum. - __Pass meta-data with batch:__ The `Learner` enqueues a tuple `(batch, timesteps)` so the `_LearnerThread` consumes the correct timesteps atomically. This also reduces communication and boilerplate. - __(Optional) Deferral of reference resolving:__ Post-solve references in `Learner.update()` to return faster in asynchronous calls. - __Clean stop + consistent semantics:__ Use a `_STOP_SENTINEL` through the learner queue; don't rely anymore on a boolean alone. And call `task_done()` on real `queue.Queue` (if not using `CircularBuffer`). Furthermore unification of buffer/queue API inside `Learner`. - __Safe-guarding multi-agent multi-learner training:__ Manual synchronization of gradients replaces Torch's `DistributedDataParallel` hooks-based synchronization for multi-learner multi-agent setups. Gradients on each rank are zero-padded and synched after all gradients have been computed. ## Related issues ## Additional information Because this PR reshapes the data flow, a few tuning tips are useful: - Circular buffer vs. simple queue. The old CircularBuffer prevented learner starvation but its push/pop are slower. The new consumer-driven pipeline is generally more efficient - assuming producers are reasonably fast and the learner queue isnβt tiny. Use `use_circular_buffer=True` only when producing is expensive/irregular (it lets the learner keep iterating over buffered data, similar to `num_epochs > 1` but in cycles). Otherwise, prefer the simple queue. Recommended defaults: `simple_queue_size=32` for `APPO`; `IMPALA` keeps a smaller `learner_queue_size=3`. - Unified interval: broadcast & metrics reduction. Previously, weights were synced by `broadcast_interval` while metrics were reduced every fixed 20 updates. The new design unifies these: `broadcast_interval` now controls both weight sync and metrics reduction. In practice, ~10 balances steady flow with acceptable off-policy lag. - Scale producers to match a fast learner. The `_LearnerThread` applies updates quickly, so overall throughput is often producer-bound. To feed it well, increase `num_env_runners` and/or `num_envs_per_env_runner`. ### Next steps This PR improves dataflow focused on the learner(s). The next steps are: - To increase throughput in `AggregatorActors` - To improve dataflow in IMPALA's main thread. - To boost performance in loss calculation. - To check, asynchronous calls to `EnvRunner`s and `Learner`s. - To test resolving references in either `_GPULoaderThreads` or `_LearnerThread` instead of the `Learner`'s main thread. ### Tests APPO in this PR was tested on the following (multi-agent) environments: - `CartPole-v1` - `ALE:Pong-v5` - `Footsies` (see https://github.com/chasemcd/FootsiesGym) #### `CartPole-v1` This PR improves performance significantly for high-producer scenarios like `CartPole-v1`. All tests used: - `broadcast_interval=10` - `use_circular_buffer=False` - `num_aggregator_actors_per_learner=3` - `num_env_runners=num_learners x 32` - `episodes_to_numpy=False` - `num_gpus_per_learner=1` <img width="757" height="404" alt="image (3)" src="https://github.com/user-attachments/assets/3beee428-d4c0-42f4-811d-61d81de484c2" /> #### `ALE:Pong-v5` All tests used: - `broadcast_interval=10` - `use_circular_buffer=False` - `num_aggregator_actors_per_learner=6` - `num_env_runners=num_learners x 32` - `episodes_to_numpy=True` (`FrameStack` connector with 4 frames) - `num_gpus_per_learner=1` <img width="676" height="366" alt="image" src="https://github.com/user-attachments/assets/43d08a87-0cc1-4902-8150-adc1c3203be6" /> --------- Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
## Description This PR improves typehinting in `ray._common.retry`. It contained a lot of `Any` or unspecified generics before and now should be fully specific. --------- Signed-off-by: Jonas Dedden <university@jonas-dedden.de>
## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: βmaster.zipβ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Rayβs HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Rayβs behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <myanstu@163.com>
Previously, if the user did not specify them, Ray preassigned the GCS port, dashboard agent port, runtime environment port, etc., and passed them to each component at startup. This created a race condition: Ray might believe a port is free, but by the time the port information is propagated to each component, another process may have already bound to that port. This can cause user-facing issues, for example when Raylet heartbeat messages are missed frequently enough that the GCS considers the node unhealthy and removes it. We originally did this because there was no standard local service discovery, so components had no way to know each otherβs serving ports unless they were preassigned. The final port discovery design is here: <img width="2106" height="1492" alt="image" src="https://github.com/user-attachments/assets/eaac8190-99d8-404b-8a8d-283a4f2f0f33" /> This PR addresses port discovery for: - GCS reporting back to the startup script (driver)β - The runtime env agent reporting back to the rayletβ - The dashboard agent reporting back to the raylet β - The raylet blocking registration with the GCS until it has collected port information from all agents β - GCS adding InitMetricsExporter to node_added_listeners_ so it starts the MetricsExporter as soon as the raylet registers with the GCS with complete port information β - The Ray client server obtaining the runtime env agent port from GCSβ - Ensuring that both a connected-only driver (e.g., `ray.init()`) and a startup driver still receive all port information from the GCSβ - Ensure GCS FT WorksοΌUsing the same GCS port as beforeβ - Ensure no metric lossβ - Clean up the old cache port codeβ (Note that this PR is a clean-up version of ray-project#59065) ## Consideration **GCS Fault tolerance:** GCS fault tolerance requires GCS to restart using exactly the same port, even if it initially starts with a dynamically assigned port (0). Before this PR, GCS cached the port in a file, and this PR preserves the same behavior (although ideally, the port should only be read from the file by the Raylet and its agent). This can be further improved by storing the GCS port in Redis, but that should be addressed in a separate PR. **GCS start sequence related:** OpenCensus Exporter and the Event Aggregator Client are now constructed without connecting to the agent port; instead, they defer the actual connection until the head Raylet registers via a callback. At that point, the actual metrics_agent_port is known from the node information. The OpenTelemetry Exporter is now also initialized at head Raylet registration time. **Ray nodes that share the same file system:** There are cases where people run multiple Ray nodes from the same or different Ray clusters, so the port file name is based on a fixed prefix plus the node ID. ## Related issues Closes ray-project#54321 ## Test For GCS-related work, here is a detailed test I wrote that covers seven starting/connecting cases: - https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/python/ray/tests/test_gcs_port_reporting.py - ray.init starts a head node and exposes a dynamic GCS port. - Connect a driver via address="auto" using the address file - Connect a driver via an explicit address - CLI starts head with dynamic GCS port - CLI starts worker connecting to the head via GCS address - CLI starts head with an explicit GCS port - CLI starts head with default GCS port For runtime env agent: - https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_agent_port.py - ray start --head (auto port discovery) - ray start --head with fixed runtime-env-agent-port - ray.init() local cluster (auto port discovery) - (we don't have ray.init() with fixed _runtime_env_agent_port) Test that ray_client_server works correctly with dynamic runtime env agent port: - https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_ray_client_with_runtime_env.py For dashboard agent ports, the existing tests already cover this quite well. ## Follow up - The dashboard agent reporting back to the raylet - The dashboard agent now also writes to GCS, but we should allow only the raylet to write to GCS ## performance before this PR: ```shell [0.000s] Starting ray.init()... [0.000s] Session dir created [0.070s] Process: gcs_server [6.885s] Process: runtime_env_agent [6.955s] Process: raylet [6.955s] Process: dashboard_agent 2025-12-12 04:47:34,391 INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2062: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [9.061s] ray.init() completed ``` After This PR: ```shell [0.000s] Starting ray.init()... [0.075s] Process: gcs_server [0.075s] Session dir created [0.075s] File: gcs_server_port.json = 39451 [6.976s] Process: raylet [6.976s] Process: dashboard_agent [6.976s] Process: runtime_env_agent [7.576s] File: runtime_env_agent_port.json = 38747 [7.640s] File: metrics_agent_port.json = 40005 [8.083s] File: metrics_export_port.json = 44515 [8.083s] File: dashboard_agent_listen_port.json = 52365 2025-12-12 02:02:54,925 INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [10.035s] ray.init() completed ``` We can see that the dominant time is actually at the start of GCS. We wait for GCS to be ready and write the cluster info. The port reporting speed is quite fast (file appearance time β raylet start time). https://github.com/ray-project/ray/blob/863ae9fd573b13a05dcae63b483e9b1eb0175571/python/ray/_private/node.py#L1365-L367 --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
Clean Up Deprecated Env Var and Document Undocumented Env Vars ### Summary Remove deprecated `RAY_SERVE_ENABLE_JSON_LOGGING` and add documentation for undocumented Ray Serve environment variables. ### Changes | File | Description | |------|-------------| | `python/ray/serve/_private/constants.py` | Removed `RAY_SERVE_ENABLE_JSON_LOGGING`, renamed `SERVE_ROOT_URL_ENV_KEY` β `RAY_SERVE_ROOT_URL`, removed deprecated `CONTROLLER_MAX_CONCURRENCY` fallback | | `python/ray/serve/_private/logging_utils.py` | Removed deprecated JSON logging logic and warning | | `python/ray/serve/_private/controller.py` | Updated to use `RAY_SERVE_ROOT_URL` constant | | `doc/source/serve/monitoring.md` | Removed deprecation note, added `RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` docs | | `doc/source/serve/advanced-guides/performance.md` | Added `RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` docs | | `doc/source/serve/production-guide/config.md` | Added `RAY_SERVE_ROOT_URL` docs | ### New Documentation | Environment Variable | Description | |---------------------|-------------| | `RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` | Max concurrent requests for Controller (default: 15000) | | `RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` | Callback for custom Controller initialization | | `RAY_SERVE_ROOT_URL` | Override root URL (useful behind load balancers) | ### Migration Users using `RAY_SERVE_ENABLE_JSON_LOGGING=1` should migrate to `LoggingConfig` with `encoding="JSON"`. --------- Signed-off-by: harshit <harshit@anyscale.com>
ray-project#59659) Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
## Description Follow up to ray-project#59350 - motivation: better abstraction for progress bars and type checking in general. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
β¦_usage` a utility function (ray-project#59674) This PR makes the `ReservationOpResourceAllocator. _get_ineligible_ops_with_usage` method a utility function named `get_ineligible_op_usage`. The motivation is so that the logic can be reused by other allocator implementations. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ay-project#59668) To avoid circular dependencies, this PR updates `ranker.py` to only import `ResourceManager` while type checking. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
β¦y_accounting` (ray-project#59671) `PhysicalOperator` has an `implements_accurate_memory_accounting` method. Subclasses override it and return `True` if they properly call the `OpRuntimeMetrics` hooks like `on_input_queued`. We previously needed this method because operators like `AllToAllOperator` didn't update `OpRuntimeMetrics`, and that would cause issues for resource allocation. However, now that all operators implement accurate memory accounting, this method isn't necessary anymore. (Sanity check from Claude below) ``` βΊ Based on my search, no concrete PhysicalOperator subclass returns False for implements_accurate_memory_accounting. Here's the breakdown: Base class default: PhysicalOperator returns False at physical_operator.py:773 All concrete operators return True: | Operator | Source | |-------------------------|----------------------------------------------------------------| | MapOperator | Overrides at map_operator.py:704 | | TaskPoolMapOperator | Inherits from MapOperator | | ActorPoolMapOperator | Inherits from MapOperator | | LimitOperator | Overrides at limit_operator.py:135 | | InputDataBuffer | Overrides at input_data_buffer.py:98 | | OutputSplitter | Overrides at output_splitter.py:285 | | AggregateNumRows | Overrides at aggregate_num_rows.py:63 | | UnionOperator | Overrides at union_operator.py:138 | | ZipOperator | Overrides at zip_operator.py:152 | | HashShuffleOperator | Inherits from HashShufflingOperatorBase (hash_shuffle.py:1005) | | JoinOperator | Inherits from HashShufflingOperatorBase | | HashAggregateOperator | Inherits from HashShufflingOperatorBase | | OneToOneOperator (base) | Overrides at base_physical_operator.py:231 | The only class that returns False (by inheritance) is the abstract NAryOperator base class, but its two concrete subclasses (UnionOperator and ZipOperator) both override it to return True. ``` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Description Remove dead code in `python/ray/data/util/data_batch_conversion.py` and `python/ray/data/util/torch_utils.py`, which is related to PR: ray-project#59420 ## Related issues Related to ray-project#59420. Signed-off-by: will <zzchun8@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
## Description ## Related issues Closes ray-project#59652 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description After removing the deprecated `read_parquet_bulk` API, `FastFileMetadataProvider` became dead code with no remaining usage in the codebase. This commit removes: - FastFileMetadataProvider class implementation - All imports and exports of FastFileMetadataProvider - Tests that specifically tested FastFileMetadataProvider - Documentation references to FastFileMetadataProvider - Code comments mentioning FastFileMetadataProvider ## Related issues > Fixes ray-project#59010 --------- Signed-off-by: rushikesh.adhav <adhavrushikesh6@gmail.com> Signed-off-by: Rushikesh Adhav <adhavrushikesh6@gmail.com>
β¦ect#59733) ## Description The constructor in ClusterAutoscaler base class is not necessary, and it adds complexity because it requires all ClusterAutoscaling implementations to accept the same dependencies This PR remove the constructor in ClusterAutoscaler. Sub-classes can prevent using the dependencies that are not used ## Related issues Closes ray-project#59684 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>
otherwise, it fails to build with missing header when grpc is upgraded. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description bump gymnasium to 1.2.2 in byod-rllib follow up on: * ray-project#59530 related: * ray-project#59572 Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
## Description Previously, GPU allocation was special-cased in `ReservationOpResourceAllocator`: 1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage 2. The check `max_resource_usage != inf() and max_resource_usage.gpu > 0` failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget 3. GPU was stripped from remaining shared resources (.copy(gpu=0)) This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor. **Changes** _task_pool_map_operator.py_ - Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator - Returns (min=1 task resources, max=max_concurrency * task resources or inf) _resource_manager.py_ - Removed GPU special-casing entirely - GPU now flows through the same allocation path as CPU and memory - Operators are capped by their max_resource_usage for all resource types uniformly - Remaining shared resources (including GPU) go to unbounded downstream operators ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
β¦9742) the original dataset is being fetched from the Internet and it is returning 403 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
CI bazel usage should be all running in python 3.10 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
not used by rllib any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
added a notebook that demonstrates serve application that takes a reference to video as input and returns scene changes, tags and video description (from the corpus). https://anyscale-ray--59859.com.readthedocs.build/en/59859/serve/tutorials/video-analysis/README.html --------- Signed-off-by: abrar <abrar@anyscale.com>
β¦ject#60109) Follow up to ray-project#52573 (comment) --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
ray-project#60076) Signed-off-by: dayshah <dhyey2019@gmail.com>
they are not required for test orchestration, as rayci can properly track dependencies now. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Opening draft PR for Ray technical charter. Planning to add GitHub usernames before merging. --------- Signed-off-by: Robert Nishihara <rkn@anyscale.com> Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
There isn't really a need to have a ray check on the event exporter as there isn't really an important correctness invariant here. One call will succeed. We already take some measure of caution here with a mutex in the event recorder. But ray checking right after the mutex is just asking for trouble. --------- Signed-off-by: zac <zac@anyscale.com>
Currently seeing issues of crane not available in the uploading environment. Default to Docker if crane is not available https://buildkite.com/ray-project/postmerge/builds/15375/steps/canvas?jid=019bb99d-6f9e-45fa-92e3-a5a1d9373e8d#019bb99d-6f9e-45fa-92e3-a5a1d9373e8d/L198 Topic: crane-fix Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>
β¦ay-project#59991) updating lock files for images and using relative paths in buildkite configs moving base extra test deps lock files from ray_release path to python/deplocks/base_extra_testdeps Release test run: https://buildkite.com/ray-project/release/builds/74936 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
β¦ct#59969) The ray-cpp wheel contains only C++ headers, libraries, and executables with no Python-specific code. Previously we built 4 identical wheels (one per Python version: cp310, cp311, cp312, cp313), wasting CI time and storage. This change produces a single wheel tagged py3-none-manylinux2014_* that works with any Python 3.x version. Changes: - Add ray-cpp-core.wanda.yaml and Dockerfile for cpp core - Add ray-cpp-wheel.wanda.yaml for cpp wheel builds - Add ci/build/build-ray-cpp-wheel.sh for Python-agnostic wheel builds - Add RayCppBdistWheel class to setup.py that forces py3-none tags (necessary because BinaryDistribution.has_ext_modules() causes bdist_wheel to use interpreter-specific ABI tags by default) - Update ray-cpp-wheel.wanda.yaml to build single wheel per architecture - Update .buildkite/build.rayci.yml to remove Python version matrix for cpp wheel build/upload steps Topic: ray-cpp-wheel Relative: ray-wheel Signed-off-by: andrew <andrew@anyscale.com> --------- Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: cristianjd <cristian.j.derr@gmail.com>
The governance information is now integrated into the contributor documentation at doc/source/ray-contribute/getting-involved.rst:399-437, making it easily discoverable for community members interested in advancing their involvement in the Ray project. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
## Description Completing the fixed-size array namespace operations ## Related issues Related to ray-project#58674 ## Additional information --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Builds running into errors since we were not logged in. https://buildkite.com/ray-project/postmerge/builds/15387/steps/canvas?jid=019bbaad-93e8-40e9-aeef-3592bd609f2d#019bbaad-93e8-40e[β¦]3592bd609f2d/L198 Topic: login-ecr-fix Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>
These checks were being done manually, but click can handle this behavior check. Removing this block in favor of click. Signed-off-by: andrew <andrew@anyscale.com>
Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com>
No need to be as verbose Signed-off-by: andrew <andrew@anyscale.com>
generating base-slim depset installing lock file in base-slim image Updating paths for constraints files and requirement files --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
security upgrade. fixes ray-project#60079 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
bumping requests to minimum ver 2.32.5 due to secruity vulnerabilities with requests<2.32.4 https://security.snyk.io/package/pip/requests Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Increase timeout for several flaky train tests. --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com>
β¦ay-project#60142) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
While looking at the UpdateObjectLocationsBatch path and the pubsub that the owner uses to communicate object location updates I noticed that we send over both object locations and primary node id. However, primary node id is not used anywhere in the obod. What's interesting is that there seems to be a couple paths: - Each raylet initially subscribes to the owner for object location updates. After the raylet is done (note it unsubscribes once done pulling), it sends over it's object location updates via a UpdateObjectLocationsBatch RPC to the owner. The owner then publishes this update to all the other currently subscribed raylets via the same pubsub previously mentioned. - We also have a pinned_node_id that is contained in the reference counter and set by PushTaskReply. Actually, in PushTaskReply we still publish an object locations update in UpdateObjectPendingCreationInternal but we send over an empty list of location updates (as that's only populated by UpdateObjectLocationsBatch) but set a pending_creation boolean flag that validates when the object is done creating. It's possible that UpdateObjectLocationsBatch is received before PushTaskReply which is why we have this guard. From looking at ray-project#25004 it seems this is intentional and we only want to use UpdateObjectLocationsBatch to modify the locations set. We send over this locations set to any raylet thats trying to decide where to pull from (makes sense) but we also send over the primary_node_id which is copied from pinned_node_id. The latter isn't used at all, so there's no point in sending it over. I removed it from the proto and a couple other places where it doesn't seem necessary to use. --------- Signed-off-by: joshlee <joshlee@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request is a large automated merge from master to main, incorporating a wide range of changes across the Ray ecosystem. The updates include significant CI/CD pipeline refactoring, new features in Ray Serve, Data, and Core, extensive documentation improvements, and internal changes to Java worker initialization. Key highlights are the modularization of the build process, the introduction of asynchronous inference and advanced multimodal capabilities, and enhanced monitoring and fault tolerance in Ray Serve. The codebase is also moving away from Python 3.9. Overall, these changes represent a substantial step forward in functionality, maintainability, and user experience. My review focuses on a minor point in the new CI configurations.
| RAYCI_DISABLE_JAVA: "false" | ||
| RAYCI_WANDA_ALWAYS_REBUILD: "true" | ||
| JDK_SUFFIX: "-jdk" | ||
| ARCH_SUFFIX: "aarch64" |
There was a problem hiding this comment.
The environment variable ARCH_SUFFIX is defined here for the manylinux-cibase-jdk-aarch64 step. However, looking at the corresponding Wanda configuration file (ci/docker/manylinux-cibase.wanda.yaml), this variable doesn't appear to be used as a build argument or within the associated Dockerfile. This makes it redundant. To improve maintainability and avoid confusion, it's best to remove unused environment variables.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-15
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.