π daily merge: master β main 2026-01-26#757
Conversation
### Summary
This PR is part 1 of 3 for adding queue-based autoscaling support for
Ray Serve TaskConsumer deployments.
### Background
TaskConsumers are workloads that consume tasks from message queues
(Redis, RabbitMQ), and their scaling needs are fundamentally different
from HTTP-based deployments. Instead of scaling based on HTTP request
load, TaskConsumers should scale based on the number of pending tasks in
the message queue.
### Overall Architecture (Full Feature)
```
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Message Queue ββββββββ QueueMonitor β β ServeController β
β (Redis/RMQ) β β Actor ββββββββ Autoscaler β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
β get_queue_length() β
βββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββ
β queue_based_autoscaling β
β _policy() β
β desired = ceil(len/target)β
βββββββββββββββββββββββββββββ
```
The full implementation consists of three PRs:
| PR | Description | Status |
|----------------|-----------------------------------------------------|------------|
| PR 1 (This PR) | QueueMonitor actor for querying broker queue length |
π Current |
| PR 2 | Introduce default Queue-based autoscaling policy | Upcoming |
| PR 3 | Integration with TaskConsumer deployments | Upcoming |
### This PR: QueueMonitor Actor
This PR introduces the QueueMonitor Ray actor that queries message
brokers to get queue length for autoscaling decisions.
### Key Features
- Multi-broker support: Redis and RabbitMQ
- Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in
runtime env
- Fault tolerance: Caches last known queue length on query failures
- Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup
### Queue Length Calculation
For accurate autoscaling, QueueMonitor returns total workload (pending
tasks):
| Broker | Pending Tasks |
|----------|----------------|
| Redis | LLEN <queue> |
| RabbitMQ | messages_ready |
### Components
1. QueueMonitorConfig - Configuration dataclass with broker URL and
queue name
2. QueueMonitor - Core class that initializes broker connections and
queries queue length
3. QueueMonitorActor - Ray actor wrapper for remote access
4. Helper functions:
- create_queue_monitor_actor() - Create named actor
- get_queue_monitor_actor() - Lookup existing actor
- delete_queue_monitor_actor() - Cleanup on deployment deletion
### Test Plan
- Unit tests for QueueMonitorConfig (7 tests)
- Broker type detection (Redis, RabbitMQ, SQS, unknown)
- Config value storage
- Unit tests for QueueMonitor (4 tests)
- Redis queue length retrieval (pending)
- RabbitMQ queue length retrieval
- Error handling with cached value fallback
---------
Signed-off-by: harshit <harshit@anyscale.com>
β¦ Checkpoint Report Time" (ray-project#58470) The existing Checkpoint Report Time metric represents the total time workers spend reporting metrics across multiple checkpoints. However, the current title does not clearly indicate that this time is cumulative. This PR addresses the issue by renaming the panel to βCumulative Checkpoint Report Timeβ and updating the description to reflect that it sums reporting time across multiple checkpoints. Current State: <img width="1730" height="988" alt="image" src="https://github.com/user-attachments/assets/5d74eb17-6b6a-4902-a9f3-c25749ec8e5d" /> Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
β¦ct#59963) ### Summary This PR removes the `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE` environment variable as it has no practical effect in production. ### Why this env var is redundant The env var was intended to "always run a proxy on the head node even if it has no replicas." However, the controller already unconditionally ensures the head node is included in the proxy nodes set, making this flag ineffective. ### Code flow analysis #### In `controller.py` `_update_proxy_nodes()` (lines 419-430): ```python def _update_proxy_nodes(self): new_proxy_nodes = self.deployment_state_manager.get_active_node_ids() new_proxy_nodes = new_proxy_nodes - set( self.cluster_node_info_cache.get_draining_nodes() ) new_proxy_nodes.add(self._controller_node_id) # <-- UNCONDITIONAL self._proxy_nodes = new_proxy_nodes ``` The head node (`_controller_node_id`) is **always** added to `_proxy_nodes` unconditionally. #### In `proxy_state.py` `update()` (where the env var was checked): ```python if RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE: proxy_nodes.add(self._head_node_id) ``` This check happens **after** the controller already added the head node. Since proxy_nodes is a set, adding the same node again is a no-op. ### Consequence - Setting `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE=0` had no effect because the controller adds the head node regardless. - Setting `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE=1` (the default) was also a no-op since the head node was already present. ### Changes - Removed `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE` definition from `constants.py` - Removed the import and usage in `proxy_state.py` --------- Signed-off-by: harshit <harshit@anyscale.com>
## Description Reviewing our testing logs, they can often be incredibly long. This PR aims to reduce them by changing three things 1. By default, the CLIReporter in `run_rllib_example_script_experiment` will report an algorithms training results at least every 5 seconds. This PR adds a `tune-max-report-freq` argument that we keep at 5 for end-users while in tests we change it to 30 seconds 2. Change the verbosity of the tune results from 2 to 1 when testing 3. Removed ` WARNING impala_learner.py:576 -- No old learner state to remove from the queue.` warnings --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
## Description Changes --- - Added `AsList` aggregation allowing to aggregate given column values into a single element as a list - Added `AsListVectorized` - Added modulo op to `Expr` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
β¦multi-agent (ray-project#59928) ## Description For the FlattenObservation connector, in the mutli-agent case, if any of the agent's had a nested space, if it was ignored then we used the base_struct version of the observation_spaces rather than the actual spaces which for nested spaces is a problem. FYI, I believe that `get_base_struct_from_space` can be removed as Gymnasium has added the ability to treat `Dict` and `Tuple` spaces as like a dict or tuple for iterating, calling `.keys()`, `.values()` and `.items()`. For a future PR possibly ## Related issues Fixes ray-project#59849 --------- Signed-off-by: Mark Towers <mark@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
β¦E.md (ray-project#59980) If publishers follow the example publishing pattern, they should end up with a structure like: ``` doc/source/{path-to-examples}/my-example/ ββ content/ ββ notebook.ipynb ββ README.md ββ β¦ ``` with path-to-examples any of "serve/tutorials/" | "data/examples/" | "ray-overview/examples/" | "train/examples/" | "tune/examples/" In the toctree (or `examples.yml` for data/train/serve), publishers *should* link to `notebook.ipynb`. The `README.md` exists only for display in the console once the example is converted into an Anyscale template. ### Issue * `README.md` is a copy of the notebook and doesn't belong to any toctree because it is **not used by Ray Docs** -> Sphinx emits *orphan warnings* for these files, which fail CI because ReadTheDocs is configured to fail on warnings * To silence the warning, publishers must manually add the file to `exclude_patterns` in `conf.py` -> This adds unnecessary overhead for publishers unfamiliar with Sphinx * Over time, it also clutters `exclude_patterns` ### Solution This PR automatically adds example `README.md` files to `exclude_patterns`, **only** when they match specific glob patterns. These patterns ensure the files: * Live under a Ray docs examples directory, and * Are contained within a `content/` folder This minimizes the risk of accidentally excluding unrelated files --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>
## Description
Currently, when displaying hanging tasks, we show ray data level task
index, which is useless for ray core debugging. This PR adds more info
to long running tasks namely:
- node_id
- pid
- attempt #
I did consider adding this to high memory detector, but avoided for 2
reasons
- requires more refractor of `RunningTaskInfo`
- afaik, not helpful in debugging since high memory is _after the task
completes_
## Example script to trigger hanging issues
```python
import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig
ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
detection_time_interval_s=1.0,
)
def sleep(x):
if x['id'] == 0:
time.sleep(100)
return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()
```
## Related issues
None
## Additional information
None
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
β¦ay-project#59907) ## Description The grid is not being rendered correctly in some logs. Instead, this changes opts for a simpler representation from tabulate. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <goutam@anyscale.com>
> Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Fixes to DownstreamCapacityBackpressuePolicy ### Fixes In DownstreamCapacityBackpressuePolicy, when calculating available/utilized object store ratio per Op, also include the downstream in-eligible Ops. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Description test_import_in_subprocess was failing on windows as `return subprocess.run(["python", "-c", "import pip_install_test"])` would use system python instead of the virtualenv's python warning highlighted in python subprocess docs: > Warning For maximum reliability, use a fully qualified path for the executable. To search for an unqualified name on PATH, use [shutil.which()](https://docs.python.org/3/library/shutil.html#shutil.which). On all platforms, passing [sys.executable](https://docs.python.org/3/library/sys.html#sys.executable) is the recommended way to launch the current Python interpreter again, and use the -m command-line format to launch an installed module. https://docs.python.org/3/library/subprocess.html test passed (but looks like some other unrelated tests failed): https://buildkite.com/organizations/ray-project/pipelines/premerge/builds/57162/jobs/019b9c44-0ec0-457b-8b8a-dac06d5ea818/log#13851-14985 https://buildkite.com/ray-project/premerge/builds/57254#019ba0e5-9bf8-4157-b997-2a2e4e01358b/L11484 ## Related issues ## Additional information --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦ Data APIs (ray-project#59918) This PR splits the Input/Output API reference page into two separate pages to improve organization and mirror the structure of the user guides. ## Changes - Renamed `input_output.rst` to `loading_data.rst` - Created `saving_data.rst` with all saving/writing APIs - Updated `api.rst` to reference both new files - Updated all references from `input-output` to `loading-data-api`/`saving-data-api` - Standardized section header formatting with dashes matching title length Fixes ray-project#59301 --------- Signed-off-by: mgchoi239 <mg.choi.239@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: mgchoi239 <mg.choi.239@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦y-project#59959) ## Description The [test_operators.py](https://github.com/ray-project/ray/blob/f85c5255669d8a682673401b54a86612a79058e3/python/ray/data/tests/test_operators.py) file (~1,670 lines, 28 test functions) occasionally times out during CI runs. We should split it into smaller, logically grouped test modules to improve test reliability and allow better parallel execution. @tianyi-ge ## Related issues Fixes ray-project#59881 Signed-off-by: Haichuan <kaisennhu@gmail.com>
β¦ay-project#59955) All PRs that are submitted to Ray must have clear titles and descriptions that give the reviewer adequate context. To help catch PRs that violate this rule, I've added a bugbot rule to cursor that will automate this in turn taking load off of the review process. --------- Signed-off-by: joshlee <joshlee@anyscale.com>
β¦ject#59957) # Summary Before this PR, the training failed error was buried in the `exc_text` part of the log. After this PR it should also appear in the `message` part of the log. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Dead code is a maintenance burden. Removing unused mocks. This came up as I was working on removing the ClusterLeaseManager from the GCS in ray-project#60008. Signed-off-by: irabbani <israbbani@gmail.com>
ray-project#60012) ## Description Reverting ray-project#59852 as it causes release test infra to fail. We need to update the infra to jive with the new port discovery settings properly. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information Example failing build: https://buildkite.com/ray-project/release/builds/74681#
β¦orphan (ray-project#59982) If publishers follow the example publishing pattern for ray data/serve/train, they should end up with a structure like: ``` doc/source/{path-to-examples}/my-example/ ββ content/ ββ notebook.ipynb ββ README.md ββ β¦ ``` with path-to-examples any of "serve/tutorials/" | "data/examples/" | "train/examples/" In the `examples.yml`, publishers link to `notebook.ipynb`. ### Issue * `examples.rst` is dynamically created by custom scripts in `custom_directives.py`. The script reads each `examples.yml` file and create the html for the examples index page in ray docs. * Because `examples.rst` is initially empty, Sphinx considers all notebooks as "orphan" documents and emits warnings, which fail CI because ReadTheDocs is configured to fail on warnings * To silence the warning, publishers must manually add a `orphan: True` to the metadata of the notebook * This adds unnecessary overhead for publishers unfamiliar with Sphinx. They should only worry about their content, not what an "orphan" document is ### Solution This PR automatically adds the `orphan: True` metadata to any files listed in any of the `examples.yml` for ray data/serve/train. This ensures: * Using examples.yml as source of truth so only impacted files are impacted. No side effect on unrelated files. * Publisher can focus on its content, doesn't have to worry about sphinx --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ different processes (ray-project#59634) **Problem:** fixes ray-project#57803 When tracing is enabled, calling an actor method from a different process than the one that created the actor fails with: ``` TypeError: got an unexpected keyword argument '_ray_trace_ctx' ``` This commonly occurs with Ray Serve, where: - `serve start` creates the controller actor (process A) - Dashboard calls `ray.get_actor()` to interact with it (process B) ## Repo Simplest way to repro is to run the following ```bash ray start --head --tracing-startup-hook="ray.util.tracing.setup_local_tmp_tracing:setup_tracing" β β ray_310 Py β with ubuntu@devbox β at 06:22:12 serve start ``` But here is a core specific repro script `repro_actor_module.py` ```python class MyActor: """A simple actor class that will be decorated dynamically.""" def __init__(self): self.value = 0 def my_method(self, x): """A simple method.""" return x * 2 def check_alive(self): """Health check method.""" return True def increment(self, amount=1): """Method with a default parameter.""" self.value += amount return self.value ``` `repro_tracing_issue.py` ```python import multiprocessing import subprocess import sys NAMESPACE = "test_ns" def creator_process(ready_event, done_event): import ray from ray.util.tracing.tracing_helper import _is_tracing_enabled # Import the actor class from module (NOT decorated yet) from repro_actor_module import MyActor setup_tracing_path = "ray.util.tracing.setup_local_tmp_tracing:setup_tracing" ray.init(_tracing_startup_hook=setup_tracing_path, namespace=NAMESPACE) print(f"[CREATOR] Tracing enabled: {_is_tracing_enabled()}") # Dynamically decorate and create the test actor (like Serve does) MyActorRemote = ray.remote( name="my_test_actor", namespace=NAMESPACE, num_cpus=0, lifetime="detached", )(MyActor) actor = MyActorRemote.remote() # Print signatures from creator's handle print(f"[CREATOR] Signatures in handle from creation:") for method_name, sig in actor._ray_method_signatures.items(): param_names = [p.name for p in sig] print(f" {method_name}: {param_names}") my_method_sig = actor._ray_method_signatures.get("my_method", []) has_trace = "_ray_trace_ctx" in [p.name for p in my_method_sig] print(f"[CREATOR] my_method has _ray_trace_ctx: {has_trace}") # Verify the method works from creator result = ray.get(actor.my_method.remote(5)) print(f"[CREATOR] Test call result: {result}") # Signal that actor is ready print("[CREATOR] Actor created, signaling getter...") sys.stdout.flush() ready_event.set() # Wait for getter to finish done_event.wait(timeout=30) print("[CREATOR] Getter finished, shutting down...") # Cleanup ray.kill(actor) ray.shutdown() def getter_process(ready_event, done_event): import ray from ray.util.tracing.tracing_helper import _is_tracing_enabled # Wait for creator to signal ready print("[GETTER] Waiting for creator to set up actor...") if not ready_event.wait(timeout=30): print("[GETTER] Timeout waiting for creator!") done_event.set() return # Connect to the existing cluster (this will also enable tracing from GCS hook) ray.init(address="auto", namespace=NAMESPACE) print(f"\n[GETTER] Tracing enabled: {_is_tracing_enabled()}") # Get the actor by name - this will RELOAD the class fresh in this process # The class loaded here was NEVER processed by _inject_tracing_into_class actor = ray.get_actor("my_test_actor", namespace=NAMESPACE) # Print signatures from getter's handle print(f"[GETTER] Signatures in handle from get_actor():") for method_name, sig in actor._ray_method_signatures.items(): param_names = [p.name for p in sig] print(f" {method_name}: {param_names}") my_method_sig = actor._ray_method_signatures.get("my_method", []) has_trace = "_ray_trace_ctx" in [p.name for p in my_method_sig] print(f"[GETTER] my_method has _ray_trace_ctx: {has_trace}") # Try calling a method print(f"\n[GETTER] Attempting to call my_method.remote(5)...") sys.stdout.flush() try: result = ray.get(actor.my_method.remote(5)) print(f"[GETTER] Method call SUCCEEDED! Result: {result}") except TypeError as e: print(f"[GETTER] Method call FAILED with TypeError: {e}") # Signal done done_event.set() ray.shutdown() def main(): # Stop any existing Ray cluster print("Stopping any existing Ray cluster...") subprocess.run(["ray", "stop", "--force"], capture_output=True) # Create synchronization events ready_event = multiprocessing.Event() done_event = multiprocessing.Event() # Start creator process creator = multiprocessing.Process(target=creator_process, args=(ready_event, done_event)) creator.start() # Start getter process (will connect to existing cluster) getter = multiprocessing.Process(target=getter_process, args=(ready_event, done_event)) getter.start() # Wait for both to complete getter.join(timeout=60) creator.join(timeout=10) # Clean up any hung processes if creator.is_alive(): creator.terminate() creator.join(timeout=5) if getter.is_alive(): getter.terminate() getter.join(timeout=5) # Cleanup Ray print("\nCleaning up...") subprocess.run(["ray", "stop", "--force"], capture_output=True) print("Done.") if __name__ == "__main__": main() ``` <details> <summary> output from master </summary> ```bash β― python repro_tracing_issue.py Stopping any existing Ray cluster... [GETTER] Waiting for creator to set up actor... 2025-12-24 07:05:02,215 INFO worker.py:1991 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [CREATOR] Tracing enabled: True [CREATOR] Signatures in handle from creation: __init__: ['_ray_trace_ctx'] __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs'] __ray_ready__: ['_ray_trace_ctx'] __ray_terminate__: ['_ray_trace_ctx'] check_alive: ['_ray_trace_ctx'] increment: ['amount', '_ray_trace_ctx'] my_method: ['x', '_ray_trace_ctx'] [CREATOR] my_method has _ray_trace_ctx: True [CREATOR] Test call result: 10 [CREATOR] Actor created, signaling getter... 2025-12-24 07:05:02,953 INFO worker.py:1811 -- Connecting to existing Ray cluster at address: 172.31.7.228:38871... 2025-12-24 07:05:02,984 INFO worker.py:1991 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [GETTER] Tracing enabled: True [GETTER] Signatures in handle from get_actor(): __init__: [] __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs'] __ray_ready__: ['_ray_trace_ctx'] __ray_terminate__: ['_ray_trace_ctx'] check_alive: [] increment: ['amount'] my_method: ['x'] [GETTER] my_method has _ray_trace_ctx: False [GETTER] Attempting to call my_method.remote(5)... [GETTER] Method call FAILED with TypeError: got an unexpected keyword argument '_ray_trace_ctx' [CREATOR] Getter finished, shutting down... Cleaning up... Done. ``` </details> <details> <summary>output from this PR</summary> ```bash β― python repro_tracing_issue.py Stopping any existing Ray cluster... [GETTER] Waiting for creator to set up actor... 2025-12-24 07:04:03,758 INFO worker.py:1991 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [CREATOR] Tracing enabled: True [CREATOR] Signatures in handle from creation: __init__: ['_ray_trace_ctx'] __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs'] __ray_ready__: ['_ray_trace_ctx'] __ray_terminate__: ['_ray_trace_ctx'] check_alive: ['_ray_trace_ctx'] increment: ['amount', '_ray_trace_ctx'] my_method: ['x', '_ray_trace_ctx'] [CREATOR] my_method has _ray_trace_ctx: True [CREATOR] Test call result: 10 [CREATOR] Actor created, signaling getter... 2025-12-24 07:04:04,476 INFO worker.py:1811 -- Connecting to existing Ray cluster at address: 172.31.7.228:37231... 2025-12-24 07:04:04,504 INFO worker.py:1991 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( [GETTER] Tracing enabled: True [GETTER] Signatures in handle from get_actor(): __init__: ['_ray_trace_ctx'] __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs'] __ray_ready__: ['_ray_trace_ctx'] __ray_terminate__: ['_ray_trace_ctx'] check_alive: ['_ray_trace_ctx'] increment: ['amount', '_ray_trace_ctx'] my_method: ['x', '_ray_trace_ctx'] [GETTER] my_method has _ray_trace_ctx: True [GETTER] Attempting to call my_method.remote(5)... [GETTER] Method call SUCCEEDED! Result: 10 [CREATOR] Getter finished, shutting down... Cleaning up... Done. ``` </details> **Root Cause:** `_inject_tracing_into_class` sets `__signature__` (including `_ray_trace_ctx`) on the method object during actor creation. However: 1. When the actor class is serialized (cloudpickle) and loaded in another process, `__signature__` is **not preserved** on module-level functions. See repro script at the end of PR description as proof 2. `_ActorClassMethodMetadata.create()` uses `inspect.unwrap()` which follows the `__wrapped__` chain to the **deeply unwrapped original method** 3. The original method's `__signature__` was lost during serialization β signatures extracted **without** `_ray_trace_ctx` 4. When calling the method, `_tracing_actor_method_invocation` adds `_ray_trace_ctx` to kwargs β **signature validation fails** **Fix:** 1. In `_inject_tracing_into_class`: Set `__signature__` on the **deeply unwrapped** method (via `inspect.unwrap`) rather than the immediate method. This ensures `_ActorClassMethodMetadata.create()` finds it after unwrapping. 2. In `load_actor_class`: Call `_inject_tracing_into_class` after loading to re-inject the lost `__signature__` attributes. **Testing:** - Added reproduction script demonstrating cross-process actor method calls with tracing - All existing tracing tests pass - Add a new test for serve with tracing `repro_cloudpickle_signature.py` ```python import inspect import cloudpickle import pickle import multiprocessing def check_signature_in_subprocess(pickled_func_bytes): func = pickle.loads(pickled_func_bytes) print(f"[SUBPROCESS] Unpickled function: {func}") print(f"[SUBPROCESS] Module: {func.__module__}") sig = getattr(func, '__signature__', None) if sig is not None: params = list(sig.parameters.keys()) print(f"[SUBPROCESS] __signature__: {sig}") if '_ray_trace_ctx' in params: print(f"[SUBPROCESS] __signature__ WAS preserved") return True else: print(f"[SUBPROCESS] __signature__ NOT preserved (missing _ray_trace_ctx)") return False else: print(f"[SUBPROCESS] __signature__ NOT preserved (attribute missing)") return False def main(): from repro_actor_module import MyActor func = MyActor.my_method print(f"\n[MAIN] Function: {func}") print(f"[MAIN] Module: {func.__module__}") print(f"[MAIN] __signature__ before: {getattr(func, '__signature__', 'NOT SET')}") # Set a custom __signature__ with _ray_trace_ctx custom_sig = inspect.signature(func) new_params = list(custom_sig.parameters.values()) + [ inspect.Parameter("_ray_trace_ctx", inspect.Parameter.KEYWORD_ONLY, default=None) ] func.__signature__ = custom_sig.replace(parameters=new_params) print(f"[MAIN] __signature__ after: {func.__signature__}") print(f"[MAIN] Parameters: {list(func.__signature__.parameters.keys())}") # Pickle print(f"\n[MAIN] Pickling with cloudpickle...") pickled = cloudpickle.dumps(func) # Test 1: Same process print(f"\n{'='*70}") print("TEST 1: Unpickle in SAME process") print(f"{'='*70}") same_func = pickle.loads(pickled) same_sig = getattr(same_func, '__signature__', None) if same_sig and '_ray_trace_ctx' in list(same_sig.parameters.keys()): print(f"Same process: __signature__ preserved") else: print(f"Same process: __signature__ NOT preserved") # Test 2: Different process print(f"\n{'='*70}") print("TEST 2: Unpickle in DIFFERENT process") print(f"{'='*70}") ctx = multiprocessing.get_context('spawn') with ctx.Pool(1) as pool: result = pool.apply(check_signature_in_subprocess, (pickled,)) if result: print("__signature__ IS preserved (unexpected)") else: print("__signature__ is NOT preserved for functions from imported modules!") if __name__ == "__main__": main() ``` --------- Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
β¦ output (ray-project#60034) Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
β¦ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com>
## Description Third PR for isolating progress managers / making them easier to work with. - Fully unify interfaces for all progress managers - Introduce the `Noop` progress --> basically to use for no-op situations - Separate out function for determining which progress manager to use - Add in `verbose_progress` setting for `ExecutionOptions`. This was missing from previous versions, mb ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
## Description Add sql_params support to read_sql so callers can pass [DBβAPI 2](https://peps.python.org/pep-0249/#id20) parameter bindings instead of string formatting. This enables safe, parameterized queries and is propagated through all SQL execution paths (count, sharding checks, and reads). Also adds a sqlite parameterized query test and updates docstring. ## Related issues Related to ray-project#54098. ## Additional information Design/implementation notes: - API: add optional sql_params to read_sql, matching DBβAPI 2 cursor.execute(operation[, parameters]). - Call chain: read_sql(...) β SQLDatasource(sql_params=...) β get_read_tasks(...) β supports_sharding/_get_num_rows/fallback read/perβshard read β _execute(cursor, sql, sql_params). - No paramstyle parsing: Ray doesnβt interpret placeholders; it passes sql_params through to the driver asβis. - Behavior: if sql_params is None, _execute falls back to cursor.execute(sql), preserving existing behavior. Tests: - pytest python/ray/data/tests/test_sql.py - Local quick check (example): ```python Python 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:46:49) [Clang 19.1.7 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sqlite3 >>> import ray >>> >>> db = "example.db" >>> conn = sqlite3.connect(db) >>> conn.execute("DROP TABLE IF EXISTS movie") <sqlite3.Cursor object at 0x1035af6c0> >>> conn.execute("CREATE TABLE movie(title, year, score)") <sqlite3.Cursor object at 0x1055c7040> >>> conn.executemany( ... "INSERT INTO movie VALUES (?, ?, ?)", ... [ ... ("Monty Python and the Holy Grail", 1975, 8.2), ... ("And Now for Something Completely Different", 1971, 7.5), ... ("Monty Python's Life of Brian", 1979, 8.0), ... ], ... ) <sqlite3.Cursor object at 0x1035af6c0> >>> conn.commit() >>> conn.close() >>> >>> def create_connection(): ... return sqlite3.connect(db) ... >>> # tuple >>> ds_tuple = ray.data.read_sql( ... "SELECT * FROM movie WHERE year >= ?", ... create_connection, ... sql_params=(1975,), ... ) 2026-01-11 00:26:54,103 INFO worker.py:2007 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 /Users/XXX/miniforge3/envs/clion-ray-ce/lib/python3.10/site-packages/ray/_private/worker.py:2055: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( 2026-01-11 00:26:54,700 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. >>> print("tuple:", ds_tuple.take_all()) 2026-01-11 00:26:56,226 INFO logging.py:397 -- Registered dataset logger for dataset dataset_0_0 2026-01-11 00:26:56,240 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. 2026-01-11 00:26:56,241 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. 2026-01-11 00:26:56,242 INFO streaming_executor.py:182 -- Starting execution of Dataset dataset_0_0. Full logs are in /tmp/ray/session_2026-01-11_00-26-50_843083_56953/logs/ray-data 2026-01-11 00:26:56,242 INFO streaming_executor.py:183 -- Execution plan of Dataset dataset_0_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadSQL] 2026-01-11 00:26:56,242 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. 2026-01-11 00:26:56,246 WARNING resource_manager.py:134 --β οΈ Ray's object store is configured to use only 25.2% of available memory (2.0GiB out of 7.9GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable. 2026-01-11 00:26:56,246 INFO streaming_executor.py:661 -- [dataset]: A new progress UI is available. To enable, set `ray.data.DataContext.get_current().enable_rich_progress_bars = True` and `ray.data.DataContext.get_current().use_ray_tqdm = False`. Running Dataset dataset_0_0.: 0.00 row [00:00, ? row/s] 2026-01-11 00:26:56,2672WARNING resource_manager.py:791 -- Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadSQL]. The job may hang forever unless the cluster scales up. βοΈ Dataset dataset_0_0 execution finished in 0.46 seconds: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2.00/2.00 [00:00<00:00, 4.48 row/s] - ReadSQL->SplitBlocks(200): Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 99.0B object store: 100%|ββββββββββ| 2.00/2.00 [00:00<00:00, 4.48 row/s] 2026-01-11 00:26:56,702 INFO streaming_executor.py:302 -- βοΈ Dataset dataset_0_0 execution finished in 0.46 seconds tuple: [{'title': 'Monty Python and the Holy Grail', 'year': 1975, 'score': 8.2}, {'title': "Monty Python's Life of Brian", 'year': 1979, 'score': 8.0}] >>> # list >>> ds_list = ray.data.read_sql( ... "SELECT * FROM movie WHERE year >= ?", ... create_connection, ... sql_params=[1975], ... ) 2026-01-11 00:27:07,304 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. >>> print("list:", ds_list.take_all()) 2026-01-11 00:27:08,867 INFO logging.py:397 -- Registered dataset logger for dataset dataset_1_0 2026-01-11 00:27:08,871 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. 2026-01-11 00:27:08,872 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. 2026-01-11 00:27:08,873 INFO streaming_executor.py:182 -- Starting execution of Dataset dataset_1_0. Full logs are in /tmp/ray/session_2026-01-11_00-26-50_843083_56953/logs/ray-data 2026-01-11 00:27:08,873 INFO streaming_executor.py:183 -- Execution plan of Dataset dataset_1_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadSQL] 2026-01-11 00:27:08,874 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. Running Dataset dataset_1_0.: 0.00 row [00:00, ? row/s] 2026-01-11 00:27:08,8812WARNING resource_manager.py:791 -- Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadSQL]. The job may hang forever unless the cluster scales up. βοΈ Dataset dataset_1_0 execution finished in 0.06 seconds: : 2.00 row [00:00, 38.9 row/s] - ReadSQL->SplitBlocks(200): Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 0.0B object store: 100%|βββββββββββ| 2.00/2.00 [00:00<00:00, 37.6 row/s] 2026-01-11 00:27:08,932 INFO streaming_executor.py:302 -- βοΈ Dataset dataset_1_0 execution finished in 0.06 seconds list: [{'title': 'Monty Python and the Holy Grail', 'year': 1975, 'score': 8.2}, {'title': "Monty Python's Life of Brian", 'year': 1979, 'score': 8.0}] >>> # dict >>> ds_dict = ray.data.read_sql( ... "SELECT * FROM movie WHERE year >= :year", ... create_connection, ... sql_params={"year": 1975}, ... ) 2026-01-11 00:27:19,155 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. >>> print("dict:", ds_dict.take_all()) 2026-01-11 00:27:19,807 INFO logging.py:397 -- Registered dataset logger for dataset dataset_2_0 2026-01-11 00:27:19,811 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. 2026-01-11 00:27:19,812 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. 2026-01-11 00:27:19,813 INFO streaming_executor.py:182 -- Starting execution of Dataset dataset_2_0. Full logs are in /tmp/ray/session_2026-01-11_00-26-50_843083_56953/logs/ray-data 2026-01-11 00:27:19,813 INFO streaming_executor.py:183 -- Execution plan of Dataset dataset_2_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadSQL] 2026-01-11 00:27:19,814 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task. Running Dataset dataset_2_0.: 0.00 row [00:00, ? row/s] 2026-01-11 00:27:19,8212WARNING resource_manager.py:791 -- Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadSQL]. The job may hang forever unless the cluster scales up. βοΈ Dataset dataset_2_0 execution finished in 0.04 seconds: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2.00/2.00 [00:00<00:00, 51.6 row/s] - ReadSQL->SplitBlocks(200): Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 99.0B object store: 100%|ββββββββββ| 2.00/2.00 [00:00<00:00, 49.0 row/s] 2026-01-11 00:27:19,859 INFO streaming_executor.py:302 -- βοΈ Dataset dataset_2_0 execution finished in 0.04 seconds dict: [{'title': 'Monty Python and the Holy Grail', 'year': 1975, 'score': 8.2}, {'title': "Monty Python's Life of Brian", 'year': 1979, 'score': 8.0}] >>> >>> ``` --------- Signed-off-by: yaommen <myanstu@163.com>
Fixes missing space in the warning message. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory. This PR: - Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and abort paths to guarantee the termination of train actors --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
this works for different python versions and it is much easier to use than in a conda managed python env. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
- async-inf readme wasn't included in toctree, and was included in the exclude pattern list, fixed it. Signed-off-by: harshit <harshit@anyscale.com>
stop using the large oss ci test base Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦#59896) ## Description Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state. ## Related issues Closes ray-project#59682 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
β¦ docs (ray-project#60390) Signed-off-by: Timothy Seah <tseah@anyscale.com>
> Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Add training ingestbenchmark release test Benchmark script for training data ingest with Ray Data. This script benchmarks different approaches for loading and preprocessing images: - Loads images from S3 (parquet or JPEG format) - Applies image transforms (crop, scale, flip) - Iterates through batches with configurable batch sizes and prefetch settings - Tests all hyperparameter combinations Supported data formats: - images_with_read_parquet: Uses ray.data.read_parquet() with embedded image bytes - images_with_map_batches: Lists JPEG files via torchdata, downloads with map_batches - images_with_read_images: Uses ray.data.read_images() with Partitioning ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: Srinath Krishnamachari <68668616+srinathk10@users.noreply.github.com>
## Description ### Goal Make ray.data._internal.logical.rules a proper package entry point with short imports and alphabetized `__all__`. ### Changes Add/complete `__all__` in rule modules and re-export via `__init__.py`. Update imports to `from ray.data._internal.logical.rules import ...`. Keep helper imports module-scoped to avoid circular dependencies. ## Related issues Related to ray-project#60204 ## Additional information --------- Signed-off-by: 400Ping <jiekai.chang326@gmail.com> Signed-off-by: 400Ping <fourhundredping@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
β¦roject#58936) Adding ray train workload templates as a general Ray example + anyscale template author: Ian Jordan --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>
β¦jobs (ray-project#60447) ## Description Removed the warning that disables operator-level progress by default in Ray Jobs and sets `DataContext.enable_operator_progress_bars = True`. Edited related docstring. ## Related issues Closes ray-project#60419 ## Additional information Deleted the warning. Such a change will remove a dependency on a private Ray Core API. Now op_progress_bars is True by default unless changed to False manually by `ray.data.DataContext.get_current().enable_operator_progress_bars = False` as described in the warning message. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
## Description Checkpointable methods `restore_from_path` and `save_to_path` should make use of PyArrow filesystem (especially for the cloud storages use cases) to make sure that RLlib components are correctly saved or restored. --------- Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
## Description This PR adds a HPO example to RLlib using HyperOpt and APPO + Cartpole. Do we want this tested in nightly or premerge? --------- Signed-off-by: Mark Towers <mark@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦roject#60242) ## Why are these changes needed? This PR replaces SHA-1 with SHA-256 across Ray's internal hash operations to address security concerns in ray-project#53435 with a conservative approach for runtime environment paths. SHA-1 has known collision vulnerabilities and is deprecated - switching to SHA-256 improves security and FIPS compliance. ## Related issue number Fixes ray-project#53435 ### SHA-256 Migration (15 files) - Function collision detection - SSH control path hashing - Configuration cache keys - CloudWatch config hashing (with method renames and UTF-8 encoding fix) - File lock paths - Deployment code versioning - Feature hashing - Test files updated ### SHA-1 Retained (5 files) Runtime environment path generation kept SHA-1 to avoid: - Path length issues on Windows and network filesystems - Breaking existing cached environments - Unnecessary environment rebuilds For content-addressable hashing (non-cryptographic), SHA-1 provides sufficient collision resistance. ### Testing - All linter checks pass - No new errors introduced - Zero breaking changes to runtime environments --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Zac Policzer <zacattackftw@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦oject#60428) Seems like gcs_client_test is breaking on windows due to the same issue as ray-project#60213 and I think any test that uses gcs_server_lib probably has a high chance of running into this issue due to it's lengthy list of dependencies. Adding the compiler_param_file flag to any bazel target that depends on it. --------- Signed-off-by: joshlee <joshlee@anyscale.com>
β¦to RAY_STOP_REQUESTED (ray-project#60412) ## Description When the autoscaler attempts to terminate instances in `RAY_INSTALLING` state (e.g., due to `max_num_nodes_per_type` limits being reduced), it crashes with an assertion error: ``` AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` <details> <summary>Full stacktrace</summary> ``` 2026-01-22 17:49:33,055 ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED Traceback (most recent call last): File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state return Reconciler.reconcile( ^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile Reconciler._step_next( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next Reconciler._scale_cluster( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster Reconciler._update_instance_manager(instance_manager, version, updates) File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager reply = instance_manager.update_instance_manager_state( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state instance = self._update_instance( ^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance assert InstanceUtil.set_status(instance, update.new_instance_status), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` </details> The reconciler incorrectly tries to transition `RAY_INSTALLING` to `RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray isn't running yet, so there's nothing to stop/drain. This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`. Both states have cloud instances allocated but Ray not yet running, so they should transition directly to `TERMINATING`. ## Related issues Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED` instances) ## Additional information The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`, `RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`. `RAY_STOP_REQUESTED` is not in this set. Signed-off-by: Johanna Reiml <johanna@reiml.dev> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Removes a bunch of unnecessary tests being run on both the min and max python versions. Run only Train v2 and Tune tests on the max python version. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
β¦roject#60414) This PR changes the bool parsing from `store_true` to parsing an input `True/False` string. This fixes an issue where default bool values are always set to `False` unless they're explicitly included as a CLI flag. * **The problem:** `action="store_true"` always defaults to False and ignores any `default=` parameter. Fields with `default=True` (like `enable_operator_progress_bars`, `actor_locality_enabled`, `enable_shard_locality`) would incorrectly default to False when parsed through the CLI, completely ignoring the Pydantic model's default. Also updates the release test commands from `--bool_flag` -> `--bool_flag=True` --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
β¦ors (ray-project#60429) When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this: ```python googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]"> ``` The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: 1. Catch the error. 2. Reload the metadata and update it again. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>
β¦ resources when not scaling up (ray-project#60321) ## Description When the DefaultAutoscalerV2 decides not to scale up (e.g., utilization is low), it currently sends an empty resource request ([]) to the autoscaling coordinator. This should be changed to request the previous allocation instead, preserving the dataset's current resource footprint. ## Related issues Closes ray-project#60191 ## Additional information - Send current resources rather than [] when not scale up - Add test --------- Signed-off-by: machichima <nary12321@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
# Summary
This PR changes the async validation API from
```
ray.train.report(
validate_fn=...,
validate_config=...
)
```
to
```
trainer = TorchTrainer(
validation_config=ValidationConfig(
fn=...,
task_config=ValidationTaskConfig(
fn_kwargs=...,
)
)
...
)
ray.train.report(
validation=ValidationTaskConfig(
fn_kwargs=...
)
)
```
These changes are backwards incompatible, but this is ok because async
validation is an
[alpha](https://docs.ray.io/en/latest/ray-contribute/stability.html#alpha)
feature.
We are moving `validation_fn` from `ray.train.report` to `TorchTrainer`
because:
1) If a Ray Train run dies while there are pending validations, Ray
doesn't currently have a good way to [serialize those Ray
tasks](https://docs.ray.io/en/latest/ray-core/objects/serialization.html#serializing-objectrefs)
and requeue them on a new run.
2) Requiring users to register a `validation_fn` up front enables better
type checking and prevents users from shooting themselves in the foot.
3) This API should still be sufficiently flexible. If users want
different reports to perform different validations, their
`validation_fn` can toggle between different modes based on a parameter.
We are changing from a `validate_config` dict to a
`ValidationConfig`/`ValidationTaskConfig` because:
1) A `validation_fn` that takes keyword arguments instead of a config
dict is less brittle.
2) We would like to encapsulate the entire async validation feature in a
single object to avoid user confusion from having too many
`ray.train.report` args. We plan to add more arguments to
`ValidationConfig`/`ValidationTaskConfig`, such as ray remote task args,
in future PR's.
This PR does everything needed to fully migrate this feature:
1) Code and unit test changes
2) Documentation changes
3) Release test changes
# Testing
Unit tests + release test
(https://buildkite.com/ray-project/release/builds/76396 - commits after
this don't change behavior)
---------
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Move away from go-approach and into Pythonic exception-raise approach Topic: refactor-crane-lib Relative: port-extract-wheel Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>
Adds Dockerfile and Wanda files for Ray images, pulling artifacts from previous steps. CI additions added in subsequent PR, planned for after next release. Topic: ray-image Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>
β¦60287) ## Description Adds the missing fields called out in ray-project#60129 to `ActorDefinitionEvent` `ActorLifecycleEvent` and `TaskLifecycleEvent`. the following fields were not added: - ip address -> can be retrieved by looking at `NodeDefinitionEvent` - repr_name -> cannot be added to definition events as it this field is only available after actor creation event (refer: https://github.com/ray-project/ray/blob/da9dabf65d7c189ea4d1ad77e13b6f8c3be8936f/src/ray/protobuf/gcs.proto#L152) ## Related issues related to ray-project#60129 --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
β¦0314) Add four new fields to the NodeDefinitionEvent proto schema: - hostname: The hostname of the node manager - node_name: The user-provided identifier for this node - instance_id: The cloud provider instance ID - instance_type_name: The instance type (e.g., m5.xlarge) fields are populated in ray-project#60314 --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
## Description Moves OrderedBundleQueue to `bundle_queue/` directory, and rebases off of the `BaseBundleQueue` class. Creates a `FIFOBundleQueue` for simple ref bundle queues. Old PR: ray-project#59093 There are about ~400 lines of tests, so not actually that big ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦instead of self-referential remote calls (ray-project#60320) Refactor `_AutoscalingCoordinatorActor` to use direct threading with a `threading.Lock` instead of self-referential remote calls. This eliminates potential Ray Core timeout errors for internal operations and makes the code easier to debug. Fixes ray-project#60190 Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=171929553 --------- Signed-off-by: DeborahOlaboye <deboraholaboye@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Allow users to configure ray remote args (such as number of retries) on the validation Ray task itself. Repurposed from ray-project#59165 since we decided to support all ray remote args as oppose to just retry/retry_exception. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
β¦=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Code Review
This is a massive pull request, primarily an automated merge from master to main. The changes encompass a major refactoring of the CI/CD pipeline, including the wheel building process, test organization, and dependency management. Key changes include the introduction of a new rule-based system for test selection, the addition of Python 3.13 support, and the removal of Python 3.9 from some builds. There are also extensive documentation updates, new examples, and API changes. While most of the changes appear to be part of a coherent and positive refactoring, I've identified a couple of potential issues in the new CI configuration that could impact performance and maintainability.
| * | ||
| @ ml tune train data serve | ||
| @ core_cpp cpp java python doc | ||
| @ linux_wheels macos_wheels dashboard tools release_tests | ||
| ; |
There was a problem hiding this comment.
This wildcard rule at the end of the file seems overly broad. It will match any file not matched by a preceding rule and assign a large number of tags (ml, tune, train, data, serve, core_cpp, cpp, java, python, doc, linux_wheels, macos_wheels, dashboard, tools, release_tests). This could cause a huge number of tests to run for almost any change, potentially slowing down CI significantly and making it difficult to reason about which tests will run. Is this intentional? If this is a fallback, perhaps it should assign a smaller, more general set of tags. If it's a mistake, it should be removed or corrected.
| RAYCI_DISABLE_JAVA: "false" | ||
| RAYCI_WANDA_ALWAYS_REBUILD: "true" | ||
| JDK_SUFFIX: "-jdk" | ||
| ARCH_SUFFIX: "aarch64" |
There was a problem hiding this comment.
The ARCH_SUFFIX: "aarch64" environment variable appears to be unused in this step. The associated Wanda file (ci/docker/manylinux-cibase.wanda.yaml) and its Dockerfile do not seem to reference this variable. This is also inconsistent with the manylinux-cibase-aarch64 step which doesn't define it. To improve clarity and maintainability, consider removing this unused variable.
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-26
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.