Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-27#758

Open
antfin-oss wants to merge 504 commits intomainfrom
create-pull-request/patch-6bf585f4ae
Open

πŸ”„ daily merge: master β†’ main 2026-01-27#758
antfin-oss wants to merge 504 commits intomainfrom
create-pull-request/patch-6bf585f4ae

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-27
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

Aydin-ab and others added 30 commits January 9, 2026 04:41
…E.md (ray-project#59980)

If publishers follow the example publishing pattern, they should end up
with a structure like:

```
doc/source/{path-to-examples}/my-example/
  └─ content/
     β”œβ”€ notebook.ipynb
     β”œβ”€ README.md
     └─ …
```
with path-to-examples any of "serve/tutorials/" | "data/examples/" |
"ray-overview/examples/" | "train/examples/" | "tune/examples/"

In the toctree (or `examples.yml` for data/train/serve), publishers
*should* link to `notebook.ipynb`. The `README.md` exists only for
display in the console once the example is converted into an Anyscale
template.

### Issue

* `README.md` is a copy of the notebook and doesn't belong to any
toctree because it is **not used by Ray Docs** -> Sphinx emits *orphan
warnings* for these files, which fail CI because ReadTheDocs is
configured to fail on warnings
* To silence the warning, publishers must manually add the file to
`exclude_patterns` in `conf.py` -> This adds unnecessary overhead for
publishers unfamiliar with Sphinx
* Over time, it also clutters `exclude_patterns`

### Solution

This PR automatically adds example `README.md` files to
`exclude_patterns`, **only** when they match specific glob patterns.
These patterns ensure the files:

* Live under a Ray docs examples directory, and
* Are contained within a `content/` folder

This minimizes the risk of accidentally excluding unrelated files

---------

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Co-authored-by: Aydin Abiar <aydin@anyscale.com>
## Description
Currently, when displaying hanging tasks, we show ray data level task
index, which is useless for ray core debugging. This PR adds more info
to long running tasks namely:
- node_id
- pid
- attempt #

I did consider adding this to high memory detector, but avoided for 2
reasons
- requires more refractor of `RunningTaskInfo`
- afaik, not helpful in debugging since high memory is _after the task
completes_

## Example script to trigger hanging issues
```python
import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig


ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()
```




## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
…ay-project#59907)

## Description
The grid is not being rendered correctly in some logs. Instead, this
changes opts for a simpler representation from tabulate.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

### [Data] Fixes to DownstreamCapacityBackpressuePolicy

### Fixes

In DownstreamCapacityBackpressuePolicy, when calculating
available/utilized object store ratio per Op, also include the
downstream in-eligible Ops.


## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Description
test_import_in_subprocess was failing on windows as `return
subprocess.run(["python", "-c", "import pip_install_test"])` would use
system python instead of the virtualenv's python

warning highlighted in python subprocess docs:
> Warning For maximum reliability, use a fully qualified path for the
executable. To search for an unqualified name on PATH, use
[shutil.which()](https://docs.python.org/3/library/shutil.html#shutil.which).
On all platforms, passing
[sys.executable](https://docs.python.org/3/library/sys.html#sys.executable)
is the recommended way to launch the current Python interpreter again,
and use the -m command-line format to launch an installed module.

https://docs.python.org/3/library/subprocess.html

test passed (but looks like some other unrelated tests failed): 

https://buildkite.com/organizations/ray-project/pipelines/premerge/builds/57162/jobs/019b9c44-0ec0-457b-8b8a-dac06d5ea818/log#13851-14985

https://buildkite.com/ray-project/premerge/builds/57254#019ba0e5-9bf8-4157-b997-2a2e4e01358b/L11484

## Related issues


## Additional information

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
… Data APIs (ray-project#59918)

This PR splits the Input/Output API reference page into two separate
pages to improve organization and mirror the structure of the user
guides.

## Changes
- Renamed `input_output.rst` to `loading_data.rst`
- Created `saving_data.rst` with all saving/writing APIs
- Updated `api.rst` to reference both new files
- Updated all references from `input-output` to
`loading-data-api`/`saving-data-api`
- Standardized section header formatting with dashes matching title
length

Fixes ray-project#59301

---------

Signed-off-by: mgchoi239 <mg.choi.239@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: mgchoi239 <mg.choi.239@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
…y-project#59959)

## Description
The
[test_operators.py](https://github.com/ray-project/ray/blob/f85c5255669d8a682673401b54a86612a79058e3/python/ray/data/tests/test_operators.py)
file (~1,670 lines, 28 test functions) occasionally times out during CI
runs. We should split it into smaller, logically grouped test modules to
improve test reliability and allow better parallel execution.
@tianyi-ge 
## Related issues
Fixes ray-project#59881

Signed-off-by: Haichuan <kaisennhu@gmail.com>
…ay-project#59955)

All PRs that are submitted to Ray must have clear titles and
descriptions that give the reviewer adequate context. To help catch PRs
that violate this rule, I've added a bugbot rule to cursor that will
automate this in turn taking load off of the review process.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
…ject#59957)

# Summary

Before this PR, the training failed error was buried in the `exc_text`
part of the log. After this PR it should also appear in the `message`
part of the log.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Dead code is a maintenance burden. Removing unused mocks. 

This came up as I was working on removing the ClusterLeaseManager from
the GCS in ray-project#60008.

Signed-off-by: irabbani <israbbani@gmail.com>
ray-project#60012)

## Description
Reverting ray-project#59852 as it causes
release test infra to fail. We need to update the infra to jive with the
new port discovery settings properly.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
Example failing build:
https://buildkite.com/ray-project/release/builds/74681#
…orphan (ray-project#59982)

If publishers follow the example publishing pattern for ray
data/serve/train, they should end up with a structure like:

```
doc/source/{path-to-examples}/my-example/
  └─ content/
     β”œβ”€ notebook.ipynb
     β”œβ”€ README.md
     └─ …
```
with path-to-examples any of "serve/tutorials/" | "data/examples/" |
"train/examples/"

In the `examples.yml`, publishers link to `notebook.ipynb`.

### Issue

* `examples.rst` is dynamically created by custom scripts in
`custom_directives.py`. The script reads each `examples.yml` file and
create the html for the examples index page in ray docs.
* Because `examples.rst` is initially empty, Sphinx considers all
notebooks as "orphan" documents and emits warnings, which fail CI
because ReadTheDocs is configured to fail on warnings
* To silence the warning, publishers must manually add a `orphan: True`
to the metadata of the notebook
* This adds unnecessary overhead for publishers unfamiliar with Sphinx.
They should only worry about their content, not what an "orphan"
document is

### Solution

This PR automatically adds the `orphan: True` metadata to any files
listed in any of the `examples.yml` for ray data/serve/train. This
ensures:

* Using examples.yml as source of truth so only impacted files are
impacted. No side effect on unrelated files.
* Publisher can focus on its content, doesn't have to worry about sphinx

---------

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
Co-authored-by: Aydin Abiar <aydin@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… different processes (ray-project#59634)

**Problem:**

fixes ray-project#57803

When tracing is enabled, calling an actor method from a different
process than the one that created the actor fails with:
```
TypeError: got an unexpected keyword argument '_ray_trace_ctx'
```

This commonly occurs with Ray Serve, where:
- `serve start` creates the controller actor (process A)
- Dashboard calls `ray.get_actor()` to interact with it (process B)

## Repo
Simplest way to repro is to run the following
```bash
ray start --head --tracing-startup-hook="ray.util.tracing.setup_local_tmp_tracing:setup_tracing"                                                                                                        βœ” β”‚ ray_310 Py β”‚ with ubuntu@devbox β”‚ at 06:22:12
serve start
```

But here is a core specific repro script

`repro_actor_module.py`
```python
class MyActor:
    """A simple actor class that will be decorated dynamically."""
    
    def __init__(self):
        self.value = 0
    
    def my_method(self, x):
        """A simple method."""
        return x * 2
    
    def check_alive(self):
        """Health check method."""
        return True
    
    def increment(self, amount=1):
        """Method with a default parameter."""
        self.value += amount
        return self.value
```

`repro_tracing_issue.py`
```python
import multiprocessing
import subprocess
import sys


NAMESPACE = "test_ns"


def creator_process(ready_event, done_event):
    import ray
    from ray.util.tracing.tracing_helper import _is_tracing_enabled
    
    # Import the actor class from module (NOT decorated yet)
    from repro_actor_module import MyActor
    
    setup_tracing_path = "ray.util.tracing.setup_local_tmp_tracing:setup_tracing"
    ray.init(_tracing_startup_hook=setup_tracing_path, namespace=NAMESPACE)
    
    print(f"[CREATOR] Tracing enabled: {_is_tracing_enabled()}")
    
    # Dynamically decorate and create the test actor (like Serve does)
    MyActorRemote = ray.remote(
        name="my_test_actor",
        namespace=NAMESPACE,
        num_cpus=0,
        lifetime="detached",
    )(MyActor)
    
    actor = MyActorRemote.remote()
    
    # Print signatures from creator's handle
    print(f"[CREATOR] Signatures in handle from creation:")
    for method_name, sig in actor._ray_method_signatures.items():
        param_names = [p.name for p in sig]
        print(f"  {method_name}: {param_names}")
    
    my_method_sig = actor._ray_method_signatures.get("my_method", [])
    has_trace = "_ray_trace_ctx" in [p.name for p in my_method_sig]
    print(f"[CREATOR] my_method has _ray_trace_ctx: {has_trace}")
    
    # Verify the method works from creator
    result = ray.get(actor.my_method.remote(5))
    print(f"[CREATOR] Test call result: {result}")
    
    # Signal that actor is ready
    print("[CREATOR] Actor created, signaling getter...")
    sys.stdout.flush()
    ready_event.set()
    
    # Wait for getter to finish
    done_event.wait(timeout=30)
    print("[CREATOR] Getter finished, shutting down...")
    
    # Cleanup
    ray.kill(actor)
    ray.shutdown()


def getter_process(ready_event, done_event):
    import ray
    from ray.util.tracing.tracing_helper import _is_tracing_enabled
    
    # Wait for creator to signal ready
    print("[GETTER] Waiting for creator to set up actor...")
    if not ready_event.wait(timeout=30):
        print("[GETTER] Timeout waiting for creator!")
        done_event.set()
        return
    
    # Connect to the existing cluster (this will also enable tracing from GCS hook)
    ray.init(address="auto", namespace=NAMESPACE)
    
    print(f"\n[GETTER] Tracing enabled: {_is_tracing_enabled()}")
    
    # Get the actor by name - this will RELOAD the class fresh in this process
    # The class loaded here was NEVER processed by _inject_tracing_into_class
    actor = ray.get_actor("my_test_actor", namespace=NAMESPACE)
    
    # Print signatures from getter's handle
    print(f"[GETTER] Signatures in handle from get_actor():")
    for method_name, sig in actor._ray_method_signatures.items():
        param_names = [p.name for p in sig]
        print(f"  {method_name}: {param_names}")
    
    my_method_sig = actor._ray_method_signatures.get("my_method", [])
    has_trace = "_ray_trace_ctx" in [p.name for p in my_method_sig]
    print(f"[GETTER] my_method has _ray_trace_ctx: {has_trace}")
    
    # Try calling a method
    print(f"\n[GETTER] Attempting to call my_method.remote(5)...")
    sys.stdout.flush()
    try:
        result = ray.get(actor.my_method.remote(5))
        print(f"[GETTER] Method call SUCCEEDED! Result: {result}")
    except TypeError as e:
        print(f"[GETTER] Method call FAILED with TypeError: {e}")
    
    # Signal done
    done_event.set()
    ray.shutdown()


def main():
    # Stop any existing Ray cluster
    print("Stopping any existing Ray cluster...")
    subprocess.run(["ray", "stop", "--force"], capture_output=True)
    
    # Create synchronization events
    ready_event = multiprocessing.Event()
    done_event = multiprocessing.Event()
    
    # Start creator process
    creator = multiprocessing.Process(target=creator_process, args=(ready_event, done_event))
    creator.start()
    
    # Start getter process (will connect to existing cluster)
    getter = multiprocessing.Process(target=getter_process, args=(ready_event, done_event))
    getter.start()
    
    # Wait for both to complete
    getter.join(timeout=60)
    creator.join(timeout=10)
    
    # Clean up any hung processes
    if creator.is_alive():
        creator.terminate()
        creator.join(timeout=5)
    if getter.is_alive():
        getter.terminate()
        getter.join(timeout=5)
    
    # Cleanup Ray
    print("\nCleaning up...")
    subprocess.run(["ray", "stop", "--force"], capture_output=True)
    print("Done.")


if __name__ == "__main__":
    main()
```

<details>

<summary> output from master </summary>

```bash
❯ python repro_tracing_issue.py
Stopping any existing Ray cluster...
[GETTER] Waiting for creator to set up actor...
2025-12-24 07:05:02,215 INFO worker.py:1991 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
[CREATOR] Tracing enabled: True
[CREATOR] Signatures in handle from creation:
  __init__: ['_ray_trace_ctx']
  __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs']
  __ray_ready__: ['_ray_trace_ctx']
  __ray_terminate__: ['_ray_trace_ctx']
  check_alive: ['_ray_trace_ctx']
  increment: ['amount', '_ray_trace_ctx']
  my_method: ['x', '_ray_trace_ctx']
[CREATOR] my_method has _ray_trace_ctx: True
[CREATOR] Test call result: 10
[CREATOR] Actor created, signaling getter...
2025-12-24 07:05:02,953 INFO worker.py:1811 -- Connecting to existing Ray cluster at address: 172.31.7.228:38871...
2025-12-24 07:05:02,984 INFO worker.py:1991 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(

[GETTER] Tracing enabled: True
[GETTER] Signatures in handle from get_actor():
  __init__: []
  __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs']
  __ray_ready__: ['_ray_trace_ctx']
  __ray_terminate__: ['_ray_trace_ctx']
  check_alive: []
  increment: ['amount']
  my_method: ['x']
[GETTER] my_method has _ray_trace_ctx: False

[GETTER] Attempting to call my_method.remote(5)...
[GETTER] Method call FAILED with TypeError: got an unexpected keyword argument '_ray_trace_ctx'
[CREATOR] Getter finished, shutting down...

Cleaning up...
Done.
```

</details>

<details>

<summary>output from this PR</summary>

```bash
❯ python repro_tracing_issue.py
Stopping any existing Ray cluster...
[GETTER] Waiting for creator to set up actor...
2025-12-24 07:04:03,758 INFO worker.py:1991 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
[CREATOR] Tracing enabled: True
[CREATOR] Signatures in handle from creation:
  __init__: ['_ray_trace_ctx']
  __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs']
  __ray_ready__: ['_ray_trace_ctx']
  __ray_terminate__: ['_ray_trace_ctx']
  check_alive: ['_ray_trace_ctx']
  increment: ['amount', '_ray_trace_ctx']
  my_method: ['x', '_ray_trace_ctx']
[CREATOR] my_method has _ray_trace_ctx: True
[CREATOR] Test call result: 10
[CREATOR] Actor created, signaling getter...
2025-12-24 07:04:04,476 INFO worker.py:1811 -- Connecting to existing Ray cluster at address: 172.31.7.228:37231...
2025-12-24 07:04:04,504 INFO worker.py:1991 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(

[GETTER] Tracing enabled: True
[GETTER] Signatures in handle from get_actor():
  __init__: ['_ray_trace_ctx']
  __ray_call__: ['fn', 'args', '_ray_trace_ctx', 'kwargs']
  __ray_ready__: ['_ray_trace_ctx']
  __ray_terminate__: ['_ray_trace_ctx']
  check_alive: ['_ray_trace_ctx']
  increment: ['amount', '_ray_trace_ctx']
  my_method: ['x', '_ray_trace_ctx']
[GETTER] my_method has _ray_trace_ctx: True

[GETTER] Attempting to call my_method.remote(5)...
[GETTER] Method call SUCCEEDED! Result: 10
[CREATOR] Getter finished, shutting down...

Cleaning up...
Done.
```

</details>

**Root Cause:**

`_inject_tracing_into_class` sets `__signature__` (including
`_ray_trace_ctx`) on the method object during actor creation. However:

1. When the actor class is serialized (cloudpickle) and loaded in
another process, `__signature__` is **not preserved** on module-level
functions. See repro script at the end of PR description as proof
2. `_ActorClassMethodMetadata.create()` uses `inspect.unwrap()` which
follows the `__wrapped__` chain to the **deeply unwrapped original
method**
3. The original method's `__signature__` was lost during serialization β†’
signatures extracted **without** `_ray_trace_ctx`
4. When calling the method, `_tracing_actor_method_invocation` adds
`_ray_trace_ctx` to kwargs β†’ **signature validation fails**

**Fix:**

1. In `_inject_tracing_into_class`: Set `__signature__` on the **deeply
unwrapped** method (via `inspect.unwrap`) rather than the immediate
method. This ensures `_ActorClassMethodMetadata.create()` finds it after
unwrapping.

2. In `load_actor_class`: Call `_inject_tracing_into_class` after
loading to re-inject the lost `__signature__` attributes.

**Testing:**
- Added reproduction script demonstrating cross-process actor method
calls with tracing
- All existing tracing tests pass
- Add a new test for serve with tracing

`repro_cloudpickle_signature.py`
```python
import inspect
import cloudpickle
import pickle
import multiprocessing


def check_signature_in_subprocess(pickled_func_bytes):
    func = pickle.loads(pickled_func_bytes)
    
    print(f"[SUBPROCESS] Unpickled function: {func}")
    print(f"[SUBPROCESS] Module: {func.__module__}")
    
    sig = getattr(func, '__signature__', None)
    if sig is not None:
        params = list(sig.parameters.keys())
        print(f"[SUBPROCESS] __signature__: {sig}")
        if '_ray_trace_ctx' in params:
            print(f"[SUBPROCESS] __signature__ WAS preserved")
            return True
        else:
            print(f"[SUBPROCESS] __signature__ NOT preserved (missing _ray_trace_ctx)")
            return False
    else:
        print(f"[SUBPROCESS] __signature__ NOT preserved (attribute missing)")
        return False


def main():
    from repro_actor_module import MyActor
    func = MyActor.my_method
    
    print(f"\n[MAIN] Function: {func}")
    print(f"[MAIN] Module: {func.__module__}")
    print(f"[MAIN] __signature__ before: {getattr(func, '__signature__', 'NOT SET')}")
    
    # Set a custom __signature__ with _ray_trace_ctx
    custom_sig = inspect.signature(func)
    new_params = list(custom_sig.parameters.values()) + [
        inspect.Parameter("_ray_trace_ctx", inspect.Parameter.KEYWORD_ONLY, default=None)
    ]
    func.__signature__ = custom_sig.replace(parameters=new_params)
    
    print(f"[MAIN] __signature__ after: {func.__signature__}")
    print(f"[MAIN] Parameters: {list(func.__signature__.parameters.keys())}")
    
    # Pickle
    print(f"\n[MAIN] Pickling with cloudpickle...")
    pickled = cloudpickle.dumps(func)
    
    # Test 1: Same process
    print(f"\n{'='*70}")
    print("TEST 1: Unpickle in SAME process")
    print(f"{'='*70}")
    same_func = pickle.loads(pickled)
    same_sig = getattr(same_func, '__signature__', None)
    if same_sig and '_ray_trace_ctx' in list(same_sig.parameters.keys()):
        print(f"Same process: __signature__ preserved")
    else:
        print(f"Same process: __signature__ NOT preserved")
    
    # Test 2: Different process
    print(f"\n{'='*70}")
    print("TEST 2: Unpickle in DIFFERENT process")
    print(f"{'='*70}")
    
    ctx = multiprocessing.get_context('spawn')
    with ctx.Pool(1) as pool:
        result = pool.apply(check_signature_in_subprocess, (pickled,))
    
    if result:
        print("__signature__ IS preserved (unexpected)")
    else:
        print("__signature__ is NOT preserved for functions from imported modules!")


if __name__ == "__main__":
    main()

```

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
… output (ray-project#60034)

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
## Description
Third PR for isolating progress managers / making them easier to work
with.

- Fully unify interfaces for all progress managers
- Introduce the `Noop` progress --> basically to use for no-op
situations
- Separate out function for determining which progress manager to use
- Add in `verbose_progress` setting for `ExecutionOptions`. This was
missing from previous versions, mb

## Related issues
N/A

## Additional information
N/A

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
## Description

Add sql_params support to read_sql so callers can pass [DB‑API
2](https://peps.python.org/pep-0249/#id20) parameter bindings instead of
string formatting. This enables safe, parameterized queries and is
propagated through all SQL execution paths (count, sharding checks, and
reads). Also adds a sqlite parameterized query test and updates
docstring.

## Related issues

Related to ray-project#54098.

## Additional information

Design/implementation notes:

- API: add optional sql_params to read_sql, matching DB‑API 2
cursor.execute(operation[, parameters]).
- Call chain: 
   read_sql(...) 
   β†’ SQLDatasource(sql_params=...) 
   β†’ get_read_tasks(...)
   β†’ supports_sharding/_get_num_rows/fallback read/per‑shard read 
   β†’ _execute(cursor, sql, sql_params).
- No paramstyle parsing: Ray doesn’t interpret placeholders; it passes
sql_params through to the driver as‑is.
- Behavior: if sql_params is None, _execute falls back to
cursor.execute(sql), preserving existing behavior.

Tests:

- pytest python/ray/data/tests/test_sql.py

- Local quick check (example):

```python
Python 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:46:49) [Clang 19.1.7 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> import ray
>>> 
>>> db = "example.db"
>>> conn = sqlite3.connect(db)
>>> conn.execute("DROP TABLE IF EXISTS movie")
<sqlite3.Cursor object at 0x1035af6c0>
>>> conn.execute("CREATE TABLE movie(title, year, score)")
<sqlite3.Cursor object at 0x1055c7040>
>>> conn.executemany(
...     "INSERT INTO movie VALUES (?, ?, ?)",
...     [
...         ("Monty Python and the Holy Grail", 1975, 8.2),
...         ("And Now for Something Completely Different", 1971, 7.5),
...         ("Monty Python's Life of Brian", 1979, 8.0),
...     ],
... )
<sqlite3.Cursor object at 0x1035af6c0>
>>> conn.commit()
>>> conn.close()
>>> 
>>> def create_connection():
...     return sqlite3.connect(db)
... 
>>> # tuple 
>>> ds_tuple = ray.data.read_sql(
...     "SELECT * FROM movie WHERE year >= ?",
...     create_connection,
...     sql_params=(1975,),
... )
2026-01-11 00:26:54,103 INFO worker.py:2007 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/Users/XXX/miniforge3/envs/clion-ray-ce/lib/python3.10/site-packages/ray/_private/worker.py:2055: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
2026-01-11 00:26:54,700 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
>>> print("tuple:", ds_tuple.take_all())
2026-01-11 00:26:56,226 INFO logging.py:397 -- Registered dataset logger for dataset dataset_0_0
2026-01-11 00:26:56,240 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
2026-01-11 00:26:56,241 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
2026-01-11 00:26:56,242 INFO streaming_executor.py:182 -- Starting execution of Dataset dataset_0_0. Full logs are in /tmp/ray/session_2026-01-11_00-26-50_843083_56953/logs/ray-data
2026-01-11 00:26:56,242 INFO streaming_executor.py:183 -- Execution plan of Dataset dataset_0_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadSQL]
2026-01-11 00:26:56,242 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
2026-01-11 00:26:56,246 WARNING resource_manager.py:134 -- ⚠️  Ray's object store is configured to use only 25.2% of available memory (2.0GiB out of 7.9GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
2026-01-11 00:26:56,246 INFO streaming_executor.py:661 -- [dataset]: A new progress UI is available. To enable, set `ray.data.DataContext.get_current().enable_rich_progress_bars = True` and `ray.data.DataContext.get_current().use_ray_tqdm = False`.
Running Dataset dataset_0_0.: 0.00 row [00:00, ? row/s]
2026-01-11 00:26:56,2672WARNING resource_manager.py:791 -- Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadSQL]. The job may hang forever unless the cluster scales up.
βœ”οΈ  Dataset dataset_0_0 execution finished in 0.46 seconds: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.00/2.00 [00:00<00:00, 4.48 row/s]
- ReadSQL->SplitBlocks(200): Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 99.0B object store: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.00/2.00 [00:00<00:00, 4.48 row/s]
2026-01-11 00:26:56,702 INFO streaming_executor.py:302 -- βœ”οΈ  Dataset dataset_0_0 execution finished in 0.46 seconds                                                   
tuple: [{'title': 'Monty Python and the Holy Grail', 'year': 1975, 'score': 8.2}, {'title': "Monty Python's Life of Brian", 'year': 1979, 'score': 8.0}]
>>> # list 
>>> ds_list = ray.data.read_sql(
...     "SELECT * FROM movie WHERE year >= ?",
...     create_connection,
...     sql_params=[1975],
... )
2026-01-11 00:27:07,304 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
>>> print("list:", ds_list.take_all())
2026-01-11 00:27:08,867 INFO logging.py:397 -- Registered dataset logger for dataset dataset_1_0
2026-01-11 00:27:08,871 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
2026-01-11 00:27:08,872 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
2026-01-11 00:27:08,873 INFO streaming_executor.py:182 -- Starting execution of Dataset dataset_1_0. Full logs are in /tmp/ray/session_2026-01-11_00-26-50_843083_56953/logs/ray-data
2026-01-11 00:27:08,873 INFO streaming_executor.py:183 -- Execution plan of Dataset dataset_1_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadSQL]
2026-01-11 00:27:08,874 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
Running Dataset dataset_1_0.: 0.00 row [00:00, ? row/s]
2026-01-11 00:27:08,8812WARNING resource_manager.py:791 -- Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadSQL]. The job may hang forever unless the cluster scales up.
βœ”οΈ  Dataset dataset_1_0 execution finished in 0.06 seconds: : 2.00 row [00:00, 38.9 row/s]
- ReadSQL->SplitBlocks(200): Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 0.0B object store: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.00/2.00 [00:00<00:00, 37.6 row/s]
2026-01-11 00:27:08,932 INFO streaming_executor.py:302 -- βœ”οΈ  Dataset dataset_1_0 execution finished in 0.06 seconds                                                   
list: [{'title': 'Monty Python and the Holy Grail', 'year': 1975, 'score': 8.2}, {'title': "Monty Python's Life of Brian", 'year': 1979, 'score': 8.0}]
>>> # dict 
>>> ds_dict = ray.data.read_sql(
...     "SELECT * FROM movie WHERE year >= :year",
...     create_connection,
...     sql_params={"year": 1975},
... )
2026-01-11 00:27:19,155 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
>>> print("dict:", ds_dict.take_all())
2026-01-11 00:27:19,807 INFO logging.py:397 -- Registered dataset logger for dataset dataset_2_0
2026-01-11 00:27:19,811 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
2026-01-11 00:27:19,812 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
2026-01-11 00:27:19,813 INFO streaming_executor.py:182 -- Starting execution of Dataset dataset_2_0. Full logs are in /tmp/ray/session_2026-01-11_00-26-50_843083_56953/logs/ray-data
2026-01-11 00:27:19,813 INFO streaming_executor.py:183 -- Execution plan of Dataset dataset_2_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadSQL]
2026-01-11 00:27:19,814 INFO sql_datasource.py:153 -- Sharding is not supported. Falling back to reading all data in a single task.
Running Dataset dataset_2_0.: 0.00 row [00:00, ? row/s]
2026-01-11 00:27:19,8212WARNING resource_manager.py:791 -- Cluster resources are not enough to run any task from TaskPoolMapOperator[ReadSQL]. The job may hang forever unless the cluster scales up.
βœ”οΈ  Dataset dataset_2_0 execution finished in 0.04 seconds: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.00/2.00 [00:00<00:00, 51.6 row/s]
- ReadSQL->SplitBlocks(200): Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 99.0B object store: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.00/2.00 [00:00<00:00, 49.0 row/s]
2026-01-11 00:27:19,859 INFO streaming_executor.py:302 -- βœ”οΈ  Dataset dataset_2_0 execution finished in 0.04 seconds                                                   
dict: [{'title': 'Monty Python and the Holy Grail', 'year': 1975, 'score': 8.2}, {'title': "Monty Python's Life of Brian", 'year': 1979, 'score': 8.0}]
>>> 
>>> 
```

---------

Signed-off-by: yaommen <myanstu@163.com>
Fixes missing space in the warning message.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Leaking Ray Train actors have been observed occupying GPU memory
following Train run termination, causing training failures/OOMs in
subsequent train runs. Despite the train actors being marked DEAD by Ray
Core, we find that upon ssh-ing into nodes, that the actor processes are
still alive and occupying valuable GPU memory.

This PR: 
- Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and
abort paths to guarantee the termination of train actors

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
this works for different python versions and it is much easier to use
than in a conda managed python env.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
- async-inf readme wasn't included in toctree, and was included in the
exclude pattern list, fixed it.

Signed-off-by: harshit <harshit@anyscale.com>
stop using the large oss ci test base

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…#59896)

## Description

Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes
were not being properly scaled from zero. With this update, clusters
managed by Ray will now automatically provision additional nodes when
there is workload demand, even when starting from an idle (zero-node)
state.

## Related issues
Closes ray-project#59682


## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…#59616)

## Description
We observed that raylet frequently emits log messages of the form
β€œDropping sync message with stale version”, which can become quite noisy
in practice.

This behavior occurs because raylet does not update the message version
for sync messages received from the GCS, and stale-version broadcast
messages are expected to be skipped by default. As a result, these log
entries are generated repeatedly even though this is normal and
non-actionable behavior.

Given that this does not indicate an error or unexpected state, logging
it at the INFO level significantly increases log noise and makes it
harder to identify genuinely important events.

We propose demoting this log from INFO to DEBUG in
RaySyncerBidiReactorBase to keep raylet logs cleaner while still
preserving the information for debugging purposes when needed.


![img_v3_02t7_be5071d6-99d2-4b3c-b189-66aa77476d3g](https://github.com/user-attachments/assets/ed91c317-3a86-441c-a2bf-b317ac0af618)

## Related issues
Closes ray-project#59615

## Additional information
- Change log level from INFO to DEBUG for β€œDropping sync message with
stale version” in RaySyncerBidiReactorBase.

Signed-off-by: Mao Yancan <yancan.mao@bytedance.com>
Co-authored-by: Mao Yancan <yancan.mao@bytedance.com>
## Description
Runs linkcheck on docs, in particular for RLlib where we've moved
tuned-examples to examples/algorithms
Further, updated github links that were automatically redirected

There are problems with some of the RLlib examples missing but I'm going
to fix these in the algorithm premerge PRs, i.e.,
ray-project#59007

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
…project#60050)

Add support for authenticating HTTPS downloads in runtime environments
using
bearer tokens via the RAY_RUNTIME_ENV_BEARER_TOKEN environment variable.

Fixes [ray-project#46833](ray-project#46833)

Signed-off-by: Denis Khachyan <khachyanda@gmail.com>
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() β†’ _release_server() β†’
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643 

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
pseudo-rnd-thoughts and others added 24 commits January 23, 2026 18:14
## Description
This PR adds a HPO example to RLlib using HyperOpt and APPO + Cartpole.

Do we want this tested in nightly or premerge?

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…roject#60242)

## Why are these changes needed?

This PR replaces SHA-1 with SHA-256 across Ray's internal hash
operations to address security concerns in ray-project#53435 with a conservative
approach for runtime environment paths. SHA-1 has known collision
vulnerabilities and is deprecated - switching to SHA-256 improves
security and FIPS compliance.

## Related issue number

Fixes ray-project#53435

### SHA-256 Migration (15 files)
- Function collision detection
- SSH control path hashing
- Configuration cache keys
- CloudWatch config hashing (with method renames and UTF-8 encoding fix)
- File lock paths
- Deployment code versioning
- Feature hashing
- Test files updated

### SHA-1 Retained (5 files)
Runtime environment path generation kept SHA-1 to avoid:
- Path length issues on Windows and network filesystems
- Breaking existing cached environments
- Unnecessary environment rebuilds

For content-addressable hashing (non-cryptographic), SHA-1 provides
sufficient collision resistance.

### Testing
- All linter checks pass
- No new errors introduced
- Zero breaking changes to runtime environments

---------

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Zac Policzer <zacattackftw@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…oject#60428)

Seems like gcs_client_test is breaking on windows due to the same issue
as ray-project#60213 and I think any test that uses gcs_server_lib probably has a
high chance of running into this issue due to it's lengthy list of
dependencies. Adding the compiler_param_file flag to any bazel target
that depends on it.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
…to RAY_STOP_REQUESTED (ray-project#60412)

## Description

When the autoscaler attempts to terminate instances in `RAY_INSTALLING`
state (e.g., due to `max_num_nodes_per_type` limits being reduced), it
crashes with an assertion error:

```
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

<details>
<summary>Full stacktrace</summary>

```
2026-01-22 17:49:33,055    ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state
    return Reconciler.reconcile(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile
    Reconciler._step_next(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._scale_cluster(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

</details>

The reconciler incorrectly tries to transition `RAY_INSTALLING` to
`RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray
isn't running yet, so there's nothing to stop/drain.

This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`.
Both states have cloud instances allocated but Ray not yet running, so
they should transition directly to `TERMINATING`.

## Related issues

Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED`
instances)

## Additional information

The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`,
`RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`.
`RAY_STOP_REQUESTED` is not in this set.

Signed-off-by: Johanna Reiml <johanna@reiml.dev>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Removes a bunch of unnecessary tests being run on both the min and max
python versions. Run only Train v2 and Tune tests on the max python
version.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…roject#60414)

This PR changes the bool parsing from `store_true` to parsing an input
`True/False` string.

This fixes an issue where default bool values are always set to `False`
unless they're explicitly included as a CLI flag.
* **The problem:** `action="store_true"` always defaults to False and
ignores any `default=` parameter. Fields with `default=True` (like
`enable_operator_progress_bars`, `actor_locality_enabled`,
`enable_shard_locality`) would incorrectly default to False when parsed
through the CLI, completely ignoring the Pydantic model's default.

Also updates the release test commands from `--bool_flag` ->
`--bool_flag=True`

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ors (ray-project#60429)

When the autoscaler tries to launch a Ray cluster on GCP, it puts a new
SSH key into the project metadata if necessary. The update may results
into an HTTP 412 precondition failure if there are concurrent tries to
update the metadata. The error will look like this:

```python
googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]">
```

The error can only be resolved by retrying. Therefore, to provide a
better user experience, this PR does the retry for the users
automatically:
1. Catch the error.
2. Reload the metadata and update it again.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Co-authored-by: Aydin Abiar <aydin@anyscale.com>
… resources when not scaling up (ray-project#60321)

## Description

When the DefaultAutoscalerV2 decides not to scale up (e.g., utilization
is low), it currently sends an empty resource request ([]) to the
autoscaling coordinator. This should be changed to request the previous
allocation instead, preserving the dataset's current resource footprint.

## Related issues
Closes ray-project#60191

## Additional information
- Send current resources rather than [] when not scale up
- Add test

---------

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
# Summary

This PR changes the async validation API from

```
ray.train.report(
  validate_fn=...,
  validate_config=...
)
```

to 

```
trainer = TorchTrainer(
  validation_config=ValidationConfig(
    fn=...,
    task_config=ValidationTaskConfig(
        fn_kwargs=...,
    )
  )
  ...
)

ray.train.report(
  validation=ValidationTaskConfig(
    fn_kwargs=...
  )
)
```

These changes are backwards incompatible, but this is ok because async
validation is an
[alpha](https://docs.ray.io/en/latest/ray-contribute/stability.html#alpha)
feature.

We are moving `validation_fn` from `ray.train.report` to `TorchTrainer`
because:
1) If a Ray Train run dies while there are pending validations, Ray
doesn't currently have a good way to [serialize those Ray
tasks](https://docs.ray.io/en/latest/ray-core/objects/serialization.html#serializing-objectrefs)
and requeue them on a new run.
2) Requiring users to register a `validation_fn` up front enables better
type checking and prevents users from shooting themselves in the foot.
3) This API should still be sufficiently flexible. If users want
different reports to perform different validations, their
`validation_fn` can toggle between different modes based on a parameter.

We are changing from a `validate_config` dict to a
`ValidationConfig`/`ValidationTaskConfig` because:
1) A `validation_fn` that takes keyword arguments instead of a config
dict is less brittle.
2) We would like to encapsulate the entire async validation feature in a
single object to avoid user confusion from having too many
`ray.train.report` args. We plan to add more arguments to
`ValidationConfig`/`ValidationTaskConfig`, such as ray remote task args,
in future PR's.

This PR does everything needed to fully migrate this feature:
1) Code and unit test changes
2) Documentation changes
3) Release test changes

# Testing

Unit tests + release test
(https://buildkite.com/ray-project/release/builds/76396 - commits after
this don't change behavior)

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Move away from go-approach and into Pythonic
exception-raise approach

Topic: refactor-crane-lib
Relative: port-extract-wheel

Signed-off-by: andrew <andrew@anyscale.com>

Signed-off-by: andrew <andrew@anyscale.com>
Adds Dockerfile and Wanda files for Ray images, pulling artifacts from
previous steps.

CI additions added in subsequent PR, planned for after next release.

Topic: ray-image

Signed-off-by: andrew <andrew@anyscale.com>

Signed-off-by: andrew <andrew@anyscale.com>
…60287)

## Description
Adds the missing fields called out in
ray-project#60129 to
`ActorDefinitionEvent` `ActorLifecycleEvent` and `TaskLifecycleEvent`.

the following fields were not added:
- ip address -> can be retrieved by looking at `NodeDefinitionEvent`
- repr_name -> cannot be added to definition events as it this field is
only available after actor creation event (refer:
https://github.com/ray-project/ray/blob/da9dabf65d7c189ea4d1ad77e13b6f8c3be8936f/src/ray/protobuf/gcs.proto#L152)

## Related issues
related to ray-project#60129

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
…0314)

Add four new fields to the NodeDefinitionEvent proto schema:
- hostname: The hostname of the node manager
- node_name: The user-provided identifier for this node
- instance_id: The cloud provider instance ID
- instance_type_name: The instance type (e.g., m5.xlarge)

fields are populated in ray-project#60314

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
## Description
Moves OrderedBundleQueue to `bundle_queue/` directory, and rebases off
of the `BaseBundleQueue` class. Creates a `FIFOBundleQueue` for simple
ref bundle queues.

Old PR: ray-project#59093

There are about ~400 lines of tests, so not actually that big

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…instead of self-referential remote calls (ray-project#60320)

Refactor `_AutoscalingCoordinatorActor` to use direct threading with a
`threading.Lock` instead of self-referential
remote calls. This eliminates potential Ray Core timeout errors for
internal operations and makes the code easier to debug.
  Fixes ray-project#60190 





Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=171929553

---------

Signed-off-by: DeborahOlaboye <deboraholaboye@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Allow users to configure ray remote args (such as number of retries) on
the validation Ray task itself. Repurposed from
ray-project#59165 since we decided to
support all ray remote args as oppose to just retry/retry_exception.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
…=1 (ray-project#60403)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
this issue is likely a direct result of
ray-project#59820

### The Problem
A segmentation fault occurs when importing `pyarrow` (via `tensorflow` β†’
`pandas`) **after** `fastapi` and `torch` have been imported in our CI
specific environment.

### Minimal Reproducer
```python
# CRASH
from fastapi import FastAPI
import torch
import tensorflow  # loads pyarrow β†’ SIGSEGV

# WORKS
import pyarrow  # or pandas
from fastapi import FastAPI
import torch
import tensorflow
```
Root Cause: unknown

### The Fix for `distilbert.py`
```python
# Before (crashes)
from fastapi import FastAPI
from transformers import pipeline

# After (works)
from transformers import pipeline  # loads pyarrow early via tensorflow
from fastapi import FastAPI
```

### Environment
- PyArrow 19.0.1, TensorFlow 2.15.1, torch 2.3.0, FastAPI 0.121.0, anyio
4.12.0
- Linux x86_64, glibc 2.31, Python 3.10

<details>

<summary>Other information collected</summary>

`repro.sh`

```bash
#!/bin/bash
set -e

echo "=============================================="
echo "PyArrow jemalloc Background Thread Segfault"
echo "=============================================="
echo ""
echo "Bug: PyArrow's jemalloc background_thread_entry() crashes"
echo "     when pyarrow is imported after (fastapi + torch + tensorflow)"
echo ""

# Minimal reproducer
echo "=== CRASH CASE ==="
echo "Import order: fastapi -> torch -> tensorflow (loads pyarrow internally)"
echo ""
python3 -c "
from fastapi import FastAPI  # Sets up anyio threading
import torch                  # Initializes CUDA/threading
import tensorflow             # Loads pyarrow via pandas - CRASH HERE
print('OK - no crash')
" 2>&1 | tail -5
echo ""

echo "=== WORKAROUND 1: Pre-import pyarrow ==="
echo "Import order: pyarrow -> fastapi -> torch -> tensorflow"
echo ""
python3 -c "
import pyarrow               # Initialize jemalloc in clean state
from fastapi import FastAPI
import torch
import tensorflow
print('OK - workaround works!')
" 2>&1 | tail -3
echo ""

echo "=== WORKAROUND 2: Pre-import pandas (loads pyarrow) ==="
echo "Import order: pandas -> fastapi -> torch -> tensorflow"  
echo ""
python3 -c "
import pandas                # Loads pyarrow early
from fastapi import FastAPI
import torch
import tensorflow
print('OK - workaround works!')
" 2>&1 | tail -3
echo ""

echo "=== PROOF: No crash without fastapi ==="
echo "Import order: torch -> tensorflow (no fastapi)"
echo ""
python3 -c "
import torch
import tensorflow
print('OK - no crash without fastapi')
" 2>&1 | tail -3
echo ""

echo "=== PROOF: No crash without torch ==="
echo "Import order: fastapi -> tensorflow (no torch)"
echo ""
python3 -c "
from fastapi import FastAPI
import tensorflow
print('OK - no crash without torch')
" 2>&1 | tail -3
echo ""

echo "=== PROOF: Blocking pyarrow prevents crash ==="
echo ""
python3 -c "
import sys
sys.modules['pyarrow'] = None  # Block pyarrow
from fastapi import FastAPI
import torch
import tensorflow
print('OK - no crash when pyarrow blocked')
" 2>&1 | tail -3
echo ""

```

```bash
(base) root@8107ea5bb295:/rayci# bash repro.sh
==============================================
PyArrow jemalloc Background Thread Segfault
==============================================

Bug: PyArrow's jemalloc background_thread_entry() crashes
     when pyarrow is imported after (fastapi + torch + tensorflow)

=== CRASH CASE ===
Import order: fastapi -> torch -> tensorflow (loads pyarrow internally)

2026-01-25 16:10:51.037099: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-01-25 16:10:51.038345: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-01-25 16:10:51.045147: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-25 16:10:51.937264: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

=== WORKAROUND 1: Pre-import pyarrow ===
Import order: pyarrow -> fastapi -> torch -> tensorflow

To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-25 16:10:56.232090: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
OK - workaround works!

=== WORKAROUND 2: Pre-import pandas (loads pyarrow) ===
Import order: pandas -> fastapi -> torch -> tensorflow

To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-25 16:11:01.770416: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
OK - workaround works!

=== PROOF: No crash without fastapi ===
Import order: torch -> tensorflow (no fastapi)

To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-25 16:11:06.337292: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
OK - no crash without fastapi

=== PROOF: No crash without torch ===
Import order: fastapi -> tensorflow (no torch)

To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-25 16:11:10.451576: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
OK - no crash without torch

=== PROOF: Blocking pyarrow prevents crash ===

To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-25 16:11:15.321192: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
OK - no crash when pyarrow blocked

==============================================
SUMMARY
==============================================

Crash occurs when:
  1. fastapi is imported (sets up anyio threading)
  2. torch is imported (CUDA/threading init)
  3. tensorflow imports pyarrow (via pandas)
  4. PyArrow's jemalloc starts background thread -> SIGSEGV

GDB backtrace shows crash in:
  #0 background_thread_entry() from libarrow.so
  Thread: jemalloc_bg_thd

Workaround: Import pyarrow BEFORE fastapi+torch

```

```bash
(base) root@8107ea5bb295:/rayci# nvidia-smi
Sun Jan 25 16:05:03 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   33C    P0             26W /   70W |       1MiB /  15360MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

```bash
=== System Info ===
Python: 3.10.19 (conda-forge, GCC 14.3.0)
Platform: Linux x86_64, glibc 2.31
Kernel: 6.1.87-99.174.amzn2023

=== Library Versions ===
pyarrow: 19.0.1
tensorflow: 2.15.1
torch: 2.3.0+cu121
fastapi: 0.121.0
anyio: 4.12.0
starlette: 0.49.1
pydantic: 2.12.4
pandas: 1.5.3
numpy: 1.26.4

=== PyArrow Memory Pool ===
Default pool: mimalloc
jemalloc available: True
mimalloc available: True
```

</details>

---------

Signed-off-by: abrar <abrar@anyscale.com>
…ngest benchmarks (ray-project#60458)

- Add heterogeneous cluster configurations with dedicated CPU worker
nodes for Ray Data operations alongside GPU worker nodes for training
- Include both fixed-size and autoscaling variants for benchmarking
different scaling behaviors

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
## Description

Update the `test_root_path` unit test in the `test_fastapi.py`. 

The new implementation is more descriptive in the expected values of the
request urls, ASGI scope["root_path"], and scope["path"] for different
root_path configurations provided via the app and serve layers.

 ## Related issues

Relates to the PR ray-project#57555

---------

Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
…ay-project#60468)

What I observe
1. test_metrics is timing out at 900s
2. It sometimes passes (1 out of 3 times)
3. It does not consistently fail on one specific test. So thatever the
problem is exogenous individual test

I am speculating, but trying these two changes
1. Health check metrics serve before starting serve metrics tests
4. Split metrics tests into two files, this would work in the event that
one large metrics file is taking more than 900s to run. When i run
locally, it takes about 500s, so this is plausible.



https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe66-42d2-85e7-2e90d74fba17/L11134

https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe64-480d-99d1-68846a93f0f1/L11107

https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe61-4c22-bcb2-4909d6ddb6f8/L4432

---------

Signed-off-by: abrar <abrar@anyscale.com>
## Description
Add Test for repr function for MapWorker to ensure that the string
always outputs even if args aren't recoverable.

This is adding the test related to this PR:
ray-project#58731

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large daily merge from master to main, introducing substantial changes across the repository. Key updates include a major refactoring of the CI/CD pipelines to improve modularity and support for multiple architectures, extensive documentation enhancements with new examples and better organization, and several API modernizations. While the changes are largely positive, I've identified a critical issue in the new CI test rules that could severely impact performance, a high-severity issue in C++ error handling, and a medium-severity concern about a change in monitoring metrics. Please review these points carefully.

Comment on lines +262 to +266
*
@ ml tune train data serve
@ core_cpp cpp java python doc
@ linux_wheels macos_wheels dashboard tools release_tests
;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The new wildcard rule * at the end of this file seems overly broad. It applies a large number of test tags to any file change, which will likely cause almost all tests to run for every modification. This could significantly increase CI execution time and costs. Was this intentional? If the goal is broad coverage, more granular rules might be a better approach to keep CI efficient. If this is a mistake, this rule should be removed or its scope reduced.

Comment on lines +81 to 111
} else if (meta_str == std::to_string(ray::rpc::ErrorType::OBJECT_LOST) ||
meta_str == std::to_string(ray::rpc::ErrorType::OWNER_DIED) ||
meta_str == std::to_string(ray::rpc::ErrorType::OBJECT_DELETED) ||
meta_str ==
std::to_string(ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_PUT) ||
meta_str ==
std::to_string(
ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_RETRIES_DISABLED) ||
meta_str ==
std::to_string(
ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_LINEAGE_EVICTED) ||
meta_str ==
std::to_string(ray::rpc::ErrorType::
OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED) ||
meta_str ==
std::to_string(ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_BORROWED) ||
meta_str == std::to_string(
ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_LOCAL_MODE) ||
meta_str ==
std::to_string(
ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_REF_NOT_FOUND) ||
meta_str ==
std::to_string(
ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_TASK_CANCELLED) ||
meta_str ==
std::to_string(
ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_LINEAGE_DISABLED)) {
// TODO: Differentiate object error
throw UnreconstructableException(std::move(data_str));
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The updated if condition for handling object-related errors appears to be missing the base ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE case, which was present in the original code. While many new OBJECT_UNRECONSTRUCTABLE_* subtypes have been added, omitting the base enum could lead to uncaught errors if it's ever used directly. It would be safer to include it to ensure all unreconstructable object errors are handled.

  } else if (meta_str == std::to_string(ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE) ||
             meta_str == std::to_string(ray::rpc::ErrorType::OBJECT_LOST) ||
             meta_str == std::to_string(ray::rpc::ErrorType::OWNER_DIED) ||
             meta_str == std::to_string(ray::rpc::ErrorType::OBJECT_DELETED) ||
             meta_str ==
                 std::to_string(ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_PUT) ||
             meta_str ==
                 std::to_string(
                     ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_RETRIES_DISABLED) ||
             meta_str ==
                 std::to_string(
                     ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_LINEAGE_EVICTED) ||
             meta_str ==
                 std::to_string(ray::rpc::ErrorType::
                                    OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED) ||
             meta_str ==
                 std::to_string(ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_BORROWED) ||
             meta_str == std::to_string(
                             ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_LOCAL_MODE) ||
             meta_str ==
                 std::to_string(
                     ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_REF_NOT_FOUND) ||
             meta_str ==
                 std::to_string(
                     ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_TASK_CANCELLED) ||
             meta_str ==
                 std::to_string(
                     ray::rpc::ErrorType::OBJECT_UNRECONSTRUCTABLE_LINEAGE_DISABLED)) {
    // TODO: Differentiate object error
    throw UnreconstructableException(std::move(data_str));
  }

Comment on lines +715 to +717
| `ray_serve_autoscaling_policy_execution_time_ms` | Gauge | `deployment`, `application`, `policy_scope` | Time taken to execute the autoscaling policy in milliseconds. `policy_scope` is `deployment` or `application`. |
| `ray_serve_autoscaling_replica_metrics_delay_ms` | Gauge | `deployment`, `application`, `replica` | Time taken for replica metrics to reach the controller in milliseconds. High values may indicate controller overload. |
| `ray_serve_autoscaling_handle_metrics_delay_ms` | Gauge | `deployment`, `application`, `handle` | Time taken for handle metrics to reach the controller in milliseconds. High values may indicate controller overload. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The metric types for ray_serve_autoscaling_policy_execution_time_ms, ray_serve_autoscaling_replica_metrics_delay_ms, and ray_serve_autoscaling_handle_metrics_delay_ms have been changed from Histogram to Gauge. This is a significant change that reduces the observability of these latencies. Histograms provide distribution information (like p50, p95, p99), which is crucial for understanding performance, while gauges only show the last value. Was this change intentional? If so, it might be worth explaining the rationale, as it could be seen as a regression by users who rely on these metrics for performance monitoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.