Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-29#760

Open
antfin-oss wants to merge 538 commits intomainfrom
create-pull-request/patch-2532941b59
Open

πŸ”„ daily merge: master β†’ main 2026-01-29#760
antfin-oss wants to merge 538 commits intomainfrom
create-pull-request/patch-2532941b59

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-29
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

goutamvenkat-anyscale and others added 30 commits January 12, 2026 19:36
## Description
Add model inference release test that closely reflects user workloads.

Release test run:
https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_glehkcquv9k26ta69f8lkc94nl?job-logs-section-tabs=application_logs&job-tab=overview&metrics-tab=data

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Reverts ray-project#59983

symlink does not work with newer version of wanda, where the newer version of wanda is doing the right thing.
…ct#59987)

- Bump .rayciversion from 0.21.0 to 0.25.0
- Move rules files to .buildkite/ with *.rules.txt naming convention
- Add always.rules.txt for always-run lint rules
- Add test.rules.test.txt with test cases
- Add test-rules CI step in cicd.rayci.yml (auto-discovery)
- Update macOS config to use new rules file paths

Topic: update-rayci-latest

Signed-off-by: andrew <andrew@anyscale.com>
…ay-project#60057)

## Summary

When running prefill-decode disaggregation with NixlConnector and data
parallelism, both prefill and decode deployments were using the same
port base for their ZMQ side channel. This caused "Address already in
use" errors when both deployments had workers on the same node:

```
zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009')
Exception in thread nixl_handshake_listener
```

## Changes

Fix by setting different `NIXL_SIDE_CHANNEL_PORT_BASE` values for
prefill (40000) and decode (41000) configs to ensure port isolation.

## Test plan

- Run `test_llm_serve_prefill_decode_with_data_parallelism` - should
complete without timeout
- The test previously hung forever waiting for "READY message from DP
Coordinator"

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…se (ray-project#60092)

Signed-off-by: Future-Outlier <eric901201@gmail.com>
- Fix ProgressBar to honor `use_ray_tqdm` in `DataContext`. 
- Note that `tqdm_ray` is designed to work in non-interactive contexts
(workers/actors) by sending JSON progress updates to the driver.

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…ray-project#59933)

## Description
The `DefaultAutoscaler2` implementation needs an
`AutoscalingCoordinator` and a way to get all of the
`_NodeResourceSpec`.

Currently, we can't explicitly inject fake implementations of either
dependency. This is problematic because the tests need to assume what
the implementation of each dependency looks like and use brittle mocks.

To solve this:
- Add the `FakeAutoscalingCoordinator` implementation to a new
`fake_autoscaling_coordinator.py` module (you can use the code below)
- `DefaultClusterAutoscalerV2` has two new parameters
`autoscaling_coordinator: Optional[AutoscalingCoordinator] = None` and
`get_node_counts: Callable[[], Dict[_NodeResourceSpec, int]] =
get_node_resource_spec_and_count`. If `autoscaling_coordinator` is None,
you can use the default implementation.
- Update `test_try_scale_up_cluster` to use the explicit seams rather
than mocks. Where possible, assert against the public interface rather
than implementation details


## Related issues
Closes ray-project#59683

---------

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: Ping <fourhundredping@gmail.com>
## Description
RLlib's rayci.yml
[file](https://github.com/ray-project/ray/blob/master/.buildkite/rllib.rayci.yml)
and the BUILD.bazel
[file](https://github.com/ray-project/ray/blob/master/rllib/BUILD.bazel)
are disconnected such that there are old tags in the BUILD not the rayci
and vice-versa.
This PR attempts to clean up both files without modifying what tests are
or aren't run currently

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
…g tracer file handles (ray-project#60078)

This fix resolves serve's window test failure:
```
[2026-01-12T22:52:13Z] =================================== ERRORS ====================================
--
[2026-01-12T22:52:13Z] _______ ERROR at teardown of test_deployment_remote_calls_with_tracing ________
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z]     @pytest.fixture
[2026-01-12T22:52:13Z]     def cleanup_spans():
[2026-01-12T22:52:13Z]         """Cleanup temporary spans_dir folder at beginning and end of test."""
[2026-01-12T22:52:13Z]         if os.path.exists(spans_dir):
[2026-01-12T22:52:13Z]             shutil.rmtree(spans_dir)
[2026-01-12T22:52:13Z]         os.makedirs(spans_dir, exist_ok=True)
[2026-01-12T22:52:13Z]         yield
[2026-01-12T22:52:13Z]         # Enable tracing only sets up tracing once per driver process.
[2026-01-12T22:52:13Z]         # We set ray.__traced__ to False here so that each
[2026-01-12T22:52:13Z]         # test will re-set up tracing.
[2026-01-12T22:52:13Z]         ray.__traced__ = False
[2026-01-12T22:52:13Z]         if os.path.exists(spans_dir):
[2026-01-12T22:52:13Z] >           shutil.rmtree(spans_dir)
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z] python\ray\serve\tests\test_serve_with_tracing.py:30:
[2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:750: in rmtree
[2026-01-12T22:52:13Z]     return _rmtree_unsafe(path, onerror)
[2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:620: in _rmtree_unsafe
[2026-01-12T22:52:13Z]     onerror(os.unlink, fullname, sys.exc_info())
[2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z] path = '/tmp/spans/'
[2026-01-12T22:52:13Z] onerror = <function rmtree.<locals>.onerror at 0x000002C0FFBBDA20>
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z]     def _rmtree_unsafe(path, onerror):
[2026-01-12T22:52:13Z]         try:
[2026-01-12T22:52:13Z]             with os.scandir(path) as scandir_it:
[2026-01-12T22:52:13Z]                 entries = list(scandir_it)
[2026-01-12T22:52:13Z]         except OSError:
[2026-01-12T22:52:13Z]             onerror(os.scandir, path, sys.exc_info())
[2026-01-12T22:52:13Z]             entries = []
[2026-01-12T22:52:13Z]         for entry in entries:
[2026-01-12T22:52:13Z]             fullname = entry.path
[2026-01-12T22:52:13Z]             if _rmtree_isdir(entry):
[2026-01-12T22:52:13Z]                 try:
[2026-01-12T22:52:13Z]                     if entry.is_symlink():
[2026-01-12T22:52:13Z]                         # This can only happen if someone replaces
[2026-01-12T22:52:13Z]                         # a directory with a symlink after the call to
[2026-01-12T22:52:13Z]                         # os.scandir or entry.is_dir above.
[2026-01-12T22:52:13Z]                         raise OSError("Cannot call rmtree on a symbolic link")
[2026-01-12T22:52:13Z]                 except OSError:
[2026-01-12T22:52:13Z]                     onerror(os.path.islink, fullname, sys.exc_info())
[2026-01-12T22:52:13Z]                     continue
[2026-01-12T22:52:13Z]                 _rmtree_unsafe(fullname, onerror)
[2026-01-12T22:52:13Z]             else:
[2026-01-12T22:52:13Z]                 try:
[2026-01-12T22:52:13Z] >                   os.unlink(fullname)
[2026-01-12T22:52:13Z] E                   PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '/tmp/spans/15464.txt'
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:618: PermissionError
```

**Cause:** The `setup_local_tmp_tracing.py` module opens a file handle
for the `ConsoleSpanExporter` that is never explicitly closed. On
Windows, files cannot be deleted while they're open, causing
`shutil.rmtree` to fail with `PermissionError: [WinError 32]` during the
`cleanup_spans` fixture teardown.

**Fix:** Added `trace.get_tracer_provider().shutdown()` in the
`ray_serve_with_tracing` fixture teardown to properly flush and close
the span exporter's file handles before the cleanup fixture attempts to
delete the spans directory.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
### Why are these changes needed?

When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the `fit()` docstring:

> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.

**Example of the bug:**

```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…project#60072)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ay-project#60037)

## Description

As mentioned in
ray-project#59740 (comment),
add explicit args in `_AutoscalingCoordinatorActor` constructor to
improve maintainability.

## Related issues

Follow-up: ray-project#59740

## Additional information
- Pass in mock function in testing as args rather than using `patch`

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…ay-project#60028)

Capture the install script content in BuildContext digest by inlining it
as a constant and adding install_python_deps_script_digest field. This
ensures build reproducibility when the script changes.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
the "test-rules" test job was missing the forge dependency

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This migrates ray wheel builds from CLI-based approach to wanda-based
container builds for x86_64.

Changes:
- Add ray-wheel.wanda.yaml and Dockerfile for wheel builds
- Update build.rayci.yml wheel steps to use wanda
- Add wheel upload steps that extract from wanda cache

Topic: ray-wheel

Signed-off-by: andrew <andrew@anyscale.com>
ray-project#60114)

…eed up iter_batches (ray-project#58467)"

This reverts commit 2a042d4.

## Description
Reverts # 58467

## Related issues

## Additional information

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ctors (ray-project#59850)

Signed-off-by: dragongu <andrewgu@vip.qq.com>
added a notebook that demonstrates serve application that takes a
reference to video as input and returns scene changes, tags and video
description (from the corpus).


https://anyscale-ray--59859.com.readthedocs.build/en/59859/serve/tutorials/video-analysis/README.html

---------

Signed-off-by: abrar <abrar@anyscale.com>
…ject#60109)

Follow up to
ray-project#52573 (comment)

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
they are not required for test orchestration, as rayci can properly
track dependencies now.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Opening draft PR for Ray technical charter.

Planning to add GitHub usernames before merging.

---------

Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
There isn't really a need to have a ray check on the event exporter as
there isn't really an important correctness invariant here. One call
will succeed. We already take some measure of caution here with a mutex
in the event recorder. But ray checking right after the mutex is just
asking for trouble.

---------

Signed-off-by: zac <zac@anyscale.com>
Currently seeing issues of crane not available in the uploading
environment. Default to Docker if crane is not available


https://buildkite.com/ray-project/postmerge/builds/15375/steps/canvas?jid=019bb99d-6f9e-45fa-92e3-a5a1d9373e8d#019bb99d-6f9e-45fa-92e3-a5a1d9373e8d/L198

Topic: crane-fix

Signed-off-by: andrew <andrew@anyscale.com>

Signed-off-by: andrew <andrew@anyscale.com>
…ay-project#59991)

updating lock files for images and using relative paths in buildkite
configs
moving base extra test deps lock files from ray_release path to
python/deplocks/base_extra_testdeps
Release test run: https://buildkite.com/ray-project/release/builds/74936

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…ct#59969)

The ray-cpp wheel contains only C++ headers, libraries, and executables
with no Python-specific code. Previously we built 4 identical wheels
(one per Python version: cp310, cp311, cp312, cp313), wasting CI time
and storage.

This change produces a single wheel tagged py3-none-manylinux2014_*
that works with any Python 3.x version.

Changes:
- Add ray-cpp-core.wanda.yaml and Dockerfile for cpp core
- Add ray-cpp-wheel.wanda.yaml for cpp wheel builds
- Add ci/build/build-ray-cpp-wheel.sh for Python-agnostic wheel builds
- Add RayCppBdistWheel class to setup.py that forces py3-none tags
  (necessary because BinaryDistribution.has_ext_modules() causes
  bdist_wheel to use interpreter-specific ABI tags by default)
- Update ray-cpp-wheel.wanda.yaml to build single wheel per architecture
- Update .buildkite/build.rayci.yml to remove Python version matrix
  for cpp wheel build/upload steps

Topic: ray-cpp-wheel
Relative: ray-wheel

Signed-off-by: andrew <andrew@anyscale.com>

---------

Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: cristianjd <cristian.j.derr@gmail.com>
The governance information is now integrated into the contributor
documentation at doc/source/ray-contribute/getting-involved.rst:399-437,
making it easily discoverable for community members interested in
advancing their involvement in the Ray project.

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
## Description
Completing the fixed-size array namespace operations

## Related issues
Related to ray-project#58674 

## Additional information

---------

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: Ping <fourhundredping@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
iamjustinhsu and others added 24 commits January 27, 2026 09:31
## Description
This should say `False`

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ct#60222)

The actor repr name is _only_ used in task receiver when replying to the
`PushTask` RPC for an actor creation task. Making it one of the task
execution outputs instead of a stateful field. I've opted to make it an
outparam for the core worker task execution callback as well, rather
than adding a custom method for it.

My meta goal is to make the logic that handles a task execution result
in the task receiver fully stateless.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…eue` (ray-project#60538)

ray-project#60017 and
ray-project#60228 refactored the
`FIFOBundleQueue` interface and renamed `FIFOBundleQueue.popleft` with
`FIFOBundleQueue.get_next`. However, this name change wasn't reflected
in the `UnionOperator` implementation, and as a result the operator can
error when it clears its output queue.

This change also fixes the flaky `test_union.py`.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**  
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**  
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**  
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**  
βœ… For normal data (variance/range > 1e-8), behavior is **identical** to
before
βœ… Only triggers new logic for extreme edge cases (variance/range < 1e-8)
βœ… All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std β‰ˆ
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range β‰ˆ
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <slfan1989@apache.org>
…ject#60479)

## Description
Add type annotations to Ray's annotation decorators so type checkers can
properly infer return types through decorated functions.

Before this change, decorators like `@PublicAPI` caused type checkers to
lose function signature information. After this change, decorated
functions retain their full type signatures.

## Related issues
Related to ray-project#59303

## Additional information
Running pyrefly with ray was complaining when calling take_all() which
led me down this rabbit hole. I tried to add annotations to all the
public facing decorators I could find that had reasonably clear fixes.
I did some drive-by type fixes in annotations.py to make it fully pass

---------

Signed-off-by: Julian Meyers <Julian@MeyersWorld.com>
## Description
Moved arrow_utils.py to a direct subpackage of `ray.data.util`.

## Related issues
Closes ray-project#60420 

## Additional information
moved file to `ray.data` subpackage. modified import paths. A minor
readability issue.

---------

Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
Signed-off-by: Hyunoh Yeo <113647638+Hyunoh-Yeo@users.noreply.github.com>
… behavior (ray-project#60394)

## Summary

This PR fixes a startup crash when running `ray start --head
--no-redirect-output` (and the same flag in KubeRay-generated `ray
start` commands). The CLI previously routed this option through a
deprecated `RayParams.redirect_output` parameter, which raises a
`DeprecationWarning` as an exception and prevents Ray from starting. The
PR also corrects the effective behavior of `--no-redirect-output` by
using the supported mechanism (`RAY_LOG_TO_STDERR=1`) to disable log
redirection.

## Description

### What happened 

- The CLI option `--no-redirect-output` was mapped to
`RayParams.redirect_output`.
- `RayParams._check_usage()` raises `DeprecationWarning("The
redirect_output argument is deprecated.")` whenever `redirect_output` is
not `None`, which terminates `ray start`.
- Additionally, the previous mapping effectively inverted intent by
setting `redirect_output=True` when `--no-redirect-output` was provided.

### What was expected to happen

- `ray start --no-redirect-output` should **not crash**.
- It should disable redirecting non-worker stdout/stderr into
`.out/.err` files (i.e., logs should go to stderr/console), consistent
with the flag name and help text.

### What this PR changes

- Stop passing the deprecated `redirect_output` argument into
`RayParams` from the `ray start` CLI.
- When `--no-redirect-output` is set, configure the supported behavior
by setting:`RAY_LOG_TO_STDERR=1`
- This leverages the existing fallback logic in
`Node.should_redirect_logs()` which checks `RAY_LOG_TO_STDERR` when
`RayParams.redirect_output` is `None`.

### Testing
<img width="1280" height="468" alt="image"
src="https://github.com/user-attachments/assets/6eb32b2e-80fa-4c05-b308-1700e92b1efb"
/>


## Related issues
Closes ray-project#60367

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
…ray-project#60526)

## Description
Currently we use `get_browsers_no_post_put_middleware` to block PUT/POST
requests from browsers since these endpoints are not intended to be
called from a browser context (e.g., via DNS rebinding or CSRF).
However, DELETE methods were not blocked, allowing browser-based
requests to delete jobs or shut down Serve applications.

This PR switches from a blocklist (POST/PUT) to an allowlist
(GET/HEAD/OPTIONS) approach, ensuring only explicitly safe methods are
permitted from browsers. This also covers PATCH and any future HTTP
methods by default.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…ct#60536)

Use to consolidate MANYLINUX_VERSION. Future will also use rayci.env to
consolidate RAY_VERSION and other related fields.

Signed-off-by: andrew <andrew@anyscale.com>
…kerfiles (ray-project#60386)

- Add --mount=type=cache to ray-core and ray-java Dockerfiles
- Update ray-cpp-core to use shared cache ID
(ray-bazel-cache-${HOSTTYPE})
- Configure Bazel repository cache inside the mount for faster
dependency resolution
- Auto-disable remote cache uploads when BUILDKITE_BAZEL_CACHE_URL is
empty, preventing 403 errors on local builds without AWS credentials

All python-agnostic images now share the same Bazel cache per
architecture, maximizing cache reuse while preventing cross-architecture
toolchain conflicts.

Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
agent-fix some misc typos and grammar in doc (+ a batch-llm var name).
Feel free to ignore if too trivial

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…tric name (ray-project#60481)

## Description
Fixes the broken **"Task Completion Time Without Backpressure"** metrics
chart in the Ray Data Grafana dashboard.

The panel was querying
`ray_data_task_completion_time_without_backpressure`, which no longer
exists. PR ray-project#57788 renamed the underlying metric to
`task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`,
but the Grafana panel in `data_dashboard_panels.py` was not updated.
This PR updates the panel's Prometheus `expr` to use
`ray_data_task_completion_time_excl_backpressure_s` so the chart
displays data again.

**Change:** Single-line fix in `data_dashboard_panels.py` β€” replace the
old metric name with the correct one in the panel's `expr`. The formula
(average task completion time excluding backpressure over a 5-minute
window) is unchanged.

## Related issues
Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data
monitoring / dashboard.
Closes: ray-project#60163 

## Additional information
- **Metric flow:**
`op_runtime_metrics.task_completion_time_excl_backpressure_s` β†’ Stats
uses `data_{name}` β†’ Metrics agent adds `ray_` namespace β†’
**`ray_data_task_completion_time_excl_backpressure_s`**
- **Manual verification:** Run a Ray Data job with Grafana + Prometheus
(see [cluster
metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then
confirm the "Task Completion Time Without Backpressure" panel shows
data.

Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
…ces (ray-project#60470)

## Description

This PR revisits `ReorderingBundleQueue` to move pointer advancements
from `get_next_inner` and `finalize` into `has_next` method to guarantee
that the queue will not get stuck with any operations sequence.

Currently, `ReorderingBundleQueue` could still get stuck in case of the
sequence captured in `test_ordered_queue_getting_stuck`.

The queue is guaranteed to traverse through all bundles so long as all
keys are finalized (ie tasks finished).

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
## Description

This removes the requirement for pipelines having `Sort` operations to
actually require `preserve_order=True`.

This is an unnecessary strict requirement that has adversarial
side-effects, and is strictly not required as there's no global ordering
between the blocks established.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ject#60544)

Follow up from:
ray-project#60526 (comment)

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ct#60529)

## Description
This PR updates logical operators and logical rules to consistently
access input dependencies via the input_dependencies property instead of
the internal _input_dependencies field. This is the first step in the
split plan for Issue ray-project#60312 and keeps physical operators out of scope.

To keep reviews small, we’re splitting the work into stacked PRs:

1. PR that just replaces references to _input_dependencies with
input_dependencies
2. PR that just renames operator attributes so they don’t have a leading
underscore (this PR)
3. PR that just removes LogicalOperator.output_dependencies (physical is
out of scope)
4. PR that converts operators to frozen dataclasses (ideally avoiding
object.__setattr__ / super().__init__)

This PR implements the first step in the planned four‑PR split.

## Related issues
> Link related issues: "Fixes ray-project#60312 ", or "Related to ray-project#60312".

## Additional information
- Scope: logical operators + logical rules only (no physical operator
changes).
  - Updated operator classes: AllToAll, NAry, OneToOne.
  - Updated rules: limit_pushdown, operator_fusion, predicate_pushdown.
- No behavior changes intended; this is a refactor to unify access
through the public property.

Signed-off-by: yaommen <myanstu@163.com>
…st_stage (ray-project#60299)

Signed-off-by: Yu Chen <yuchen.ecnu@gmail.com>
…#60558)

## Description

This PR fixes pydoclint documentation linting errors (DOC101 and DOC103)
in `python/ray/data/read_api.py`. These errors occur when function
signatures and docstrings are inconsistent, which can confuse users
reading the API documentation.

The fixes ensure that:
- All `**kwargs` parameters are properly documented with `**` prefix
- Missing parameters are documented in docstrings
- Parameter names match exactly between function signatures and
docstrings
- Typos in parameter names are corrected

## Related issues
> Link related issues: "Fixes ray-project#60545", "Closes ray-project#60545", or "Related to
ray-project#60545".

Fixes pydoclint DOC101 (missing arguments) and DOC103 (argument
mismatch) violations in read_api.py.

## Additional information

### Changes made:

**1. Fixed `**kwargs` parameter documentation format:**
- `read_datasource`: `read_args` β†’ `**read_args`
- `read_mongo`: `mongo_args` β†’ `**mongo_args`
- `read_parquet`: `arrow_parquet_args` β†’ `**arrow_parquet_args`
- `read_json`: `arrow_json_args` β†’ `**arrow_json_args`
- `read_csv`: `arrow_csv_args` β†’ `**arrow_csv_args`
- `read_numpy`: `numpy_load_args` β†’ `**numpy_load_args`

**2. Added missing parameter documentation:**
- `read_audio`: Added `shuffle` parameter documentation
- `read_videos`: Added `shuffle` and `override_num_blocks` parameter
documentation
- `read_bigquery`: Added `query` parameter documentation
- `read_text`: Added `drop_empty_lines` parameter documentation

**3. Fixed typos:**
- `read_videos`: Fixed `include_timestmaps` β†’ `include_timestamps`

---------

Signed-off-by: slfan1989 <slfan1989@apache.org>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ray-project#57694)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This PR adds the `label_selector` option to the supported list of Actor
options for a Serve deployment. Additionally, we add
`bundle_label_selector` to specify label selectors for bundles when
`placement_group_bundles` are specified for the deployment. These two
options are already supported for Tasks/Actors and placement groups
respectively.

Example use case:
```
llm_config = LLMConfig(
    model_loading_config={
        "model_id": "meta-llama/Meta-Llama-3-70B-Instruct",
        "model_source": "huggingface",
    },
    engine_kwargs=tpu_engine_config,
    resources_per_bundle={"TPU": 4},
    runtime_env={"env_vars": {"VLLM_USE_V1": "1"}},
    deployment_config={
        "num_replicas": 4,
        "ray_actor_options": {
            # In a GKE cluster with multiple TPU node-pools, schedule
            # only to the desired slice.
            "label_selector": {
                "ray.io/tpu-topology": "4x4" # added by default by Ray
            }
        }
    }
)
```

The expected behaviors of these new fields is as follows:

**Pack scheduling enabled**
----------------------------------------
**PACK/STRICT_PACK PG strategy:**
- Standard PG without bundle_label_selector or fallback:
- Sorts replicas by resource size (descending). Attempts to find the
"best fit" node (minimizing fragmentation) that has available resources.
Creates a Placement Group on that target node.

- PG node label selector provided:
- Same behavior as regular placement group but filters the list of
candidate nodes to only those matching the label selector before finding
the best fit
  
- PG node label selector and fallback:
  Same as above but when scheduling tries the following:
1. Tries to find a node matching the primary placement_group_bundles and
bundle_label_selector.
2. If no node fits, iterates through the
placement_group_fallback_strategy. For each fallback entry, tries to
find a node matching that entry's bundles and labels.
  3. If a node is found, creates a PG on it.
  
**SPREAD/STRICT_SPREAD PG strategy:**
- If any deployment uses these strategies, the global logic falls back
to "Spread Scheduling" (see below)

**Spread scheduling enabled**
----------------------------------------
- Standard PG without bundle_label_selector or fallback:
- Creates a Placement Group via Ray Core without specifying a
target_node_id. Ray Core decides placement based on the strategy.
- PG node label selector provided:
- Serve passes the bundle_label_selector to the
CreatePlacementGroupRequest. Ray Core handles the soft/hard constraint
logic during PG creation.
- PG node label selector and fallback:
- Serve passes the bundle_label_selector to the
CreatePlacementGroupRequest, fallback_strategy is not yet supported in
the placement group options so this field isn't passed / considered.
It's only used in the "best fit" node selection logic which is skipped
for Spread scheduling.

## Related issue number

ray-project#51564

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
ray-project#60482)

Fixes ray-project#58851

### Changes

1. **New `gRPCStatusError` exception class** - Wraps exceptions with
user-set gRPC status codes so they flow through Ray's error handling
path.
2. **Exception wrapping in replica methods** - `handle_request`,
`handle_request_streaming`, and `handle_request_with_rejection` now wrap
exceptions with `gRPCStatusError` when the user has set a status code on
the gRPC context.
3. **Status code preservation in proxy** - `get_grpc_response_status()`
now detects `gRPCStatusError` and returns the user's intended status
code instead of `INTERNAL`.
4. **Message truncation** - Added `_truncate_message()` to limit error
details to 4KB, avoiding HTTP/2 trailer size limits.
5. **Documentation updates** - Updated the gRPC guide to document the
new behavior.

---------

Signed-off-by: abrar <abrar@anyscale.com>
…ay-project#60569)

## Description
The autoscaling validation warning was incorrectly raised for fixed-size
actor pools (`min_size == max_size`). These pools don't scale up, so the
warning doesn't apply.

## Related issues
Context:
ray-project#60477 (comment)

## Additional information
After this change, when we run `python -m pytest -v -s
test_vllm_engine_proc.py::test_generation_model`, we no longer observe
autoscaling warnings in the log.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
## Description
- Remove outdated air library in ray data
- Update some old usage from `ray.air` to `ray.data` 

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main, containing a wide variety of changes. The most significant changes include a major refactoring of the CI/CD pipeline to a more modular, wanda-based system, updates to Python and library versions (e.g., dropping Python 3.9 support in some areas), and extensive documentation improvements. The CI refactoring appears to enhance caching and multi-architecture support. The documentation has been significantly improved for clarity, accuracy, and completeness across many components. I've identified one potential issue with a new CI rule that seems overly broad and could lead to CI inefficiency. Overall, the changes are positive and well-structured.

Comment on lines +263 to +267
*
@ ml tune train data serve
@ core_cpp cpp java python doc
@ linux_wheels macos_wheels dashboard tools release_tests
;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new wildcard rule at the end of this file seems overly broad. It applies a large number of tags (ml, tune, train, data, serve, core_cpp, cpp, java, python, doc, linux_wheels, macos_wheels, dashboard, tools, release_tests) to any file that doesn't match a more specific rule. This could lead to a significant number of unnecessary CI jobs being triggered for minor or unrelated changes (e.g., a typo fix in a non-code file).

While this might be intended as a conservative fallback, it could also be a source of CI inefficiency. Consider making this default rule more restrictive, or splitting it into smaller, more targeted fallback rules if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.