Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-23#756

Open
antfin-oss wants to merge 467 commits intomainfrom
create-pull-request/patch-542fd29717
Open

πŸ”„ daily merge: master β†’ main 2026-01-23#756
antfin-oss wants to merge 467 commits intomainfrom
create-pull-request/patch-542fd29717

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-23
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

harshit-anyscale and others added 30 commits January 7, 2026 09:01
- adding anyscale template configs for async inf template

Signed-off-by: harshit <harshit@anyscale.com>
As-is, this script installs for arm architecture, regardless of actual
machine type. Also bumping version to unblock issue from running with
newer
OpenSSL version--
```
[ERROR 2026-01-07 03:46:50,067] crane_lib.py: 70  Crane command `/home/forge/.cache/bazel/_bazel_forge/5fe90af4e7d1ed9fcf52f59e39e126f5/external/crane_linux_x86_64/crane copy 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu` failed with stderr:
--
2026/01/07 03:46:49 Copying from 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu to us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu
ERROR: gcloud failed to load: module 'lib' has no attribute 'X509_V_FLAG_NOTIFY_POLICY'
gcloud_main = _import_gcloud_main()
import googlecloudsdk.gcloud_main
from googlecloudsdk.calliope import cli
from googlecloudsdk.calliope import backend
from googlecloudsdk.calliope import parser_extensions
from googlecloudsdk.core.updater import update_manager
from googlecloudsdk.core.updater import installers
from googlecloudsdk.core.credentials import store
from googlecloudsdk.api_lib.auth import util as auth_util
from googlecloudsdk.core.credentials import google_auth_credentials as c_google_auth
from oauth2client import client as oauth2client_client
from oauth2client import crypt
from oauth2client import _openssl_crypt
from OpenSSL import crypto
from OpenSSL import SSL, crypto
from OpenSSL.crypto import (
class X509StoreFlags:
NOTIFY_POLICY: int = _lib.X509_V_FLAG_NOTIFY_POLICY
Β 
This usually indicates corruption in your gcloud installation or problems with your Python interpreter.

```

---------

Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: Andrew Pollack-Gray <andrew@anyscale.com>
…ect#58435)

- Fix memory safety for core_worker in the shutdown executor -- use
`weak_ptr` instead of raw pointer.
- Ensure shutdown completes before core worker destructs.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
No longer relevant.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Add documentation to 20 functions in ci/raydepsets/cli.py that were
missing docstrings, improving code readability and maintainability.

πŸ€– Generated with [Claude Code]

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…9745)

## Description
Fixed a broken link in the read_unity_catalog doc string. Previous URL
was outdated.

## Related issues
None 

## Additional information
N/A

---------

Signed-off-by: Jess <jessica.jy.kong@gmail.com>
Signed-off-by: Jessica Kong <jessica.jy.kong@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description

`CountDistinct` allow users to compute the number of distinct values in
a column, similar to SQL's `COUNT(DISTINCT ...)`.

## Related issues

close ray-project#58252

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
…ct#59942)

Updating to reflect an issue that I debugged recently.

Recommendation is to use `overlayfs` instead of the default `vfs` for
faster container startup.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ialization overhead (ray-project#59919)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Fix typos in docs and docstrings. If any are too trivial, just lmk.
Agent assisted

---------

Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
This was used early in the development of the Ray Dashboard and is not
used any more so we should remove it (I recently came across this).

---------

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
…7735)

There have been ask for enabling --temp-dir flag on a per node
basis in contrast to the current implementation that only allows
all node's temp dir to be configured to the head node's temp dir
configuration.

This PR introduces the capability for the ray temp directory
to be specified on a per node basis, eliminating the restriction
that --temp-dir flag can only be used in conjunction with the --head
flag. get_user_temp_dir and get_ray_temp_dir has been marked as
deprecated and replaced with the resolve_user_ray_temp_dir function
to ensure that temp dir is consistent across the system.

## New Behaviors
**Temp dir**

|  | head node temp_dir NOT specified | head node temp_dir specified |
|---|---|---|
| worker node temp_dir NOT specified | Worker & head node uses
`/tmp/ray` | Worker uses head node's temp_dir |
| worker node temp_dir specified | Worker uses its own specified
temp_dir. Head node uses default | Each node uses its own specified
temp_dir |

**Object spilling directory**

| | head node spilling dir NOT specified | head node spilling dir
specified |
|---|---|---|
| worker node spilling dir NOT specified | Each node uses its own
temp_dir as spilling dir | Worker uses head node's spilling dir |
| worker node spilling dir specified | Worker uses its own specified
spilling dir. Head node uses its temp_dir | Each node uses its own
specified spilling dir |

## Testing
We tested the expected behaviors on a local multi-node kuberay cluster
by verifying that:
1. nodes defaults to `/tmp/ray` when node temp_dir is specified
2. non-head nodes picked up head node's temp_dir specifications when
only head node temp_dir was specified
3. non-head nodes can take independent temp_dir regardless of head node
temp_dir when specified
4. nodes default to their own temp dir as spilling directory for all
three cases above
5. nodes default to head node's spilling directory when only head node
spilling directory is specified
6. nodes can have their spilling directory specified independent of the
head node's spilling directory

Behaviors were verified by checking that the directories were created,
and that the right information is fetched from head node.

## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->
ray-project#47262
ray-project#51218
ray-project#40628
ray-project#32962
ray-project#40628

## Types of change

- [ ] Bug fix πŸ›
- [ ] New feature ✨
- [x] Enhancement πŸš€
- [ ] Code refactoring πŸ”§
- [ ] Documentation update πŸ“–
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->
This PR should not introduce any breaking changes just yet. However,
this PR deprecates `get_user_temp_dir` and `get_ray_temp_dir`. The two
functions will be marked as errors in the next version update.

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
…9949)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

### [Train] Train Benchmark to include time to first batch

- In train benchmarks, include time to first batch while reporting
Throughput.
- Without this, it's misleading because throughput with preserve-order
looks better than without.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…59953)

updating `build-ray-docker.sh` 

- removing CONSTRAINTS_FILE build arg
- copying constraint file to "${CPU_TMP}/requirements_compiled.txt"

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Description
Since there have been a number of expressions that have been added, this
seems like a good time to reorganize the expression tests so that it's
clear what is covered by tests.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
… ActorPool (ray-project#59645)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…dir (ray-project#59941)

Rename build_dir parameter to context_dir and move it to the last
argument position for better API consistency.

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Description
Adding py313 dependency sets (not in use yet):

- Adding requirements_compiled_py3.13.txt
- Adding requirements_compiled_py3.10,11,12.txt symlinked to
requirements_compiled.txt
- updated script to remove header from requirements_compiled_py* files
- parameterizing requirements_compiled_py{PYTHON_VERSION} in raydepsets
config
- Generating py313 dependency sets
- Removing ray_dev.in requirements file to deplocks directory

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…oject#59954)

There were a couple missing updates to RAY_testing_rpc_failure when we
moved to a json format from here ray-project#58886
that I noticed.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
…y-project#59895)

## Description
For ray-project#59508 and
ray-project#59581 that using TicTacToe would
cause the following error

```
ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True))
  File "ray/python/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "ray/python/ray/tune/trainable/trainable.py", line 328, in train
    result = self.step()
  File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step
    train_results, train_iter_ctx = self._run_one_training_iteration()
  File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration
    training_step_return_value = self.training_step()
  File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step
    return self._training_step_new_api_stack()
  File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack
    self.local_replay_buffer.add(episodes)
  File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add
    existing_eps.concat_episode(eps)
  File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode
    sa_episode.concat_episode(other.agent_episodes[agent_id])
  File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode
    assert self.t == other.t_started
AssertionError
```

In the multi-agent-episode `concat_episode`, we check if any agent
hasn't received their next observation from an observation, action. This
results in a hanging action where in one episode is the observation,
action then in the next is the resulting observation, reward, etc. This
[code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848)
check if this has happened then added an extra step at the beginning to
include this hanging data.

However, in testing, the multi-agent episode `cut` method already
implements this (if using `slice` this will cause a hidden bug) meaning
that an extra unnecessary step's data is being added resulting in the
episode beginnings not lining up.
Therefore, this PR removes this code and replaces with a simple check to
assume that the hanging action is equivalent to the initial action in
the next episode.

For testing, I found that the `concat_episode` test was using `slice`
which doesn't account for hanging data while `cut` which is used in the
env-runner does. I modified the test to be more functional based where I
created a custom environment that has agents taking actions at different
frequencies then returning as an observation the agent's timestep. This
allows us to test through concatenating all episodes of the same ID and
checking that the observations increase 0, 1, 2, ... and ensures that no
data goes missing for users.

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
)

## Description

Before this PR, users could not tell why objects could not be
reconstructed, since the response only contained a generic error
message.
```
All copies of 8774b2e5680a48cdffffffffffffffffffffffff0200000003000000 have been lost due to node failure. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the failure."
```


### Object Recovery Flow (After This PR)

#### Step 1: Reference & Ownership Check
- `if (!ref_exists)` β†’ `OBJECT_UNRECONSTRUCTABLE_REF_NOT_FOUND`
```
[OBJECT_UNRECONSTRUCTABLE_REF_NOT_FOUND] The object cannot be reconstructed because its reference was not found in the reference counter. Please file an issue at https://github.com/ray-project/ray/issues.
```
- `if (!owned_by_us)` β†’ `OBJECT_UNRECONSTRUCTABLE_BORROWED`
```
[OBJECT_UNRECONSTRUCTABLE_BORROWED] The object cannot be reconstructed because it crossed an ownership boundary. Only the owner of an object can trigger reconstruction, but this worker borrowed the object from another worker.
```

#### Step 2: Try to Pin Existing Copies
- Look up object locations via `object_lookup_`
- If copies exist on other nodes β†’ `PinExistingObjectCopy`
   - If all locations fail β†’ proceed to Step 3
- If no copies exist β†’ proceed to Step 3

#### Step 3: Lineage Eligibility Check
- I define eligibility as: we don't need to actually rerun the task, and
we already know whether it is eligible for reconstruction.
- `INELIGIBLE_PUT` β†’ `OBJECT_UNRECONSTRUCTABLE_PUT`
```
[OBJECT_UNRECONSTRUCTABLE_PUT] The object cannot be reconstructed because it was created by ray.put(), which has no task lineage. To prevent this error, return the value from a task instead.

```
- `INELIGIBLE_NO_RETRIES` β†’ `OBJECT_UNRECONSTRUCTABLE_RETRIES_DISABLED`
```
[OBJECT_UNRECONSTRUCTABLE_RETRIES_DISABLED] The object cannot be reconstructed because the task was created with max_retries=0. Consider enabling retries using `@ray.remote(max_retries=N)`.
```
- `INELIGIBLE_LINEAGE_EVICTED` β†’
`OBJECT_UNRECONSTRUCTABLE_LINEAGE_EVICTED`
```
[OBJECT_UNRECONSTRUCTABLE_LINEAGE_EVICTED] The object cannot be reconstructed because its lineage has been evicted to reduce memory pressure. To prevent this error, set the environment variable RAY_max_lineage_bytes=<bytes> (default 1GB) during `ray start`.
```
- `INELIGIBLE_LOCAL_MODE` β†’ `OBJECT_UNRECONSTRUCTABLE_LOCAL_MODE`
```
[OBJECT_UNRECONSTRUCTABLE_LOCAL_MODE] The object cannot be reconstructed because Ray is running in local mode. Local mode does not support object reconstruction.
```
- `INELIGIBLE_LINEAGE_DISABLED` ->
`OBJECT_UNRECONSTRUCTABLE_LINEAGE_DISABLED`
```
[OBJECT_UNRECONSTRUCTABLE_LINEAGE_DISABLED] The object cannot be reconstructed because lineage reconstruction is disabled system-wide (object_reconstruction_enabled=False).
```
- `ELIGIBLE` β†’ proceed to Step 4

#### Step 4: Task Resubmission
- `OBJECT_UNRECONSTRUCTABLE_TASK_CANCELLED`
```
[OBJECT_UNRECONSTRUCTABLE_TASK_CANCELLED] The object cannot be reconstructed because the task that would produce it was cancelled.
```
- `OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED`
```
[OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED] The object cannot be reconstructed because the maximum number of task retries has been exceeded. Consider increasing the number of retries using `@ray.remote(max_retries=N)`.
```

#### Step 5: Dependency Recovery (Recursive)
```cpp
for (const auto &dep : task_deps) {
    auto error = RecoverObject(dep);  // Recursive call to Step 1
    if (error.has_value()) {
        recovery_failure_callback_(dep, *error, true);
    }
}
// Dependencies can fail with any error from Steps 1–4
```
This PR also:
- Adds appropriate log messages at each step

## Related issues
Closes ray-project#59562

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
…-project#59425)

Fixes the issue where the interpreter would crash instead of providing a
useful error message.

Previously, calling ray.kill() on an ActorHandle from a previous Ray
session (after ray.shutdown() and ray.init()) would crash the Python
interpreter with a C++ assertion failure.

This fix:
1. Prevents the crash by only calling OnActorKilled() in C++ when the
kill operation succeeds
2. Catches the error in Python and converts it to a helpful ValueError
explaining that ActorHandle objects are not valid across sessions
3. Adds a test to verify the fix

The error message now clearly explains:
- ActorHandle objects are not valid across Ray sessions
- When this typically happens (after shutdown/restart)
- What the user should do (create a new actor handle)

Fixes ray-project#59340


---------
Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Description

ray-project@7198193
made a backwards incompatible change in env variable name leading to
regression in `scheduling_test_many_0s_tasks_many_nodes` release test.
(the env var is being used by the anyscale cluster used to run the
release tests). reverting this change to fix the problem

release test is now passing:
https://buildkite.com/ray-project/release/builds/74397
## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…server bindings (ray-project#59852)

## Description
This PR adds IPv6 localhost support and improves server binding security
by eliminating 0.0.0.0 bindings.

### goal
- Avoid hardcoding 127.0.0.1, which breaks IPv6 support.
- Avoid proactively using 0.0.0.0, which is insecure.

##### Server side
- For local-only servers, bind to localhost (resolved via
GetLocalhostIP()/ get_localhost_ip(); IPv4/IPv6).
- For servers that need cross-node communication, bind to the node IP
instead of 0.0.0.0.
- If the user explicitly configures a bind address, always respect the
user setting.

##### Client side
- Use localhost when connecting to local-only servers (resolved via
get_localhost_ip()).
- Use the node IP when connecting to servers that require cross-node
communication.



#### Note:
`0.0.0.0 β†’ node_ip` related changes this PR made:
- GCS Server:  `0.0.0.0 β†’ node_ip`
- Raylet gRPC:  `0.0.0.0 β†’ node_ip`
- Core Worker gRPC:  `0.0.0.0 β†’ node_ip`
- Object Manager:  `0.0.0.0 β†’ node_ip`
- Remote Python Debugger: `0.0.0.0 β†’ node_ip`
- Ray Client Server already passed the node IP before this PR, but its
default `--host` was 0.0.0.0. This PR changed it to localhost.
- Dashboard Server by default binds to localhost. This PR just updated
the documentation to suggest using node IP instead of 0.0.0.0 for remote
access.
- Non-Ray-core components (e.g., Serve): This PR keeps them binding to
all interfaces as before, but replaced hardcoded 0.0.0.0 with
`get_all_interfaces_ip()` to handle IPv6 (returns :: for IPv6
environments).





## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
…-project#59113)

## Description
This PR adds three test cases that validate the interoperability between
the Intel GPU software stack and Ray in various deployment scenarios.
These tests assume that the environment is already configured as
required. They serve as smoke tests that users can run to confirm a
correct deployment.

Covered deployment scenarios:

- Single GPU on a single node (sanity check)
- Multiple GPUs on a single node (scale-up)
- Multiple GPUs across multiple nodes (scale-out)

## Additional information
- Tests will automatically skip if `dpctl` (an Intel GPU dependency) is
not installed or if the environment is not properly configured for the
given test.
- Tests require the `RAY_PYTEST_USE_GPU` flag to be set, for consistency
with other Ray GPU tests.


## Motivation
I understand that Ray currently does not include Intel GPUs in its CI
infrastructure, so these tests will be skipped during CI runs and may
not provide immediate value to the Ray development team. However, they
can serve as a useful verification and troubleshooting tool for Ray
users deploying on Intel GPUs, making them worth upstreaming. They also
require very little maintanence as they'll simply skip gracefully and
provide future readiness in case Intel GPU support in Ray will expand.

---------

Signed-off-by: Jakub Zimny <jakub.zimny@intel.com>
…#59478)

## Description
i'm hoping to make the example publishing process smoother when setting
up the CI for testings (release tests):

Currently, when publishing examples and setting up release tests in CI,
we have to manually tweak multiple BUILD.bazel to make our `ci/aws.yaml`
and `ci/gce.yaml` discoverable to the CI release package
(release/BUILD.bazel). It adds additional overhead/confusion to the
writer and clutter the files over time

solution: 
Consolidate all `ci/aws.yaml` and `gce.yaml` configs into a single
filegroup under doc/BUILD.bazel. Discover that unique filegroup from
release/BUILD.bazel
The only requirement is to match the `doc/source/**/ci/aws.yaml` or
`doc/source/**/ci/gce.yaml` pattern which maches our standard way to
publish examples

### Changes
- **Updated** `doc/BUILD.bazel` to define one single filegroup for all
of `ci/` configs. Use glob patterns to catch all `aws.yaml` and
`gce.yaml` under a `ci/` folder
- **Updated** `release/BUILD.bazel` to reference that filegroup
- **Updated** all inner doc/** BUILD.bazel accordingly with their own
local filegroups

### Tests


<details>
<summary>Manual review with bazel query 'kind("source file",
deps(//doc:all_examples_ci_configs))'</summary>

```
(repo_ray_docs) aydin@aydin-JCDF7JJD9H doc % bazel query 'kind("source file", deps(//doc:all_examples_ci_configs))'
//doc:source/ray-overview/examples/e2e-audio/ci/aws.yaml
//doc:source/ray-overview/examples/e2e-audio/ci/gce.yaml
//doc:source/ray-overview/examples/e2e-multimodal-ai-workloads/ci/aws.yaml
//doc:source/ray-overview/examples/e2e-multimodal-ai-workloads/ci/gce.yaml
//doc:source/ray-overview/examples/e2e-rag/ci/aws.yaml
//doc:source/ray-overview/examples/e2e-rag/ci/gce.yaml
//doc:source/ray-overview/examples/e2e-timeseries/ci/aws.yaml
//doc:source/ray-overview/examples/e2e-timeseries/ci/gce.yaml
//doc:source/ray-overview/examples/e2e-xgboost/ci/aws.yaml
//doc:source/ray-overview/examples/e2e-xgboost/ci/gce.yaml
//doc:source/ray-overview/examples/entity-recognition-with-llms/ci/aws.yaml
//doc:source/ray-overview/examples/entity-recognition-with-llms/ci/gce.yaml
//doc:source/ray-overview/examples/langchain_agent_ray_serve/ci/aws.yaml
//doc:source/ray-overview/examples/langchain_agent_ray_serve/ci/gce.yaml
//doc:source/ray-overview/examples/llamafactory-llm-fine-tune/ci/aws.yaml
//doc:source/ray-overview/examples/llamafactory-llm-fine-tune/ci/gce.yaml
//doc:source/ray-overview/examples/mcp-ray-serve/ci/aws.yaml
//doc:source/ray-overview/examples/mcp-ray-serve/ci/gce.yaml
//doc:source/ray-overview/examples/object-detection/ci/aws.yaml
//doc:source/ray-overview/examples/object-detection/ci/gce.yaml
//doc:source/serve/tutorials/asynchronous-inference/ci/aws.yaml
//doc:source/serve/tutorials/asynchronous-inference/ci/gce.yaml
//doc:source/serve/tutorials/deployment-serve-llm/ci/aws.yaml
//doc:source/serve/tutorials/deployment-serve-llm/ci/gce.yaml
//doc/source/data/examples:unstructured_data_ingestion/ci/aws.yaml
//doc/source/data/examples:unstructured_data_ingestion/ci/gce.yaml
//doc/source/train/examples/pytorch:deepspeed_finetune/ci/aws.yaml
//doc/source/train/examples/pytorch:deepspeed_finetune/ci/gce.yaml
//doc/source/train/examples/pytorch:distributing-pytorch/ci/aws.yaml
//doc/source/train/examples/pytorch:distributing-pytorch/ci/gce.yaml
//doc/source/train/examples/pytorch:pytorch-fsdp/ci/aws.yaml
//doc/source/train/examples/pytorch:pytorch-fsdp/ci/gce.yaml
//doc/source/train/examples/pytorch:pytorch-profiling/ci/aws.yaml
//doc/source/train/examples/pytorch:pytorch-profiling/ci/gce.yaml
Loading: 1 packages loaded
```

</details>

I also ran all release tests whose ci/ configs are affected by this
change and verified that their ci/ configuration is still being read
correctly (i.e., the Anyscale job is launched as expected).
https://buildkite.com/ray-project/release/builds/73954#
(test failure are due to application-level errors after launching the
anyscale job, not because of this change)

---------

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
Co-authored-by: Aydin Abiar <aydin@anyscale.com>
## Description
Automatically exclude common directories (.git, .venv, venv,
__pycache__) when uploading working_dir in runtime environment packages.

At a minimum we need to exclude `.git/` because unlike the others,
nobody includes .git/ in `.gitignore`. This causes Ray to throw a
`ray.exceptions.RuntimeEnvSetupError` if your `.git` dir is larger than
512 MiB.

I also updated the documentation in handling-dependencies.rst and
improved the error message if the env exceeds the GCS_STORAGE_MAX_SIZE
limit.

## Related issues
N/A

## Additional information
This PR pytorch/tutorials#3709 was failing to
run because the PyTorch tutorials .git/ folder is huge.

---------

Signed-off-by: Ricardo Decal <public@ricardodecal.com>
Signed-off-by: Ricardo Decal <crypdick@users.noreply.github.com>
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
Co-authored-by: Ricardo Decal <public@ricardodecal.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…roject#59956)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.


**[Data] Fix test_execution_optimizer_limit_pushdown determinism**

Fix by adding `override_num_blocks=1`

```

[2026-01-08T00:20:42Z] =================================== FAILURES ===================================
--
[2026-01-08T00:20:42Z] ____________________ test_limit_pushdown_basic_limit_fusion ____________________
[2026-01-08T00:20:42Z]
[2026-01-08T00:20:42Z] ray_start_regular_shared_2_cpus = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.19', ray_version='3.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}')
[2026-01-08T00:20:42Z]
[2026-01-08T00:20:42Z]     def test_limit_pushdown_basic_limit_fusion(ray_start_regular_shared_2_cpus):
[2026-01-08T00:20:42Z]         """Test basic Limit -> Limit fusion."""
[2026-01-08T00:20:42Z]         ds = ray.data.range(100).limit(5).limit(100)
[2026-01-08T00:20:42Z] >       _check_valid_plan_and_result(
[2026-01-08T00:20:42Z]             ds,
[2026-01-08T00:20:42Z]             "Read[ReadRange] -> Limit[limit=5]",
[2026-01-08T00:20:42Z]             [{"id": i} for i in range(5)],
[2026-01-08T00:20:42Z]             check_ordering=False,
[2026-01-08T00:20:42Z]         )
[2026-01-08T00:20:42Z]
[2026-01-08T00:20:42Z] python/ray/data/tests/test_execution_optimizer_limit_pushdown.py:40:
[2026-01-08T00:20:42Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-08T00:20:42Z]
[2026-01-08T00:20:42Z] ds = limit=5
[2026-01-08T00:20:42Z] +- Dataset(num_rows=5, schema={id: int64})
[2026-01-08T00:20:42Z] expected_plan = 'Read[ReadRange] -> Limit[limit=5]'
[2026-01-08T00:20:42Z] expected_result = [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}]
[2026-01-08T00:20:42Z] expected_physical_plan_ops = None, check_ordering = False
[2026-01-08T00:20:42Z]
[2026-01-08T00:20:42Z]     def _check_valid_plan_and_result(
[2026-01-08T00:20:42Z]         ds: Dataset,
[2026-01-08T00:20:42Z]         expected_plan: Plan,
[2026-01-08T00:20:42Z]         expected_result: List[Dict[str, Any]],
[2026-01-08T00:20:42Z]         expected_physical_plan_ops=None,
[2026-01-08T00:20:42Z]         check_ordering=True,
[2026-01-08T00:20:42Z]     ):
[2026-01-08T00:20:42Z]         actual_result = ds.take_all()
[2026-01-08T00:20:42Z]         if check_ordering:
[2026-01-08T00:20:42Z]             assert actual_result == expected_result
[2026-01-08T00:20:42Z]         else:
[2026-01-08T00:20:42Z] >           assert rows_same(pd.DataFrame(actual_result), pd.DataFrame(expected_result))
[2026-01-08T00:20:42Z] E           AssertionError: assert False
[2026-01-08T00:20:42Z] E            +  where False = rows_same(   id\n0  25\n1  26\n2  27\n3  28\n4  29,    id\n0   0\n1   1\n2   2\n3   3\n4   4)
[2026-01-08T00:20:42Z] E            +    where    id\n0  25\n1  26\n2  27\n3  28\n4  29 = <class 'pandas.core.frame.DataFrame'>([{'id': 25}, {'id': 26}, {'id': 27}, {'id': 28}, {'id': 29}])
[2026-01-08T00:20:42Z] E            +      where <class 'pandas.core.frame.DataFrame'> = pd.DataFrame
[2026-01-08T00:20:42Z] E            +    and      id\n0   0\n1   1\n2   2\n3   3\n4   4 = <class 'pandas.core.frame.DataFrame'>([{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}])
[2026-01-08T00:20:42Z] E            +      where <class 'pandas.core.frame.DataFrame'> = pd.DataFrame
Β 
```



## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
iamjustinhsu and others added 22 commits January 21, 2026 11:33
## Description
After ray-project#60017 got merged, I forgot to update the `test_bundle_queue` test
suite. This PR adds more tests for `num_blocks`, `num_rows`,
`estimate_size_bytes`, and `len(queue)`
## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ject#60338)

## Description
> Briefly describe what this PR accomplishes and why it's needed.
This PR adds support for Google Cloud's 7th generation TPU (Ironwood).

The TPU 7x generation introduces a change in the accelerator type naming
convention reported by the environment. Unlike previous generations
(v6e-16, v5p-8, etc.), 7x instances report types starting with tpu (e.g.
tpu7x-16).

This PR accounts for the new format and enables Ray to auto-detect the
v7x hardware automatically (users don't have to manually configure env
vars). This is critical for libraries like Ray Train and for vLLM
support - where the automatic device discovery is utilized during JAX
initialization.

## Related issues
Fixes ray-project#59964

## Additional information
For more info about TPU v7x:
https://docs.cloud.google.com/tpu/docs/tpu7x.

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
## Description

1. the flakyness for test_flush_worker_result_queue is, when
queue_backlog_length is 0, after `wg._start()`, we immediately
wg.poll_status() and asserts finished, sometimes rank 0’s training
thread is still running at that instant .
leads to the below error:
```
where False = WorkerGroupPollStatus(worker_statuses={0: WorkerStatus(running=True, error=None, training_report=None), 1: WorkerStatus(running=False, error=None, training_report=None), 2: WorkerStatus(running=False, error=None, training_report=None), 3: WorkerStatus(running=False, error=None, training_report=None)}).finished
```
2. use the same pattern in `test_poll_status_finished` in the same file
to address this flakyness.
3. increase `test_placement_group_handle ` to medium to avoid timeout. 
```

python/ray/train/v2/tests/test_placement_group_handle.py::test_slice_handle_shutdown -- Test timed out at 2026-01-20 18:12:46 UTC --
--
[2026-01-20T18:15:17Z] ERROR [100%]
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] ==================================== ERRORS ====================================
[2026-01-20T18:15:17Z] _________________ ERROR at setup of test_slice_handle_shutdown _________________
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z]     @pytest.fixture(autouse=True)
[2026-01-20T18:15:17Z]     def ray_start():
[2026-01-20T18:15:17Z] >       ray.init(num_cpus=4)
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] python/ray/train/v2/tests/test_placement_group_handle.py:16:
[2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/client_mode_hook.py:104: in wrapper
[2026-01-20T18:15:17Z]     return func(*args, **kwargs)
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1910: in init
[2026-01-20T18:15:17Z]     _global_node = ray._private.node.Node(
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/node.py:402: in __init__
[2026-01-20T18:15:17Z]     time.sleep(0.1)
[2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] signum = 15
[2026-01-20T18:15:17Z] frame = <frame at 0x55cf6cb749f0, file '/rayci/python/ray/_private/node.py', line 402, code __init__>
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z]     def sigterm_handler(signum, frame):
[2026-01-20T18:15:17Z] >       sys.exit(signum)
[2026-01-20T18:15:17Z] E       SystemExit: 15
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1670: SystemExit


```

4. add a `manual` tag for `test_jax_gpu` bazel target to temporally
disable CI for this unit test given that pypi jax version only support
at least CUDA 12.2 now while our CI runs on CUDA 12.1

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
…d of the highest available version. (ray-project#60378)

Signed-off-by: irabbani <israbbani@gmail.com>
ray-project#60384)

This reverts commit c9ff164.
After investigations by the core team we we're able to determine this
minor OTEL version upgrade dropped task/actor creation throughput by
around 25% from 600ms to 450ms. Buildkite that verifies this fix:
https://buildkite.com/ray-project/release/builds/76576#019be280-c80f-40c6-9907-904ff5f93d4b

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
…t files (ray-project#60236)

## Description
Returning `None` when you don't have partition_columns selects all the
partitions which is not the right behavior. Returning `[]` when no
partition columns are selected.

## Related issues
Closes ray-project#60215 

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
)

## Description
This PR reduces CI time for Data-only PRs by ensuring that changes to
`python/ray/data/` no longer trigger all ML/train tests unnecessarily.

## Related issues
Closes ray-project#59780 


Contribution by Gittensor, learn more at https://gittensor.io/

---------

Signed-off-by: DeborahOlaboye <deboraholaboye@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…ray-project#60224)

There is no need for this callback to be nested so deeply inside of the
`TaskReceiver`. We can instead call it from `CoreWorker::ExecuteTask`
prior to returning.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ect#60352)

use more conventional methods, so that it is clearer how the job status
info get used.

this is for preparation of anyscale cli/sdk migration

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…ct#60376)

1. `ray.get(pg_handle.ready(),
timeout=self._worker_group_start_timeout_s)` includes both start
placement group and install runtime env, if the installation takes
longer than 30s, it will go into a scheduling/rescheduling phase
2. this change is to change the default timeout to 60s instead to
mitigate the fixedScalingPolicy experience when packages are
installed via runtime environment.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
…r restart (ray-project#58877)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
…roject#57735)" (ray-project#60388)

## Description
The PR to introduce node specific `temp-dir` specification capabilities
introduced a number of tests that fails on the Windows environment in
post merge. To prevent these tests from blocking other PRs, we will be
reverting the PR until the tests has been fixed.

This reverts PR "[Core] Introduce node specific temp-dir specification.
(ray-project#57735)".

## Related issues
N/A

## Additional information
Temp dir PR: ray-project#57735

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
## Description
Remove the user-facing `meta_provider` parameter from all read APIs, its
docstrings, and related tests while keeping the metadata provider
implementations and logic.

## Related issues
Closes ray-project#60310

## Additional information
Deleted `meta_provider` parameter from all read APIs, its deprecation
warnings, deleted tests that explicitly tests the parameter. I kept all
metadata provider implementations `DefaultFileMetadataProvider,
BaseFileMetadataProvider, FileMetadataProvider` and `meta_provider`
internally being used such as subclasses of
`ray.data.datasource.Datasource`. Tested remaining read API tests.

---------

Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
…n. (ray-project#60245)

The GcsActorManager has public methods that are only used in the class
or in testing. This a clear violation of encapsulation.

I've made these methods private. For test that use them, I've made them
explicit friends of the GcsActorManager class. I don't love this, but
it's better than the status quo.

---------

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…y-project#60415)

This was from 7 years ago. We really truly don't support Python 2
anymore.

Signed-off-by: irabbani <israbbani@gmail.com>
`test_ray_intentional_errors` has been flaky as there is a race between
the FINISHED task event reaching GCS and the worker-dead timeout firing
(killing the actor in the test triggers worker dead callback). this has
been fixed by increasing `gcs_mark_task_failed_on_worker_dead_delay_ms`
(this is affecting both linux and windows but seems to be more frequent
on windows)

we can consider increasing this config by default, but I feel this is an
edge case which may not be worth our time

successful run:
https://buildkite.com/ray-project/premerge/builds/57999#019bccb1-b365-4d5c-8f1c-e473e95959da/L11240

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…e cgroups (ray-project#59051)

## Description

Add a user guide for enabling Ray resource isolation on Kubernetes using
writable cgroups

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
## Description

ema stats can become noisy if we process a bunch of nan values.
This PR makes it so that these warnings become quiet as they are to be
expected in our setting.

This change is important because the notifications from numpy are logged
without a stack trace or hint to which component they come from. So if
some of your metrics (that you may not even know of) are nan, you'll see
this error all the time and have no idea how to fix it.
…ob_logs (ray-project#60346)

When using `JobSubmissionClient.tail_job_logs()` with authenticated Ray
clusters (e.g., clusters behind an authentication proxy or with
token-based auth), the WebSocket connection fails silently because
authentication headers are not passed to the WebSocket upgrade request.

### Current Behavior (Bug)
- `ray job submit` hangs indefinitely when trying to tail logs on
authenticated clusters
- SDK users cannot tail logs from authenticated clusters via WebSocket
- The connection closes silently without proper error reporting

### Root Cause

The bug exists in `python/ray/dashboard/modules/job/sdk.py` lines
497-502:

```python
async with aiohttp.ClientSession(
    cookies=self._cookies, headers=self._headers  # ← Headers set on session
) as session:
    ws = await session.ws_connect(
        f"{self._address}/api/jobs/{job_id}/logs/tail",
        ssl=self._ssl_context
    )  # ← But NOT passed to ws_connect()!
```

**Why this is a problem:** Unlike HTTP requests, aiohttp's
`ClientSession` does NOT automatically include session-level headers in
WebSocket upgrade requests. Per aiohttp's design, `ws_connect()` creates
fresh headers with only WebSocket protocol headers. Session headers must
be explicitly passed via the `headers` parameter.

**Evidence from aiohttp source:**
- HTTP requests call `_prepare_headers()` which merges session defaults
- `ws_connect()` creates a new `CIMultiDict()` without merging session
headers
- See: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/client.py

## Related issue number

<!-- If there's a related GitHub issue, link it here -->
<!-- Otherwise you can delete this section -->

## Changes Made

### 1. Fix in `sdk.py` (1 line changed)

**File:** `python/ray/dashboard/modules/job/sdk.py`
**Lines:** 500-502

```diff
  async with aiohttp.ClientSession(
      cookies=self._cookies, headers=self._headers
  ) as session:
      ws = await session.ws_connect(
          f"{self._address}/api/jobs/{job_id}/logs/tail",
+         headers=self._headers,
          ssl=self._ssl_context
      )
```

### 2. New Test in `test_sdk.py` (59 lines added)

**File:** `python/ray/dashboard/modules/job/tests/test_sdk.py`

Added `test_tail_job_logs_passes_headers_to_websocket()` which:
- Creates a `JobSubmissionClient` with authentication headers
- Mocks aiohttp's `ClientSession` and WebSocket connection
- Verifies that headers are explicitly passed to `ws_connect()`
- Ensures authentication headers reach the WebSocket upgrade request

## Testing

### Automated Testing

The new test `test_tail_job_logs_passes_headers_to_websocket` verifies
the fix by:
- Mocking the aiohttp WebSocket connection
- Checking that `ws_connect()` receives the `headers` parameter
- Asserting the headers match what was passed to `JobSubmissionClient`

### Manual Testing

To manually test this fix with an authenticated Ray cluster:

```python
from ray.job_submission import JobSubmissionClient

# Connect to authenticated cluster
client = JobSubmissionClient(
    address="https://your-ray-cluster/",
    headers={"Authorization": "Bearer your-token"}
)

# Submit job
job_id = client.submit_job(entrypoint="echo 'Hello, Ray!'")

# Tail logs (this should now work instead of hanging)
async for lines in client.tail_job_logs(job_id):
    print(lines, end="")
```

**Before this fix:** The above code would hang indefinitely or fail
silently.

**After this fix:** Logs stream correctly via WebSocket with
authentication.

## Impact

This one-line fix enables:

1. **`ray job submit` to work with authenticated clusters**
   - Previously would hang indefinitely when auto-tailing logs
   - Now streams logs correctly

2. **SDK users can tail logs from authenticated clusters**
   - `tail_job_logs()` now works with any authentication mechanism
   - Enables real-time log streaming for production deployments

3. **Proxied Ray clusters work correctly**
   - Ray clusters behind authentication proxies (common in production)
   - Multi-tenant Ray deployments with auth

4. **No breaking changes**
   - Backward compatible with non-authenticated clusters
   - Headers parameter is optional (None is valid)
   - Existing tests continue to pass

---------

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
ray-project#60392)

## Description

Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and
running. This timeout is often insufficient for rare instance types,
such as GPU and TPU instances, which can take much longer to provision.

Multiple users have encountered failures caused by this short timeout,
for example:

https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559


https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM


https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM


https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM

This PR increases the timeout from 5 minutes to 1 hour, making
Autoscaler v2 more robust for slow-provisioning instance types.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main, containing a wide range of changes across the repository. The most significant updates include a major refactoring of the CI/CD pipelines, moving towards a more unified and parameterized build system using wanda. This involves changes to Buildkite configurations, Dockerfiles, and build scripts. Additionally, there are substantial improvements to documentation, including new guides, better organization, and more realistic examples. The RLlib examples have been restructured, and several new features and APIs have been introduced in Ray Core and Ray Serve, such as token authentication, locality-aware routing, and improved observability. My review focuses on ensuring the consistency and correctness of these large-scale changes.

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.