π daily merge: master β main 2026-01-23#756
Conversation
- adding anyscale template configs for async inf template Signed-off-by: harshit <harshit@anyscale.com>
As-is, this script installs for arm architecture, regardless of actual machine type. Also bumping version to unblock issue from running with newer OpenSSL version-- ``` [ERROR 2026-01-07 03:46:50,067] crane_lib.py: 70 Crane command `/home/forge/.cache/bazel/_bazel_forge/5fe90af4e7d1ed9fcf52f59e39e126f5/external/crane_linux_x86_64/crane copy 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu` failed with stderr: -- 2026/01/07 03:46:49 Copying from 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu to us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu ERROR: gcloud failed to load: module 'lib' has no attribute 'X509_V_FLAG_NOTIFY_POLICY' gcloud_main = _import_gcloud_main() import googlecloudsdk.gcloud_main from googlecloudsdk.calliope import cli from googlecloudsdk.calliope import backend from googlecloudsdk.calliope import parser_extensions from googlecloudsdk.core.updater import update_manager from googlecloudsdk.core.updater import installers from googlecloudsdk.core.credentials import store from googlecloudsdk.api_lib.auth import util as auth_util from googlecloudsdk.core.credentials import google_auth_credentials as c_google_auth from oauth2client import client as oauth2client_client from oauth2client import crypt from oauth2client import _openssl_crypt from OpenSSL import crypto from OpenSSL import SSL, crypto from OpenSSL.crypto import ( class X509StoreFlags: NOTIFY_POLICY: int = _lib.X509_V_FLAG_NOTIFY_POLICY Β This usually indicates corruption in your gcloud installation or problems with your Python interpreter. ``` --------- Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: Andrew Pollack-Gray <andrew@anyscale.com>
β¦ect#58435) - Fix memory safety for core_worker in the shutdown executor -- use `weak_ptr` instead of raw pointer. - Ensure shutdown completes before core worker destructs. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
No longer relevant. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Add documentation to 20 functions in ci/raydepsets/cli.py that were missing docstrings, improving code readability and maintainability. π€ Generated with [Claude Code] Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
β¦9745) ## Description Fixed a broken link in the read_unity_catalog doc string. Previous URL was outdated. ## Related issues None ## Additional information N/A --------- Signed-off-by: Jess <jessica.jy.kong@gmail.com> Signed-off-by: Jessica Kong <jessica.jy.kong@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description `CountDistinct` allow users to compute the number of distinct values in a column, similar to SQL's `COUNT(DISTINCT ...)`. ## Related issues close ray-project#58252 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
β¦ct#59942) Updating to reflect an issue that I debugged recently. Recommendation is to use `overlayfs` instead of the default `vfs` for faster container startup. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦ialization overhead (ray-project#59919) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Fix typos in docs and docstrings. If any are too trivial, just lmk. Agent assisted --------- Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description This was used early in the development of the Ray Dashboard and is not used any more so we should remove it (I recently came across this). --------- Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
β¦7735) There have been ask for enabling --temp-dir flag on a per node basis in contrast to the current implementation that only allows all node's temp dir to be configured to the head node's temp dir configuration. This PR introduces the capability for the ray temp directory to be specified on a per node basis, eliminating the restriction that --temp-dir flag can only be used in conjunction with the --head flag. get_user_temp_dir and get_ray_temp_dir has been marked as deprecated and replaced with the resolve_user_ray_temp_dir function to ensure that temp dir is consistent across the system. ## New Behaviors **Temp dir** | | head node temp_dir NOT specified | head node temp_dir specified | |---|---|---| | worker node temp_dir NOT specified | Worker & head node uses `/tmp/ray` | Worker uses head node's temp_dir | | worker node temp_dir specified | Worker uses its own specified temp_dir. Head node uses default | Each node uses its own specified temp_dir | **Object spilling directory** | | head node spilling dir NOT specified | head node spilling dir specified | |---|---|---| | worker node spilling dir NOT specified | Each node uses its own temp_dir as spilling dir | Worker uses head node's spilling dir | | worker node spilling dir specified | Worker uses its own specified spilling dir. Head node uses its temp_dir | Each node uses its own specified spilling dir | ## Testing We tested the expected behaviors on a local multi-node kuberay cluster by verifying that: 1. nodes defaults to `/tmp/ray` when node temp_dir is specified 2. non-head nodes picked up head node's temp_dir specifications when only head node temp_dir was specified 3. non-head nodes can take independent temp_dir regardless of head node temp_dir when specified 4. nodes default to their own temp dir as spilling directory for all three cases above 5. nodes default to head node's spilling directory when only head node spilling directory is specified 6. nodes can have their spilling directory specified independent of the head node's spilling directory Behaviors were verified by checking that the directories were created, and that the right information is fetched from head node. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ray-project#47262 ray-project#51218 ray-project#40628 ray-project#32962 ray-project#40628 ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [x] Enhancement π - [ ] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [x] No <!-- If yes, describe what breaks and how users should migrate --> This PR should not introduce any breaking changes just yet. However, this PR deprecates `get_user_temp_dir` and `get_ray_temp_dir`. The two functions will be marked as errors in the next version update. --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
β¦9949) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Train] Train Benchmark to include time to first batch - In train benchmarks, include time to first batch while reporting Throughput. - Without this, it's misleading because throughput with preserve-order looks better than without. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
β¦59953) updating `build-ray-docker.sh` - removing CONSTRAINTS_FILE build arg - copying constraint file to "${CPU_TMP}/requirements_compiled.txt" Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Description Since there have been a number of expressions that have been added, this seems like a good time to reorganize the expression tests so that it's clear what is covered by tests. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
β¦ ActorPool (ray-project#59645) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
β¦dir (ray-project#59941) Rename build_dir parameter to context_dir and move it to the last argument position for better API consistency. π€ Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Description
Adding py313 dependency sets (not in use yet):
- Adding requirements_compiled_py3.13.txt
- Adding requirements_compiled_py3.10,11,12.txt symlinked to
requirements_compiled.txt
- updated script to remove header from requirements_compiled_py* files
- parameterizing requirements_compiled_py{PYTHON_VERSION} in raydepsets
config
- Generating py313 dependency sets
- Removing ray_dev.in requirements file to deplocks directory
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
β¦oject#59954) There were a couple missing updates to RAY_testing_rpc_failure when we moved to a json format from here ray-project#58886 that I noticed. --------- Signed-off-by: joshlee <joshlee@anyscale.com>
β¦y-project#59895) ## Description For ray-project#59508 and ray-project#59581 that using TicTacToe would cause the following error ``` ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True)) File "ray/python/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "ray/python/ray/tune/trainable/trainable.py", line 328, in train result = self.step() File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step train_results, train_iter_ctx = self._run_one_training_iteration() File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration training_step_return_value = self.training_step() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step return self._training_step_new_api_stack() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack self.local_replay_buffer.add(episodes) File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add existing_eps.concat_episode(eps) File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode sa_episode.concat_episode(other.agent_episodes[agent_id]) File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode assert self.t == other.t_started AssertionError ``` In the multi-agent-episode `concat_episode`, we check if any agent hasn't received their next observation from an observation, action. This results in a hanging action where in one episode is the observation, action then in the next is the resulting observation, reward, etc. This [code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848) check if this has happened then added an extra step at the beginning to include this hanging data. However, in testing, the multi-agent episode `cut` method already implements this (if using `slice` this will cause a hidden bug) meaning that an extra unnecessary step's data is being added resulting in the episode beginnings not lining up. Therefore, this PR removes this code and replaces with a simple check to assume that the hanging action is equivalent to the initial action in the next episode. For testing, I found that the `concat_episode` test was using `slice` which doesn't account for hanging data while `cut` which is used in the env-runner does. I modified the test to be more functional based where I created a custom environment that has agents taking actions at different frequencies then returning as an observation the agent's timestep. This allows us to test through concatenating all episodes of the same ID and checking that the observations increase 0, 1, 2, ... and ensures that no data goes missing for users. --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
β¦ay-project#59801) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
) ## Description Before this PR, users could not tell why objects could not be reconstructed, since the response only contained a generic error message. ``` All copies of 8774b2e5680a48cdffffffffffffffffffffffff0200000003000000 have been lost due to node failure. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the failure." ``` ### Object Recovery Flow (After This PR) #### Step 1: Reference & Ownership Check - `if (!ref_exists)` β `OBJECT_UNRECONSTRUCTABLE_REF_NOT_FOUND` ``` [OBJECT_UNRECONSTRUCTABLE_REF_NOT_FOUND] The object cannot be reconstructed because its reference was not found in the reference counter. Please file an issue at https://github.com/ray-project/ray/issues. ``` - `if (!owned_by_us)` β `OBJECT_UNRECONSTRUCTABLE_BORROWED` ``` [OBJECT_UNRECONSTRUCTABLE_BORROWED] The object cannot be reconstructed because it crossed an ownership boundary. Only the owner of an object can trigger reconstruction, but this worker borrowed the object from another worker. ``` #### Step 2: Try to Pin Existing Copies - Look up object locations via `object_lookup_` - If copies exist on other nodes β `PinExistingObjectCopy` - If all locations fail β proceed to Step 3 - If no copies exist β proceed to Step 3 #### Step 3: Lineage Eligibility Check - I define eligibility as: we don't need to actually rerun the task, and we already know whether it is eligible for reconstruction. - `INELIGIBLE_PUT` β `OBJECT_UNRECONSTRUCTABLE_PUT` ``` [OBJECT_UNRECONSTRUCTABLE_PUT] The object cannot be reconstructed because it was created by ray.put(), which has no task lineage. To prevent this error, return the value from a task instead. ``` - `INELIGIBLE_NO_RETRIES` β `OBJECT_UNRECONSTRUCTABLE_RETRIES_DISABLED` ``` [OBJECT_UNRECONSTRUCTABLE_RETRIES_DISABLED] The object cannot be reconstructed because the task was created with max_retries=0. Consider enabling retries using `@ray.remote(max_retries=N)`. ``` - `INELIGIBLE_LINEAGE_EVICTED` β `OBJECT_UNRECONSTRUCTABLE_LINEAGE_EVICTED` ``` [OBJECT_UNRECONSTRUCTABLE_LINEAGE_EVICTED] The object cannot be reconstructed because its lineage has been evicted to reduce memory pressure. To prevent this error, set the environment variable RAY_max_lineage_bytes=<bytes> (default 1GB) during `ray start`. ``` - `INELIGIBLE_LOCAL_MODE` β `OBJECT_UNRECONSTRUCTABLE_LOCAL_MODE` ``` [OBJECT_UNRECONSTRUCTABLE_LOCAL_MODE] The object cannot be reconstructed because Ray is running in local mode. Local mode does not support object reconstruction. ``` - `INELIGIBLE_LINEAGE_DISABLED` -> `OBJECT_UNRECONSTRUCTABLE_LINEAGE_DISABLED` ``` [OBJECT_UNRECONSTRUCTABLE_LINEAGE_DISABLED] The object cannot be reconstructed because lineage reconstruction is disabled system-wide (object_reconstruction_enabled=False). ``` - `ELIGIBLE` β proceed to Step 4 #### Step 4: Task Resubmission - `OBJECT_UNRECONSTRUCTABLE_TASK_CANCELLED` ``` [OBJECT_UNRECONSTRUCTABLE_TASK_CANCELLED] The object cannot be reconstructed because the task that would produce it was cancelled. ``` - `OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED` ``` [OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED] The object cannot be reconstructed because the maximum number of task retries has been exceeded. Consider increasing the number of retries using `@ray.remote(max_retries=N)`. ``` #### Step 5: Dependency Recovery (Recursive) ```cpp for (const auto &dep : task_deps) { auto error = RecoverObject(dep); // Recursive call to Step 1 if (error.has_value()) { recovery_failure_callback_(dep, *error, true); } } // Dependencies can fail with any error from Steps 1β4 ``` This PR also: - Adds appropriate log messages at each step ## Related issues Closes ray-project#59562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
β¦-project#59425) Fixes the issue where the interpreter would crash instead of providing a useful error message. Previously, calling ray.kill() on an ActorHandle from a previous Ray session (after ray.shutdown() and ray.init()) would crash the Python interpreter with a C++ assertion failure. This fix: 1. Prevents the crash by only calling OnActorKilled() in C++ when the kill operation succeeds 2. Catches the error in Python and converts it to a helpful ValueError explaining that ActorHandle objects are not valid across sessions 3. Adds a test to verify the fix The error message now clearly explains: - ActorHandle objects are not valid across Ray sessions - When this typically happens (after shutdown/restart) - What the user should do (create a new actor handle) Fixes ray-project#59340 --------- Signed-off-by: kriyanshii <kriyanshishah06@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Description ray-project@7198193 made a backwards incompatible change in env variable name leading to regression in `scheduling_test_many_0s_tasks_many_nodes` release test. (the env var is being used by the anyscale cluster used to run the release tests). reverting this change to fix the problem release test is now passing: https://buildkite.com/ray-project/release/builds/74397 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦server bindings (ray-project#59852) ## Description This PR adds IPv6 localhost support and improves server binding security by eliminating 0.0.0.0 bindings. ### goal - Avoid hardcoding 127.0.0.1, which breaks IPv6 support. - Avoid proactively using 0.0.0.0, which is insecure. ##### Server side - For local-only servers, bind to localhost (resolved via GetLocalhostIP()/ get_localhost_ip(); IPv4/IPv6). - For servers that need cross-node communication, bind to the node IP instead of 0.0.0.0. - If the user explicitly configures a bind address, always respect the user setting. ##### Client side - Use localhost when connecting to local-only servers (resolved via get_localhost_ip()). - Use the node IP when connecting to servers that require cross-node communication. #### NoteοΌ `0.0.0.0 β node_ip` related changes this PR madeοΌ - GCS Server: `0.0.0.0 β node_ip` - Raylet gRPC: `0.0.0.0 β node_ip` - Core Worker gRPC: `0.0.0.0 β node_ip` - Object Manager: `0.0.0.0 β node_ip` - Remote Python Debugger: `0.0.0.0 β node_ip` - Ray Client Server already passed the node IP before this PR, but its default `--host` was 0.0.0.0. This PR changed it to localhost. - Dashboard Server by default binds to localhost. This PR just updated the documentation to suggest using node IP instead of 0.0.0.0 for remote access. - Non-Ray-core components (e.g., Serve): This PR keeps them binding to all interfaces as before, but replaced hardcoded 0.0.0.0 with `get_all_interfaces_ip()` to handle IPv6 (returns :: for IPv6 environments). ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
β¦-project#59113) ## Description This PR adds three test cases that validate the interoperability between the Intel GPU software stack and Ray in various deployment scenarios. These tests assume that the environment is already configured as required. They serve as smoke tests that users can run to confirm a correct deployment. Covered deployment scenarios: - Single GPU on a single node (sanity check) - Multiple GPUs on a single node (scale-up) - Multiple GPUs across multiple nodes (scale-out) ## Additional information - Tests will automatically skip if `dpctl` (an Intel GPU dependency) is not installed or if the environment is not properly configured for the given test. - Tests require the `RAY_PYTEST_USE_GPU` flag to be set, for consistency with other Ray GPU tests. ## Motivation I understand that Ray currently does not include Intel GPUs in its CI infrastructure, so these tests will be skipped during CI runs and may not provide immediate value to the Ray development team. However, they can serve as a useful verification and troubleshooting tool for Ray users deploying on Intel GPUs, making them worth upstreaming. They also require very little maintanence as they'll simply skip gracefully and provide future readiness in case Intel GPU support in Ray will expand. --------- Signed-off-by: Jakub Zimny <jakub.zimny@intel.com>
β¦#59478) ## Description i'm hoping to make the example publishing process smoother when setting up the CI for testings (release tests): Currently, when publishing examples and setting up release tests in CI, we have to manually tweak multiple BUILD.bazel to make our `ci/aws.yaml` and `ci/gce.yaml` discoverable to the CI release package (release/BUILD.bazel). It adds additional overhead/confusion to the writer and clutter the files over time solution: Consolidate all `ci/aws.yaml` and `gce.yaml` configs into a single filegroup under doc/BUILD.bazel. Discover that unique filegroup from release/BUILD.bazel The only requirement is to match the `doc/source/**/ci/aws.yaml` or `doc/source/**/ci/gce.yaml` pattern which maches our standard way to publish examples ### Changes - **Updated** `doc/BUILD.bazel` to define one single filegroup for all of `ci/` configs. Use glob patterns to catch all `aws.yaml` and `gce.yaml` under a `ci/` folder - **Updated** `release/BUILD.bazel` to reference that filegroup - **Updated** all inner doc/** BUILD.bazel accordingly with their own local filegroups ### Tests <details> <summary>Manual review with bazel query 'kind("source file", deps(//doc:all_examples_ci_configs))'</summary> ``` (repo_ray_docs) aydin@aydin-JCDF7JJD9H doc % bazel query 'kind("source file", deps(//doc:all_examples_ci_configs))' //doc:source/ray-overview/examples/e2e-audio/ci/aws.yaml //doc:source/ray-overview/examples/e2e-audio/ci/gce.yaml //doc:source/ray-overview/examples/e2e-multimodal-ai-workloads/ci/aws.yaml //doc:source/ray-overview/examples/e2e-multimodal-ai-workloads/ci/gce.yaml //doc:source/ray-overview/examples/e2e-rag/ci/aws.yaml //doc:source/ray-overview/examples/e2e-rag/ci/gce.yaml //doc:source/ray-overview/examples/e2e-timeseries/ci/aws.yaml //doc:source/ray-overview/examples/e2e-timeseries/ci/gce.yaml //doc:source/ray-overview/examples/e2e-xgboost/ci/aws.yaml //doc:source/ray-overview/examples/e2e-xgboost/ci/gce.yaml //doc:source/ray-overview/examples/entity-recognition-with-llms/ci/aws.yaml //doc:source/ray-overview/examples/entity-recognition-with-llms/ci/gce.yaml //doc:source/ray-overview/examples/langchain_agent_ray_serve/ci/aws.yaml //doc:source/ray-overview/examples/langchain_agent_ray_serve/ci/gce.yaml //doc:source/ray-overview/examples/llamafactory-llm-fine-tune/ci/aws.yaml //doc:source/ray-overview/examples/llamafactory-llm-fine-tune/ci/gce.yaml //doc:source/ray-overview/examples/mcp-ray-serve/ci/aws.yaml //doc:source/ray-overview/examples/mcp-ray-serve/ci/gce.yaml //doc:source/ray-overview/examples/object-detection/ci/aws.yaml //doc:source/ray-overview/examples/object-detection/ci/gce.yaml //doc:source/serve/tutorials/asynchronous-inference/ci/aws.yaml //doc:source/serve/tutorials/asynchronous-inference/ci/gce.yaml //doc:source/serve/tutorials/deployment-serve-llm/ci/aws.yaml //doc:source/serve/tutorials/deployment-serve-llm/ci/gce.yaml //doc/source/data/examples:unstructured_data_ingestion/ci/aws.yaml //doc/source/data/examples:unstructured_data_ingestion/ci/gce.yaml //doc/source/train/examples/pytorch:deepspeed_finetune/ci/aws.yaml //doc/source/train/examples/pytorch:deepspeed_finetune/ci/gce.yaml //doc/source/train/examples/pytorch:distributing-pytorch/ci/aws.yaml //doc/source/train/examples/pytorch:distributing-pytorch/ci/gce.yaml //doc/source/train/examples/pytorch:pytorch-fsdp/ci/aws.yaml //doc/source/train/examples/pytorch:pytorch-fsdp/ci/gce.yaml //doc/source/train/examples/pytorch:pytorch-profiling/ci/aws.yaml //doc/source/train/examples/pytorch:pytorch-profiling/ci/gce.yaml Loading: 1 packages loaded ``` </details> I also ran all release tests whose ci/ configs are affected by this change and verified that their ci/ configuration is still being read correctly (i.e., the Anyscale job is launched as expected). https://buildkite.com/ray-project/release/builds/73954# (test failure are due to application-level errors after launching the anyscale job, not because of this change) --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>
## Description Automatically exclude common directories (.git, .venv, venv, __pycache__) when uploading working_dir in runtime environment packages. At a minimum we need to exclude `.git/` because unlike the others, nobody includes .git/ in `.gitignore`. This causes Ray to throw a `ray.exceptions.RuntimeEnvSetupError` if your `.git` dir is larger than 512 MiB. I also updated the documentation in handling-dependencies.rst and improved the error message if the env exceeds the GCS_STORAGE_MAX_SIZE limit. ## Related issues N/A ## Additional information This PR pytorch/tutorials#3709 was failing to run because the PyTorch tutorials .git/ folder is huge. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Ricardo Decal <crypdick@users.noreply.github.com> Signed-off-by: Ricardo Decal <rdecal@anyscale.com> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦roject#59956) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. **[Data] Fix test_execution_optimizer_limit_pushdown determinism** Fix by adding `override_num_blocks=1` ``` [2026-01-08T00:20:42Z] =================================== FAILURES =================================== -- [2026-01-08T00:20:42Z] ____________________ test_limit_pushdown_basic_limit_fusion ____________________ [2026-01-08T00:20:42Z] [2026-01-08T00:20:42Z] ray_start_regular_shared_2_cpus = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.19', ray_version='3.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}') [2026-01-08T00:20:42Z] [2026-01-08T00:20:42Z] def test_limit_pushdown_basic_limit_fusion(ray_start_regular_shared_2_cpus): [2026-01-08T00:20:42Z] """Test basic Limit -> Limit fusion.""" [2026-01-08T00:20:42Z] ds = ray.data.range(100).limit(5).limit(100) [2026-01-08T00:20:42Z] > _check_valid_plan_and_result( [2026-01-08T00:20:42Z] ds, [2026-01-08T00:20:42Z] "Read[ReadRange] -> Limit[limit=5]", [2026-01-08T00:20:42Z] [{"id": i} for i in range(5)], [2026-01-08T00:20:42Z] check_ordering=False, [2026-01-08T00:20:42Z] ) [2026-01-08T00:20:42Z] [2026-01-08T00:20:42Z] python/ray/data/tests/test_execution_optimizer_limit_pushdown.py:40: [2026-01-08T00:20:42Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-08T00:20:42Z] [2026-01-08T00:20:42Z] ds = limit=5 [2026-01-08T00:20:42Z] +- Dataset(num_rows=5, schema={id: int64}) [2026-01-08T00:20:42Z] expected_plan = 'Read[ReadRange] -> Limit[limit=5]' [2026-01-08T00:20:42Z] expected_result = [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}] [2026-01-08T00:20:42Z] expected_physical_plan_ops = None, check_ordering = False [2026-01-08T00:20:42Z] [2026-01-08T00:20:42Z] def _check_valid_plan_and_result( [2026-01-08T00:20:42Z] ds: Dataset, [2026-01-08T00:20:42Z] expected_plan: Plan, [2026-01-08T00:20:42Z] expected_result: List[Dict[str, Any]], [2026-01-08T00:20:42Z] expected_physical_plan_ops=None, [2026-01-08T00:20:42Z] check_ordering=True, [2026-01-08T00:20:42Z] ): [2026-01-08T00:20:42Z] actual_result = ds.take_all() [2026-01-08T00:20:42Z] if check_ordering: [2026-01-08T00:20:42Z] assert actual_result == expected_result [2026-01-08T00:20:42Z] else: [2026-01-08T00:20:42Z] > assert rows_same(pd.DataFrame(actual_result), pd.DataFrame(expected_result)) [2026-01-08T00:20:42Z] E AssertionError: assert False [2026-01-08T00:20:42Z] E + where False = rows_same( id\n0 25\n1 26\n2 27\n3 28\n4 29, id\n0 0\n1 1\n2 2\n3 3\n4 4) [2026-01-08T00:20:42Z] E + where id\n0 25\n1 26\n2 27\n3 28\n4 29 = <class 'pandas.core.frame.DataFrame'>([{'id': 25}, {'id': 26}, {'id': 27}, {'id': 28}, {'id': 29}]) [2026-01-08T00:20:42Z] E + where <class 'pandas.core.frame.DataFrame'> = pd.DataFrame [2026-01-08T00:20:42Z] E + and id\n0 0\n1 1\n2 2\n3 3\n4 4 = <class 'pandas.core.frame.DataFrame'>([{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}]) [2026-01-08T00:20:42Z] E + where <class 'pandas.core.frame.DataFrame'> = pd.DataFrame Β ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Description After ray-project#60017 got merged, I forgot to update the `test_bundle_queue` test suite. This PR adds more tests for `num_blocks`, `num_rows`, `estimate_size_bytes`, and `len(queue)` ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦ject#60338) ## Description > Briefly describe what this PR accomplishes and why it's needed. This PR adds support for Google Cloud's 7th generation TPU (Ironwood). The TPU 7x generation introduces a change in the accelerator type naming convention reported by the environment. Unlike previous generations (v6e-16, v5p-8, etc.), 7x instances report types starting with tpu (e.g. tpu7x-16). This PR accounts for the new format and enables Ray to auto-detect the v7x hardware automatically (users don't have to manually configure env vars). This is critical for libraries like Ray Train and for vLLM support - where the automatic device discovery is utilized during JAX initialization. ## Related issues Fixes ray-project#59964 ## Additional information For more info about TPU v7x: https://docs.cloud.google.com/tpu/docs/tpu7x. --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>
## Description
1. the flakyness for test_flush_worker_result_queue is, when
queue_backlog_length is 0, after `wg._start()`, we immediately
wg.poll_status() and asserts finished, sometimes rank 0βs training
thread is still running at that instant .
leads to the below error:
```
where False = WorkerGroupPollStatus(worker_statuses={0: WorkerStatus(running=True, error=None, training_report=None), 1: WorkerStatus(running=False, error=None, training_report=None), 2: WorkerStatus(running=False, error=None, training_report=None), 3: WorkerStatus(running=False, error=None, training_report=None)}).finished
```
2. use the same pattern in `test_poll_status_finished` in the same file
to address this flakyness.
3. increase `test_placement_group_handle ` to medium to avoid timeout.
```
python/ray/train/v2/tests/test_placement_group_handle.py::test_slice_handle_shutdown -- Test timed out at 2026-01-20 18:12:46 UTC --
--
[2026-01-20T18:15:17Z] ERROR [100%]
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] ==================================== ERRORS ====================================
[2026-01-20T18:15:17Z] _________________ ERROR at setup of test_slice_handle_shutdown _________________
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] @pytest.fixture(autouse=True)
[2026-01-20T18:15:17Z] def ray_start():
[2026-01-20T18:15:17Z] > ray.init(num_cpus=4)
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] python/ray/train/v2/tests/test_placement_group_handle.py:16:
[2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/client_mode_hook.py:104: in wrapper
[2026-01-20T18:15:17Z] return func(*args, **kwargs)
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1910: in init
[2026-01-20T18:15:17Z] _global_node = ray._private.node.Node(
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/node.py:402: in __init__
[2026-01-20T18:15:17Z] time.sleep(0.1)
[2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] signum = 15
[2026-01-20T18:15:17Z] frame = <frame at 0x55cf6cb749f0, file '/rayci/python/ray/_private/node.py', line 402, code __init__>
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] def sigterm_handler(signum, frame):
[2026-01-20T18:15:17Z] > sys.exit(signum)
[2026-01-20T18:15:17Z] E SystemExit: 15
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1670: SystemExit
```
4. add a `manual` tag for `test_jax_gpu` bazel target to temporally
disable CI for this unit test given that pypi jax version only support
at least CUDA 12.2 now while our CI runs on CUDA 12.1
---------
Signed-off-by: Lehui Liu <lehui@anyscale.com>
β¦d of the highest available version. (ray-project#60378) Signed-off-by: irabbani <israbbani@gmail.com>
ray-project#60384) This reverts commit c9ff164. After investigations by the core team we we're able to determine this minor OTEL version upgrade dropped task/actor creation throughput by around 25% from 600ms to 450ms. Buildkite that verifies this fix: https://buildkite.com/ray-project/release/builds/76576#019be280-c80f-40c6-9907-904ff5f93d4b --------- Signed-off-by: joshlee <joshlee@anyscale.com>
β¦t files (ray-project#60236) ## Description Returning `None` when you don't have partition_columns selects all the partitions which is not the right behavior. Returning `[]` when no partition columns are selected. ## Related issues Closes ray-project#60215 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
β¦60185)" (ray-project#60361) Signed-off-by: joshlee <joshlee@anyscale.com>
) ## Description This PR reduces CI time for Data-only PRs by ensuring that changes to `python/ray/data/` no longer trigger all ML/train tests unnecessarily. ## Related issues Closes ray-project#59780 Contribution by Gittensor, learn more at https://gittensor.io/ --------- Signed-off-by: DeborahOlaboye <deboraholaboye@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦ray-project#60224) There is no need for this callback to be nested so deeply inside of the `TaskReceiver`. We can instead call it from `CoreWorker::ExecuteTask` prior to returning. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦ect#60352) use more conventional methods, so that it is clearer how the job status info get used. this is for preparation of anyscale cli/sdk migration Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦ct#60376) 1. `ray.get(pg_handle.ready(), timeout=self._worker_group_start_timeout_s)` includes both start placement group and install runtime env, if the installation takes longer than 30s, it will go into a scheduling/rescheduling phase 2. this change is to change the default timeout to 60s instead to mitigate the fixedScalingPolicy experience when packages are installed via runtime environment. Signed-off-by: Lehui Liu <lehui@anyscale.com>
β¦r restart (ray-project#58877) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
β¦roject#57735)" (ray-project#60388) ## Description The PR to introduce node specific `temp-dir` specification capabilities introduced a number of tests that fails on the Windows environment in post merge. To prevent these tests from blocking other PRs, we will be reverting the PR until the tests has been fixed. This reverts PR "[Core] Introduce node specific temp-dir specification. (ray-project#57735)". ## Related issues N/A ## Additional information Temp dir PR: ray-project#57735 Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
## Description Remove the user-facing `meta_provider` parameter from all read APIs, its docstrings, and related tests while keeping the metadata provider implementations and logic. ## Related issues Closes ray-project#60310 ## Additional information Deleted `meta_provider` parameter from all read APIs, its deprecation warnings, deleted tests that explicitly tests the parameter. I kept all metadata provider implementations `DefaultFileMetadataProvider, BaseFileMetadataProvider, FileMetadataProvider` and `meta_provider` internally being used such as subclasses of `ray.data.datasource.Datasource`. Tested remaining read API tests. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
β¦n. (ray-project#60245) The GcsActorManager has public methods that are only used in the class or in testing. This a clear violation of encapsulation. I've made these methods private. For test that use them, I've made them explicit friends of the GcsActorManager class. I don't love this, but it's better than the status quo. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦y-project#60415) This was from 7 years ago. We really truly don't support Python 2 anymore. Signed-off-by: irabbani <israbbani@gmail.com>
`test_ray_intentional_errors` has been flaky as there is a race between the FINISHED task event reaching GCS and the worker-dead timeout firing (killing the actor in the test triggers worker dead callback). this has been fixed by increasing `gcs_mark_task_failed_on_worker_dead_delay_ms` (this is affecting both linux and windows but seems to be more frequent on windows) we can consider increasing this config by default, but I feel this is an edge case which may not be worth our time successful run: https://buildkite.com/ray-project/premerge/builds/57999#019bccb1-b365-4d5c-8f1c-e473e95959da/L11240 --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦e cgroups (ray-project#59051) ## Description Add a user guide for enabling Ray resource isolation on Kubernetes using writable cgroups ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
## Description ema stats can become noisy if we process a bunch of nan values. This PR makes it so that these warnings become quiet as they are to be expected in our setting. This change is important because the notifications from numpy are logged without a stack trace or hint to which component they come from. So if some of your metrics (that you may not even know of) are nan, you'll see this error all the time and have no idea how to fix it.
β¦ob_logs (ray-project#60346) When using `JobSubmissionClient.tail_job_logs()` with authenticated Ray clusters (e.g., clusters behind an authentication proxy or with token-based auth), the WebSocket connection fails silently because authentication headers are not passed to the WebSocket upgrade request. ### Current Behavior (Bug) - `ray job submit` hangs indefinitely when trying to tail logs on authenticated clusters - SDK users cannot tail logs from authenticated clusters via WebSocket - The connection closes silently without proper error reporting ### Root Cause The bug exists in `python/ray/dashboard/modules/job/sdk.py` lines 497-502: ```python async with aiohttp.ClientSession( cookies=self._cookies, headers=self._headers # β Headers set on session ) as session: ws = await session.ws_connect( f"{self._address}/api/jobs/{job_id}/logs/tail", ssl=self._ssl_context ) # β But NOT passed to ws_connect()! ``` **Why this is a problem:** Unlike HTTP requests, aiohttp's `ClientSession` does NOT automatically include session-level headers in WebSocket upgrade requests. Per aiohttp's design, `ws_connect()` creates fresh headers with only WebSocket protocol headers. Session headers must be explicitly passed via the `headers` parameter. **Evidence from aiohttp source:** - HTTP requests call `_prepare_headers()` which merges session defaults - `ws_connect()` creates a new `CIMultiDict()` without merging session headers - See: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/client.py ## Related issue number <!-- If there's a related GitHub issue, link it here --> <!-- Otherwise you can delete this section --> ## Changes Made ### 1. Fix in `sdk.py` (1 line changed) **File:** `python/ray/dashboard/modules/job/sdk.py` **Lines:** 500-502 ```diff async with aiohttp.ClientSession( cookies=self._cookies, headers=self._headers ) as session: ws = await session.ws_connect( f"{self._address}/api/jobs/{job_id}/logs/tail", + headers=self._headers, ssl=self._ssl_context ) ``` ### 2. New Test in `test_sdk.py` (59 lines added) **File:** `python/ray/dashboard/modules/job/tests/test_sdk.py` Added `test_tail_job_logs_passes_headers_to_websocket()` which: - Creates a `JobSubmissionClient` with authentication headers - Mocks aiohttp's `ClientSession` and WebSocket connection - Verifies that headers are explicitly passed to `ws_connect()` - Ensures authentication headers reach the WebSocket upgrade request ## Testing ### Automated Testing The new test `test_tail_job_logs_passes_headers_to_websocket` verifies the fix by: - Mocking the aiohttp WebSocket connection - Checking that `ws_connect()` receives the `headers` parameter - Asserting the headers match what was passed to `JobSubmissionClient` ### Manual Testing To manually test this fix with an authenticated Ray cluster: ```python from ray.job_submission import JobSubmissionClient # Connect to authenticated cluster client = JobSubmissionClient( address="https://your-ray-cluster/", headers={"Authorization": "Bearer your-token"} ) # Submit job job_id = client.submit_job(entrypoint="echo 'Hello, Ray!'") # Tail logs (this should now work instead of hanging) async for lines in client.tail_job_logs(job_id): print(lines, end="") ``` **Before this fix:** The above code would hang indefinitely or fail silently. **After this fix:** Logs stream correctly via WebSocket with authentication. ## Impact This one-line fix enables: 1. **`ray job submit` to work with authenticated clusters** - Previously would hang indefinitely when auto-tailing logs - Now streams logs correctly 2. **SDK users can tail logs from authenticated clusters** - `tail_job_logs()` now works with any authentication mechanism - Enables real-time log streaming for production deployments 3. **Proxied Ray clusters work correctly** - Ray clusters behind authentication proxies (common in production) - Multi-tenant Ray deployments with auth 4. **No breaking changes** - Backward compatible with non-authenticated clusters - Headers parameter is optional (None is valid) - Existing tests continue to pass --------- Signed-off-by: Tri Lam <trilamsr@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request is an automated daily merge from master to main, containing a wide range of changes across the repository. The most significant updates include a major refactoring of the CI/CD pipelines, moving towards a more unified and parameterized build system using wanda. This involves changes to Buildkite configurations, Dockerfiles, and build scripts. Additionally, there are substantial improvements to documentation, including new guides, better organization, and more realistic examples. The RLlib examples have been restructured, and several new features and APIs have been introduced in Ray Core and Ray Serve, such as token authentication, locality-aware routing, and improved observability. My review focuses on ensuring the consistency and correctness of these large-scale changes.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-23
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.