π daily merge: master β main 2026-01-29#760
Conversation
## Description Add model inference release test that closely reflects user workloads. Release test run: https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_glehkcquv9k26ta69f8lkc94nl?job-logs-section-tabs=application_logs&job-tab=overview&metrics-tab=data ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
Reverts ray-project#59983 symlink does not work with newer version of wanda, where the newer version of wanda is doing the right thing.
β¦ct#59987) - Bump .rayciversion from 0.21.0 to 0.25.0 - Move rules files to .buildkite/ with *.rules.txt naming convention - Add always.rules.txt for always-run lint rules - Add test.rules.test.txt with test cases - Add test-rules CI step in cicd.rayci.yml (auto-discovery) - Update macOS config to use new rules file paths Topic: update-rayci-latest Signed-off-by: andrew <andrew@anyscale.com>
β¦ay-project#60057) ## Summary When running prefill-decode disaggregation with NixlConnector and data parallelism, both prefill and decode deployments were using the same port base for their ZMQ side channel. This caused "Address already in use" errors when both deployments had workers on the same node: ``` zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009') Exception in thread nixl_handshake_listener ``` ## Changes Fix by setting different `NIXL_SIDE_CHANNEL_PORT_BASE` values for prefill (40000) and decode (41000) configs to ensure port isolation. ## Test plan - Run `test_llm_serve_prefill_decode_with_data_parallelism` - should complete without timeout - The test previously hung forever waiting for "READY message from DP Coordinator" Signed-off-by: Seiji Eicher <seiji@anyscale.com>
β¦se (ray-project#60092) Signed-off-by: Future-Outlier <eric901201@gmail.com>
- Fix ProgressBar to honor `use_ray_tqdm` in `DataContext`. - Note that `tqdm_ray` is designed to work in non-interactive contexts (workers/actors) by sending JSON progress updates to the driver. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
β¦ray-project#59933) ## Description The `DefaultAutoscaler2` implementation needs an `AutoscalingCoordinator` and a way to get all of the `_NodeResourceSpec`. Currently, we can't explicitly inject fake implementations of either dependency. This is problematic because the tests need to assume what the implementation of each dependency looks like and use brittle mocks. To solve this: - Add the `FakeAutoscalingCoordinator` implementation to a new `fake_autoscaling_coordinator.py` module (you can use the code below) - `DefaultClusterAutoscalerV2` has two new parameters `autoscaling_coordinator: Optional[AutoscalingCoordinator] = None` and `get_node_counts: Callable[[], Dict[_NodeResourceSpec, int]] = get_node_resource_spec_and_count`. If `autoscaling_coordinator` is None, you can use the default implementation. - Update `test_try_scale_up_cluster` to use the explicit seams rather than mocks. Where possible, assert against the public interface rather than implementation details ## Related issues Closes ray-project#59683 --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com>
## Description RLlib's rayci.yml [file](https://github.com/ray-project/ray/blob/master/.buildkite/rllib.rayci.yml) and the BUILD.bazel [file](https://github.com/ray-project/ray/blob/master/rllib/BUILD.bazel) are disconnected such that there are old tags in the BUILD not the rayci and vice-versa. This PR attempts to clean up both files without modifying what tests are or aren't run currently --------- Signed-off-by: Mark Towers <mark@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
β¦g tracer file handles (ray-project#60078) This fix resolves serve's window test failure: ``` [2026-01-12T22:52:13Z] =================================== ERRORS ==================================== -- [2026-01-12T22:52:13Z] _______ ERROR at teardown of test_deployment_remote_calls_with_tracing ________ [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] @pytest.fixture [2026-01-12T22:52:13Z] def cleanup_spans(): [2026-01-12T22:52:13Z] """Cleanup temporary spans_dir folder at beginning and end of test.""" [2026-01-12T22:52:13Z] if os.path.exists(spans_dir): [2026-01-12T22:52:13Z] shutil.rmtree(spans_dir) [2026-01-12T22:52:13Z] os.makedirs(spans_dir, exist_ok=True) [2026-01-12T22:52:13Z] yield [2026-01-12T22:52:13Z] # Enable tracing only sets up tracing once per driver process. [2026-01-12T22:52:13Z] # We set ray.__traced__ to False here so that each [2026-01-12T22:52:13Z] # test will re-set up tracing. [2026-01-12T22:52:13Z] ray.__traced__ = False [2026-01-12T22:52:13Z] if os.path.exists(spans_dir): [2026-01-12T22:52:13Z] > shutil.rmtree(spans_dir) [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] python\ray\serve\tests\test_serve_with_tracing.py:30: [2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:750: in rmtree [2026-01-12T22:52:13Z] return _rmtree_unsafe(path, onerror) [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:620: in _rmtree_unsafe [2026-01-12T22:52:13Z] onerror(os.unlink, fullname, sys.exc_info()) [2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] path = '/tmp/spans/' [2026-01-12T22:52:13Z] onerror = <function rmtree.<locals>.onerror at 0x000002C0FFBBDA20> [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] def _rmtree_unsafe(path, onerror): [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] with os.scandir(path) as scandir_it: [2026-01-12T22:52:13Z] entries = list(scandir_it) [2026-01-12T22:52:13Z] except OSError: [2026-01-12T22:52:13Z] onerror(os.scandir, path, sys.exc_info()) [2026-01-12T22:52:13Z] entries = [] [2026-01-12T22:52:13Z] for entry in entries: [2026-01-12T22:52:13Z] fullname = entry.path [2026-01-12T22:52:13Z] if _rmtree_isdir(entry): [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] if entry.is_symlink(): [2026-01-12T22:52:13Z] # This can only happen if someone replaces [2026-01-12T22:52:13Z] # a directory with a symlink after the call to [2026-01-12T22:52:13Z] # os.scandir or entry.is_dir above. [2026-01-12T22:52:13Z] raise OSError("Cannot call rmtree on a symbolic link") [2026-01-12T22:52:13Z] except OSError: [2026-01-12T22:52:13Z] onerror(os.path.islink, fullname, sys.exc_info()) [2026-01-12T22:52:13Z] continue [2026-01-12T22:52:13Z] _rmtree_unsafe(fullname, onerror) [2026-01-12T22:52:13Z] else: [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] > os.unlink(fullname) [2026-01-12T22:52:13Z] E PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '/tmp/spans/15464.txt' [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:618: PermissionError ``` **Cause:** The `setup_local_tmp_tracing.py` module opens a file handle for the `ConsoleSpanExporter` that is never explicitly closed. On Windows, files cannot be deleted while they're open, causing `shutil.rmtree` to fail with `PermissionError: [WinError 32]` during the `cleanup_spans` fixture teardown. **Fix:** Added `trace.get_tracer_provider().shutdown()` in the `ray_serve_with_tracing` fixture teardown to properly flush and close the span exporter's file handles before the cleanup fixture attempts to delete the spans directory. --------- Signed-off-by: doyoung <doyoung@anyscale.com>
### Why are these changes needed?
When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).
This violates the documented behavior in the `fit()` docstring:
> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.
**Example of the bug:**
```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()
# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}
# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)
# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
# "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```
---------
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦project#60072) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
β¦ay-project#60028) Capture the install script content in BuildContext digest by inlining it as a constant and adding install_python_deps_script_digest field. This ensures build reproducibility when the script changes. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
the "test-rules" test job was missing the forge dependency Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This migrates ray wheel builds from CLI-based approach to wanda-based container builds for x86_64. Changes: - Add ray-wheel.wanda.yaml and Dockerfile for wheel builds - Update build.rayci.yml wheel steps to use wanda - Add wheel upload steps that extract from wanda cache Topic: ray-wheel Signed-off-by: andrew <andrew@anyscale.com>
ray-project#60114) β¦eed up iter_batches (ray-project#58467)" This reverts commit 2a042d4. ## Description Reverts # 58467 ## Related issues ## Additional information Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦ctors (ray-project#59850) Signed-off-by: dragongu <andrewgu@vip.qq.com>
added a notebook that demonstrates serve application that takes a reference to video as input and returns scene changes, tags and video description (from the corpus). https://anyscale-ray--59859.com.readthedocs.build/en/59859/serve/tutorials/video-analysis/README.html --------- Signed-off-by: abrar <abrar@anyscale.com>
β¦ject#60109) Follow up to ray-project#52573 (comment) --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
ray-project#60076) Signed-off-by: dayshah <dhyey2019@gmail.com>
they are not required for test orchestration, as rayci can properly track dependencies now. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Opening draft PR for Ray technical charter. Planning to add GitHub usernames before merging. --------- Signed-off-by: Robert Nishihara <rkn@anyscale.com> Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
There isn't really a need to have a ray check on the event exporter as there isn't really an important correctness invariant here. One call will succeed. We already take some measure of caution here with a mutex in the event recorder. But ray checking right after the mutex is just asking for trouble. --------- Signed-off-by: zac <zac@anyscale.com>
Currently seeing issues of crane not available in the uploading environment. Default to Docker if crane is not available https://buildkite.com/ray-project/postmerge/builds/15375/steps/canvas?jid=019bb99d-6f9e-45fa-92e3-a5a1d9373e8d#019bb99d-6f9e-45fa-92e3-a5a1d9373e8d/L198 Topic: crane-fix Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>
β¦ay-project#59991) updating lock files for images and using relative paths in buildkite configs moving base extra test deps lock files from ray_release path to python/deplocks/base_extra_testdeps Release test run: https://buildkite.com/ray-project/release/builds/74936 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
β¦ct#59969) The ray-cpp wheel contains only C++ headers, libraries, and executables with no Python-specific code. Previously we built 4 identical wheels (one per Python version: cp310, cp311, cp312, cp313), wasting CI time and storage. This change produces a single wheel tagged py3-none-manylinux2014_* that works with any Python 3.x version. Changes: - Add ray-cpp-core.wanda.yaml and Dockerfile for cpp core - Add ray-cpp-wheel.wanda.yaml for cpp wheel builds - Add ci/build/build-ray-cpp-wheel.sh for Python-agnostic wheel builds - Add RayCppBdistWheel class to setup.py that forces py3-none tags (necessary because BinaryDistribution.has_ext_modules() causes bdist_wheel to use interpreter-specific ABI tags by default) - Update ray-cpp-wheel.wanda.yaml to build single wheel per architecture - Update .buildkite/build.rayci.yml to remove Python version matrix for cpp wheel build/upload steps Topic: ray-cpp-wheel Relative: ray-wheel Signed-off-by: andrew <andrew@anyscale.com> --------- Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: cristianjd <cristian.j.derr@gmail.com>
The governance information is now integrated into the contributor documentation at doc/source/ray-contribute/getting-involved.rst:399-437, making it easily discoverable for community members interested in advancing their involvement in the Ray project. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
## Description Completing the fixed-size array namespace operations ## Related issues Related to ray-project#58674 ## Additional information --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description This should say `False` ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦ct#60222) The actor repr name is _only_ used in task receiver when replying to the `PushTask` RPC for an actor creation task. Making it one of the task execution outputs instead of a stateful field. I've opted to make it an outparam for the core worker task execution callback as well, rather than adding a custom method for it. My meta goal is to make the logic that handles a task execution result in the task receiver fully stateless. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦eue` (ray-project#60538) ray-project#60017 and ray-project#60228 refactored the `FIFOBundleQueue` interface and renamed `FIFOBundleQueue.popleft` with `FIFOBundleQueue.get_next`. However, this name change wasn't reflected in the `UnionOperator` implementation, and as a result the operator can error when it clears its output queue. This change also fixes the flaky `test_union.py`. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** β For normal data (variance/range > 1e-8), behavior is **identical** to before β Only triggers new logic for extreme edge cases (variance/range < 1e-8) β All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std β 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range β 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <slfan1989@apache.org>
β¦ject#60479) ## Description Add type annotations to Ray's annotation decorators so type checkers can properly infer return types through decorated functions. Before this change, decorators like `@PublicAPI` caused type checkers to lose function signature information. After this change, decorated functions retain their full type signatures. ## Related issues Related to ray-project#59303 ## Additional information Running pyrefly with ray was complaining when calling take_all() which led me down this rabbit hole. I tried to add annotations to all the public facing decorators I could find that had reasonably clear fixes. I did some drive-by type fixes in annotations.py to make it fully pass --------- Signed-off-by: Julian Meyers <Julian@MeyersWorld.com>
## Description Moved arrow_utils.py to a direct subpackage of `ray.data.util`. ## Related issues Closes ray-project#60420 ## Additional information moved file to `ray.data` subpackage. modified import paths. A minor readability issue. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com> Signed-off-by: Hyunoh Yeo <113647638+Hyunoh-Yeo@users.noreply.github.com>
β¦ behavior (ray-project#60394) ## Summary This PR fixes a startup crash when running `ray start --head --no-redirect-output` (and the same flag in KubeRay-generated `ray start` commands). The CLI previously routed this option through a deprecated `RayParams.redirect_output` parameter, which raises a `DeprecationWarning` as an exception and prevents Ray from starting. The PR also corrects the effective behavior of `--no-redirect-output` by using the supported mechanism (`RAY_LOG_TO_STDERR=1`) to disable log redirection. ## Description ### What happened - The CLI option `--no-redirect-output` was mapped to `RayParams.redirect_output`. - `RayParams._check_usage()` raises `DeprecationWarning("The redirect_output argument is deprecated.")` whenever `redirect_output` is not `None`, which terminates `ray start`. - Additionally, the previous mapping effectively inverted intent by setting `redirect_output=True` when `--no-redirect-output` was provided. ### What was expected to happen - `ray start --no-redirect-output` should **not crash**. - It should disable redirecting non-worker stdout/stderr into `.out/.err` files (i.e., logs should go to stderr/console), consistent with the flag name and help text. ### What this PR changes - Stop passing the deprecated `redirect_output` argument into `RayParams` from the `ray start` CLI. - When `--no-redirect-output` is set, configure the supported behavior by setting:`RAY_LOG_TO_STDERR=1` - This leverages the existing fallback logic in `Node.should_redirect_logs()` which checks `RAY_LOG_TO_STDERR` when `RayParams.redirect_output` is `None`. ### Testing <img width="1280" height="468" alt="image" src="https://github.com/user-attachments/assets/6eb32b2e-80fa-4c05-b308-1700e92b1efb" /> ## Related issues Closes ray-project#60367 --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
β¦ray-project#60526) ## Description Currently we use `get_browsers_no_post_put_middleware` to block PUT/POST requests from browsers since these endpoints are not intended to be called from a browser context (e.g., via DNS rebinding or CSRF). However, DELETE methods were not blocked, allowing browser-based requests to delete jobs or shut down Serve applications. This PR switches from a blocklist (POST/PUT) to an allowlist (GET/HEAD/OPTIONS) approach, ensuring only explicitly safe methods are permitted from browsers. This also covers PATCH and any future HTTP methods by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦ject#60502) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
β¦ct#60536) Use to consolidate MANYLINUX_VERSION. Future will also use rayci.env to consolidate RAY_VERSION and other related fields. Signed-off-by: andrew <andrew@anyscale.com>
β¦kerfiles (ray-project#60386) - Add --mount=type=cache to ray-core and ray-java Dockerfiles - Update ray-cpp-core to use shared cache ID (ray-bazel-cache-${HOSTTYPE}) - Configure Bazel repository cache inside the mount for faster dependency resolution - Auto-disable remote cache uploads when BUILDKITE_BAZEL_CACHE_URL is empty, preventing 403 errors on local builds without AWS credentials All python-agnostic images now share the same Bazel cache per architecture, maximizing cache reuse while preventing cross-architecture toolchain conflicts. Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
agent-fix some misc typos and grammar in doc (+ a batch-llm var name). Feel free to ignore if too trivial Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
β¦tric name (ray-project#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR ray-project#57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` β replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: ray-project#60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` β Stats uses `data_{name}` β Metrics agent adds `ray_` namespace β **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
β¦ces (ray-project#60470) ## Description This PR revisits `ReorderingBundleQueue` to move pointer advancements from `get_next_inner` and `finalize` into `has_next` method to guarantee that the queue will not get stuck with any operations sequence. Currently, `ReorderingBundleQueue` could still get stuck in case of the sequence captured in `test_ordered_queue_getting_stuck`. The queue is guaranteed to traverse through all bundles so long as all keys are finalized (ie tasks finished). ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
## Description This removes the requirement for pipelines having `Sort` operations to actually require `preserve_order=True`. This is an unnecessary strict requirement that has adversarial side-effects, and is strictly not required as there's no global ordering between the blocks established. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
β¦ject#60544) Follow up from: ray-project#60526 (comment) --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦ct#60529) ## Description This PR updates logical operators and logical rules to consistently access input dependencies via the input_dependencies property instead of the internal _input_dependencies field. This is the first step in the split plan for Issue ray-project#60312 and keeps physical operators out of scope. To keep reviews small, weβre splitting the work into stacked PRs: 1. PR that just replaces references to _input_dependencies with input_dependencies 2. PR that just renames operator attributes so they donβt have a leading underscore (this PR) 3. PR that just removes LogicalOperator.output_dependencies (physical is out of scope) 4. PR that converts operators to frozen dataclasses (ideally avoiding object.__setattr__ / super().__init__) This PR implements the first step in the planned fourβPR split. ## Related issues > Link related issues: "Fixes ray-project#60312 ", or "Related to ray-project#60312". ## Additional information - Scope: logical operators + logical rules only (no physical operator changes). - Updated operator classes: AllToAll, NAry, OneToOne. - Updated rules: limit_pushdown, operator_fusion, predicate_pushdown. - No behavior changes intended; this is a refactor to unify access through the public property. Signed-off-by: yaommen <myanstu@163.com>
β¦st_stage (ray-project#60299) Signed-off-by: Yu Chen <yuchen.ecnu@gmail.com>
β¦#60558) ## Description This PR fixes pydoclint documentation linting errors (DOC101 and DOC103) in `python/ray/data/read_api.py`. These errors occur when function signatures and docstrings are inconsistent, which can confuse users reading the API documentation. The fixes ensure that: - All `**kwargs` parameters are properly documented with `**` prefix - Missing parameters are documented in docstrings - Parameter names match exactly between function signatures and docstrings - Typos in parameter names are corrected ## Related issues > Link related issues: "Fixes ray-project#60545", "Closes ray-project#60545", or "Related to ray-project#60545". Fixes pydoclint DOC101 (missing arguments) and DOC103 (argument mismatch) violations in read_api.py. ## Additional information ### Changes made: **1. Fixed `**kwargs` parameter documentation format:** - `read_datasource`: `read_args` β `**read_args` - `read_mongo`: `mongo_args` β `**mongo_args` - `read_parquet`: `arrow_parquet_args` β `**arrow_parquet_args` - `read_json`: `arrow_json_args` β `**arrow_json_args` - `read_csv`: `arrow_csv_args` β `**arrow_csv_args` - `read_numpy`: `numpy_load_args` β `**numpy_load_args` **2. Added missing parameter documentation:** - `read_audio`: Added `shuffle` parameter documentation - `read_videos`: Added `shuffle` and `override_num_blocks` parameter documentation - `read_bigquery`: Added `query` parameter documentation - `read_text`: Added `drop_empty_lines` parameter documentation **3. Fixed typos:** - `read_videos`: Fixed `include_timestmaps` β `include_timestamps` --------- Signed-off-by: slfan1989 <slfan1989@apache.org> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ray-project#57694) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This PR adds the `label_selector` option to the supported list of Actor options for a Serve deployment. Additionally, we add `bundle_label_selector` to specify label selectors for bundles when `placement_group_bundles` are specified for the deployment. These two options are already supported for Tasks/Actors and placement groups respectively. Example use case: ``` llm_config = LLMConfig( model_loading_config={ "model_id": "meta-llama/Meta-Llama-3-70B-Instruct", "model_source": "huggingface", }, engine_kwargs=tpu_engine_config, resources_per_bundle={"TPU": 4}, runtime_env={"env_vars": {"VLLM_USE_V1": "1"}}, deployment_config={ "num_replicas": 4, "ray_actor_options": { # In a GKE cluster with multiple TPU node-pools, schedule # only to the desired slice. "label_selector": { "ray.io/tpu-topology": "4x4" # added by default by Ray } } } ) ``` The expected behaviors of these new fields is as follows: **Pack scheduling enabled** ---------------------------------------- **PACK/STRICT_PACK PG strategy:** - Standard PG without bundle_label_selector or fallback: - Sorts replicas by resource size (descending). Attempts to find the "best fit" node (minimizing fragmentation) that has available resources. Creates a Placement Group on that target node. - PG node label selector provided: - Same behavior as regular placement group but filters the list of candidate nodes to only those matching the label selector before finding the best fit - PG node label selector and fallback: Same as above but when scheduling tries the following: 1. Tries to find a node matching the primary placement_group_bundles and bundle_label_selector. 2. If no node fits, iterates through the placement_group_fallback_strategy. For each fallback entry, tries to find a node matching that entry's bundles and labels. 3. If a node is found, creates a PG on it. **SPREAD/STRICT_SPREAD PG strategy:** - If any deployment uses these strategies, the global logic falls back to "Spread Scheduling" (see below) **Spread scheduling enabled** ---------------------------------------- - Standard PG without bundle_label_selector or fallback: - Creates a Placement Group via Ray Core without specifying a target_node_id. Ray Core decides placement based on the strategy. - PG node label selector provided: - Serve passes the bundle_label_selector to the CreatePlacementGroupRequest. Ray Core handles the soft/hard constraint logic during PG creation. - PG node label selector and fallback: - Serve passes the bundle_label_selector to the CreatePlacementGroupRequest, fallback_strategy is not yet supported in the placement group options so this field isn't passed / considered. It's only used in the "best fit" node selection logic which is skipped for Spread scheduling. ## Related issue number ray-project#51564 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
ray-project#60482) Fixes ray-project#58851 ### Changes 1. **New `gRPCStatusError` exception class** - Wraps exceptions with user-set gRPC status codes so they flow through Ray's error handling path. 2. **Exception wrapping in replica methods** - `handle_request`, `handle_request_streaming`, and `handle_request_with_rejection` now wrap exceptions with `gRPCStatusError` when the user has set a status code on the gRPC context. 3. **Status code preservation in proxy** - `get_grpc_response_status()` now detects `gRPCStatusError` and returns the user's intended status code instead of `INTERNAL`. 4. **Message truncation** - Added `_truncate_message()` to limit error details to 4KB, avoiding HTTP/2 trailer size limits. 5. **Documentation updates** - Updated the gRPC guide to document the new behavior. --------- Signed-off-by: abrar <abrar@anyscale.com>
β¦ay-project#60569) ## Description The autoscaling validation warning was incorrectly raised for fixed-size actor pools (`min_size == max_size`). These pools don't scale up, so the warning doesn't apply. ## Related issues Context: ray-project#60477 (comment) ## Additional information After this change, when we run `python -m pytest -v -s test_vllm_engine_proc.py::test_generation_model`, we no longer observe autoscaling warnings in the log. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
## Description - Remove outdated air library in ray data - Update some old usage from `ray.air` to `ray.data` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request is an automated daily merge from master to main, containing a wide variety of changes. The most significant changes include a major refactoring of the CI/CD pipeline to a more modular, wanda-based system, updates to Python and library versions (e.g., dropping Python 3.9 support in some areas), and extensive documentation improvements. The CI refactoring appears to enhance caching and multi-architecture support. The documentation has been significantly improved for clarity, accuracy, and completeness across many components. I've identified one potential issue with a new CI rule that seems overly broad and could lead to CI inefficiency. Overall, the changes are positive and well-structured.
| * | ||
| @ ml tune train data serve | ||
| @ core_cpp cpp java python doc | ||
| @ linux_wheels macos_wheels dashboard tools release_tests | ||
| ; |
There was a problem hiding this comment.
The new wildcard rule at the end of this file seems overly broad. It applies a large number of tags (ml, tune, train, data, serve, core_cpp, cpp, java, python, doc, linux_wheels, macos_wheels, dashboard, tools, release_tests) to any file that doesn't match a more specific rule. This could lead to a significant number of unnecessary CI jobs being triggered for minor or unrelated changes (e.g., a typo fix in a non-code file).
While this might be intended as a conservative fallback, it could also be a source of CI inefficiency. Consider making this default rule more restrictive, or splitting it into smaller, more targeted fallback rules if possible.
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-29
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.