🔄 daily merge: master → main 2026-01-21 by antfin-oss · Pull Request #753 · antgroup/ant-ray

antfin-oss · 2026-01-21T03:18:15Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2026-01-21
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…t#58695) ## Description This PR adds a new documentation page, Head Node Memory Management, under the Ray Core advanced topics section. ## Related issues Closes ray-project#58621 ## Additional information <img width="2048" height="1358" alt="image" src="https://github.com/user-attachments/assets/3b98150d-05e6-4d15-9cd3-7e05e82ff516" /> <img width="2048" height="498" alt="image" src="https://github.com/user-attachments/assets/4ec8fe43-e3a5-4df4-bca7-376ae407c77b" /> --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>

…y-project#59845) - [x] Update the docstring for `ray.shutdown()` in `python/ray/_private/worker.py` to clarify: - When connecting to a remote cluster via `ray.init(address="xxx")`, `ray.shutdown()` only disconnects the client and does NOT terminate the remote cluster - Only local clusters started by `ray.init()` will have their processes terminated by `ray.shutdown()` - Clarified that `ray.init()` without address argument will auto-detect existing clusters - [x] Add documentation note to `doc/source/ray-core/starting-ray.rst` explaining the same behavior difference - [x] Review the changes via code_review - [x] Run codeql_checker for security scan (no code changes requiring analysis) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

## Description upgrading cuda base gpu image from 11.8 to 12.8.1 This is required for future py3.13 dependency upgrades --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…roject#59735) ## Description ### Problem Using --entrypoint-resources '{"fragile_node":"!1"}' with the Job API raises an error saying only numeric values are allowed. ### Expect --entrypoint-resources should accept label selectors just like ray.remote/PlacementGroups, so entrypoints can target or avoid nodes with specific labels. ## Related issues > Link related issues: "Fixes ray-project#58662 ", "Closes ray-project#58662", or "Related to ray-project#58662". ## Additional information ### Implementation approach - Relax `JobSubmitRequest.entrypoint_resources` validation to allow string values (`python/ray/dashboard/modules/job/common.py`). - Add `_split_entrypoint_resources()` to separate numeric requests from selector strings and run them through `validate_label_selector` (`python/ray/dashboard/modules/job/job_manager.py`). - Pass numeric resources via the existing `resources` option and selector dict via `label_selector` when spawning the job supervisor, leaving the field unset if only resources were provided (`python/ray/dashboard/modules/job/job_manager.py`). - Extend CLI parsing/tests to cover string-valued resources and assert selector plumbing through the job manager (`python/ray/dashboard/modules/job/tests/test_cli.py`, `python/ray/dashboard/modules/job/tests/test_common.py`, `python/ray/dashboard/modules/job/tests/test_job_manager.py`). Signed-off-by: yaommen <myanstu@163.com>

update with more up-to-date information, and format the markdown file a bit Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…-project#59848) # Fix StreamingRepartition hang with empty upstream results ## Summary Fix a bug where `StreamingRepartitionRefBundler` would hang when processing empty datasets (0 rows). ## Problem When upstream operations (e.g., `filter`, `map`, etc.) produce an empty result (0 rows), resulting empty `RefBundle` gets added to `_pending_bundles` but never gets flushed because: 1. `add_bundle()` adds empty bundles (0 rows) to `_pending_bundles` 2. `_total_pending_rows` remains 0 3. `done_adding_bundles()` checks `len(_pending_bundles) > 0` and calls `flush_remaining=True` 4. `_try_build_ready_bundle(flush_remaining=True)` checks `_total_pending_rows > 0` → False, so no flush happens 5. Empty bundles remain in `_pending_bundles` forever (memory leak) ## Reproduction ```python import ray ray.init() ds = ray.data.range(5).filter(lambda row: row['id'] > 100) ds = ds.repartition(target_num_rows_per_block=8) ds.count() ``` ## Solution Changed flush condition in `_try_build_ready_bundle()` from checking `_total_pending_rows > 0` to `len(self._pending_bundles) > 0`: ```python # Before: if flush_remaining and self._total_pending_rows > 0: # After: if flush_remaining and len(self._pending_bundles) > 0: ``` This ensures empty bundles never enter the bundler state, preventing both hangs and memory leaks. Signed-off-by: dragongu <andrewgu@vip.qq.com>

…uts (ray-project#59883) If you have a pipeline like `read --> [some cpu transformation] --> [gpu transformation init_concurrency =N] --> write`, the `gpu transformation` might downscale to 0 actors if the CPU transformation is slow. This basically nullifies `init_concurrency` and can cause cold-start delays. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

stop using python 3.9 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…butes (ray-project#59894) ## Description The `StatelessCartPole` example form APPO is timing out. This could be due to the latest changes in the APPO data pipeline. This PR modifies the setup of the example by using the new APPO attributes. ## Related issues Fixes https://buildkite.com/ray-project/postmerge/builds/15188#019b8f6e-2850-465e-a98c-63c29fbf98f7/L4702 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Signed-off-by: dayshah <dhyey2019@gmail.com>

this avoid the need to put in the dummy no-op files, and also allows us to add env vars in the future. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

## Description ray: handle dual task errors with read-only args - avoid writing to user-defined args when building RayTaskError hybrids - fall back to RayTaskError-only with warning if subclassing fails - add regression test covering read-only args user exceptions ## Related issues Fixes ray-project#59437

ray-project#59846) ## Description dashboard agent services such as reporter agent and event aggregator agent do not run in minimal ray installs (`pip install ray`). this pr skips client creation (and adds a info log to guide users) when using minimal installs. ## Related issues Fixes ray-project#59665 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

## Description this pr adds auth middleware to dashboard http agent service and configures clients to include token headers in their requests. Pr also covers passing auth headers in state_manager runtime env agent api call which was previously missed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

…prising (ray-project#59390) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

## Description This PR adds support in the `JaxTrainer` to schedule across multiple TPU slices using the `ray.util.tpu` public utilities. To support this, this PR adds new `AcceleratorConfig`s to the V2 scaling config, which consolidate the accelerator related fields for TPU and GPU. When `TPUAcceleratorConfig` is specified, the JaxTrainer utilizes a `SlicePlacementGroup` to atomically reserve `num_slices` TPU slices of the desired topology, auto-detecting the required values for `num_workers` and `resources_per_worker` when unspecified. TODO: I'll add some manual testing and usage examples in the comments. ## Related issues ray-project#55162 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…e policy (ray-project#59803) ## Description Given a typical scenario of a fast producing operator followed by a slow producing operator how does the backpressure policy and resource allocator behave? This change just adds tests to cement the expected behavior. ## Related issues DATA-1712 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

This PR adds documentation for several Ray Serve environment variables that were defined in `constants.py` but missing from the documentation, and also cleans up deprecated legacy environment variable names. ### Changes Made #### Documentation additions **`doc/source/serve/production-guide/config.md`** (Proxy config section): - `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE` - Control whether to always run a proxy on the head node - `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` - Proxy health check timeout - `RAY_SERVE_PROXY_HEALTH_CHECK_PERIOD_S` - Proxy health check period - `RAY_SERVE_PROXY_READY_CHECK_TIMEOUT_S` - Proxy ready check timeout - `RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S` - Minimum proxy draining period **`doc/source/serve/production-guide/fault-tolerance.md`** (New "Replica constructor retries" section): - `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT` - Max constructor retries per replica - `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` - Max constructor retries per deployment **`doc/source/serve/advanced-guides/performance.md`**: - `RAY_SERVE_PROXY_PREFER_LOCAL_NODE_ROUTING` - Proxy node locality routing preference - `RAY_SERVE_PROXY_PREFER_LOCAL_AZ_ROUTING` - Proxy AZ locality routing preference - `RAY_SERVE_MAX_CACHED_HANDLES` - Max cached deployment handles (controller debugging section) **`doc/source/serve/monitoring.md`**: - `RAY_SERVE_HTTP_PROXY_CALLBACK_IMPORT_PATH` - HTTP proxy initialization callback - `SERVE_SLOW_STARTUP_WARNING_S` - Slow startup warning threshold - `SERVE_SLOW_STARTUP_WARNING_PERIOD_S` - Slow startup warning interval #### Code cleanup **`python/ray/serve/_private/constants.py`**: - Removed legacy fallback for `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` (now only `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT`) - Removed legacy fallback for `MAX_PER_REPLICA_RETRY_COUNT` (now only `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT`) - Removed legacy fallback for `MAX_CACHED_HANDLES` (now only `RAY_SERVE_MAX_CACHED_HANDLES`) **`python/ray/serve/_private/constants_utils.py`**: - Removed `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` and `MAX_PER_REPLICA_RETRY_COUNT` from the deprecated names whitelist --------- Signed-off-by: harshit <harshit@anyscale.com>

…reating (ray-project#59610) Signed-off-by: dayshah <dhyey2019@gmail.com>

## Description allow `RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES` to accept `ALL` so that all events are exported. will be used by history server. (without this config, kuberay needs to explicitly list each event type which is tedious as this list may grow in the future) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…project#59784) ## Description run state api and task event unit tests with both the default (task_event -> gcs flow) and aggregator (task_event -> aggregator -> gcs) to smoothen the transition from default to aggregator flow --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>

AnyscaleJobRunner is the only implementation/child class of CommandRunner right now. There is no need to use inheritance. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

) Add BuildContext TypedDict to capture post_build_script, python_depset, their SHA256 digests, and environment variables for custom BYOD image builds. Changes: - Add build_context.py with BuildContext TypedDict and helper functions: - make_build_context: constructs BuildContext with computed file digests - encode_build_context: deterministic minified JSON serialization - decode_build_context: JSON deserialization - build_context_digest: SHA256 digest of encoded context - Refactor build_anyscale_custom_byod_image to accept BuildContext instead of individual post_build_script and python_depset arguments - Update callers: custom_byod_build.py, ray_bisect.py - Add comprehensive unit tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…project#59839) # Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays ## Issue Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow chunks to NumPy arrays with `zero_copy_only=True`: ``` File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks)) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True) File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True ``` This error occurs when checkpoint data is loaded from Ray's object store, where PyArrow buffers may reside in shared memory and cannot be zero-copied to NumPy. ## Reproduction ```python #!/usr/bin/env python3 import ray from ray.data import DataContext from ray.data.checkpoint import CheckpointConfig import tempfile ray.init() with tempfile.TemporaryDirectory() as ckpt_dir, \ tempfile.TemporaryDirectory() as data_dir, \ tempfile.TemporaryDirectory() as output_dir: # Step 1: Create data ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir) # Step 2: Enable checkpoint and write ctx = DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( checkpoint_path=ckpt_dir, id_column="id", delete_checkpoint_on_success=False ) ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir) # Step 3: Second write triggers checkpoint filtering ray.data.read_parquet(data_dir).write_parquet(output_dir) ray.shutdown() ``` ## Solution Change `to_numpy(zero_copy_only=True)` to `to_numpy(zero_copy_only=False)` in `BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows PyArrow to copy data when necessary. ### Changes **File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py` - Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to `ckpt_chunk.to_numpy(zero_copy_only=False)` ### Performance Impact No performance regression expected. PyArrow will only perform a copy when zero-copy is not possible. Signed-off-by: dragongu <andrewgu@vip.qq.com>

## Description Adds repr_name field to actor_lifecycle_event schema and populates it when available. ## Related issues Closes ray-project#59813 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>

…y-project#59893) ## Description Fix inconsistent task name in metrics between RUNNING and FINISHED states. When a Ray task is defined with a custom name via `.options(name="custom_name")`, the `ray_tasks` metrics show inconsistent names: - **RUNNING** state: shows the original function name (e.g., `RemoteFn`) - **FINISHED/FAILED** state: shows the custom name (e.g., `test`) **Root cause:** The RUNNING task counter in `CoreWorker` uses `FunctionDescriptor()->CallString()` to get the task name, while finished task events correctly use `TaskSpecification::GetName()`. **Fix:** Changed both `HandlePushTask` and `ExecuteTask` in `core_worker.cc` to use `task_spec.GetName()` consistently, which properly returns the custom name when set. ## Related issues None - this PR addresses a newly discovered bug. ## Additional information **Files changed:** - `src/ray/core_worker/core_worker.cc` - Use `GetName()` instead of `FunctionDescriptor()->CallString()` for metrics - `python/ray/tests/test_task_metrics.py` - Added test `test_task_custom_name_metrics` to verify custom names appear correctly in metrics Signed-off-by: Yuan Jiewei <jieweihh.yuan@gmail.com> Co-authored-by: Yuan Jiewei <jieweihh.yuan@gmail.com>

## Description update metrics export docs based on changes in ray-project#59337 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

…ray-project#59808) Adds a new RLlib algorithm TQC, which extends SAC with distributional critics using quantile regression to control Q-function overestimation bias. Key components: - TQC algorithm configuration and implementation - Default TQC RLModule with multiple quantile critics - TQC catalog for building network components - Comprehensive test suite covering compilation, simple environments, and parameter validation - Documentation including > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: tk42 <nsplat@gmail.com> Co-authored-by: simonsays1980 <simon.zehnder@gmail.com>

…60304) ## Description We had a separate field in `OpState` to keep track of outputted rows. `OpRuntimeMetrics` exist per `PhysicalOperator`, and also has a field to keep track of outputted rows, so there is no need to keep track of a duplicate in OpState. ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

## Description This PR removes an obsolete HalfCheetah release test. ## Related issues See also: ray-project#59007 Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

## Description Currently ray attach only allows opening an SSH session on the head node. It could be useful to allow attaching to worker nodes to check what state the execution environment and file system are in (e.g. running conda list, examining config files such as ~/.keras/keras.json). ## Related issues Closes ray-project#7064 ## Additional information This PR add `--node-ip` args to `ray attach` to specify the node IP to attach to. Usage: `ray attach cluster.yaml --node-ip <node ip>`. Default to head node if the `--node-ip` is not provided. Add unit test and tested on GCP (see ray-project#59931 (comment)) --------- Signed-off-by: machichima <nary12321@gmail.com>

…oject#60276) so that we are not pretending that we are fetching results or teminating jobs.. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…in homepage (ray-project#60229) ## Summary Replaced the Ray Tune example in the homepage (`index.html`) to show vanilla Ray Tune usage instead of V1 tune+train integration. **Changes:** - Removed `ScalingConfig` and `LightGBMTrainer` imports (Ray Train components) - Added a pure Ray Tune example demonstrating: - An objective function that trains a model with hyperparameters and reports metrics - Hyperparameter search space using common Tune methods (`loguniform`, `choice`, `randint`) - Running 1000 trials with the `Tuner` API - Retrieving the best result This makes the example clearer for users who want to learn Ray Tune's hyperparameter optimization capabilities without the complexity of Ray Train integration. Signed-off-by: xgui <xgui@anyscale.com>

if a test is not stable, it should be on manual frequency. we will no longer treat unstable tests differently. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…ect#60264) the alias is not used anywhere. this clears all the `__init__.py` under `ray_release/` directory. making it consistent with other files, and easier to convert everything to idiomatic bazel Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

we have always been using a constant. if one needs more logs, they can go to anyscale's UI and view logs there. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

) metrics saving is handled in job wrapper Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…#60277) just save the sdk as a private member instead Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…oject#60272) so that it is not going back and forth between the implementation and the abstract class, and not implemented as a property. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

## Description Deprecate Predictor API and its concreate subclasses DLPredictor(Predictor), LightGBMPredictor(Predictor), TensorflowPredictor(DLPredictor), TorchPredictor(DLPredictor), XGBoostPredictor(Predictor) TorchDetectionPredictor(TorchPredictor). ## Related issues Closes ray-project#60266 ## Additional information added `@Deprecated` annotations to corresponding classes and warns 'DeprecationWarning' when the constructor of the superclass is called. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com> Signed-off-by: Hyunoh-Yeo <113647638+Hyunoh-Yeo@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…dule IDs (ray-project#60234) ### Description This PR fixes a bug in the RLLib `MultiAgentEnvRunner` where module episode returns metrics were incorrectly calculated when multiple agents share the same module ID. Previously, the code was overwriting returns instead of accumulating them, leading to incorrect metrics. - Fixed module return calculation logic in `MultiAgentEnvRunner` to properly accumulate returns when multiple agents use the same module ID - Added test case to verify that module metrics returns equal the sum of agent returns assigned to that module ### Related issues Fixes ray-project#59860 (ray-project#59860) ### Files modified: - `rllib/env/multi_agent_env_runner.py`: Core bug fix - `rllib/env/tests/test_multi_agent_env_runner.py`: New test case called `test_module_metrics_returns_equal_sum_of_agent_returns()` --------- Signed-off-by: Adam Kelloway <kelloway@amazon.com> Co-authored-by: Adam Kelloway <kelloway@amazon.com>

install from tarball from official source, rathar than deb. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…oject#60151) removing requirement file and constraint file build args from the following images base-deps base-extra base-extra-test-deps base-slim (defaulting constraints file as a build arg) defaulting PYTHON_DEPSET & CONSTRAINTS_FILE args in the dockerfile Renaming ray-llm, ray-gpu & ray base extra testdeps lock files. IMAGE_TYPE defined on the BK jobs will determine which lock file to copy to the image hello world release test run: https://buildkite.com/ray-project/release/builds/76001# --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

ray-project#59897) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

…netes token authentication (ray-project#59621) ## Description Per discussion from REP PR (ray-project/enhancements#63), this PR adds a server-side config `RAY_ENABLE_K8S_TOKEN_RBAC=true` to enable Kubernetes-based token authentication. This must be set in addition to `RAY_AUTH_MODE=token`. The main benefit of this change is that the server-side authentication flow becomes opaque to clients, and all clients only need to set `RAY_AUTH_MODE=token` along with their token. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…t#60267) it is always an instance of AnyscaleJobRunner. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…0278) and saves the job ID in `_job_id`. this makes the information flow clearer and simpler. this is preparation for refactoring the job sdk usage. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

per anyscale#727 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

… from Ray Data (ray-project#60292) ## Description Remove all top-level imports of `ray.data` from the `ray.train` module. Imports needed only for type annotations should be guarded behind if `TYPE_CHECKING:`. Imports needed at runtime should be moved inline (lazy imports within functions/methods). ## Related issues Fixes ray-project#60152. --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>

gemini-code-assist

Code Review

This pull request primarily focuses on updating and refactoring the CI/CD pipeline, removing Python 3.9 support, and introducing new build steps for C++ wheels. Several documentation files have also been updated to reflect these changes and improve clarity. The removal of the oss tag from various build steps across different platforms might impact how these jobs are categorized or filtered in the CI system. Additionally, the refactoring of Bazel sharding logic and dependency management indicates a significant overhaul of the build infrastructure.

gemini-code-assist · 2026-01-21T03:26:24Z

cpp/src/ray/util/process_helper.cc

+  if (ConfigInternal::Instance().worker_type != WorkerType::DRIVER) {
+    options.worker_id = WorkerID::FromHex(ConfigInternal::Instance().worker_id);
+  }


The options.startup_token assignment has been replaced with a conditional options.worker_id assignment. This aligns with the renaming of startup_token to worker_id. Ensure that the WorkerID::FromHex conversion is robust and handles all possible string inputs for worker_id.

gemini-code-assist · 2026-01-21T03:26:24Z

cpp/src/ray/config_internal.cc

      head_args.insert(head_args.end(), args.begin(), args.end());
    }
-    startup_token = absl::GetFlag<int64_t>(FLAGS_startup_token);
+    worker_id = absl::GetFlag<std::string>(FLAGS_ray_worker_id);


The assignment startup_token = absl::GetFlag<int64_t>(FLAGS_startup_token); has been replaced with worker_id = absl::GetFlag<std::string>(FLAGS_ray_worker_id);. This change must be carefully reviewed to ensure that the new worker_id is correctly retrieved and used in all relevant parts of the system, especially considering the type change from int64_t to std::string.

gemini-code-assist · 2026-01-21T03:26:24Z

cpp/src/ray/config_internal.cc

+ABSL_FLAG(std::string,
+          ray_worker_id,
+          "",
+          "The worker ID assigned to this worker process by the raylet (hex string).");


The startup_token flag has been replaced with ray_worker_id of type std::string. This is a significant change in how worker identification is handled. Ensure all components that rely on startup_token are updated to use ray_worker_id and that the string format is correctly parsed and used.

gemini-code-assist · 2026-01-21T03:26:24Z

doc/source/ray-core/doc_code/direct_transport_gloo.py

+# Correct example of ray.get(), using the object store to fetch the RDT object because the caller
+# is not part of the collective group.
+print(ray.get(tensor, _use_object_store=True))


The _tensor_transport="object_store" parameter has been updated to _use_object_store=True. This is a breaking API change that needs to be clearly communicated to users, along with migration instructions.

Suggested change

# Correct example of ray.get(), using the object store to fetch the RDT object because the caller

# is not part of the collective group.

print(ray.get(tensor, _use_object_store=True))

print(ray.get(tensor, _use_object_store=True))

gemini-code-assist · 2026-01-21T03:26:25Z

doc/source/ray-core/direct-transport.rst

 The :func:`ray.get <ray.get>` function can also be used as usual to retrieve the result of an RDT object. However, :func:`ray.get <ray.get>` will by default use the same tensor transport as the one specified in the :func:`@ray.method <ray.method>` decorator. For collective-based transports, this will not work if the caller is not part of the collective group.

-Therefore, users need to specify the Ray object store as the tensor transport explicitly by setting ``_tensor_transport`` in :func:`ray.get <ray.get>`.
+Therefore, users need to specify the Ray object store as the tensor transport explicitly by setting ``_use_object_store`` in :func:`ray.get <ray.get>`.


The parameter for specifying the Ray object store as the tensor transport has been changed from _tensor_transport to _use_object_store. This is a breaking API change that needs to be clearly communicated to users, along with migration instructions.

gemini-code-assist · 2026-01-21T03:26:26Z

.buildkite/macos/macos.rayci.yml

      - python
      - macos_wheels
-      - oss
    job_env: MACOS


The oss tag has been removed from this step. Please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

gemini-code-assist · 2026-01-21T03:26:26Z

.buildkite/linux_aarch64.rayci.yml

@@ -19,11 +22,11 @@ steps:
    tags:


The oss tag has been removed from this step. Please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

gemini-code-assist · 2026-01-21T03:26:26Z

doc/source/ray-core/scheduling/placement-group.rst

+Note that Ray decouples the lifetime option and the name option. If you only specify
+the name without specifying ``lifetime="detached"``, then you can only retrieve the placement group
+while the driver where you created the placement group is still running.
+It's recommended to always specify the name when creating the detached placement group. If you don't,
+there is no way to retrieve the placement group from another process, and there is no way
+to kill it once you exit the driver script that created the placement group.



The description for detached placement groups has been expanded to emphasize the importance of specifying the name when creating a detached placement group, and the consequences of not doing so. This provides crucial guidance for users.

gemini-code-assist · 2026-01-21T03:26:26Z

.buildkite/core.rayci.yml

@@ -484,7 +484,6 @@ steps:
    # avoid running them for every C++ code change.
    tags:
      - spark_on_ray


The oss tag has been removed from this step. Please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

gemini-code-assist · 2026-01-21T03:26:26Z

.buildkite/core.rayci.yml

@@ -432,7 +433,6 @@ steps:
    tags:
      - java
      - python


The oss tag has been removed from this step. Similar to other instances, please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

github-actions · 2026-02-04T13:36:31Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

nadongjun and others added 30 commits January 6, 2026 01:54

[Data] - Don't reserve GPU budget for non-GPU tasks (ray-project#59789)

9e2de8d

[image] update ray image README file (ray-project#59826)

a80f765

update with more up-to-date information, and format the markdown file a bit Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[ci] change manylinux default python to 3.10 (ray-project#59841)

c05bbcf

stop using python 3.9 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[core][data] Make data cross az test weekly (ray-project#59777)

6138d68

Signed-off-by: dayshah <dhyey2019@gmail.com>

[release test] use generated custom dockerfile (ray-project#59817)

a60adcf

this avoid the need to put in the dummy no-op files, and also allows us to add env vars in the future. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[Serve][LLM] Make PrefixCacheAwareRouter imbalance threshold less sur…

338ce48

…prising (ray-project#59390) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[core][rdt] Support out-of-order actors by extracting metadata when c…

bf1a850

…reating (ray-project#59610) Signed-off-by: dayshah <dhyey2019@gmail.com>

[deps][llm] vLLM 0.13.0 (ray-project#59440)

6b1ac3d

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>

[release test] use AnyscaleJobRunner in testing (ray-project#59858)

bbc29e6

AnyscaleJobRunner is the only implementation/child class of CommandRunner right now. There is no need to use inheritance. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

kyuds and others added 22 commits January 19, 2026 21:39

Remove APPO with HalfCheetah release test (ray-project#60238)

22f7f7d

## Description This PR removes an obsolete HalfCheetah release test. ## Related issues See also: ray-project#59007 Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

[release test] remove unimplemented KuberayJobManager methods (ray-pr…

c36f560

…oject#60276) so that we are not pretending that we are fetching results or teminating jobs.. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] remove stable field handling (ray-project#60263)

57b4ea7

if a test is not stable, it should be on manual frequency. we will no longer treat unstable tests differently. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] removes logs streaming limit (ray-project#60265)

e727636

we have always been using a constant. if one needs more logs, they can go to anyscale's UI and view logs there. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] remove save_metrics method in runner (ray-project#60268

3a3b98c

) metrics saving is handled in job wrapper Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] remove sdk property in AnyscaleJobManager (ray-project…

9c9a996

…#60277) just save the sdk as a private member instead Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[release test] move command_env getter into AnyscaleJobRunner (ray-pr…

b44494b

…oject#60272) so that it is not going back and forth between the implementation and the abstract class, and not implemented as a property. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[ci] fix npm install (ray-project#60328)

6bc5e85

install from tarball from official source, rathar than deb. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[docs] [template] [data] Refactor current LLM Batch inference template (

4eca80f

ray-project#59897) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

[release test] collapse into AnyscaleJobRunner in glue.py (ray-projec…

4311242

…t#60267) it is always an instance of AnyscaleJobRunner. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[data] disable sort_autoscaling test (ray-project#60339)

6bf7ec0

per anyscale#727 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners January 21, 2026 03:18

antfin-oss added auto-generated daily-merge labels Jan 21, 2026

antfin-oss assigned ffbin Jan 21, 2026

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

github-actions bot added the stale label Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔄 daily merge: master → main 2026-01-21#753

🔄 daily merge: master → main 2026-01-21#753
antfin-oss wants to merge 434 commits intomainfrom
create-pull-request/patch-b16808c752

antfin-oss commented Jan 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

@@ @@ -432,7 +433,6 @@ steps: @@
                   tags:
                     - java
                     - python

Conversation

antfin-oss commented Jan 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants