π daily merge: master β main 2026-01-21#753
Conversation
β¦t#58695) ## Description This PR adds a new documentation page, Head Node Memory Management, under the Ray Core advanced topics section. ## Related issues Closes ray-project#58621 ## Additional information <img width="2048" height="1358" alt="image" src="https://github.com/user-attachments/assets/3b98150d-05e6-4d15-9cd3-7e05e82ff516" /> <img width="2048" height="498" alt="image" src="https://github.com/user-attachments/assets/4ec8fe43-e3a5-4df4-bca7-376ae407c77b" /> --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
β¦y-project#59845) - [x] Update the docstring for `ray.shutdown()` in `python/ray/_private/worker.py` to clarify: - When connecting to a remote cluster via `ray.init(address="xxx")`, `ray.shutdown()` only disconnects the client and does NOT terminate the remote cluster - Only local clusters started by `ray.init()` will have their processes terminated by `ray.shutdown()` - Clarified that `ray.init()` without address argument will auto-detect existing clusters - [x] Add documentation note to `doc/source/ray-core/starting-ray.rst` explaining the same behavior difference - [x] Review the changes via code_review - [x] Run codeql_checker for security scan (no code changes requiring analysis) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
## Description upgrading cuda base gpu image from 11.8 to 12.8.1 This is required for future py3.13 dependency upgrades --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦roject#59735) ## Description ### Problem Using --entrypoint-resources '{"fragile_node":"!1"}' with the Job API raises an error saying only numeric values are allowed. ### Expect --entrypoint-resources should accept label selectors just like ray.remote/PlacementGroups, so entrypoints can target or avoid nodes with specific labels. ## Related issues > Link related issues: "Fixes ray-project#58662 ", "Closes ray-project#58662", or "Related to ray-project#58662". ## Additional information ### Implementation approach - Relax `JobSubmitRequest.entrypoint_resources` validation to allow string values (`python/ray/dashboard/modules/job/common.py`). - Add `_split_entrypoint_resources()` to separate numeric requests from selector strings and run them through `validate_label_selector` (`python/ray/dashboard/modules/job/job_manager.py`). - Pass numeric resources via the existing `resources` option and selector dict via `label_selector` when spawning the job supervisor, leaving the field unset if only resources were provided (`python/ray/dashboard/modules/job/job_manager.py`). - Extend CLI parsing/tests to cover string-valued resources and assert selector plumbing through the job manager (`python/ray/dashboard/modules/job/tests/test_cli.py`, `python/ray/dashboard/modules/job/tests/test_common.py`, `python/ray/dashboard/modules/job/tests/test_job_manager.py`). Signed-off-by: yaommen <myanstu@163.com>
update with more up-to-date information, and format the markdown file a bit Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦-project#59848) # Fix StreamingRepartition hang with empty upstream results ## Summary Fix a bug where `StreamingRepartitionRefBundler` would hang when processing empty datasets (0 rows). ## Problem When upstream operations (e.g., `filter`, `map`, etc.) produce an empty result (0 rows), resulting empty `RefBundle` gets added to `_pending_bundles` but never gets flushed because: 1. `add_bundle()` adds empty bundles (0 rows) to `_pending_bundles` 2. `_total_pending_rows` remains 0 3. `done_adding_bundles()` checks `len(_pending_bundles) > 0` and calls `flush_remaining=True` 4. `_try_build_ready_bundle(flush_remaining=True)` checks `_total_pending_rows > 0` β False, so no flush happens 5. Empty bundles remain in `_pending_bundles` forever (memory leak) ## Reproduction ```python import ray ray.init() ds = ray.data.range(5).filter(lambda row: row['id'] > 100) ds = ds.repartition(target_num_rows_per_block=8) ds.count() ``` ## Solution Changed flush condition in `_try_build_ready_bundle()` from checking `_total_pending_rows > 0` to `len(self._pending_bundles) > 0`: ```python # Before: if flush_remaining and self._total_pending_rows > 0: # After: if flush_remaining and len(self._pending_bundles) > 0: ``` This ensures empty bundles never enter the bundler state, preventing both hangs and memory leaks. Signed-off-by: dragongu <andrewgu@vip.qq.com>
β¦uts (ray-project#59883) If you have a pipeline like `read --> [some cpu transformation] --> [gpu transformation init_concurrency =N] --> write`, the `gpu transformation` might downscale to 0 actors if the CPU transformation is slow. This basically nullifies `init_concurrency` and can cause cold-start delays. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
stop using python 3.9 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦butes (ray-project#59894) ## Description The `StatelessCartPole` example form APPO is timing out. This could be due to the latest changes in the APPO data pipeline. This PR modifies the setup of the example by using the new APPO attributes. ## Related issues Fixes https://buildkite.com/ray-project/postmerge/builds/15188#019b8f6e-2850-465e-a98c-63c29fbf98f7/L4702 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
this avoid the need to put in the dummy no-op files, and also allows us to add env vars in the future. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description ray: handle dual task errors with read-only args - avoid writing to user-defined args when building RayTaskError hybrids - fall back to RayTaskError-only with warning if subclassing fails - add regression test covering read-only args user exceptions ## Related issues Fixes ray-project#59437
ray-project#59846) ## Description dashboard agent services such as reporter agent and event aggregator agent do not run in minimal ray installs (`pip install ray`). this pr skips client creation (and adds a info log to guide users) when using minimal installs. ## Related issues Fixes ray-project#59665 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
## Description this pr adds auth middleware to dashboard http agent service and configures clients to include token headers in their requests. Pr also covers passing auth headers in state_manager runtime env agent api call which was previously missed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦prising (ray-project#59390) Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description This PR adds support in the `JaxTrainer` to schedule across multiple TPU slices using the `ray.util.tpu` public utilities. To support this, this PR adds new `AcceleratorConfig`s to the V2 scaling config, which consolidate the accelerator related fields for TPU and GPU. When `TPUAcceleratorConfig` is specified, the JaxTrainer utilizes a `SlicePlacementGroup` to atomically reserve `num_slices` TPU slices of the desired topology, auto-detecting the required values for `num_workers` and `resources_per_worker` when unspecified. TODO: I'll add some manual testing and usage examples in the comments. ## Related issues ray-project#55162 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦e policy (ray-project#59803) ## Description Given a typical scenario of a fast producing operator followed by a slow producing operator how does the backpressure policy and resource allocator behave? This change just adds tests to cement the expected behavior. ## Related issues DATA-1712 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
This PR adds documentation for several Ray Serve environment variables that were defined in `constants.py` but missing from the documentation, and also cleans up deprecated legacy environment variable names. ### Changes Made #### Documentation additions **`doc/source/serve/production-guide/config.md`** (Proxy config section): - `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE` - Control whether to always run a proxy on the head node - `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` - Proxy health check timeout - `RAY_SERVE_PROXY_HEALTH_CHECK_PERIOD_S` - Proxy health check period - `RAY_SERVE_PROXY_READY_CHECK_TIMEOUT_S` - Proxy ready check timeout - `RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S` - Minimum proxy draining period **`doc/source/serve/production-guide/fault-tolerance.md`** (New "Replica constructor retries" section): - `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT` - Max constructor retries per replica - `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` - Max constructor retries per deployment **`doc/source/serve/advanced-guides/performance.md`**: - `RAY_SERVE_PROXY_PREFER_LOCAL_NODE_ROUTING` - Proxy node locality routing preference - `RAY_SERVE_PROXY_PREFER_LOCAL_AZ_ROUTING` - Proxy AZ locality routing preference - `RAY_SERVE_MAX_CACHED_HANDLES` - Max cached deployment handles (controller debugging section) **`doc/source/serve/monitoring.md`**: - `RAY_SERVE_HTTP_PROXY_CALLBACK_IMPORT_PATH` - HTTP proxy initialization callback - `SERVE_SLOW_STARTUP_WARNING_S` - Slow startup warning threshold - `SERVE_SLOW_STARTUP_WARNING_PERIOD_S` - Slow startup warning interval #### Code cleanup **`python/ray/serve/_private/constants.py`**: - Removed legacy fallback for `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` (now only `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT`) - Removed legacy fallback for `MAX_PER_REPLICA_RETRY_COUNT` (now only `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT`) - Removed legacy fallback for `MAX_CACHED_HANDLES` (now only `RAY_SERVE_MAX_CACHED_HANDLES`) **`python/ray/serve/_private/constants_utils.py`**: - Removed `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` and `MAX_PER_REPLICA_RETRY_COUNT` from the deprecated names whitelist --------- Signed-off-by: harshit <harshit@anyscale.com>
β¦reating (ray-project#59610) Signed-off-by: dayshah <dhyey2019@gmail.com>
## Description allow `RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES` to accept `ALL` so that all events are exported. will be used by history server. (without this config, kuberay needs to explicitly list each event type which is tedious as this list may grow in the future) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦project#59784) ## Description run state api and task event unit tests with both the default (task_event -> gcs flow) and aggregator (task_event -> aggregator -> gcs) to smoothen the transition from default to aggregator flow --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
AnyscaleJobRunner is the only implementation/child class of CommandRunner right now. There is no need to use inheritance. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
) Add BuildContext TypedDict to capture post_build_script, python_depset, their SHA256 digests, and environment variables for custom BYOD image builds. Changes: - Add build_context.py with BuildContext TypedDict and helper functions: - make_build_context: constructs BuildContext with computed file digests - encode_build_context: deterministic minified JSON serialization - decode_build_context: JSON deserialization - build_context_digest: SHA256 digest of encoded context - Refactor build_anyscale_custom_byod_image to accept BuildContext instead of individual post_build_script and python_depset arguments - Update callers: custom_byod_build.py, ray_bisect.py - Add comprehensive unit tests π€ Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
β¦project#59839) # Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays ## Issue Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow chunks to NumPy arrays with `zero_copy_only=True`: ``` File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks)) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True) File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True ``` This error occurs when checkpoint data is loaded from Ray's object store, where PyArrow buffers may reside in shared memory and cannot be zero-copied to NumPy. ## Reproduction ```python #!/usr/bin/env python3 import ray from ray.data import DataContext from ray.data.checkpoint import CheckpointConfig import tempfile ray.init() with tempfile.TemporaryDirectory() as ckpt_dir, \ tempfile.TemporaryDirectory() as data_dir, \ tempfile.TemporaryDirectory() as output_dir: # Step 1: Create data ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir) # Step 2: Enable checkpoint and write ctx = DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( checkpoint_path=ckpt_dir, id_column="id", delete_checkpoint_on_success=False ) ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir) # Step 3: Second write triggers checkpoint filtering ray.data.read_parquet(data_dir).write_parquet(output_dir) ray.shutdown() ``` ## Solution Change `to_numpy(zero_copy_only=True)` to `to_numpy(zero_copy_only=False)` in `BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows PyArrow to copy data when necessary. ### Changes **File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py` - Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to `ckpt_chunk.to_numpy(zero_copy_only=False)` ### Performance Impact No performance regression expected. PyArrow will only perform a copy when zero-copy is not possible. Signed-off-by: dragongu <andrewgu@vip.qq.com>
## Description Adds repr_name field to actor_lifecycle_event schema and populates it when available. ## Related issues Closes ray-project#59813 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦y-project#59893) ## Description Fix inconsistent task name in metrics between RUNNING and FINISHED states. When a Ray task is defined with a custom name via `.options(name="custom_name")`, the `ray_tasks` metrics show inconsistent names: - **RUNNING** state: shows the original function name (e.g., `RemoteFn`) - **FINISHED/FAILED** state: shows the custom name (e.g., `test`) **Root cause:** The RUNNING task counter in `CoreWorker` uses `FunctionDescriptor()->CallString()` to get the task name, while finished task events correctly use `TaskSpecification::GetName()`. **Fix:** Changed both `HandlePushTask` and `ExecuteTask` in `core_worker.cc` to use `task_spec.GetName()` consistently, which properly returns the custom name when set. ## Related issues None - this PR addresses a newly discovered bug. ## Additional information **Files changed:** - `src/ray/core_worker/core_worker.cc` - Use `GetName()` instead of `FunctionDescriptor()->CallString()` for metrics - `python/ray/tests/test_task_metrics.py` - Added test `test_task_custom_name_metrics` to verify custom names appear correctly in metrics Signed-off-by: Yuan Jiewei <jieweihh.yuan@gmail.com> Co-authored-by: Yuan Jiewei <jieweihh.yuan@gmail.com>
## Description update metrics export docs based on changes in ray-project#59337 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦ray-project#59808) Adds a new RLlib algorithm TQC, which extends SAC with distributional critics using quantile regression to control Q-function overestimation bias. Key components: - TQC algorithm configuration and implementation - Default TQC RLModule with multiple quantile critics - TQC catalog for building network components - Comprehensive test suite covering compilation, simple environments, and parameter validation - Documentation including > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: tk42 <nsplat@gmail.com> Co-authored-by: simonsays1980 <simon.zehnder@gmail.com>
β¦60304) ## Description We had a separate field in `OpState` to keep track of outputted rows. `OpRuntimeMetrics` exist per `PhysicalOperator`, and also has a field to keep track of outputted rows, so there is no need to keep track of a duplicate in OpState. ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
## Description This PR removes an obsolete HalfCheetah release test. ## Related issues See also: ray-project#59007 Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
## Description Currently ray attach only allows opening an SSH session on the head node. It could be useful to allow attaching to worker nodes to check what state the execution environment and file system are in (e.g. running conda list, examining config files such as ~/.keras/keras.json). ## Related issues Closes ray-project#7064 ## Additional information This PR add `--node-ip` args to `ray attach` to specify the node IP to attach to. Usage: `ray attach cluster.yaml --node-ip <node ip>`. Default to head node if the `--node-ip` is not provided. Add unit test and tested on GCP (see ray-project#59931 (comment)) --------- Signed-off-by: machichima <nary12321@gmail.com>
β¦oject#60276) so that we are not pretending that we are fetching results or teminating jobs.. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦in homepage (ray-project#60229) ## Summary Replaced the Ray Tune example in the homepage (`index.html`) to show vanilla Ray Tune usage instead of V1 tune+train integration. **Changes:** - Removed `ScalingConfig` and `LightGBMTrainer` imports (Ray Train components) - Added a pure Ray Tune example demonstrating: - An objective function that trains a model with hyperparameters and reports metrics - Hyperparameter search space using common Tune methods (`loguniform`, `choice`, `randint`) - Running 1000 trials with the `Tuner` API - Retrieving the best result This makes the example clearer for users who want to learn Ray Tune's hyperparameter optimization capabilities without the complexity of Ray Train integration. Signed-off-by: xgui <xgui@anyscale.com>
if a test is not stable, it should be on manual frequency. we will no longer treat unstable tests differently. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦ect#60264) the alias is not used anywhere. this clears all the `__init__.py` under `ray_release/` directory. making it consistent with other files, and easier to convert everything to idiomatic bazel Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
we have always been using a constant. if one needs more logs, they can go to anyscale's UI and view logs there. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦#60277) just save the sdk as a private member instead Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦oject#60272) so that it is not going back and forth between the implementation and the abstract class, and not implemented as a property. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description Deprecate Predictor API and its concreate subclasses DLPredictor(Predictor), LightGBMPredictor(Predictor), TensorflowPredictor(DLPredictor), TorchPredictor(DLPredictor), XGBoostPredictor(Predictor) TorchDetectionPredictor(TorchPredictor). ## Related issues Closes ray-project#60266 ## Additional information added `@Deprecated` annotations to corresponding classes and warns 'DeprecationWarning' when the constructor of the superclass is called. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com> Signed-off-by: Hyunoh-Yeo <113647638+Hyunoh-Yeo@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦dule IDs (ray-project#60234) ### Description This PR fixes a bug in the RLLib `MultiAgentEnvRunner` where module episode returns metrics were incorrectly calculated when multiple agents share the same module ID. Previously, the code was overwriting returns instead of accumulating them, leading to incorrect metrics. - Fixed module return calculation logic in `MultiAgentEnvRunner` to properly accumulate returns when multiple agents use the same module ID - Added test case to verify that module metrics returns equal the sum of agent returns assigned to that module ### Related issues Fixes ray-project#59860 (ray-project#59860) ### Files modified: - `rllib/env/multi_agent_env_runner.py`: Core bug fix - `rllib/env/tests/test_multi_agent_env_runner.py`: New test case called `test_module_metrics_returns_equal_sum_of_agent_returns()` --------- Signed-off-by: Adam Kelloway <kelloway@amazon.com> Co-authored-by: Adam Kelloway <kelloway@amazon.com>
install from tarball from official source, rathar than deb. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
β¦oject#60151) removing requirement file and constraint file build args from the following images base-deps base-extra base-extra-test-deps base-slim (defaulting constraints file as a build arg) defaulting PYTHON_DEPSET & CONSTRAINTS_FILE args in the dockerfile Renaming ray-llm, ray-gpu & ray base extra testdeps lock files. IMAGE_TYPE defined on the BK jobs will determine which lock file to copy to the image hello world release test run: https://buildkite.com/ray-project/release/builds/76001# --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
ray-project#59897) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>
β¦netes token authentication (ray-project#59621) ## Description Per discussion from REP PR (ray-project/enhancements#63), this PR adds a server-side config `RAY_ENABLE_K8S_TOKEN_RBAC=true` to enable Kubernetes-based token authentication. This must be set in addition to `RAY_AUTH_MODE=token`. The main benefit of this change is that the server-side authentication flow becomes opaque to clients, and all clients only need to set `RAY_AUTH_MODE=token` along with their token. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
β¦-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
β¦t#60267) it is always an instance of AnyscaleJobRunner. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦0278) and saves the job ID in `_job_id`. this makes the information flow clearer and simpler. this is preparation for refactoring the job sdk usage. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
per anyscale#727 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦ from Ray Data (ray-project#60292) ## Description Remove all top-level imports of `ray.data` from the `ray.train` module. Imports needed only for type annotations should be guarded behind if `TYPE_CHECKING:`. Imports needed at runtime should be moved inline (lazy imports within functions/methods). ## Related issues Fixes ray-project#60152. --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Code Review
This pull request primarily focuses on updating and refactoring the CI/CD pipeline, removing Python 3.9 support, and introducing new build steps for C++ wheels. Several documentation files have also been updated to reflect these changes and improve clarity. The removal of the oss tag from various build steps across different platforms might impact how these jobs are categorized or filtered in the CI system. Additionally, the refactoring of Bazel sharding logic and dependency management indicates a significant overhaul of the build infrastructure.
| if (ConfigInternal::Instance().worker_type != WorkerType::DRIVER) { | ||
| options.worker_id = WorkerID::FromHex(ConfigInternal::Instance().worker_id); | ||
| } |
There was a problem hiding this comment.
| head_args.insert(head_args.end(), args.begin(), args.end()); | ||
| } | ||
| startup_token = absl::GetFlag<int64_t>(FLAGS_startup_token); | ||
| worker_id = absl::GetFlag<std::string>(FLAGS_ray_worker_id); |
There was a problem hiding this comment.
The assignment startup_token = absl::GetFlag<int64_t>(FLAGS_startup_token); has been replaced with worker_id = absl::GetFlag<std::string>(FLAGS_ray_worker_id);. This change must be carefully reviewed to ensure that the new worker_id is correctly retrieved and used in all relevant parts of the system, especially considering the type change from int64_t to std::string.
| ABSL_FLAG(std::string, | ||
| ray_worker_id, | ||
| "", | ||
| "The worker ID assigned to this worker process by the raylet (hex string)."); |
There was a problem hiding this comment.
| # Correct example of ray.get(), using the object store to fetch the RDT object because the caller | ||
| # is not part of the collective group. | ||
| print(ray.get(tensor, _use_object_store=True)) |
There was a problem hiding this comment.
The _tensor_transport="object_store" parameter has been updated to _use_object_store=True. This is a breaking API change that needs to be clearly communicated to users, along with migration instructions.
| # Correct example of ray.get(), using the object store to fetch the RDT object because the caller | |
| # is not part of the collective group. | |
| print(ray.get(tensor, _use_object_store=True)) | |
| print(ray.get(tensor, _use_object_store=True)) |
| The :func:`ray.get <ray.get>` function can also be used as usual to retrieve the result of an RDT object. However, :func:`ray.get <ray.get>` will by default use the same tensor transport as the one specified in the :func:`@ray.method <ray.method>` decorator. For collective-based transports, this will not work if the caller is not part of the collective group. | ||
|
|
||
| Therefore, users need to specify the Ray object store as the tensor transport explicitly by setting ``_tensor_transport`` in :func:`ray.get <ray.get>`. | ||
| Therefore, users need to specify the Ray object store as the tensor transport explicitly by setting ``_use_object_store`` in :func:`ray.get <ray.get>`. |
| - python | ||
| - macos_wheels | ||
| - oss | ||
| job_env: MACOS |
| @@ -19,11 +22,11 @@ steps: | |||
| tags: | |||
| Note that Ray decouples the lifetime option and the name option. If you only specify | ||
| the name without specifying ``lifetime="detached"``, then you can only retrieve the placement group | ||
| while the driver where you created the placement group is still running. | ||
| It's recommended to always specify the name when creating the detached placement group. If you don't, | ||
| there is no way to retrieve the placement group from another process, and there is no way | ||
| to kill it once you exit the driver script that created the placement group. | ||
|
|
| @@ -484,7 +484,6 @@ steps: | |||
| # avoid running them for every C++ code change. | |||
| tags: | |||
| - spark_on_ray | |||
| @@ -432,7 +433,6 @@ steps: | |||
| tags: | |||
| - java | |||
| - python | |||
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-21
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.