π daily merge: master β main 2026-01-20#750
Conversation
not used anywhere any more. min install tests uses dockerfiles to setup the test environments now Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
and remove python 3.9 related tests Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description
When the Ray process unexpected terminated and node id changed, we
cannot know the previous and current node id through event as we did not
include `node_id` in the base event.
This feature is needed by the history server for the Ray cluster. When
the Ray process unexpected terminated, we have to flush the events
generated by the previous node. If the ray process was restarted fast,
it is difficult to know which events are generated by the previous node.
This PR add the `node_id` into the base event showing where the event is
being emitted.
### Main changes
- `src/ray/protobuf/public/events_base_event.proto`
- Add node id to the base event proto (`RayEvent`)
For GCS:
- `src/ray/gcs/gcs_server_main.cc`
- add `--node_id` as cli args
- `src/ray/observability/` and `src/ray/gcs/` (some files)
- Add `node_id` as arguments and pass to `RayEvent`
For CoreWorker
- `src/ray/core_worker/`
- Passing the `node_id` to the `RayEvent`
Python side
- `python/ray/_private/node.py`
- Passing `node_id` when starting gcs server
## Related issues
Closes ray-project#58879
## Additional information
### Testing process
1. export env var:
- `RAY_enable_core_worker_ray_event_to_aggregator=1`
-
`RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR="http://localhost:8000"`
2. `ray start --head --system-config='{"enable_ray_event":true}'`
3. Submit simple job `ray job submit -- python rayjob.py`. E.g.
```py
import ray
@ray.remote
def hello_world():
return "hello world"
ray.init()
print(ray.get(hello_world.remote()))
```
4. Run event listener (script below) to start listening the event export
host `python event_listener.py`
```py
import http.server
import socketserver
import json
import logging
PORT = 8000
class EventReceiver(http.server.SimpleHTTPRequestHandler):
def do_POST(self):
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
print(json.loads(post_data.decode('utf-8')))
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({"status": "success", "message": "Event received"}).encode('utf-8'))
if __name__ == "__main__":
with socketserver.TCPServer(("", PORT), EventReceiver) as httpd:
print(f"Serving event listener on http://localhost:{PORT}")
print("Set RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR to this address.")
httpd.serve_forever()
```
Will get the event as below:
- GCS event:
```json
[
{
"eventId":"A+yrzknbyDjQALBvaimTPcyZWll9Te3Tw+FEnQ==",
"sourceType":"GCS",
"eventType":"DRIVER_JOB_LIFECYCLE_EVENT",
"timestamp":"2025-12-07T10: 54: 12.621560Z",
"severity":"INFO",
"sessionName":"session_2025-12-07_17-33-33_853835_27993",
"driverJobLifecycleEvent":{
"jobId":"BAAAAA==",
"stateTransitions":[
{
"state":"FINISHED",
"timestamp":"2025-12-07T10: 54: 12.621562Z"
}
]
},
"nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==", // <- nodeId set
"message":""
}
]
```
- CoreWorker event:
```json
[
{
"eventId":"TIAp8D4NwN/ne3VhPHQ0QnsBCYSkZmOUWoe6zQ==",
"sourceType":"CORE_WORKER",
"eventType":"TASK_DEFINITION_EVENT",
"timestamp":"2025-12-07T10:54:12.025967Z",
"severity":"INFO",
"sessionName":"session_2025-12-07_17-33-33_853835_27993",
"taskDefinitionEvent":{
"taskId":"yoDzqOi6LlD///////////////8EAAAA",
"taskFunc":{
"pythonFunctionDescriptor":{
"moduleName":"rayjob",
"functionName":"hello_world",
"functionHash":"a37aacc3b7884c2da4aec32db6151d65",
"className":""
}
},
"taskName":"hello_world",
"requiredResources":{
"CPU":1.0
},
"jobId":"BAAAAA==",
"parentTaskId":"//////////////////////////8EAAAA",
"placementGroupId":"////////////////////////",
"serializedRuntimeEnv":"{}",
"taskAttempt":0,
"taskType":"NORMAL_TASK",
"language":"PYTHON",
"refIds":{
}
},
"nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==", // <- nodeId set here
"message":""
}
]
```
---------
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: win5923 <ken89@kimo.com>
from python 3.9 to python 3.10 we are practically already using python 3.10 everywhere Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
and remove python dependency requirements for python 3.9 or below Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
β¦ded_decodingβ¦ (ray-project#59421) Signed-off-by: Sathyanarayanaa-T <tsathyanarayanaa@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
β¦oject#59806) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
## Description Move gcs_callback_types.h to rpc_callback_types.h ## Related issues Closes ray-project#58597 --------- Signed-off-by: tianyi-ge <tianyig@outlook.com> Co-authored-by: tianyi-ge <tianyig@outlook.com>
β¦ay-project#59345) ## Description Using stateful models in Offline RL is an important feature and the major prerequisites for this feature are already implemented in RLlib's stack. However, some minor changes are needed to indeed train such models in the new stack. This PR implements these minor but important changes at different locations in the code: 1. It introduces the `STATE_OUT` key in the outputs of `BC`'s `forward` function to make the next hidden state available to the connectors and loss function. 2. It adds in the `AddStatesFromEpisodesToBatch` the initial state to the batch for offline data. 3. It adds in `MARWIL` a burn-in for the state that can be controlled via `burnin_len`. 4. It generates sequence sampling in the `OfflinePreLearner` dependent on the `max_seq_len`, `lookback_len` and `burnin_len`. 5. It fixes multiple smaller bugs in `OfflineEnvRunner` when recording from class-referenced environments, in `offline_rl_with_image_data.py` and `cartpole_recording.py` examples when loading the `RLModule` from checkpoint. 6. It fixes the use of `explore=True` in evaluation of Offline RL tests and examples. 7. It adds recorded expert data from `StatelessCartPole` to the `s3://ray-example-data/rllib/offline-data/statelesscartpole` 8. It adds a test for learning on a single episode and a single batch from recorded stateful expert data. Adds also a test to use instead of recorded states the initial states for sequences. 9. Adds a new config parameter `prelearner_use_recorded_module_states` to either use recorded states from the data (`True`) or use the initial state from the `RLModule` (`False`). ## Related issues ## Additional information The only API change is the introduction of a `burnin_len` to the `MARWIL/BC` config. >__Note:__ Stateful model training is only possible in `BC` and `MARWIL` so far for Offline RL. For `IQL` and `CQL` these changes have to be initiated through the off-policy algorithms (`DQN/SAC`) and for these all the buffers need to provide sequence sampling which is implemented right now solely in the `EpisodeReplayBuffer`. Therefore a couple of follow-up PRs need to be produced: 1. Introduce sequence sampling to `PrioritizedEpisodeReplayBuffer`. 2. Introduce sequence sampling to `MultiAgentEpisodeReplayBuffer`. 3. Introduce sequence sampling to `MultiAgentPrioritizedEpisodeReplayBuffer`. 4. Introduce stateful model training in `SAC`. 5. Introduce stateful model training to `IQL/CQL`. --------- Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
) ## Description ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
β¦59774) ## Description Instead of rendering a large json blob for Operator metrics, render the log in a tabular form for better readability. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
β¦ing row count (ray-project#59513) ## Description ```python def process_single_row(x): time.sleep(.3) x['id2'] = x['id'] return x class PrepareImageUdf: def __call__(self, x): time.sleep(5) return x class ChatUdf: def __call__(self, x): time.sleep(5) return x def preprocess(x): time.sleep(.3) return x def task2(x): time.sleep(.3) return x def task3(x): time.sleep(.3) return x def filterfn(x): return True ds = ( ray.data.range(1024, override_num_blocks=1024) .map_batches(task3, compute=ray.data.TaskPoolStrategy(size=1)) .drop_columns(cols="id") .map(process_single_row) .filter(filterfn) .map(preprocess) .map_batches(PrepareImageUdf, zero_copy_batch=True, batch_size=64, compute=ray.data.ActorPoolStrategy(min_size=1, max_size=12)) ) ds.explain() ``` Here's a fun question: what should this return as the optimized physical plan: Ans: ```python -------- Physical Plan (Optimized) -------- ActorPoolMapOperator[MapBatches(drop_columns)->Map(process_single_row)->Filter(filterfn)->Map(preprocess)->MapBatches(PrepareImageUdf)] +- TaskPoolMapOperator[MapBatches(task3)] +- TaskPoolMapOperator[ReadRange] +- InputDataBuffer[Input] ``` Cool. It looks like it fused mostly everything from `drop_columns` to `PrepareImageUDF` Ok what if I added these lines: what happens now? ```python ds = ( ds .map_batches(ChatUdf, zero_copy_batch=True, batch_size=64, compute=ray.data.ActorPoolStrategy(min_size=1, max_size=12)) ) ds.explain() ``` Ans: ```python -------- Physical Plan (Optimized) -------- ActorPoolMapOperator[MapBatches(ChatUdf)] +- ActorPoolMapOperator[Map(preprocess)->MapBatches(PrepareImageUdf)] +- TaskPoolMapOperator[MapBatches(drop_columns)->Map(process_single_row)->Filter(filterfn)] +- TaskPoolMapOperator[MapBatches(task3)] +- TaskPoolMapOperator[ReadRange] +- InputDataBuffer[Input] ``` HuH?? Why did `preprocess->PrepareImageUDF` get defused?? The issue is that operator map fusion does not preserve whether or not the row counts can be modified. This PR addresses that. ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
## Description
- Add conda clean --all -y to ci/env/install-miniforge.sh to reduce CI
image size.
- Local Mac arm64 build comparison:
- baseline: 4.95GB (compressed), 4,954,968,534 bytes (uncompressed)
- clean: 4.3GB (compressed), 4,300,122,206 bytes (uncompressed)
- delta: ~0.61 GiB (~0.65 GB)
- Note: - I only have a Mac arm64 environment, so the size measurements
were taken on Mac (aarch64) builds;
## Related issues
Fixes ray-project#59727
## Additional information
Test
- docker build -f ci/docker/base.test.Dockerfile -t
ray-base-test:baseline .
- docker build -f ci/docker/base.test.Dockerfile -t ray-base-test:clean
.
- docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | rg
"ray-base-test"
- docker image inspect -f '{{.Size}}' ray-base-test:baseline
- docker image inspect -f '{{.Size}}' ray-base-test:clean
---------
Signed-off-by: yaommen <myanstu@163.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦ay-project#57279) ## Why are these changes needed? In `StandardAutoscaler`, `self.provider.internal_ip()` can raise exceptions if the node is not found by the provider, such as if too much time has passed since node was preempted and the provider has "forgotten" about the node. Any exceptions raised by `self.provider.internal_ip()` will cause `StandardAutoscaler` updates and node termination to fail. This change wraps most calls to `self.provider.internal_ip()` within try-catch blocks and provides reasonable fallback behavior. This should allow `StandardAutoscaler` updates and node termination to keep functioning in the presence of "forgotten" nodes. ## Related issue number Addresses ray-project#29698 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Yifan Mai <yifan@cs.stanford.edu> Co-authored-by: Rueian <rueiancsie@gmail.com>
## Description when token auth is enabled, the ray log api's need to pass the auth token in their request headers. (`get_log()` and `list_logs()` functions bypass the StateApiClient and use raw `requests.get()`) Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦de loop (ray-project#59190) The `parse_resource_demands()` and `pending_placement_groups` computation were being called inside the node iteration loop, causing redundant computation for each node. Since resource demands and placement groups are global (not per-node), these should be computed once before the loop. This reduces time complexity from O(N Γ M) to O(N + M), where N is the number of nodes and M is the number of resource demands. For a cluster with 100 nodes, this eliminates ~99% of redundant computation and reduces GIL hold time in the main thread. --------- Signed-off-by: mingfei <mingfei@mds-trading.com> Co-authored-by: mingfei <mingfei@mds-trading.com> Co-authored-by: Rueian <rueiancsie@gmail.com>
β¦project#58701) ## Description > Support Viewing PIDs for Dashboard and Runtime Env Agent ## Related issues > Link related issues: "Related to ray-project#58700". --------- Signed-off-by: yang <yanghang233@126.com> Signed-off-by: Hang Yang <yanghang233@126.com> Signed-off-by: tianyi-ge <tianyig@outlook.com> Co-authored-by: tianyi-ge <tianyig@outlook.com>
## Description Before you would get a message that looks like: ``` Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted ``` when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries. Also removing an unused ray config - `raylet_fetch_timeout_milliseconds`. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
β¦ject#59506) ## Description Currently, when we get error in `tail_job_logs`, we will not raise it because of the backward compatible issue mentioned in ray-project#57037 (comment). This cause inconvenience as when `tail_job_logs` complete, we cannot guarantee that it is completed successfully or if there's any error. This PR raise the `tail_job_logs` error in the newer Ray version only to keep the backward compatibility ## Related issues Related to: ray-project/kuberay#4285 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>
- Fix parent_task_id to use SubmitterTaskId for concurrent actors - Add missing fields: call_site, label_selector, is_debugger_paused, actor_repr_name - Fix func_or_class_name to use CallString() for consistency with event_buffer -> GCS path Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦59434) ## Description i am using ray with the following code: ```python import logging import pandas as pd import ray logger = logging.getLogger(__name__) class QwenPredictor: def __init__(self): logger.info("download ckpt...") async def __call__(self, x): logger.info("start predict...") return x if __name__ == "__main__": ray.init( ignore_reinit_error=True, logging_config=ray.LoggingConfig( encoding="TEXT", log_level="INFO", additional_log_standard_attrs=["asctime"] ) ) context = ray.data.DataContext.get_current() context.enable_progress_bars = False input = ray.data.from_pandas( pd.DataFrame({ "id": [i for i in range(10)], }) ) output = input.map_batches( fn=QwenPredictor, batch_size=10, num_cpus=1, concurrency=1 ) output.count() ``` execute this code, and i get ``` 2025-12-15 20:24:37,628 INFO 1.py:11 -- download ckpt... asctime=2025-12-15 20:24:37,628 job_id=01000000 worker_id=3762350aee1ab375c12794dfb65aaaeac9bca9877b29e41d05b5fb03 node_id=e23283ea1db083d7c27ed7be410e3de6f9c2c3cbaf863799bc7faf3a actor_id=20004433089e747379beb49001000000 task_id=ffffffffffffffff20004433089e747379beb49001000000 task_name=MapWorker(MapBatches(Qwen3ASRPredictor)).__init__ task_func_name=ray.data._internal.execution.operators.actor_pool_map_operator.MapWorker(MapBatches(Qwen3ASRPredictor)).__init__ actor_name= timestamp_ns=1765801477628095000 2025-12-15 20:24:37,656 INFO 1.py:14 -- start predict... asctime=2025-12-15 20:24:37,656 job_id=01000000 worker_id=3762350aee1ab375c12794dfb65aaaeac9bca9877b29e41d05b5fb03 node_id=e23283ea1db083d7c27ed7be410e3de6f9c2c3cbaf863799bc7faf3a actor_id=20004433089e747379beb49001000000 task_id=bc18b04d6a8770425e09011be724738669b2826e01000000 task_name= task_func_name= actor_name= timestamp_ns=1765801477656984000 ``` It can be seen that when the value of log value is an empty string, Ray outputs ${log key}= (for example, task_name= task_func_name= actor_name= ) I futhur figure out why these values were empty, and found out that in ray[data], a thread pool/thread are created to execute the `__call__` method. [sync](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/python/ray/data/_internal/execution/util.py#L79) [async](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/python/ray/data/_internal/planner/plan_udf_map_op.py#L102), and this new thread does not set the task's spec. [link](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/src/ray/core_worker/core_worker.cc#L2852) Therefore, the task name is an empty string. [link](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/src/ray/core_worker/core_worker.h#L263) Since setting the task spec is an internal behavior, I think it's not easy to modify. Therefore, in this PR, I made a small modification: when the log value is an empty string, the log key will not be displayed. --------- Co-authored-by: xiaowen.wxw <wxw403883@alibaba-inc.com>
β¦cking (ray-project#59278) ## Description ## Problem Statement Currently, the schedulerβs node feasibility and availability checks are inconsistent with the actual resource allocation logic. The scheduler reasons only about aggregated GPU capacity per node, while the allocator(local) enforces constraints based on the per-GPU topology. For example, consider a node with two GPUs, each with 0.2 GPU remaining. The scheduler observes 0.4 GPU available in total and concludes that a actor requesting 0.4 GPU can be placed on this node. However, the allocator(local) rejects the request because no single GPU has 0.4 GPU available. ## what this PR does The high-level goal of this PR is to make node feasibility and availability checks consistent between the scheduler and the resource allocator. Although the detailed design is still a work in progress and need big refactor, the first step is to make the schedulerβs node feasibility and availability checks itself consistent and centralized. Right now, Ray has three scheduling paths: - Normal task scheduling - Normal actor scheduling - Placement group - Placement Group reservation(scheduling bundle) - Task/Actor with Placement Group Tasks and actors essentially share the same scheduling path and use the same node feasibility and availability check function. Placement group scheduling, however, implements its own logic in certain path, even though it is conceptually the same. Since we may override or extend the node feasibility and availability checks in later PRs, it is better to first ensure that all scheduling paths use a single, shared implementation of this logic. This PR addresses that problem. ## Related issues Related to ray-project#52133 ray-project#54729 ## Additional information Here I list all the cases that make sure we are relying on the same node feasibility and availability checking func. Later we can just focusing on changing the func and underlying data structure: **Normal task/actor schedulingοΌ** - HybridSchedulingPolicyοΌ - https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/hybrid_scheduling_policy.cc#L41 - https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/hybrid_scheduling_policy.cc#L137 - SpreadSchedulingPolicy: - https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/spread_scheduling_policy.cc#L49 - https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/spread_scheduling_policy.cc#L54 - RandomSchedulingPolicy - https://github.com/ray-project/ray/blob/456d1903277668c1f79f3eb230b908a6e6c403a8/src/ray/raylet/scheduling/policy/random_scheduling_policy.cc#L47-L48 - NodeAffinitySchedulingPolicy - Don't care, just schedule to user specified node by default - Having fallback option that checks: https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/node_affinity_scheduling_policy.cc#L26-L30 - NodeLabelSchedulingPolicy - https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/node_label_scheduling_policy.cc#L171 - https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/node_label_scheduling_policy.cc#L186 **Placement Group reservation(scheduling bundle)οΌ** - PACK/SPREAD/STRICT_SPREAD - https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/scorer.cc#L58 - Note, after this PR, it will also be IsAvailable - STRICT_SPREAD - https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/scorer.cc#L58 - Note, after this PR, it will also be IsAvailable - https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/bundle_scheduling_policy.cc#L185 **Task/Actor with Placement GroupοΌ** - AffinityWithBundleSchedulingPolicy - https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/affinity_with_bundle_scheduling_policy.cc#L25-L26 --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
β¦project#59830) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦t#58695) ## Description This PR adds a new documentation page, Head Node Memory Management, under the Ray Core advanced topics section. ## Related issues Closes ray-project#58621 ## Additional information <img width="2048" height="1358" alt="image" src="https://github.com/user-attachments/assets/3b98150d-05e6-4d15-9cd3-7e05e82ff516" /> <img width="2048" height="498" alt="image" src="https://github.com/user-attachments/assets/4ec8fe43-e3a5-4df4-bca7-376ae407c77b" /> --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
β¦y-project#59845) - [x] Update the docstring for `ray.shutdown()` in `python/ray/_private/worker.py` to clarify: - When connecting to a remote cluster via `ray.init(address="xxx")`, `ray.shutdown()` only disconnects the client and does NOT terminate the remote cluster - Only local clusters started by `ray.init()` will have their processes terminated by `ray.shutdown()` - Clarified that `ray.init()` without address argument will auto-detect existing clusters - [x] Add documentation note to `doc/source/ray-core/starting-ray.rst` explaining the same behavior difference - [x] Review the changes via code_review - [x] Run codeql_checker for security scan (no code changes requiring analysis) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
## Description upgrading cuda base gpu image from 11.8 to 12.8.1 This is required for future py3.13 dependency upgrades --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦roject#59735) ## Description ### Problem Using --entrypoint-resources '{"fragile_node":"!1"}' with the Job API raises an error saying only numeric values are allowed. ### Expect --entrypoint-resources should accept label selectors just like ray.remote/PlacementGroups, so entrypoints can target or avoid nodes with specific labels. ## Related issues > Link related issues: "Fixes ray-project#58662 ", "Closes ray-project#58662", or "Related to ray-project#58662". ## Additional information ### Implementation approach - Relax `JobSubmitRequest.entrypoint_resources` validation to allow string values (`python/ray/dashboard/modules/job/common.py`). - Add `_split_entrypoint_resources()` to separate numeric requests from selector strings and run them through `validate_label_selector` (`python/ray/dashboard/modules/job/job_manager.py`). - Pass numeric resources via the existing `resources` option and selector dict via `label_selector` when spawning the job supervisor, leaving the field unset if only resources were provided (`python/ray/dashboard/modules/job/job_manager.py`). - Extend CLI parsing/tests to cover string-valued resources and assert selector plumbing through the job manager (`python/ray/dashboard/modules/job/tests/test_cli.py`, `python/ray/dashboard/modules/job/tests/test_common.py`, `python/ray/dashboard/modules/job/tests/test_job_manager.py`). Signed-off-by: yaommen <myanstu@163.com>
update with more up-to-date information, and format the markdown file a bit Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
as it is build the artifact generically, not running oss specific logic (e.g. uploading to ray wheels s3) Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦ into a separate module (ray-project#60188) This PR stacks on ray-project#60121. This is 5/N in a series of PRs to remove Centralized Actor Scheduling by the GCS (introduced in ray-project#15943). The feature is off by default and no longer in use or supported. In this PR, - Moving the gcs_actor_* files into a separate bazel module `/ray/gcs/actor/` - Moves the LocalLeaseManager into `/ray/raylet/scheduling` with its friends - Enabling cpplint Pure refactoring. No logic changes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦project#60240) ## Description The test seems to have gone flaky because we're spawning enough tasks fast enough to trigger throttling on downloading the input dataset (getting errors and failing tasks and thus causing the job to fail). We deliberated a bit, and we decided this test isn't actually adding much value anyhow. So instead of adding retries or vendoring the dataset, we're just gonna put the kabosh on this. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zac <zac@anyscale.com> Signed-off-by: Zac Policzer <zac@anyscale.com>
β¦ding behavior (ray-project#60199) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
setting lower bounds for aiohttp v3.13.3 due to security vulnerabilities on previous versions also upgrading aiosignal to 1.4.0 due to the new aiohttp version Open issue: ray-project#59943 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦ject#60197) Adds new script `push_ray_image.py` for publishing Wanda-cached Ray images to Docker Hub. Focus on this is replicating the tagging logic from ci/ray_ci/docker_container.py. Many test cases were added to try to replicate the existing publishing cases, but it'll be good to hear if any others would be helpful. Signed-off-by: andrew <andrew@anyscale.com>
## Description Updates the IMPALA examples and premerge with CartPole and TicTacToe. We only have a minimal number of examples as most users should use PPO or APPO rather than IMPALA. --------- Signed-off-by: Mark Towers <mark@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
remove unused sdk mock methods Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦t#60226) x86 is already ported, so this just follows up on that. Added comments where architecture-specific deviations were made ray-dashboard needed to be updated because AFAIK, wanda only pulls the host default architecture. ray-wheel-build-aarch64 needs ray-dashboard, so we needed to add $ARCH_SUFFIX to make the name unique across the architectures. Topic: wanda-aarch64-wheel Signed-off-by: andrew <andrew@anyscale.com> --------- Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: Andrew Pollack-Gray <andrew@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
no real use. the only usage installs `ray[default]` which is not the right usage. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
all tests are using python 3.10 or above now Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
removes python 3.9 reference in CI scripts and docs; ray only supports python 3.10+ now Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
so that it is consistent no matter how we build the wheel. `[cpp]` and `[all-cpp]` extra's are not included in `[all]` today. so unless user explicitly specify them, they will be skipped. as a result, we do not need to drop them from the extra declarations but can just always include them. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
## Description While working on ray-project#60241, I realized that it is possible to use `OpState` as keys to dictionaries. This didn't occur to me in the past, so I had to use a workaround where I would have a `progress_manager_uuid` flag in `OpState` to link between the progress managers and `OpState`, creating tight coupling. This PR removes this. ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
β¦ject#60208) This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes. This fix has been discussed firsthand with @bveeramani. "Fixes ray-project#60166" --------- Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
β¦meters for the 'serve' API (ray-project#56507) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Serve `http_options` and `proxy_location` (which determines `http_options.location`) can't be updated in runtime. When users try to update them they receive warnings in the following format: ``` WARNING 2025-11-03 20:51:10,462 serve 23061 -- The new client HTTP config differs from the existing one in the following fields: ['host', 'port', 'location']. The new HTTP config is ignored. -- 2025-11-03 20:51:59,590 WARNING serve_head.py:281 -- Serve is already running on this Ray cluster and it's not possible to update its HTTP options without restarting it. Following options are attempted to be updated: ['host']. ``` This PR: - change warning to failing with the error `RayServeConfigException` - eliminate `validate_http_options` function in the `dashboard.modules.serve.serve_head` as it's not needed anymore - update documentation for [Proxy config](https://docs.ray.io/en/latest/serve/production-guide/config.html#proxy-config) that the parameter is global and can't be updated at runtime. - change `HTTPOptions.location` default value `HeadOnly` -> `EveryNode` (it's likely the desired value) -------------------------------------------------------------- User scenario: - have a file `hello_world.py`: ``` # hello_world.py from ray.serve import deployment @deployment async def hello_world(): return "Hello, world!" hello_world_app = hello_world.bind() ``` - execute commands: ``` ray stop ray start --head serve build -o config.yaml hello_world:hello_world_app # generate `config.yaml` serve deploy config.yaml # in the `config.yaml` file update: # proxy_location: EveryNode -> HeadOnly # http_options.host: 0.0.0.0 -> 0.0.0.1 # http_options.port: 8000 -> 8001 serve deploy config.yaml ``` Output before the change: ``` # stdout: bash$ serve deploy config.yaml 2025-09-14 17:19:15,606 INFO scripts.py:239 -- Deploying from config file: 'config.yaml'. 2025-09-14 17:19:15,619 SUCC scripts.py:359 -- Sent deploy request successfully. * Use `serve status` to check applications' statuses. * Use `serve config` to see the current application config(s). # /tmp/ray/session_latest/logs/dashboard_ServeHead.log 2025-09-14 17:19:15,615 WARNING serve_head.py:177 -- Serve is already running on this Ray cluster and it's not possible to update its HTTP options without restarting it. Following options are attempted to be updated: ['location', 'host', 'port']. ``` Output after the change: ``` # stdout: bash$ serve deploy config.yaml 2025-11-03 21:04:51,252 INFO scripts.py:243 -- Deploying from config file: 'config.yaml'. Traceback (most recent call last): File "~/ray/.venv/bin/serve", line 7, in <module> sys.exit(cli()) File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "~/ray/python/ray/serve/scripts.py", line 360, in deploy ServeSubmissionClient(address).deploy_applications( File "~/ray/python/ray/dashboard/modules/serve/sdk.py", line 80, in deploy_applications self._raise_error(response) File "~/ray/python/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error raise RuntimeError( RuntimeError: Request failed with status code 500: Traceback (most recent call last): File "~/ray/python/ray/dashboard/optional_utils.py", line 188, in decorator return await f(self, *args, **kwargs) File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 37, in check return await func(self, *args, **kwargs) File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 167, in put_all_applications client = await serve_start_async( File "~/ray/python/ray/serve/_private/api.py", line 148, in serve_start_async _check_http_options(client.http_config, http_options) File "~/ray/python/ray/serve/_private/api.py", line 51, in _check_http_options raise RayServeConfigException( ray.serve.exceptions.RayServeConfigException: Attempt to update `http_options` or `proxy_location` has been detected! Attempted updates: `{'location': {'previous': 'EveryNode', 'new': 'HeadOnly'}, 'host': {'previous': '0.0.0.0', 'new': '0.0.0.1'}, 'port': {'previous': 8000, 'new': 8001}}`. HTTP config is global to your Ray cluster, and you can't update it during runtime. Please restart Ray Serve to apply the change. ``` The same behavior for serve REST API: ``` bash$ curl -X PUT "http://localhost:8265/api/serve/applications/" -H "Accept: application/json" -H "Content-Type: application/json" -d '{ > "http_options": { > Makefile config.yaml config_changed.yaml discrepancy.py hello_world_2.py out test_failure.py __pycache__/ config_3_apps.yaml curl hello_world.py new rez.json > "host": "0.0.0.1" > }, > "applications": [ > { > "name": "app1", > "route_prefix": "/", > "import_path": "hello_world:hello_world_app", > "runtime_env": {}, > "deployments": [ > { "name": "hello_world" } > ] > } > ] > }' Traceback (most recent call last): File "~/ray/python/ray/dashboard/optional_utils.py", line 188, in decorator return await f(self, *args, **kwargs) File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 37, in check return await func(self, *args, **kwargs) File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 167, in put_all_applications client = await serve_start_async( File "~/ray/python/ray/serve/_private/api.py", line 148, in serve_start_async _check_http_options(client.http_config, http_options) File "~/ray/python/ray/serve/_private/api.py", line 51, in _check_http_options raise RayServeConfigException( ray.serve.exceptions.RayServeConfigException: Attempt to update `http_options` or `proxy_location` has been detected! Attempted updates: `{'host': {'previous': '0.0.0.0', 'new': '0.0.0.1'}}`. HTTP config is global to your Ray cluster, and you can't update it during runtime. Please restart Ray Serve to apply the change. ``` -------------------------------------------- A thing worth to mention is that, this change makes explicit the discrepancy between default `host` value in [serve.config.HTTPOptions](https://github.com/ray-project/ray/blob/master/python/ray/serve/config.py#L433) (host="127.0.0.1") vs [serve.schema.HTTPOptionsSchema](https://github.com/ray-project/ray/blob/master/python/ray/serve/schema.py#L683) (host="0.0.0.0"). `serve.config.HTTPOptions` is primarily used in imperative serve API (Python API or CLI with params) and `serve.schema.HTTPOptionsSchema` is used in declarative serve API (REST API or `deploy/run` with config file) Previously, when users use commands `start` and then `deploy` or `run` with default params - `http_options` from the `start` command were used. Now we explicitly failing in this scenario: ``` ray stop ray start --head serve build -o config.yaml hello_world:hello_world_app # generate `config.yaml` with default values serve start serve deploy config.yaml ... ray.serve.exceptions.RayServeConfigException: Attempt to update `http_options` or `proxy_location` has been detected! Attempted updates: `{'host': {'previous': '127.0.0.1', 'new': '0.0.0.0'}}`. HTTP config is global to your Ray cluster, and you can't update it during runtime. Please restart Ray Serve to apply the change. ``` Maybe the `host` default value should be aligned in `HTTPOptions` and `HTTPOptionsSchema`. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> Closes ray-project#56163 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Disallow runtime updates to global `http_options`/`proxy_location` by raising `RayServeConfigException`, fix default proxy location handling, refactor validation, and update docs/tests/CLI accordingly. > > - **Serve API (backend)** > - Raise `RayServeConfigException` on attempts to change global `http_options`/`proxy_location` via `_check_http_options` (now compares against `client.http_config`, normalizes `ProxyLocation`/`DeploymentMode`). > - Fix default proxy location handling: `_prepare_http_options` sets `DeploymentMode.EveryNode` when `proxy_location=None`; `start()` uses it. > - Adjust `serve_start`/`serve_start_async` to pass `client.http_config` into `_check_http_options`. > - Add `RayServeConfigException` in `ray.serve.exceptions`. > - **Dashboard REST** > - Remove `validate_http_options` warning logic from `serve_head.py`; rely on backend check. > - PUT with changed HTTP/proxy config now fails (500) instead of warning. > - **CLI/Tests** > - Update tests to expect failure on config changes and to use explicit HTTP host where needed. > - Add tests for `_prepare_http_options` and `serve.start` rejection on changed HTTP config; add CLI test verifying detailed diff in error. > - **Docs** > - Note that `proxy_location` and HTTP/gRPC configs are cluster-global and cannot be updated at runtime. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1f690a6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
## Why are these changes needed? When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1`, inter-deployment calls use gRPC instead of Ray actor calls. In this mode, `_ray_trace_ctx` is not injected into kwargs since gRPC calls bypass Ray's tracing decorators. Tracing context propagation for gRPC mode requires additional work to properly capture and forward the context. This PR skips the `test_deployment_remote_calls_with_tracing` test in gRPC mode until a proper solution is implemented. ## Related issue number Tracking issue: ray-project#60223 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for <https://docs.ray.io/en/master/>. - [x] I've added any new APIs to the API Reference. For doc changes, see [Contribute Docs](https://docs.ray.io/en/latest/ray-contribute/docs.html). - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description Organize read/write tests under python/ray/data/tests/datasource and update the HuggingFace helper import to match the new path. ## Related issues Link related issues: "Fixes ray-project#60164" --------- Signed-off-by: kriyanshii <kriyanshishah06@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦es (ray-project#60084) (ray-project#60091) Categorize APIs into Public APIs and Developer APIs, and sort them alphabetically by service name. Changes: - Reorganized loading_data.rst and saving_data.rst with Public APIs first, then Developer APIs - Sorted all APIs alphabetically by service name within each section - Sections that originally had APIs for both Public and Developer APIs were divided to respective sections - Removed datasource.FastFileMetadataProvider API that has been removed ([reference](ray-project#59027)) Fixes ray-project#60084 Signed-off-by: mgchoi239 <mg.choi.239@gmail.com> --------- Signed-off-by: mgchoi239 <mg.choi.239@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: mgchoi239 <mg.choi.239@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
ray-project#60133) ## Description > Make DefaultClusterAutoscalerV2 knobs configurable via environment variables ## Related issues > Closes ray-project#60004 --------- Signed-off-by: Rushikesh Adhav <adhavrushikesh6@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
## Description Features were introduced to log progress separately in non-atty situations to prevent spamming. This feature was introduced all over the place, so this PR groups the logging part into a separate `LoggingExecutionProgressManager`, similar to how we group the other implementations (ie: rich, tqdm) ## Related issues Fixes ray-project#60083 ## Additional information All feedback to specific UI is welcome. --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
- adding .claude folder to .gitignore so that users can --------- Signed-off-by: harshit <harshit@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request is an automated daily merge from master to main. It includes a wide range of changes, primarily focused on a major refactoring of the CI/CD system, dependency updates, and significant documentation improvements. Key changes include a new modular build system using wanda, dropping Python 3.9 support in many areas, and adding new tutorials and internal design documents. I've identified one potential issue with the CI test selection rules that could lead to inefficiencies.
| * | ||
| @ ml tune train data serve | ||
| @ core_cpp cpp java python doc | ||
| @ linux_wheels macos_wheels dashboard tools release_tests | ||
| ; |
There was a problem hiding this comment.
This wildcard rule at the end of the file will match every file changed in a pull request and assign a large set of tags (ml, tune, train, data, serve, core_cpp, cpp, java, python, doc, linux_wheels, macos_wheels, dashboard, tools, release_tests). This will cause a significant number of tests to run for any change, regardless of its scope, potentially leading to very long and expensive CI runs. Was this intentional, or should this rule be more specific or removed?
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-20
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.