Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-20#750

Open
antfin-oss wants to merge 410 commits intomainfrom
create-pull-request/patch-390ae76911
Open

πŸ”„ daily merge: master β†’ main 2026-01-20#750
antfin-oss wants to merge 410 commits intomainfrom
create-pull-request/patch-390ae76911

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-20
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

aslonnie and others added 30 commits January 3, 2026 17:07
not used anywhere any more. min install tests uses dockerfiles to setup
the test environments now

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
and remove python 3.9 related tests

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description

When the Ray process unexpected terminated and node id changed, we
cannot know the previous and current node id through event as we did not
include `node_id` in the base event.

This feature is needed by the history server for the Ray cluster. When
the Ray process unexpected terminated, we have to flush the events
generated by the previous node. If the ray process was restarted fast,
it is difficult to know which events are generated by the previous node.

This PR add the `node_id` into the base event showing where the event is
being emitted.

### Main changes

- `src/ray/protobuf/public/events_base_event.proto`
    - Add node id to the base event proto (`RayEvent`)

For GCS:

- `src/ray/gcs/gcs_server_main.cc`
    - add `--node_id` as cli args
- `src/ray/observability/` and `src/ray/gcs/` (some files)
    - Add `node_id` as arguments and pass to `RayEvent`

For CoreWorker

- `src/ray/core_worker/`
    - Passing the `node_id` to the `RayEvent`

Python side

- `python/ray/_private/node.py`
    - Passing `node_id` when starting gcs server


## Related issues

Closes ray-project#58879

## Additional information

### Testing process

1. export env var:
    - `RAY_enable_core_worker_ray_event_to_aggregator=1`
-
`RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR="http://localhost:8000"`
2. `ray start --head --system-config='{"enable_ray_event":true}'`
3. Submit simple job `ray job submit -- python rayjob.py`. E.g.

```py
import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.get(hello_world.remote()))
```

4. Run event listener (script below) to start listening the event export
host `python event_listener.py`

```py
import http.server
import socketserver
import json
import logging

PORT = 8000

class EventReceiver(http.server.SimpleHTTPRequestHandler):
    def do_POST(self):
        content_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(content_length)
        print(json.loads(post_data.decode('utf-8')))
        self.send_response(200)
        self.send_header('Content-type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps({"status": "success", "message": "Event received"}).encode('utf-8'))


if __name__ == "__main__":
    with socketserver.TCPServer(("", PORT), EventReceiver) as httpd:
        print(f"Serving event listener on http://localhost:{PORT}")
        print("Set RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR to this address.")
        httpd.serve_forever()
```

Will get the event as below:

- GCS event:

```json
[
   {
      "eventId":"A+yrzknbyDjQALBvaimTPcyZWll9Te3Tw+FEnQ==",
      "sourceType":"GCS",
      "eventType":"DRIVER_JOB_LIFECYCLE_EVENT",
      "timestamp":"2025-12-07T10: 54: 12.621560Z",
      "severity":"INFO",
      "sessionName":"session_2025-12-07_17-33-33_853835_27993",
      "driverJobLifecycleEvent":{
         "jobId":"BAAAAA==",
         "stateTransitions":[
            {
               "state":"FINISHED",
               "timestamp":"2025-12-07T10: 54: 12.621562Z"
            }
         ]
      },
      "nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==",   // <- nodeId set
      "message":""
   }
]
```

- CoreWorker event:

```json
[
   {
      "eventId":"TIAp8D4NwN/ne3VhPHQ0QnsBCYSkZmOUWoe6zQ==",
      "sourceType":"CORE_WORKER",
      "eventType":"TASK_DEFINITION_EVENT",
      "timestamp":"2025-12-07T10:54:12.025967Z",
      "severity":"INFO",
      "sessionName":"session_2025-12-07_17-33-33_853835_27993",
      "taskDefinitionEvent":{
         "taskId":"yoDzqOi6LlD///////////////8EAAAA",
         "taskFunc":{
            "pythonFunctionDescriptor":{
               "moduleName":"rayjob",
               "functionName":"hello_world",
               "functionHash":"a37aacc3b7884c2da4aec32db6151d65",
               "className":""
            }
         },
         "taskName":"hello_world",
         "requiredResources":{
            "CPU":1.0
         },
         "jobId":"BAAAAA==",
         "parentTaskId":"//////////////////////////8EAAAA",
         "placementGroupId":"////////////////////////",
         "serializedRuntimeEnv":"{}",
         "taskAttempt":0,
         "taskType":"NORMAL_TASK",
         "language":"PYTHON",
         "refIds":{
            
         }
      },
      "nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==",   // <- nodeId set here
      "message":""
   }
]
```

---------

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
from python 3.9 to python 3.10

we are practically already using python 3.10 everywhere

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
and remove python dependency requirements for python 3.9 or below

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ded_decoding… (ray-project#59421)

Signed-off-by: Sathyanarayanaa-T <tsathyanarayanaa@gmail.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
## Description
Move gcs_callback_types.h to rpc_callback_types.h

## Related issues
Closes ray-project#58597

---------

Signed-off-by: tianyi-ge <tianyig@outlook.com>
Co-authored-by: tianyi-ge <tianyig@outlook.com>
…ay-project#59345)

## Description
Using stateful models in Offline RL is an important feature and the
major prerequisites for this feature are already implemented in RLlib's
stack. However, some minor changes are needed to indeed train such
models in the new stack. This PR implements these minor but important
changes at different locations in the code:

1. It introduces the `STATE_OUT` key in the outputs of `BC`'s `forward`
function to make the next hidden state available to the connectors and
loss function.
2. It adds in the `AddStatesFromEpisodesToBatch` the initial state to
the batch for offline data.
3. It adds in `MARWIL` a burn-in for the state that can be controlled
via `burnin_len`.
4. It generates sequence sampling in the `OfflinePreLearner` dependent
on the `max_seq_len`, `lookback_len` and `burnin_len`.
5. It fixes multiple smaller bugs in `OfflineEnvRunner` when recording
from class-referenced environments, in `offline_rl_with_image_data.py`
and `cartpole_recording.py` examples when loading the `RLModule` from
checkpoint.
6. It fixes the use of `explore=True` in evaluation of Offline RL tests
and examples.
7. It adds recorded expert data from `StatelessCartPole` to the
`s3://ray-example-data/rllib/offline-data/statelesscartpole`
8. It adds a test for learning on a single episode and a single batch
from recorded stateful expert data. Adds also a test to use instead of
recorded states the initial states for sequences.
9. Adds a new config parameter `prelearner_use_recorded_module_states`
to either use recorded states from the data (`True`) or use the initial
state from the `RLModule` (`False`).

## Related issues


## Additional information
The only API change is the introduction of a `burnin_len` to the
`MARWIL/BC` config.

>__Note:__ Stateful model training is only possible in `BC` and `MARWIL`
so far for Offline RL. For `IQL` and `CQL` these changes have to be
initiated through the off-policy algorithms (`DQN/SAC`) and for these
all the buffers need to provide sequence sampling which is implemented
right now solely in the `EpisodeReplayBuffer`. Therefore a couple of
follow-up PRs need to be produced:

1. Introduce sequence sampling to `PrioritizedEpisodeReplayBuffer`.
2. Introduce sequence sampling to `MultiAgentEpisodeReplayBuffer`.
3. Introduce sequence sampling to
`MultiAgentPrioritizedEpisodeReplayBuffer`.
4. Introduce stateful model training in `SAC`.
5. Introduce stateful model training to `IQL/CQL`.

---------

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
)

## Description

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
…59774)

## Description
Instead of rendering a large json blob for Operator metrics, render the
log in a tabular form for better readability.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
…ing row count (ray-project#59513)

## Description
```python
def process_single_row(x):
    time.sleep(.3)
    x['id2'] = x['id']
    return x

class PrepareImageUdf:
    def __call__(self, x):
        time.sleep(5)
        return x

class ChatUdf:
    def __call__(self, x):
        time.sleep(5)
        return x

def preprocess(x):
    time.sleep(.3)
    return x

def task2(x):
    time.sleep(.3)
    return x

def task3(x):
    time.sleep(.3)
    return x

def filterfn(x):
    return True


ds = (
    ray.data.range(1024, override_num_blocks=1024)
    .map_batches(task3, compute=ray.data.TaskPoolStrategy(size=1))
    .drop_columns(cols="id")
    .map(process_single_row)
    .filter(filterfn)
    .map(preprocess)
    .map_batches(PrepareImageUdf, zero_copy_batch=True, batch_size=64, compute=ray.data.ActorPoolStrategy(min_size=1, max_size=12))
)
ds.explain()
```
Here's a fun question: what should this return as the optimized physical
plan:
Ans:
```python
-------- Physical Plan (Optimized) --------
ActorPoolMapOperator[MapBatches(drop_columns)->Map(process_single_row)->Filter(filterfn)->Map(preprocess)->MapBatches(PrepareImageUdf)]
+- TaskPoolMapOperator[MapBatches(task3)]
   +- TaskPoolMapOperator[ReadRange]
      +- InputDataBuffer[Input]
```

Cool. It looks like it fused mostly everything from `drop_columns` to
`PrepareImageUDF`
Ok what if I added these lines: what happens now?
```python
ds = (
    ds
    .map_batches(ChatUdf, zero_copy_batch=True, batch_size=64, compute=ray.data.ActorPoolStrategy(min_size=1, max_size=12))
)
ds.explain()
```
Ans:
```python
-------- Physical Plan (Optimized) --------
ActorPoolMapOperator[MapBatches(ChatUdf)]
+- ActorPoolMapOperator[Map(preprocess)->MapBatches(PrepareImageUdf)]
   +- TaskPoolMapOperator[MapBatches(drop_columns)->Map(process_single_row)->Filter(filterfn)]
      +- TaskPoolMapOperator[MapBatches(task3)]
         +- TaskPoolMapOperator[ReadRange]
            +- InputDataBuffer[Input]
```
HuH?? Why did `preprocess->PrepareImageUDF` get defused??

The issue is that operator map fusion does not preserve whether or not
the row counts can be modified. This PR addresses that.

## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
## Description

- Add conda clean --all -y to ci/env/install-miniforge.sh to reduce CI
image size.
  - Local Mac arm64 build comparison:
- baseline: 4.95GB (compressed), 4,954,968,534 bytes (uncompressed)
      - clean: 4.3GB (compressed), 4,300,122,206 bytes (uncompressed)
      - delta: ~0.61 GiB (~0.65 GB)
- Note: - I only have a Mac arm64 environment, so the size measurements
were taken on Mac (aarch64) builds;

## Related issues
Fixes ray-project#59727
## Additional information
  Test

- docker build -f ci/docker/base.test.Dockerfile -t
ray-base-test:baseline .
- docker build -f ci/docker/base.test.Dockerfile -t ray-base-test:clean
.
- docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | rg
"ray-base-test"
  - docker image inspect -f '{{.Size}}' ray-base-test:baseline
  - docker image inspect -f '{{.Size}}' ray-base-test:clean

---------

Signed-off-by: yaommen <myanstu@163.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…ay-project#57279)

## Why are these changes needed?

In `StandardAutoscaler`, `self.provider.internal_ip()` can raise
exceptions if the node is not found by the provider, such as if too much
time has passed since node was preempted and the provider has
"forgotten" about the node. Any exceptions raised by
`self.provider.internal_ip()` will cause `StandardAutoscaler` updates
and node termination to fail.

This change wraps most calls to `self.provider.internal_ip()` within
try-catch blocks and provides reasonable fallback behavior. This should
allow `StandardAutoscaler` updates and node termination to keep
functioning in the presence of "forgotten" nodes.

## Related issue number

Addresses ray-project#29698

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Yifan Mai <yifan@cs.stanford.edu>
Co-authored-by: Rueian <rueiancsie@gmail.com>
## Description
when token auth is enabled, the ray log api's need to pass the auth
token in their request headers. (`get_log()` and `list_logs()` functions
bypass the StateApiClient and use raw `requests.get()`)

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…de loop (ray-project#59190)

The `parse_resource_demands()` and `pending_placement_groups`
computation were being called inside the node iteration loop, causing
redundant computation for each node. Since resource demands and
placement groups are global (not per-node), these should be computed
once before the loop.
    
This reduces time complexity from O(N Γ— M) to O(N + M), where N is the
number of nodes and M is the number of resource demands. For a cluster
with 100 nodes, this eliminates ~99% of redundant computation and
reduces GIL hold time in the main thread.

---------

Signed-off-by: mingfei <mingfei@mds-trading.com>
Co-authored-by: mingfei <mingfei@mds-trading.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
…project#58701)

## Description
> Support Viewing PIDs for Dashboard and Runtime Env Agent

## Related issues
> Link related issues: "Related to ray-project#58700".

---------

Signed-off-by: yang <yanghang233@126.com>
Signed-off-by: Hang Yang <yanghang233@126.com>
Signed-off-by: tianyi-ge <tianyig@outlook.com>
Co-authored-by: tianyi-ge <tianyig@outlook.com>
## Description
Before you would get a message that looks like:
```
Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted
```
when a task with 1 retry gets retried. Doesn't make sense to get retried
when there's "0 retries". This is because we decrement before pushing
this message. Fixing this with a +1 in the msg str. Also we'd print the
same thing for preemptions which could possibly not make sense, e.g. 0
retries remaining but still retrying when node is preempted because it
doesn't count against retries. So now it explicitly says that this retry
is because of preemption and it won't count against retries.

Also removing an unused ray config -
`raylet_fetch_timeout_milliseconds`.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
…ject#59506)

## Description

Currently, when we get error in `tail_job_logs`, we will not raise it
because of the backward compatible issue mentioned in
ray-project#57037 (comment).
This cause inconvenience as when `tail_job_logs` complete, we cannot
guarantee that it is completed successfully or if there's any error.

This PR raise the `tail_job_logs` error in the newer Ray version only to
keep the backward compatibility

## Related issues

Related to: ray-project/kuberay#4285

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
- Fix parent_task_id to use SubmitterTaskId for concurrent actors
- Add missing fields: call_site, label_selector, is_debugger_paused,
actor_repr_name
- Fix func_or_class_name to use CallString() for consistency with
event_buffer -> GCS path

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
…59434)

## Description
i am using ray with the following code:
```python
import logging

import pandas as pd
import ray

logger = logging.getLogger(__name__)


class QwenPredictor:
    def __init__(self):
        logger.info("download ckpt...")

    async def __call__(self, x):
        logger.info("start predict...")
        return x

if __name__ == "__main__":
    ray.init(
        ignore_reinit_error=True,
        logging_config=ray.LoggingConfig(
            encoding="TEXT", log_level="INFO",
            additional_log_standard_attrs=["asctime"]
        )
    )

    context = ray.data.DataContext.get_current()
    context.enable_progress_bars = False

    input = ray.data.from_pandas(
        pd.DataFrame({
            "id": [i for i in range(10)],
        })
    )

    output = input.map_batches(
        fn=QwenPredictor,
        batch_size=10,
        num_cpus=1,
        concurrency=1
    )

    output.count()
```
execute this code, and i get
```
2025-12-15 20:24:37,628	INFO 1.py:11 -- download ckpt... asctime=2025-12-15 20:24:37,628 job_id=01000000 worker_id=3762350aee1ab375c12794dfb65aaaeac9bca9877b29e41d05b5fb03 node_id=e23283ea1db083d7c27ed7be410e3de6f9c2c3cbaf863799bc7faf3a actor_id=20004433089e747379beb49001000000 task_id=ffffffffffffffff20004433089e747379beb49001000000 task_name=MapWorker(MapBatches(Qwen3ASRPredictor)).__init__ task_func_name=ray.data._internal.execution.operators.actor_pool_map_operator.MapWorker(MapBatches(Qwen3ASRPredictor)).__init__ actor_name= timestamp_ns=1765801477628095000
2025-12-15 20:24:37,656	INFO 1.py:14 -- start predict... asctime=2025-12-15 20:24:37,656 job_id=01000000 worker_id=3762350aee1ab375c12794dfb65aaaeac9bca9877b29e41d05b5fb03 node_id=e23283ea1db083d7c27ed7be410e3de6f9c2c3cbaf863799bc7faf3a actor_id=20004433089e747379beb49001000000 task_id=bc18b04d6a8770425e09011be724738669b2826e01000000 task_name= task_func_name= actor_name= timestamp_ns=1765801477656984000
```
It can be seen that when the value of log value is an empty string, Ray
outputs ${log key}= (for example, task_name= task_func_name= actor_name=
)

I futhur figure out why these values were empty, and found out that in
ray[data], a thread pool/thread are created to execute the `__call__`
method.
[sync](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/python/ray/data/_internal/execution/util.py#L79)
[async](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/python/ray/data/_internal/planner/plan_udf_map_op.py#L102),
and this new thread does not set the task's spec.
[link](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/src/ray/core_worker/core_worker.cc#L2852)
Therefore, the task name is an empty string.
[link](https://github.com/ray-project/ray/blob/fb2c7b2a50bebb702893605886b838ecd89a75ee/src/ray/core_worker/core_worker.h#L263)

Since setting the task spec is an internal behavior, I think it's not
easy to modify. Therefore, in this PR, I made a small modification: when
the log value is an empty string, the log key will not be displayed.

---------

Co-authored-by: xiaowen.wxw <wxw403883@alibaba-inc.com>
…cking (ray-project#59278)

## Description

## Problem Statement
Currently, the scheduler’s node feasibility and availability checks are
inconsistent with the actual resource allocation logic. The scheduler
reasons only about aggregated GPU capacity per node, while the
allocator(local) enforces constraints based on the per-GPU topology.

For example, consider a node with two GPUs, each with 0.2 GPU remaining.
The scheduler observes 0.4 GPU available in total and concludes that a
actor requesting 0.4 GPU can be placed on this node. However, the
allocator(local) rejects the request because no single GPU has 0.4 GPU
available.


## what this PR does
The high-level goal of this PR is to make node feasibility and
availability checks consistent between the scheduler and the resource
allocator.

Although the detailed design is still a work in progress and need big
refactor, the first step is to make the scheduler’s node feasibility and
availability checks itself consistent and centralized.

Right now, Ray has three scheduling paths:
- Normal task scheduling
- Normal actor scheduling
- Placement group
  - Placement Group reservation(scheduling bundle)
  - Task/Actor with Placement Group

Tasks and actors essentially share the same scheduling path and use the
same node feasibility and availability check function. Placement group
scheduling, however, implements its own logic in certain path, even
though it is conceptually the same.

Since we may override or extend the node feasibility and availability
checks in later PRs, it is better to first ensure that all scheduling
paths use a single, shared implementation of this logic.

This PR addresses that problem.







## Related issues
Related to ray-project#52133 ray-project#54729

## Additional information

Here I list all the cases that make sure we are relying on the same node
feasibility and availability checking func. Later we can just focusing
on changing the func and underlying data structure:

**Normal task/actor scheduling:**
- HybridSchedulingPolicy:
-
https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/hybrid_scheduling_policy.cc#L41
-
https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/hybrid_scheduling_policy.cc#L137
 
- SpreadSchedulingPolicy:
-
https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/spread_scheduling_policy.cc#L49
-
https://github.com/ray-project/ray/blob/555fab350c1c3179195889c437fe6213b416114c/src/ray/raylet/scheduling/policy/spread_scheduling_policy.cc#L54

- RandomSchedulingPolicy
-
https://github.com/ray-project/ray/blob/456d1903277668c1f79f3eb230b908a6e6c403a8/src/ray/raylet/scheduling/policy/random_scheduling_policy.cc#L47-L48

- NodeAffinitySchedulingPolicy
  - Don't care, just schedule to user specified node by default
- Having fallback option that checks:
https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/node_affinity_scheduling_policy.cc#L26-L30


- NodeLabelSchedulingPolicy
-
https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/node_label_scheduling_policy.cc#L171
-
https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/node_label_scheduling_policy.cc#L186




**Placement Group reservation(scheduling bundle):**
- PACK/SPREAD/STRICT_SPREAD
-
https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/scorer.cc#L58
    - Note, after this PR, it will also be IsAvailable 
- STRICT_SPREAD
-
https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/scorer.cc#L58
    - Note, after this PR, it will also be IsAvailable 
-
https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/bundle_scheduling_policy.cc#L185



**Task/Actor with Placement Group:**
- AffinityWithBundleSchedulingPolicy
-
https://github.com/ray-project/ray/blob/1180868dd4472b444aaffb83a72779adc0dbe1e8/src/ray/raylet/scheduling/policy/affinity_with_bundle_scheduling_policy.cc#L25-L26

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
…project#59830)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…t#58695)

## Description

This PR adds a new documentation page, Head Node Memory Management,
under the Ray Core advanced topics section.

## Related issues
Closes ray-project#58621

## Additional information
<img width="2048" height="1358" alt="image"
src="https://github.com/user-attachments/assets/3b98150d-05e6-4d15-9cd3-7e05e82ff516"
/>
<img width="2048" height="498" alt="image"
src="https://github.com/user-attachments/assets/4ec8fe43-e3a5-4df4-bca7-376ae407c77b"
/>

---------

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
…y-project#59845)

- [x] Update the docstring for `ray.shutdown()` in
`python/ray/_private/worker.py` to clarify:
- When connecting to a remote cluster via `ray.init(address="xxx")`,
`ray.shutdown()` only disconnects the client and does NOT terminate the
remote cluster
- Only local clusters started by `ray.init()` will have their processes
terminated by `ray.shutdown()`
- Clarified that `ray.init()` without address argument will auto-detect
existing clusters
- [x] Add documentation note to `doc/source/ray-core/starting-ray.rst`
explaining the same behavior difference
- [x] Review the changes via code_review
- [x] Run codeql_checker for security scan (no code changes requiring
analysis)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
## Description
upgrading cuda base gpu image from 11.8 to 12.8.1
This is required for future py3.13 dependency upgrades

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…roject#59735)

## Description
### Problem
Using --entrypoint-resources '{"fragile_node":"!1"}' with the Job API
raises an error saying only numeric values are allowed.

### Expect
--entrypoint-resources should accept label selectors just like
ray.remote/PlacementGroups, so entrypoints can target or avoid nodes
with specific labels.



## Related issues
> Link related issues: "Fixes ray-project#58662 ", "Closes ray-project#58662", or "Related to
ray-project#58662".

## Additional information

### Implementation approach
- Relax `JobSubmitRequest.entrypoint_resources` validation to allow
string values (`python/ray/dashboard/modules/job/common.py`).
- Add `_split_entrypoint_resources()` to separate numeric requests from
selector strings and run them through `validate_label_selector`
(`python/ray/dashboard/modules/job/job_manager.py`).
- Pass numeric resources via the existing `resources` option and
selector dict via `label_selector` when spawning the job supervisor,
leaving the field unset if only resources were provided
(`python/ray/dashboard/modules/job/job_manager.py`).
- Extend CLI parsing/tests to cover string-valued resources and assert
selector plumbing through the job manager
(`python/ray/dashboard/modules/job/tests/test_cli.py`,
`python/ray/dashboard/modules/job/tests/test_common.py`,
`python/ray/dashboard/modules/job/tests/test_job_manager.py`).

Signed-off-by: yaommen <myanstu@163.com>
update with more up-to-date information, and format the markdown file a
bit

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
aslonnie and others added 22 commits January 17, 2026 00:21
as it is build the artifact generically, not running oss specific logic
(e.g. uploading to ray wheels s3)

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
… into a separate module (ray-project#60188)

This PR stacks on ray-project#60121.

This is 5/N in a series of PRs to remove Centralized Actor Scheduling by
the GCS (introduced in ray-project#15943).
The feature is off by default and no longer in use or supported.

In this PR,

- Moving the gcs_actor_* files into a separate bazel module
`/ray/gcs/actor/`
- Moves the LocalLeaseManager into `/ray/raylet/scheduling` with its
friends
- Enabling cpplint

Pure refactoring. No logic changes.

---------

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…project#60240)

## Description
The test seems to have gone flaky because we're spawning enough tasks
fast enough to trigger throttling on downloading the input dataset
(getting errors and failing tasks and thus causing the job to fail). We
deliberated a bit, and we decided this test isn't actually adding much
value anyhow. So instead of adding retries or vendoring the dataset,
we're just gonna put the kabosh on this.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: zac <zac@anyscale.com>
Signed-off-by: Zac Policzer <zac@anyscale.com>
…ding behavior (ray-project#60199)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
setting lower bounds for aiohttp v3.13.3 due to security vulnerabilities
on previous versions
also upgrading aiosignal to 1.4.0 due to the new aiohttp version

Open issue: ray-project#59943

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…ject#60197)

Adds new script `push_ray_image.py` for publishing Wanda-cached Ray
images to Docker Hub.

Focus on this is replicating the tagging logic from
ci/ray_ci/docker_container.py. Many test
cases were added to try to replicate the existing publishing cases, but
it'll be good to hear
if any others would be helpful.

Signed-off-by: andrew <andrew@anyscale.com>
## Description
Updates the IMPALA examples and premerge with CartPole and TicTacToe. 
We only have a minimal number of examples as most users should use PPO
or APPO rather than IMPALA.

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
remove unused sdk mock methods

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…t#60226)

x86 is already ported, so this just follows up on that. Added comments
where architecture-specific deviations were made

ray-dashboard needed to be updated because AFAIK, wanda only pulls the
host default architecture. ray-wheel-build-aarch64 needs ray-dashboard,
so we needed to add $ARCH_SUFFIX to make the name unique across the
architectures.

Topic: wanda-aarch64-wheel

Signed-off-by: andrew <andrew@anyscale.com>

---------

Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: Andrew Pollack-Gray <andrew@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
no real use. the only usage installs `ray[default]` which is not the
right usage.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
all tests are using python 3.10 or above now

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
removes python 3.9 reference in CI scripts and docs; ray only supports
python 3.10+ now

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
so that it is consistent no matter how we build the wheel.

`[cpp]` and `[all-cpp]` extra's are not included in `[all]` today. so
unless user explicitly specify them, they will be skipped. as a result,
we do not need to drop them from the extra declarations but can just
always include them.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
## Description
While working on ray-project#60241, I realized that it is possible to use `OpState`
as keys to dictionaries. This didn't occur to me in the past, so I had
to use a workaround where I would have a `progress_manager_uuid` flag in
`OpState` to link between the progress managers and `OpState`, creating
tight coupling. This PR removes this.

## Related issues
N/A

## Additional information
N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
…ject#60208)

This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0
logical CPUs by replacing direct dictionary access (r["CPU"]) with
r.get("CPU", 0), preventing crashes on dedicated GPU nodes.
This fix has been discussed firsthand with @bveeramani.

"Fixes ray-project#60166"

---------

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
…meters for the 'serve' API (ray-project#56507)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Serve `http_options` and `proxy_location` (which determines
`http_options.location`) can't be updated in runtime. When users try to
update them they receive warnings in the following format:
```
WARNING 2025-11-03 20:51:10,462 serve 23061 -- The new client HTTP config differs from the existing one in the following fields: ['host', 'port', 'location']. The new HTTP config is ignored.
--
2025-11-03 20:51:59,590 WARNING serve_head.py:281 -- Serve is already running on this Ray cluster and it's not possible to update its HTTP options without restarting it. Following options are attempted to be updated: ['host'].
```

This PR:
- change warning to failing with the error `RayServeConfigException`
- eliminate `validate_http_options` function in the
`dashboard.modules.serve.serve_head` as it's not needed anymore
- update documentation for [Proxy
config](https://docs.ray.io/en/latest/serve/production-guide/config.html#proxy-config)
that the parameter is global and can't be updated at runtime.
- change `HTTPOptions.location` default value `HeadOnly` -> `EveryNode`
(it's likely the desired value)

--------------------------------------------------------------
User scenario:
- have a file `hello_world.py`:
```
# hello_world.py
from ray.serve import deployment

@deployment
async def hello_world():
    return "Hello, world!"

hello_world_app = hello_world.bind()
```
- execute commands:
```
ray stop
ray start --head
serve build -o config.yaml hello_world:hello_world_app  # generate `config.yaml`
serve deploy config.yaml
# in the `config.yaml` file update:
# proxy_location: EveryNode -> HeadOnly
# http_options.host: 0.0.0.0 -> 0.0.0.1
# http_options.port: 8000 -> 8001
serve deploy config.yaml
```
Output before the change:
```
# stdout:
bash$ serve deploy config.yaml
2025-09-14 17:19:15,606 INFO scripts.py:239 -- Deploying from config file: 'config.yaml'.
2025-09-14 17:19:15,619 SUCC scripts.py:359 -- 
Sent deploy request successfully.
 * Use `serve status` to check applications' statuses.
 * Use `serve config` to see the current application config(s).

# /tmp/ray/session_latest/logs/dashboard_ServeHead.log
2025-09-14 17:19:15,615 WARNING serve_head.py:177 -- Serve is already running on this Ray cluster and it's not possible to update its HTTP options without restarting it. Following options are attempted to be updated: ['location', 'host', 'port'].
```
Output after the change:
```
# stdout:
bash$ serve deploy config.yaml
2025-11-03 21:04:51,252 INFO scripts.py:243 -- Deploying from config file: 'config.yaml'.
Traceback (most recent call last):
  File "~/ray/.venv/bin/serve", line 7, in <module>
    sys.exit(cli())
  File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/ray/.venv/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "~/ray/python/ray/serve/scripts.py", line 360, in deploy
    ServeSubmissionClient(address).deploy_applications(
  File "~/ray/python/ray/dashboard/modules/serve/sdk.py", line 80, in deploy_applications
    self._raise_error(response)
  File "~/ray/python/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
  File "~/ray/python/ray/dashboard/optional_utils.py", line 188, in decorator
    return await f(self, *args, **kwargs)
  File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 37, in check
    return await func(self, *args, **kwargs)
  File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 167, in put_all_applications
    client = await serve_start_async(
  File "~/ray/python/ray/serve/_private/api.py", line 148, in serve_start_async
    _check_http_options(client.http_config, http_options)
  File "~/ray/python/ray/serve/_private/api.py", line 51, in _check_http_options
    raise RayServeConfigException(
ray.serve.exceptions.RayServeConfigException: Attempt to update `http_options` or `proxy_location` has been detected! Attempted updates: `{'location': {'previous': 'EveryNode', 'new': 'HeadOnly'}, 'host': {'previous': '0.0.0.0', 'new': '0.0.0.1'}, 'port': {'previous': 8000, 'new': 8001}}`. HTTP config is global to your Ray cluster, and you can't update it during runtime. Please restart Ray Serve to apply the change.

```

The same behavior for serve REST API:
```
bash$ curl -X PUT "http://localhost:8265/api/serve/applications/"   -H "Accept: application/json"   -H "Content-Type: application/json"   -d '{
> "http_options": {
> 
Makefile             config.yaml          config_changed.yaml  discrepancy.py       hello_world_2.py     out                  test_failure.py      
__pycache__/         config_3_apps.yaml   curl                 hello_world.py       new                  rez.json             
> "host": "0.0.0.1"
> },
>   "applications": [
>     {
>       "name": "app1",
>       "route_prefix": "/",
>       "import_path": "hello_world:hello_world_app",
>       "runtime_env": {},
>       "deployments": [
>         { "name": "hello_world" }
>       ]
>     }
>   ]
> }'
Traceback (most recent call last):
  File "~/ray/python/ray/dashboard/optional_utils.py", line 188, in decorator
    return await f(self, *args, **kwargs)
  File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 37, in check
    return await func(self, *args, **kwargs)
  File "~/ray/python/ray/dashboard/modules/serve/serve_head.py", line 167, in put_all_applications
    client = await serve_start_async(
  File "~/ray/python/ray/serve/_private/api.py", line 148, in serve_start_async
    _check_http_options(client.http_config, http_options)
  File "~/ray/python/ray/serve/_private/api.py", line 51, in _check_http_options
    raise RayServeConfigException(
ray.serve.exceptions.RayServeConfigException: Attempt to update `http_options` or `proxy_location` has been detected! Attempted updates: `{'host': {'previous': '0.0.0.0', 'new': '0.0.0.1'}}`. HTTP config is global to your Ray cluster, and you can't update it during runtime. Please restart Ray Serve to apply the change.
```

--------------------------------------------

A thing worth to mention is that, this change makes explicit the
discrepancy between default `host` value in
[serve.config.HTTPOptions](https://github.com/ray-project/ray/blob/master/python/ray/serve/config.py#L433)
(host="127.0.0.1") vs
[serve.schema.HTTPOptionsSchema](https://github.com/ray-project/ray/blob/master/python/ray/serve/schema.py#L683)
(host="0.0.0.0"). `serve.config.HTTPOptions` is primarily used in
imperative serve API (Python API or CLI with params) and
`serve.schema.HTTPOptionsSchema` is used in declarative serve API (REST
API or `deploy/run` with config file)
Previously, when users use commands `start` and then `deploy` or `run`
with default params - `http_options` from the `start` command were used.
Now we explicitly failing in this scenario:
```
ray stop
ray start --head
serve build -o config.yaml hello_world:hello_world_app  # generate `config.yaml` with default values

serve start
serve deploy config.yaml
...
ray.serve.exceptions.RayServeConfigException: Attempt to update `http_options` or `proxy_location` has been detected! Attempted updates: `{'host': {'previous': '127.0.0.1', 'new': '0.0.0.0'}}`. HTTP config is global to your Ray cluster, and you can't update it during runtime. Please restart Ray Serve to apply the change.
``` 
Maybe the `host` default value should be aligned in `HTTPOptions` and
`HTTPOptionsSchema`.


<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

Closes ray-project#56163

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Disallow runtime updates to global `http_options`/`proxy_location` by
raising `RayServeConfigException`, fix default proxy location handling,
refactor validation, and update docs/tests/CLI accordingly.
> 
> - **Serve API (backend)**
> - Raise `RayServeConfigException` on attempts to change global
`http_options`/`proxy_location` via `_check_http_options` (now compares
against `client.http_config`, normalizes
`ProxyLocation`/`DeploymentMode`).
> - Fix default proxy location handling: `_prepare_http_options` sets
`DeploymentMode.EveryNode` when `proxy_location=None`; `start()` uses
it.
> - Adjust `serve_start`/`serve_start_async` to pass
`client.http_config` into `_check_http_options`.
>   - Add `RayServeConfigException` in `ray.serve.exceptions`.
> - **Dashboard REST**
> - Remove `validate_http_options` warning logic from `serve_head.py`;
rely on backend check.
> - PUT with changed HTTP/proxy config now fails (500) instead of
warning.
> - **CLI/Tests**
> - Update tests to expect failure on config changes and to use explicit
HTTP host where needed.
> - Add tests for `_prepare_http_options` and `serve.start` rejection on
changed HTTP config; add CLI test verifying detailed diff in error.
> - **Docs**
> - Note that `proxy_location` and HTTP/gRPC configs are cluster-global
and cannot be updated at runtime.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
1f690a6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
## Why are these changes needed?

When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1`, inter-deployment calls use gRPC
instead of Ray actor calls. In this mode, `_ray_trace_ctx` is not
injected into kwargs since gRPC calls bypass Ray's tracing decorators.

Tracing context propagation for gRPC mode requires additional work to
properly capture and forward the context. This PR skips the
`test_deployment_remote_calls_with_tracing` test in gRPC mode until a
proper solution is implemented.

## Related issue number

Tracking issue: ray-project#60223

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
<https://docs.ray.io/en/master/>.
- [x] I've added any new APIs to the API Reference. For doc changes, see
[Contribute
Docs](https://docs.ray.io/en/latest/ray-contribute/docs.html).
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
Organize read/write tests under python/ray/data/tests/datasource and
update the HuggingFace helper import to match the new path.

## Related issues
Link related issues: "Fixes ray-project#60164"

---------

Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
…es (ray-project#60084) (ray-project#60091)

Categorize APIs into Public APIs and Developer APIs, and sort them
alphabetically by service name.

Changes:
- Reorganized loading_data.rst and saving_data.rst with Public APIs
first, then Developer APIs
- Sorted all APIs alphabetically by service name within each section
- Sections that originally had APIs for both Public and Developer APIs
were divided to respective sections
- Removed datasource.FastFileMetadataProvider API that has been removed
([reference](ray-project#59027))

Fixes ray-project#60084

Signed-off-by: mgchoi239 <mg.choi.239@gmail.com>

---------

Signed-off-by: mgchoi239 <mg.choi.239@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: mgchoi239 <mg.choi.239@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
ray-project#60133)

## Description
> Make DefaultClusterAutoscalerV2 knobs configurable via environment
variables

## Related issues
> Closes ray-project#60004

---------

Signed-off-by: Rushikesh Adhav <adhavrushikesh6@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
## Description
Features were introduced to log progress separately in non-atty
situations to prevent spamming. This feature was introduced all over the
place, so this PR groups the logging part into a separate
`LoggingExecutionProgressManager`, similar to how we group the other
implementations (ie: rich, tqdm)

## Related issues
Fixes ray-project#60083 

## Additional information
All feedback to specific UI is welcome.

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
- adding .claude folder to .gitignore so that users can

---------

Signed-off-by: harshit <harshit@anyscale.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main. It includes a wide range of changes, primarily focused on a major refactoring of the CI/CD system, dependency updates, and significant documentation improvements. Key changes include a new modular build system using wanda, dropping Python 3.9 support in many areas, and adding new tutorials and internal design documents. I've identified one potential issue with the CI test selection rules that could lead to inefficiencies.

Comment on lines +262 to +266
*
@ ml tune train data serve
@ core_cpp cpp java python doc
@ linux_wheels macos_wheels dashboard tools release_tests
;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This wildcard rule at the end of the file will match every file changed in a pull request and assign a large set of tags (ml, tune, train, data, serve, core_cpp, cpp, java, python, doc, linux_wheels, macos_wheels, dashboard, tools, release_tests). This will cause a significant number of tests to run for any change, regardless of its scope, potentially leading to very long and expensive CI runs. Was this intentional, or should this rule be more specific or removed?

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.