Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-15#746

Open
antfin-oss wants to merge 328 commits intomainfrom
create-pull-request/patch-24f610b1be
Open

πŸ”„ daily merge: master β†’ main 2026-01-15#746
antfin-oss wants to merge 328 commits intomainfrom
create-pull-request/patch-24f610b1be

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-15
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

codope and others added 30 commits December 22, 2025 16:25
…er initialization (ray-project#59611)

## Description
Fixes a race condition in
`MetricsAgentClientImpl::WaitForServerReadyWithRetry` where concurrent
HealthCheck callbacks could both attempt to initialize the exporter,
causing GCS to crash with:
```
Check failed: !exporting_started_ RayEventRecorder::StartExportingEvents() should be called only once.
```
The `exporter_initialized_` flag was a non-atomic bool. When multiple
HealthCheck RPCs completed simultaneously, their callbacks could both
read false before either set it to true, leading to `init_exporter_fn`
being called twice. Changed the flag to `std::atomic<bool>` to ensure
only one callback wins the race.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…STOP_REQUESTED in autoscaler v2 (ray-project#59550)

## Description

When the autoscaler attempts to terminate QUEUED instances to enforce
the `max_num_nodes_per_type` limit, the reconciler crashes with an
assertion error. This happens because QUEUED instances are selected for
termination, but the state machine doesn't allow transitioning them to a
terminated state.

The reconciler assumes all non-ALLOCATED instances have Ray running and
attempts to transition QUEUED β†’ RAY_STOP_REQUESTED, which is invalid.


https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197

This occurs when `max_workers` configuration is dynamically reduced or
when instances exceed the limit.

```
2025-12-04 06:21:55,298	INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached).
2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307	INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED
```

This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling
queued instances.

## Related issues
Closes ray-project#59219

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: win5923 <ken89@kimo.com>
## Description
When running bellow code:
```
from ray import ActorID
ActorID.nil().job_id
```
or
```
from ray import TaskID
TaskID.nil().job_id()
```

Bellow error shows:
<img width="1912" height="331" alt="ζˆͺεœ– 2025-12-18 δΈ‹εˆ6 49 18"
src="https://github.com/user-attachments/assets/b4200ef8-10df-4c91-83ff-f96f7874b0ce"
/>

The program should throw an error instead of crash, and this PR fixed it
by adding a helper function to do nil check.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".
Closes [ray-project#53872](ray-project#53872)

## Additional information
After the fix, now it will throw an `ValueError`
<img width="334" height="52" alt="ζˆͺεœ– 2025-12-20 上午8 47 30"
src="https://github.com/user-attachments/assets/00228923-2d26-4cb4-bf53-615945d2ce6c"
/>

<img width="668" height="103" alt="ζˆͺεœ– 2025-12-20 上午8 47 49"
src="https://github.com/user-attachments/assets/ee68213a-681a-4499-bef2-2e13533e3ffd"
/>

---------

Signed-off-by: Alex Wu <c.alexwu@gmail.com>
…0 GPUs on CPU-only cluster (ray-project#59514)

If you request zero GPUs from the autoscaling coordinator but GPUs don't
exist on the cluster, the autoscaling coordinator crashes.

This PR fixes that bug.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…object construction (ray-project#59500)

## Description
Reduce overhead added by token authentication:

- Return shared_ptr from AuthenticationTokenLoader::GetToken() instead
of constructing a new AuthenticationToken object copy every time (which
would also add object destruction overhead)
- Cache token in client interceptor at construction (previously called
GetToken() for every RPC)
- Use CompareWithMetadata() to validate tokens directly from string_view
without constructing new AuthenticationToken objects
- Pass shared_ptr through ServerCallFactory to avoid per-call copies

release tests:

without this change, the microbenchmark `multi_client_put_gigabytes` was
in 25-30 range, eg run:
https://buildkite.com/ray-project/release/builds/70658

now with this change it is in 40-45 range 
https://buildkite.com/ray-project/release/builds/72070

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
## Description
This is a series of PRs to refactor/consolidate progress reporting and
to decouple it from the executor, opstates, etc.

## Related issues
Split from ray-project#58173 

## Additional information
N/A

---------

Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
## Description

Move extension types to ray.data

ray.air.util.objection_extensions ->
ray.data._internal.object_extensions
ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions

## Related issues
Closes ray-project#59418

Signed-off-by: will <zzchun8@gmail.com>
…ovements. (ray-project#59544)

## Description
APPO targets high-throughput RL and relies on asynchrony - via actor
parallelism and multi-threading. In our tests, the local-learner setup
(num_learners=0) underperformed, prompting a deeper investigation into
root causes and improvements as follows.

### Causes
- __Producer-driven + weak backpressure:__ The data-flow was
producer-driven. There was no bounded, blocking handoff to pace
producers. The in-queue for the `LearnerThread` was a `deque` which is
unblocking. As a result, the `Learner` wasn't setting the pace (sawtooth
throughput and stalls at 20-update barrier, i.e. learner starved;
furthermore: CPU-waste and hot GIL when polling from empty queues).
- __Thread contention at multiple places:__ `Learner` and
`_LearnerThread` shared the `_num_updates_lock` and
`metrics._threading_lock` (`RLock`). Every update both threads contended
on the same lock. Every 20 updates occasionally both threads contended
on the other shared lock.
- __Global timesteps, race condition:__ `update()` wrote a global
`_CURRENT_GLOBAL_TIMESTEPS` and the `_LearnerThread` had to read it
later. Rapid calls to `update()` could overwrite this before the thread
consumes it, i.e. timesteps can mismatch the batch actually trained.
- __Aggregation race + spurious reduces:__ The "every 20-updates" path
checked a copy of `_num_updates` without synchronization, then reset it
at `>=20` - but without any threading event/condition to align the
producer with the consumer. This returned often `{}` or reduce at odd
times.
- __No pinned memory or stream handoff:__ GPU-copies used
`pin_memory=False` and there was no explicit stream handoff; any
implicit sync may land inside of the learner update timing.
- __Reference resolving on producer's hot path:__
`TrainingData.solve_refs()` was called synchronously inside
`LearnerGroup.update()` before queuing which cost sometimes around ~25%
of time in some calls. Extends the window where producer/learner can
drift.
- __Mixed queue semantics + missing task semantics:__ The code mixed
`queue.Queue` and `deque` (and optionally `CircularBuffer`).
`task_done`/`join()` semantics don't exist for `deque`, correctness
relies on polling and manual drops. There was no bounded, blocking
handoff to pace producers. This was brittle under load.
- __No clean stop path:__ The thread used a `stop` flag but no sentinel
was enqueued, if it was blocked/polling the shutdown could hang or
increment counters after a stop.
- __Complete multi-learner stalling:__ In multi-learner setups with
__multi-agent__ policies asynchronous batches (i.e. batches with
different combinations of policies) led to stalls in Torch's
`DistributedDataParallel` gradient asynchronous synchronization. One
rank computed gradients for a policy not existent on the other rank(s)
and waited indefinitely for synched gradients.

### Improvements by this PR
- __Consumer-driven:__ `Learner` dictates pace through blocking queues
(`Learner` does `get()`, producer does `put`). That avoids busy polling
(no CPU burn). Faster reloading through quick returns from
`Learner.update()` with no results ready. Avoids learner starving -
bigger queues allow for frequent producer burstiness.
- __Edge-triggered aggregation:__ Only `_LearnerThread` increments a
private counter and on __exactly__ the `broadcast_interval`th update
fires an __event__ (`_agg_event`). The producer simply `wait`s for the
event and `clear`s it (no lock fight). Furthermore, the `_LearnerThread`
now reduces metrics and returns them through an out-queue from which the
`Learner` picks them up and returns them to the main process. All of
these measures reduce thread contention to an absolute minimum.
- __Pass meta-data with batch:__ The `Learner` enqueues a tuple `(batch,
timesteps)` so the `_LearnerThread` consumes the correct timesteps
atomically. This also reduces communication and boilerplate.
- __(Optional) Deferral of reference resolving:__ Post-solve references
in `Learner.update()` to return faster in asynchronous calls.
- __Clean stop + consistent semantics:__ Use a `_STOP_SENTINEL` through
the learner queue; don't rely anymore on a boolean alone. And call
`task_done()` on real `queue.Queue` (if not using `CircularBuffer`).
Furthermore unification of buffer/queue API inside `Learner`.
- __Safe-guarding multi-agent multi-learner training:__ Manual
synchronization of gradients replaces Torch's `DistributedDataParallel`
hooks-based synchronization for multi-learner multi-agent setups.
Gradients on each rank are zero-padded and synched after all gradients
have been computed.

## Related issues


## Additional information
Because this PR reshapes the data flow, a few tuning tips are useful:

- Circular buffer vs. simple queue. The old CircularBuffer prevented
learner starvation but its push/pop are slower. The new consumer-driven
pipeline is generally more efficient - assuming producers are reasonably
fast and the learner queue isn’t tiny. Use `use_circular_buffer=True`
only when producing is expensive/irregular (it lets the learner keep
iterating over buffered data, similar to `num_epochs > 1` but in
cycles). Otherwise, prefer the simple queue. Recommended defaults:
`simple_queue_size=32` for `APPO`; `IMPALA` keeps a smaller
`learner_queue_size=3`.
- Unified interval: broadcast & metrics reduction. Previously, weights
were synced by `broadcast_interval` while metrics were reduced every
fixed 20 updates. The new design unifies these: `broadcast_interval` now
controls both weight sync and metrics reduction. In practice, ~10
balances steady flow with acceptable off-policy lag.
- Scale producers to match a fast learner. The `_LearnerThread` applies
updates quickly, so overall throughput is often producer-bound. To feed
it well, increase `num_env_runners` and/or `num_envs_per_env_runner`.


### Next steps
This PR improves dataflow focused on the learner(s). The next steps are:
- To increase throughput in `AggregatorActors`
- To improve dataflow in IMPALA's main thread.
- To boost performance in loss calculation.
- To check, asynchronous calls to `EnvRunner`s and `Learner`s.
- To test resolving references in either `_GPULoaderThreads` or
`_LearnerThread` instead of the `Learner`'s main thread.

### Tests
APPO in this PR was tested on the following (multi-agent) environments:
- `CartPole-v1`
- `ALE:Pong-v5`
- `Footsies` (see https://github.com/chasemcd/FootsiesGym)

#### `CartPole-v1`
This PR improves performance significantly for high-producer scenarios
like `CartPole-v1`. All tests used:
- `broadcast_interval=10`
- `use_circular_buffer=False`
- `num_aggregator_actors_per_learner=3`
- `num_env_runners=num_learners x 32`
- `episodes_to_numpy=False`
- `num_gpus_per_learner=1` 

<img width="757" height="404" alt="image (3)"
src="https://github.com/user-attachments/assets/3beee428-d4c0-42f4-811d-61d81de484c2"
/>

#### `ALE:Pong-v5`
All tests used:
- `broadcast_interval=10`
- `use_circular_buffer=False`
- `num_aggregator_actors_per_learner=6`
- `num_env_runners=num_learners x 32`
- `episodes_to_numpy=True` (`FrameStack` connector with 4 frames)
- `num_gpus_per_learner=1`

<img width="676" height="366" alt="image"
src="https://github.com/user-attachments/assets/43d08a87-0cc1-4902-8150-adc1c3203be6"
/>

---------

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
## Description
This PR improves typehinting in `ray._common.retry`. It contained a lot
of `Any` or unspecified generics before and now should be fully
specific.

---------

Signed-off-by: Jonas Dedden <university@jonas-dedden.de>
## Description
When using `runtime_env.working_dir` with a remote zip archive URL (for
example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`),
Ray downloads an HTML page instead of the actual zip file. This causes
the Ray job to fail when accessing files from the working directory.

Downloading the same URL with standard tools such as `wget` works as
expected and returns the correct zip archive. This PR addresses the
inconsistency in how `runtime_env.working_dir` handles remote archive
downloads.

#### for example
```
import ray

ray.init(include_dashboard=False, ignore_reinit_error=True)
@ray.remote(
    runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"}
)
def list_repo_files():
    import pathlib
    return sorted(p.name for p in pathlib.Path(".").iterdir())

print(ray.get(list_repo_files.remote()))
ray.shutdown()
```

https_gitee_com_whaozi_kuberay_repository_archive_master is empty, 
and 
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML
file
<img width="1438" height="550" alt="image"
src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab"
/>


#### We test
```
wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip
--2025-08-05 14:28:52--  https://gitee.com/whaozi/kuberay/repository/archive/master.zip
Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225
Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following]
--2025-08-05 14:28:54--  https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D
Reusing existing connection to gitee.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: β€˜master.zip’

master.zip                                 [                                                  <=>                        ]  10.37M  1.23MB/s    in 13s
```
I think we are not handling http redirection here. If I directly use the
redirected url, it works
```
from smart_open import open as open_file

with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin:
    with open_file("/tmp/jjyao_test.zip", "wb") as fout:
        fout.write(fin.read())
```

So, 
#### Problem is:
When using runtime_env.working_dir with a remote zip URL (e.g. gitee
archives), Ray’s HTTPS downloader uses the default Python-urllib
user-agent, and some hosts respond with HTML rather than the archive.
The working directory then contains HTML and the Ray job fails, while
wget succeeds because it presents a curl-like user-agent.

#### Solution
_download_https_uri() now sets curl-like headers
(ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via
RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent
with curl/wget, allowing gitee and similar hosts to return the proper
zip file. A regression test verifies the headers are set.

## Related issues
related issues: "Fixes ray-project#52233"

## Additional information

---------

Signed-off-by: yaommen <myanstu@163.com>
Previously, if the user did not specify them, Ray preassigned the GCS
port, dashboard agent port, runtime environment port, etc., and passed
them to each component at startup. This created a race condition: Ray
might believe a port is free, but by the time the port information is
propagated to each component, another process may have already bound to
that port.

This can cause user-facing issues, for example when Raylet heartbeat
messages are missed frequently enough that the GCS considers the node
unhealthy and removes it.

We originally did this because there was no standard local service
discovery, so components had no way to know each other’s serving ports
unless they were preassigned.

The final port discovery design is here:
<img width="2106" height="1492" alt="image"
src="https://github.com/user-attachments/assets/eaac8190-99d8-404b-8a8d-283a4f2f0f33"
/>



This PR addresses port discovery for:
- GCS reporting back to the startup script (driver)βœ…
- The runtime env agent reporting back to the rayletβœ…
- The dashboard agent reporting back to the raylet βœ…
- The raylet blocking registration with the GCS until it has collected
port information from all agents βœ…
- GCS adding InitMetricsExporter to node_added_listeners_ so it starts
the MetricsExporter as soon as the raylet registers with the GCS with
complete port information βœ…
- The Ray client server obtaining the runtime env agent port from GCSβœ…
- Ensuring that both a connected-only driver (e.g., `ray.init()`) and a
startup driver still receive all port information from the GCSβœ…
- Ensure GCS FT Works:Using the same GCS port as beforeβœ…
- Ensure no metric lossβœ…
- Clean up the old cache port codeβœ…

(Note that this PR is a clean-up version of
ray-project#59065)

## Consideration
**GCS Fault tolerance:**
GCS fault tolerance requires GCS to restart using exactly the same port,
even if it initially starts with a dynamically assigned port (0). Before
this PR, GCS cached the port in a file, and this PR preserves the same
behavior (although ideally, the port should only be read from the file
by the Raylet and its agent).

This can be further improved by storing the GCS port in Redis, but that
should be addressed in a separate PR.

**GCS start sequence related:**
OpenCensus Exporter and the Event Aggregator Client are now constructed
without connecting to the agent port; instead, they defer the actual
connection until the head Raylet registers via a callback. At that
point, the actual metrics_agent_port is known from the node information.

The OpenTelemetry Exporter is now also initialized at head Raylet
registration time.

**Ray nodes that share the same file system:**
There are cases where people run multiple Ray nodes from the same or
different Ray clusters, so the port file name is based on a fixed prefix
plus the node ID.

## Related issues
Closes ray-project#54321

## Test

For GCS-related work, here is a detailed test I wrote that covers seven
starting/connecting cases:
-
https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/python/ray/tests/test_gcs_port_reporting.py
  - ray.init starts a head node and exposes a dynamic GCS port.
  - Connect a driver via address="auto" using the address file
  - Connect a driver via an explicit address
  - CLI starts head with dynamic GCS port
  - CLI starts worker connecting to the head via GCS address
  - CLI starts head with an explicit GCS port
  - CLI starts head with default GCS port

For runtime env agent:
-
https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_agent_port.py
  - ray start --head (auto port discovery)
  - ray start --head with fixed runtime-env-agent-port
  - ray.init() local cluster (auto port discovery)
  - (we don't have ray.init() with fixed _runtime_env_agent_port)

Test that ray_client_server works correctly with dynamic runtime env
agent port:
-
https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_ray_client_with_runtime_env.py

For dashboard agent ports, the existing tests already cover this quite
well.


## Follow up
- The dashboard agent reporting back to the raylet
- The dashboard agent now also writes to GCS, but we should allow only
the raylet to write to GCS

## performance

before this PR:

```shell
[0.000s] Starting ray.init()...
[0.000s] Session dir created
[0.070s] Process: gcs_server
[6.885s] Process: runtime_env_agent
[6.955s] Process: raylet
[6.955s] Process: dashboard_agent
2025-12-12 04:47:34,391 INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2062: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
[9.061s] ray.init() completed
```

After This PR:
```shell
[0.000s] Starting ray.init()...
[0.075s] Process: gcs_server
[0.075s] Session dir created
[0.075s] File: gcs_server_port.json = 39451
[6.976s] Process: raylet
[6.976s] Process: dashboard_agent
[6.976s] Process: runtime_env_agent
[7.576s] File: runtime_env_agent_port.json = 38747
[7.640s] File: metrics_agent_port.json = 40005
[8.083s] File: metrics_export_port.json = 44515
[8.083s] File: dashboard_agent_listen_port.json = 52365
2025-12-12 02:02:54,925 INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
[10.035s] ray.init() completed
```
We can see that the dominant time is actually at the start of GCS. We
wait for GCS to be ready and write the cluster info.
The port reporting speed is quite fast (file appearance time βˆ’ raylet
start time).

https://github.com/ray-project/ray/blob/863ae9fd573b13a05dcae63b483e9b1eb0175571/python/ray/_private/node.py#L1365-L367

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
Clean Up Deprecated Env Var and Document Undocumented Env Vars

### Summary

Remove deprecated `RAY_SERVE_ENABLE_JSON_LOGGING` and add documentation
for undocumented Ray Serve environment variables.

### Changes

| File | Description |
|------|-------------|
| `python/ray/serve/_private/constants.py` | Removed
`RAY_SERVE_ENABLE_JSON_LOGGING`, renamed `SERVE_ROOT_URL_ENV_KEY` β†’
`RAY_SERVE_ROOT_URL`, removed deprecated `CONTROLLER_MAX_CONCURRENCY`
fallback |
| `python/ray/serve/_private/logging_utils.py` | Removed deprecated JSON
logging logic and warning |
| `python/ray/serve/_private/controller.py` | Updated to use
`RAY_SERVE_ROOT_URL` constant |
| `doc/source/serve/monitoring.md` | Removed deprecation note, added
`RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` docs |
| `doc/source/serve/advanced-guides/performance.md` | Added
`RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` docs |
| `doc/source/serve/production-guide/config.md` | Added
`RAY_SERVE_ROOT_URL` docs |

### New Documentation

| Environment Variable | Description |
|---------------------|-------------|
| `RAY_SERVE_CONTROLLER_MAX_CONCURRENCY` | Max concurrent requests for
Controller (default: 15000) |
| `RAY_SERVE_CONTROLLER_CALLBACK_IMPORT_PATH` | Callback for custom
Controller initialization |
| `RAY_SERVE_ROOT_URL` | Override root URL (useful behind load
balancers) |

### Migration

Users using `RAY_SERVE_ENABLE_JSON_LOGGING=1` should migrate to
`LoggingConfig` with `encoding="JSON"`.

---------

Signed-off-by: harshit <harshit@anyscale.com>
ray-project#59659)

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
## Description
Follow up to ray-project#59350
- motivation: better abstraction for progress bars and type checking in
general.

## Related issues
N/A

## Additional information
N/A

---------

Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
…_usage` a utility function (ray-project#59674)

This PR makes the `ReservationOpResourceAllocator.
_get_ineligible_ops_with_usage` method a utility function named
`get_ineligible_op_usage`. The motivation is so that the logic can be
reused by other allocator implementations.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ay-project#59668)

To avoid circular dependencies, this PR updates `ranker.py` to only
import `ResourceManager` while type checking.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
…y_accounting` (ray-project#59671)

`PhysicalOperator` has an `implements_accurate_memory_accounting`
method. Subclasses override it and return `True` if they properly call
the `OpRuntimeMetrics` hooks like `on_input_queued`.

We previously needed this method because operators like
`AllToAllOperator` didn't update `OpRuntimeMetrics`, and that would
cause issues for resource allocation. However, now that all operators
implement accurate memory accounting, this method isn't necessary
anymore.

(Sanity check from Claude below)
```
⏺ Based on my search, no concrete PhysicalOperator subclass returns False for implements_accurate_memory_accounting.

  Here's the breakdown:

  Base class default: PhysicalOperator returns False at physical_operator.py:773

  All concrete operators return True:

  | Operator                | Source                                                         |
  |-------------------------|----------------------------------------------------------------|
  | MapOperator             | Overrides at map_operator.py:704                               |
  | TaskPoolMapOperator     | Inherits from MapOperator                                      |
  | ActorPoolMapOperator    | Inherits from MapOperator                                      |
  | LimitOperator           | Overrides at limit_operator.py:135                             |
  | InputDataBuffer         | Overrides at input_data_buffer.py:98                           |
  | OutputSplitter          | Overrides at output_splitter.py:285                            |
  | AggregateNumRows        | Overrides at aggregate_num_rows.py:63                          |
  | UnionOperator           | Overrides at union_operator.py:138                             |
  | ZipOperator             | Overrides at zip_operator.py:152                               |
  | HashShuffleOperator     | Inherits from HashShufflingOperatorBase (hash_shuffle.py:1005) |
  | JoinOperator            | Inherits from HashShufflingOperatorBase                        |
  | HashAggregateOperator   | Inherits from HashShufflingOperatorBase                        |
  | OneToOneOperator (base) | Overrides at base_physical_operator.py:231                     |

  The only class that returns False (by inheritance) is the abstract NAryOperator base class, but its two concrete subclasses (UnionOperator and ZipOperator) both override it to return True.
```

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Description

Remove dead code in `python/ray/data/util/data_batch_conversion.py` and
`python/ray/data/util/torch_utils.py`,
which is related to PR: ray-project#59420 


## Related issues

Related to ray-project#59420.

Signed-off-by: will <zzchun8@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
## Description


## Related issues
Closes ray-project#59652

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description
After removing the deprecated `read_parquet_bulk` API,
`FastFileMetadataProvider` became dead code with no remaining usage in
the codebase.

This commit removes:
- FastFileMetadataProvider class implementation
- All imports and exports of FastFileMetadataProvider
- Tests that specifically tested FastFileMetadataProvider
- Documentation references to FastFileMetadataProvider
- Code comments mentioning FastFileMetadataProvider

## Related issues
> Fixes ray-project#59010

---------

Signed-off-by: rushikesh.adhav <adhavrushikesh6@gmail.com>
Signed-off-by: Rushikesh Adhav <adhavrushikesh6@gmail.com>
…ect#59733)

## Description

The constructor in ClusterAutoscaler base class is not necessary, and it
adds complexity because it requires all ClusterAutoscaling
implementations to accept the same dependencies

This PR remove the constructor in ClusterAutoscaler. Sub-classes can
prevent using the dependencies that are not used

## Related issues
Closes ray-project#59684

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
otherwise, it fails to build with missing header when grpc is upgraded.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description
bump gymnasium to 1.2.2 in byod-rllib
follow up on:
* ray-project#59530

related:
* ray-project#59572

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
## Description

Previously, GPU allocation was special-cased in
`ReservationOpResourceAllocator`:
1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu)
regardless of their max_resource_usage
2. The check `max_resource_usage != inf() and max_resource_usage.gpu >
0` failed for unbounded actor pools (max_size=None), causing them to get
zero GPU budget
3. GPU was stripped from remaining shared resources (.copy(gpu=0))

This caused a bug where ActorPoolStrategy(min_size=1, max_size=None)
with GPU actors couldn't autoscale beyond the initial actor.

**Changes**
_task_pool_map_operator.py_
- Added min_max_resource_requirements() method for consistency with
ActorPoolMapOperator
- Returns (min=1 task resources, max=max_concurrency * task resources or
inf)

_resource_manager.py_

- Removed GPU special-casing entirely
- GPU now flows through the same allocation path as CPU and memory
- Operators are capped by their max_resource_usage for all resource
types uniformly
- Remaining shared resources (including GPU) go to unbounded downstream
operators


## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
…9742)

the original dataset is being fetched from the Internet and it is
returning 403

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
CI bazel usage should be all running in python 3.10

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
not used by rllib any more

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
abrarsheikh and others added 22 commits January 14, 2026 00:29
added a notebook that demonstrates serve application that takes a
reference to video as input and returns scene changes, tags and video
description (from the corpus).


https://anyscale-ray--59859.com.readthedocs.build/en/59859/serve/tutorials/video-analysis/README.html

---------

Signed-off-by: abrar <abrar@anyscale.com>
…ject#60109)

Follow up to
ray-project#52573 (comment)

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
they are not required for test orchestration, as rayci can properly
track dependencies now.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Opening draft PR for Ray technical charter.

Planning to add GitHub usernames before merging.

---------

Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
There isn't really a need to have a ray check on the event exporter as
there isn't really an important correctness invariant here. One call
will succeed. We already take some measure of caution here with a mutex
in the event recorder. But ray checking right after the mutex is just
asking for trouble.

---------

Signed-off-by: zac <zac@anyscale.com>
Currently seeing issues of crane not available in the uploading
environment. Default to Docker if crane is not available


https://buildkite.com/ray-project/postmerge/builds/15375/steps/canvas?jid=019bb99d-6f9e-45fa-92e3-a5a1d9373e8d#019bb99d-6f9e-45fa-92e3-a5a1d9373e8d/L198

Topic: crane-fix

Signed-off-by: andrew <andrew@anyscale.com>

Signed-off-by: andrew <andrew@anyscale.com>
…ay-project#59991)

updating lock files for images and using relative paths in buildkite
configs
moving base extra test deps lock files from ray_release path to
python/deplocks/base_extra_testdeps
Release test run: https://buildkite.com/ray-project/release/builds/74936

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…ct#59969)

The ray-cpp wheel contains only C++ headers, libraries, and executables
with no Python-specific code. Previously we built 4 identical wheels
(one per Python version: cp310, cp311, cp312, cp313), wasting CI time
and storage.

This change produces a single wheel tagged py3-none-manylinux2014_*
that works with any Python 3.x version.

Changes:
- Add ray-cpp-core.wanda.yaml and Dockerfile for cpp core
- Add ray-cpp-wheel.wanda.yaml for cpp wheel builds
- Add ci/build/build-ray-cpp-wheel.sh for Python-agnostic wheel builds
- Add RayCppBdistWheel class to setup.py that forces py3-none tags
  (necessary because BinaryDistribution.has_ext_modules() causes
  bdist_wheel to use interpreter-specific ABI tags by default)
- Update ray-cpp-wheel.wanda.yaml to build single wheel per architecture
- Update .buildkite/build.rayci.yml to remove Python version matrix
  for cpp wheel build/upload steps

Topic: ray-cpp-wheel
Relative: ray-wheel

Signed-off-by: andrew <andrew@anyscale.com>

---------

Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: cristianjd <cristian.j.derr@gmail.com>
The governance information is now integrated into the contributor
documentation at doc/source/ray-contribute/getting-involved.rst:399-437,
making it easily discoverable for community members interested in
advancing their involvement in the Ray project.

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
## Description
Completing the fixed-size array namespace operations

## Related issues
Related to ray-project#58674 

## Additional information

---------

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: Ping <fourhundredping@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Builds running into errors since we were not logged in.


https://buildkite.com/ray-project/postmerge/builds/15387/steps/canvas?jid=019bbaad-93e8-40e9-aeef-3592bd609f2d#019bbaad-93e8-40e[…]3592bd609f2d/L198

Topic: login-ecr-fix
Signed-off-by: andrew <andrew@anyscale.com>

Signed-off-by: andrew <andrew@anyscale.com>
These checks were being done manually, but click can handle this
behavior check. Removing this block in favor of click.

Signed-off-by: andrew <andrew@anyscale.com>
Support arrow format in OneHotEncoder.


Benchmark: TPC-H SF10

The improvement for `OneHotEncoder`:

Before:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 |
| 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 |
| 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 |
| 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 |
| 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 |
| 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 |

After:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 |
| 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 |
| 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 |
| 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 |
| 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 |
| 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 |


### Null behavior

We keep the null behaviors the same as the old pandas implementations.

| Encoder | Path | Null Input Behavior | Unseen Category Behavior |
|---------|------|---------------------|--------------------------|
| OrdinalEncoder | Pandas | **ValueError** | NaN |
| OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) |
| OneHotEncoder | Pandas | **ValueError** | all-zeros vector |
| OneHotEncoder | Arrow | **ValueError** | all-zeros vector |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
No need to be as verbose

Signed-off-by: andrew <andrew@anyscale.com>
generating base-slim depset
installing lock file in base-slim image
Updating paths for constraints files and requirement files

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
security upgrade. fixes ray-project#60079

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
bumping requests to minimum ver 2.32.5 due to secruity vulnerabilities
with requests<2.32.4

https://security.snyk.io/package/pip/requests

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Increase timeout for several flaky train tests.

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
While looking at the UpdateObjectLocationsBatch path and the pubsub that
the owner uses to communicate object location updates I noticed that we
send over both object locations and primary node id. However, primary
node id is not used anywhere in the obod. What's interesting is that
there seems to be a couple paths:

- Each raylet initially subscribes to the owner for object location
updates. After the raylet is done (note it unsubscribes once done
pulling), it sends over it's object location updates via a
UpdateObjectLocationsBatch RPC to the owner. The owner then publishes
this update to all the other currently subscribed raylets via the same
pubsub previously mentioned.
- We also have a pinned_node_id that is contained in the reference
counter and set by PushTaskReply. Actually, in PushTaskReply we still
publish an object locations update in
UpdateObjectPendingCreationInternal but we send over an empty list of
location updates (as that's only populated by
UpdateObjectLocationsBatch) but set a pending_creation boolean flag that
validates when the object is done creating. It's possible that
UpdateObjectLocationsBatch is received before PushTaskReply which is why
we have this guard.

From looking at ray-project#25004 it seems
this is intentional and we only want to use UpdateObjectLocationsBatch
to modify the locations set. We send over this locations set to any
raylet thats trying to decide where to pull from (makes sense) but we
also send over the primary_node_id which is copied from pinned_node_id.
The latter isn't used at all, so there's no point in sending it over. I
removed it from the proto and a couple other places where it doesn't
seem necessary to use.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large automated merge from master to main, incorporating a wide range of changes across the Ray ecosystem. The updates include significant CI/CD pipeline refactoring, new features in Ray Serve, Data, and Core, extensive documentation improvements, and internal changes to Java worker initialization. Key highlights are the modularization of the build process, the introduction of asynchronous inference and advanced multimodal capabilities, and enhanced monitoring and fault tolerance in Ray Serve. The codebase is also moving away from Python 3.9. Overall, these changes represent a substantial step forward in functionality, maintainability, and user experience. My review focuses on a minor point in the new CI configurations.

RAYCI_DISABLE_JAVA: "false"
RAYCI_WANDA_ALWAYS_REBUILD: "true"
JDK_SUFFIX: "-jdk"
ARCH_SUFFIX: "aarch64"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable ARCH_SUFFIX is defined here for the manylinux-cibase-jdk-aarch64 step. However, looking at the corresponding Wanda configuration file (ci/docker/manylinux-cibase.wanda.yaml), this variable doesn't appear to be used as a build argument or within the associated Dockerfile. This makes it redundant. To improve maintainability and avoid confusion, it's best to remove unused environment variables.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.