Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-28#759

Open
antfin-oss wants to merge 526 commits intomainfrom
create-pull-request/patch-4dfa82a39c
Open

πŸ”„ daily merge: master β†’ main 2026-01-28#759
antfin-oss wants to merge 526 commits intomainfrom
create-pull-request/patch-4dfa82a39c

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-28
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

aslonnie and others added 30 commits January 12, 2026 11:52
stop using the large oss ci test base

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…#59896)

## Description

Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes
were not being properly scaled from zero. With this update, clusters
managed by Ray will now automatically provision additional nodes when
there is workload demand, even when starting from an idle (zero-node)
state.

## Related issues
Closes ray-project#59682


## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…#59616)

## Description
We observed that raylet frequently emits log messages of the form
β€œDropping sync message with stale version”, which can become quite noisy
in practice.

This behavior occurs because raylet does not update the message version
for sync messages received from the GCS, and stale-version broadcast
messages are expected to be skipped by default. As a result, these log
entries are generated repeatedly even though this is normal and
non-actionable behavior.

Given that this does not indicate an error or unexpected state, logging
it at the INFO level significantly increases log noise and makes it
harder to identify genuinely important events.

We propose demoting this log from INFO to DEBUG in
RaySyncerBidiReactorBase to keep raylet logs cleaner while still
preserving the information for debugging purposes when needed.


![img_v3_02t7_be5071d6-99d2-4b3c-b189-66aa77476d3g](https://github.com/user-attachments/assets/ed91c317-3a86-441c-a2bf-b317ac0af618)

## Related issues
Closes ray-project#59615

## Additional information
- Change log level from INFO to DEBUG for β€œDropping sync message with
stale version” in RaySyncerBidiReactorBase.

Signed-off-by: Mao Yancan <yancan.mao@bytedance.com>
Co-authored-by: Mao Yancan <yancan.mao@bytedance.com>
## Description
Runs linkcheck on docs, in particular for RLlib where we've moved
tuned-examples to examples/algorithms
Further, updated github links that were automatically redirected

There are problems with some of the RLlib examples missing but I'm going
to fix these in the algorithm premerge PRs, i.e.,
ray-project#59007

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
…project#60050)

Add support for authenticating HTTPS downloads in runtime environments
using
bearer tokens via the RAY_RUNTIME_ENV_BEARER_TOKEN environment variable.

Fixes [ray-project#46833](ray-project#46833)

Signed-off-by: Denis Khachyan <khachyanda@gmail.com>
…ect#60014)

## Description
This PR fixes a critical deadlock issue in Ray Client that occurs when
garbage collection triggers `ClientObjectRef.__del__()` while the
DataClient lock is held.

When using Ray Client, a deadlock can occur in the following scenario:

  1. Main thread acquires DataClient.lock (e.g., in _async_send())
  2. Garbage collection is triggered while holding the lock
  3. GC calls `ClientObjectRef.__del__()`
4. `__del__()` attempts to call call_release() β†’ _release_server() β†’
DataClient.ReleaseObject()
  5. ReleaseObject() tries to acquire the same DataClient.lock
6. Deadlock: The same thread tries to acquire a non-reentrant lock it
already holds

## Related issues
> Fixes ray-project#59643 

## Additional information
This PR implements a deferred release pattern that completely avoids the
deadlock:

1. Deferred Release Queue: Introduces _release_queue (a thread-safe
queue.SimpleQueue) to collect object IDs that need to be released
2. Background Release Thread: Adds _release_thread that processes the
release queue asynchronously
3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs
into the queue (no lock acquisition)

---------

Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
Context
---
This change aims at revisiting of the `HashShuffleAggregator` protocol
by

 - Removing global lock (per aggregator)
 - Making shard accepting flow lock-free
 - Relocating all state from `ShuffleAggregation` into Aggregator itself
- Adding dynamic compaction (exponentially increasing compaction period)
to amortize compaction costs
 - Adding debugging state dumps

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Adding CONSTRAINTS_FILE docker arg for ray base-deps image

release test run: https://buildkite.com/ray-project/release/builds/74879

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Description
1. Jax dependency is introduced in
ray-project#58322
2. The current test environment is for CUDA 12.1, which limit jax
version below 0.4.14.
3. jax <= 0.4.14 does not support py 3.12.
4. skip jax test if it runs against py3.12+.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
…se (ray-project#60080)

Signed-off-by: Future-Outlier <eric901201@gmail.com>
## Description
Add model inference release test that closely reflects user workloads.

Release test run:
https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_glehkcquv9k26ta69f8lkc94nl?job-logs-section-tabs=application_logs&job-tab=overview&metrics-tab=data

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Reverts ray-project#59983

symlink does not work with newer version of wanda, where the newer version of wanda is doing the right thing.
…ct#59987)

- Bump .rayciversion from 0.21.0 to 0.25.0
- Move rules files to .buildkite/ with *.rules.txt naming convention
- Add always.rules.txt for always-run lint rules
- Add test.rules.test.txt with test cases
- Add test-rules CI step in cicd.rayci.yml (auto-discovery)
- Update macOS config to use new rules file paths

Topic: update-rayci-latest

Signed-off-by: andrew <andrew@anyscale.com>
…ay-project#60057)

## Summary

When running prefill-decode disaggregation with NixlConnector and data
parallelism, both prefill and decode deployments were using the same
port base for their ZMQ side channel. This caused "Address already in
use" errors when both deployments had workers on the same node:

```
zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009')
Exception in thread nixl_handshake_listener
```

## Changes

Fix by setting different `NIXL_SIDE_CHANNEL_PORT_BASE` values for
prefill (40000) and decode (41000) configs to ensure port isolation.

## Test plan

- Run `test_llm_serve_prefill_decode_with_data_parallelism` - should
complete without timeout
- The test previously hung forever waiting for "READY message from DP
Coordinator"

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…se (ray-project#60092)

Signed-off-by: Future-Outlier <eric901201@gmail.com>
- Fix ProgressBar to honor `use_ray_tqdm` in `DataContext`. 
- Note that `tqdm_ray` is designed to work in non-interactive contexts
(workers/actors) by sending JSON progress updates to the driver.

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…ray-project#59933)

## Description
The `DefaultAutoscaler2` implementation needs an
`AutoscalingCoordinator` and a way to get all of the
`_NodeResourceSpec`.

Currently, we can't explicitly inject fake implementations of either
dependency. This is problematic because the tests need to assume what
the implementation of each dependency looks like and use brittle mocks.

To solve this:
- Add the `FakeAutoscalingCoordinator` implementation to a new
`fake_autoscaling_coordinator.py` module (you can use the code below)
- `DefaultClusterAutoscalerV2` has two new parameters
`autoscaling_coordinator: Optional[AutoscalingCoordinator] = None` and
`get_node_counts: Callable[[], Dict[_NodeResourceSpec, int]] =
get_node_resource_spec_and_count`. If `autoscaling_coordinator` is None,
you can use the default implementation.
- Update `test_try_scale_up_cluster` to use the explicit seams rather
than mocks. Where possible, assert against the public interface rather
than implementation details


## Related issues
Closes ray-project#59683

---------

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: Ping <fourhundredping@gmail.com>
## Description
RLlib's rayci.yml
[file](https://github.com/ray-project/ray/blob/master/.buildkite/rllib.rayci.yml)
and the BUILD.bazel
[file](https://github.com/ray-project/ray/blob/master/rllib/BUILD.bazel)
are disconnected such that there are old tags in the BUILD not the rayci
and vice-versa.
This PR attempts to clean up both files without modifying what tests are
or aren't run currently

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
…g tracer file handles (ray-project#60078)

This fix resolves serve's window test failure:
```
[2026-01-12T22:52:13Z] =================================== ERRORS ====================================
--
[2026-01-12T22:52:13Z] _______ ERROR at teardown of test_deployment_remote_calls_with_tracing ________
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z]     @pytest.fixture
[2026-01-12T22:52:13Z]     def cleanup_spans():
[2026-01-12T22:52:13Z]         """Cleanup temporary spans_dir folder at beginning and end of test."""
[2026-01-12T22:52:13Z]         if os.path.exists(spans_dir):
[2026-01-12T22:52:13Z]             shutil.rmtree(spans_dir)
[2026-01-12T22:52:13Z]         os.makedirs(spans_dir, exist_ok=True)
[2026-01-12T22:52:13Z]         yield
[2026-01-12T22:52:13Z]         # Enable tracing only sets up tracing once per driver process.
[2026-01-12T22:52:13Z]         # We set ray.__traced__ to False here so that each
[2026-01-12T22:52:13Z]         # test will re-set up tracing.
[2026-01-12T22:52:13Z]         ray.__traced__ = False
[2026-01-12T22:52:13Z]         if os.path.exists(spans_dir):
[2026-01-12T22:52:13Z] >           shutil.rmtree(spans_dir)
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z] python\ray\serve\tests\test_serve_with_tracing.py:30:
[2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:750: in rmtree
[2026-01-12T22:52:13Z]     return _rmtree_unsafe(path, onerror)
[2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:620: in _rmtree_unsafe
[2026-01-12T22:52:13Z]     onerror(os.unlink, fullname, sys.exc_info())
[2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z] path = '/tmp/spans/'
[2026-01-12T22:52:13Z] onerror = <function rmtree.<locals>.onerror at 0x000002C0FFBBDA20>
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z]     def _rmtree_unsafe(path, onerror):
[2026-01-12T22:52:13Z]         try:
[2026-01-12T22:52:13Z]             with os.scandir(path) as scandir_it:
[2026-01-12T22:52:13Z]                 entries = list(scandir_it)
[2026-01-12T22:52:13Z]         except OSError:
[2026-01-12T22:52:13Z]             onerror(os.scandir, path, sys.exc_info())
[2026-01-12T22:52:13Z]             entries = []
[2026-01-12T22:52:13Z]         for entry in entries:
[2026-01-12T22:52:13Z]             fullname = entry.path
[2026-01-12T22:52:13Z]             if _rmtree_isdir(entry):
[2026-01-12T22:52:13Z]                 try:
[2026-01-12T22:52:13Z]                     if entry.is_symlink():
[2026-01-12T22:52:13Z]                         # This can only happen if someone replaces
[2026-01-12T22:52:13Z]                         # a directory with a symlink after the call to
[2026-01-12T22:52:13Z]                         # os.scandir or entry.is_dir above.
[2026-01-12T22:52:13Z]                         raise OSError("Cannot call rmtree on a symbolic link")
[2026-01-12T22:52:13Z]                 except OSError:
[2026-01-12T22:52:13Z]                     onerror(os.path.islink, fullname, sys.exc_info())
[2026-01-12T22:52:13Z]                     continue
[2026-01-12T22:52:13Z]                 _rmtree_unsafe(fullname, onerror)
[2026-01-12T22:52:13Z]             else:
[2026-01-12T22:52:13Z]                 try:
[2026-01-12T22:52:13Z] >                   os.unlink(fullname)
[2026-01-12T22:52:13Z] E                   PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '/tmp/spans/15464.txt'
[2026-01-12T22:52:13Z]
[2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:618: PermissionError
```

**Cause:** The `setup_local_tmp_tracing.py` module opens a file handle
for the `ConsoleSpanExporter` that is never explicitly closed. On
Windows, files cannot be deleted while they're open, causing
`shutil.rmtree` to fail with `PermissionError: [WinError 32]` during the
`cleanup_spans` fixture teardown.

**Fix:** Added `trace.get_tracer_provider().shutdown()` in the
`ray_serve_with_tracing` fixture teardown to properly flush and close
the span exporter's file handles before the cleanup fixture attempts to
delete the spans directory.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
### Why are these changes needed?

When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the `fit()` docstring:

> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.

**Example of the bug:**

```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…project#60072)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ay-project#60037)

## Description

As mentioned in
ray-project#59740 (comment),
add explicit args in `_AutoscalingCoordinatorActor` constructor to
improve maintainability.

## Related issues

Follow-up: ray-project#59740

## Additional information
- Pass in mock function in testing as args rather than using `patch`

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…ay-project#60028)

Capture the install script content in BuildContext digest by inlining it
as a constant and adding install_python_deps_script_digest field. This
ensures build reproducibility when the script changes.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
the "test-rules" test job was missing the forge dependency

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This migrates ray wheel builds from CLI-based approach to wanda-based
container builds for x86_64.

Changes:
- Add ray-wheel.wanda.yaml and Dockerfile for wheel builds
- Update build.rayci.yml wheel steps to use wanda
- Add wheel upload steps that extract from wanda cache

Topic: ray-wheel

Signed-off-by: andrew <andrew@anyscale.com>
ray-project#60114)

…eed up iter_batches (ray-project#58467)"

This reverts commit 2a042d4.

## Description
Reverts # 58467

## Related issues

## Additional information

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ctors (ray-project#59850)

Signed-off-by: dragongu <andrewgu@vip.qq.com>
abrarsheikh and others added 24 commits January 26, 2026 13:25
…ay-project#60468)

What I observe
1. test_metrics is timing out at 900s
2. It sometimes passes (1 out of 3 times)
3. It does not consistently fail on one specific test. So thatever the
problem is exogenous individual test

I am speculating, but trying these two changes
1. Health check metrics serve before starting serve metrics tests
4. Split metrics tests into two files, this would work in the event that
one large metrics file is taking more than 900s to run. When i run
locally, it takes about 500s, so this is plausible.



https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe66-42d2-85e7-2e90d74fba17/L11134

https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe64-480d-99d1-68846a93f0f1/L11107

https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe61-4c22-bcb2-4909d6ddb6f8/L4432

---------

Signed-off-by: abrar <abrar@anyscale.com>
## Description
Add Test for repr function for MapWorker to ensure that the string
always outputs even if args aren't recoverable.

This is adding the test related to this PR:
ray-project#58731

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
…ay-project#60377)

- Add new `s3_url` data format that lists JPEG files from S3 and
downloads images via `map_batches`

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ject#60513)

## Summary
- Add CODEOWNERS entry for
`/doc/source/data/doc_code/working-with-llms/` to assign ownership to
the ray-llm team

## Test plan
- N/A (CODEOWNERS change only)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
### Goal: 
Make `ray.data._internal.logical.operator`s a package entry point with
short imports and alphabetized `__all__`.

### Changes:
Add/complete __all__ in operator modules and re-export via
`__init__.py`.
Update imports to from `ray.data._internal.logical.operators import
...`.
Keep intra-operator dependencies using module paths to avoid cycles.

## Related issues
Related to ray-project#60204 

## Additional information

---------

Signed-off-by: 400Ping <jiekai.chang326@gmail.com>
Signed-off-by: Jie-Kai Chang <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Co-authored-by: Jie-Kai Chang <fourhundredping@gmail.com>
…ray-project#60334)

## Description
> PR1: Remove in‑place mutations in logical rules

#### Issue goal: 
Make LogicalOperator immutable and comparable to prevent in-place
mutations during optimization.

The issue is split into two PRs for easier review. 
#### This PR focuses on: 
changing logical optimization rules from in-place edits to copy/rebuild
the DAG, as a precursor to immutability.


## Related issues
> Link related issues: "Fixes ray-project#60312", "Closes ray-project#60312", or "Related to
ray-project#60312".

## Additional information
#### Implementation details
update limit_pushdown, predicate_pushdown, and inherit_batch_format to
rebuild nodes and rewire inputs instead of mutating dependencies;
optimization semantics are unchangedβ€”only construction changes.
#### API changes
none externally; internal logic switches from in-place mutation to
rebuilding.

---------

Signed-off-by: yaommen <myanstu@163.com>
…0507)

## Summary
- Remove all runtime pip install commands from basic_llm_example.py
- Add `doc/source/data/doc_code/working-with-llms/` to LLM CI test rules

## Why is this change needed?

### 1. Remove unnecessary pip installs
The basic_llm_example.py doc test was running pip install commands at
runtime:
- `pip install --upgrade ray[llm]`
- `pip install --upgrade transformers`
- `pip install numpy==1.26.4`

These are unnecessary because the llmgpubuild Docker image already has
all dependencies installed via the lock file.

The `pip install --upgrade transformers` line specifically caused the
test to break when transformers v5.0.0 was released (Jan 26, 2026),
because vLLM 0.13.0 imports `ALLOWED_LAYER_TYPES` from
`transformers.configuration_utils` - a constant that was split into
separate constants in v5.

### 2. Fix CI test triggering
Changes to `doc/source/data/doc_code/working-with-llms/` were not
triggering LLM CI tests because the path wasn't in
`.buildkite/test.rules.txt`. The tests have `team:llm` and `gpu` tags
and run on the llmgpubuild image, so they should be triggered by the LLM
rules.

## Related issue
- vLLM issue: vllm-project/vllm#31181

## Test plan
- [ ] LLM CI tests should now be triggered for this PR
- [ ] `//doc:source/data/doc_code/working-with-llms/basic_llm_example`
test should pass

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The test test_proxy_router_updated_replicas_then_gcs_failure was failing
with httpx.ReadTimeout because it didn't ensure the proxy's replica
queue length cache was populated before killing GCS. The equivalent
handle test (test_handle_router_updated_replicas_then_gcs_failure) uses
check_cache_populated=True to ensure the cache is populated, but the
proxy test was missing this check.


https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe62-4df7-b279-6e41f6cbd6c3/L1137

https://buildkite.com/ray-project/postmerge/builds/15625#019bebd9-ce8e-4cd0-bac9-c6659cb3c659/L1111

https://buildkite.com/ray-project/postmerge/builds/15625#019bebd9-ce85-48a0-84df-530baea6c481/L1111

I was not able to repro this locally since this is a timing issue
between when the GCS is killed and new replica getting added + probe
happening.

Signed-off-by: abrar <abrar@anyscale.com>
## Description
The existing release tests only include TPC-H query Q1. To achieve full
coverage, we plan to incorporate the remaining 21 TPCH queries, for a
total of 22 test cases into our test suite.

This PR move the TPCH tests into a new folder and extract common logic
into common.py for other future tests.

Also,
[examples.citusdata.com/tpch_queries.html](https://examples.citusdata.com/tpch_queries.html)
is unavailable now, so this pr updated to
[tpc.org/tpch](https://www.tpc.org/tpch/)

Release test link to make sure the changes is corrrect:
https://buildkite.com/ray-project/release/builds/77070/steps/canvas

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
…ray-project#60271)

## Description
- Limit the user-code event loop’s default ThreadPoolExecutor size to
the deployment’s ray_actor_options["num_cpus"] (fractional values round
up, <=0 leaves
  defaults).
- This ensures asyncio.to_thread in Serve replicas respects the CPU
reservation and avoids oversubscription.
- Added a Serve test that verifies the default executor’s max_workers
matches num_cpus.
## Related issues
> Link related issues: "Fixes ray-project#59750 ", "Closes ray-project#59750 ", or "Related to
ray-project#59750 ".

## Additional information
- Tests run:
- python -m pytest
python/ray/serve/tests/unit/test_user_callable_wrapper.py
- python -m pytest python/ray/serve/tests/test_replica_sync_methods.py

---------

Signed-off-by: yaommen <myanstu@163.com>
…ct spaces (ray-project#60451)

## Description

We don't natively build encoders for dict spaces and so we don't account
for them in the forward method of the DQN rlm.
This is an issue because users may still want to use encoder configs for
dictionaries or they may want to override DQNRLModule.build_encoder etc.

This PR makes a fix and introduces testing for different types of
forward passes, observations spaces and configurations for the DQN RL
Module.

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
## Description
This should say `False`

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ct#60222)

The actor repr name is _only_ used in task receiver when replying to the
`PushTask` RPC for an actor creation task. Making it one of the task
execution outputs instead of a stateful field. I've opted to make it an
outparam for the core worker task execution callback as well, rather
than adding a custom method for it.

My meta goal is to make the logic that handles a task execution result
in the task receiver fully stateless.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…eue` (ray-project#60538)

ray-project#60017 and
ray-project#60228 refactored the
`FIFOBundleQueue` interface and renamed `FIFOBundleQueue.popleft` with
`FIFOBundleQueue.get_next`. However, this name change wasn't reflected
in the `UnionOperator` implementation, and as a result the operator can
error when it clears its output queue.

This change also fixes the flaky `test_union.py`.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**  
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**  
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**  
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**  
βœ… For normal data (variance/range > 1e-8), behavior is **identical** to
before
βœ… Only triggers new logic for extreme edge cases (variance/range < 1e-8)
βœ… All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std β‰ˆ
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range β‰ˆ
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <slfan1989@apache.org>
…ject#60479)

## Description
Add type annotations to Ray's annotation decorators so type checkers can
properly infer return types through decorated functions.

Before this change, decorators like `@PublicAPI` caused type checkers to
lose function signature information. After this change, decorated
functions retain their full type signatures.

## Related issues
Related to ray-project#59303

## Additional information
Running pyrefly with ray was complaining when calling take_all() which
led me down this rabbit hole. I tried to add annotations to all the
public facing decorators I could find that had reasonably clear fixes.
I did some drive-by type fixes in annotations.py to make it fully pass

---------

Signed-off-by: Julian Meyers <Julian@MeyersWorld.com>
## Description
Moved arrow_utils.py to a direct subpackage of `ray.data.util`.

## Related issues
Closes ray-project#60420 

## Additional information
moved file to `ray.data` subpackage. modified import paths. A minor
readability issue.

---------

Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
Signed-off-by: Hyunoh Yeo <113647638+Hyunoh-Yeo@users.noreply.github.com>
… behavior (ray-project#60394)

## Summary

This PR fixes a startup crash when running `ray start --head
--no-redirect-output` (and the same flag in KubeRay-generated `ray
start` commands). The CLI previously routed this option through a
deprecated `RayParams.redirect_output` parameter, which raises a
`DeprecationWarning` as an exception and prevents Ray from starting. The
PR also corrects the effective behavior of `--no-redirect-output` by
using the supported mechanism (`RAY_LOG_TO_STDERR=1`) to disable log
redirection.

## Description

### What happened 

- The CLI option `--no-redirect-output` was mapped to
`RayParams.redirect_output`.
- `RayParams._check_usage()` raises `DeprecationWarning("The
redirect_output argument is deprecated.")` whenever `redirect_output` is
not `None`, which terminates `ray start`.
- Additionally, the previous mapping effectively inverted intent by
setting `redirect_output=True` when `--no-redirect-output` was provided.

### What was expected to happen

- `ray start --no-redirect-output` should **not crash**.
- It should disable redirecting non-worker stdout/stderr into
`.out/.err` files (i.e., logs should go to stderr/console), consistent
with the flag name and help text.

### What this PR changes

- Stop passing the deprecated `redirect_output` argument into
`RayParams` from the `ray start` CLI.
- When `--no-redirect-output` is set, configure the supported behavior
by setting:`RAY_LOG_TO_STDERR=1`
- This leverages the existing fallback logic in
`Node.should_redirect_logs()` which checks `RAY_LOG_TO_STDERR` when
`RayParams.redirect_output` is `None`.

### Testing
<img width="1280" height="468" alt="image"
src="https://github.com/user-attachments/assets/6eb32b2e-80fa-4c05-b308-1700e92b1efb"
/>


## Related issues
Closes ray-project#60367

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
…ray-project#60526)

## Description
Currently we use `get_browsers_no_post_put_middleware` to block PUT/POST
requests from browsers since these endpoints are not intended to be
called from a browser context (e.g., via DNS rebinding or CSRF).
However, DELETE methods were not blocked, allowing browser-based
requests to delete jobs or shut down Serve applications.

This PR switches from a blocklist (POST/PUT) to an allowlist
(GET/HEAD/OPTIONS) approach, ensuring only explicitly safe methods are
permitted from browsers. This also covers PATCH and any future HTTP
methods by default.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…ct#60536)

Use to consolidate MANYLINUX_VERSION. Future will also use rayci.env to
consolidate RAY_VERSION and other related fields.

Signed-off-by: andrew <andrew@anyscale.com>
…kerfiles (ray-project#60386)

- Add --mount=type=cache to ray-core and ray-java Dockerfiles
- Update ray-cpp-core to use shared cache ID
(ray-bazel-cache-${HOSTTYPE})
- Configure Bazel repository cache inside the mount for faster
dependency resolution
- Auto-disable remote cache uploads when BUILDKITE_BAZEL_CACHE_URL is
empty, preventing 403 errors on local builds without AWS credentials

All python-agnostic images now share the same Bazel cache per
architecture, maximizing cache reuse while preventing cross-architecture
toolchain conflicts.

Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main. It contains a large number of changes, primarily focused on a major refactoring of the CI/CD and build system. Key changes include migrating to a more modular, Wanda-based build process, improving multi-architecture support, dropping support for Python 3.9, and adding support for Python 3.13. There are also extensive documentation updates, including new examples, better organization, and clarifications. The overall changes appear to be a significant step forward in modernizing the project's infrastructure. I've found one minor inconsistency in a CI configuration file, for which I've left a comment.

RAYCI_DISABLE_JAVA: "false"
RAYCI_WANDA_ALWAYS_REBUILD: "true"
JDK_SUFFIX: "-jdk"
ARCH_SUFFIX: "aarch64"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ARCH_SUFFIX environment variable is defined here for the manylinux-cibase-jdk-aarch64 step, but it's not defined for the other aarch64 step (manylinux-cibase-aarch64) or for any of the x86_64 steps. This seems inconsistent. Based on the build scripts, this variable doesn't appear to be used. For consistency and to avoid confusion, consider removing this line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.