π daily merge: master β main 2026-01-28#759
Conversation
stop using the large oss ci test base Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
β¦#59896) ## Description Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state. ## Related issues Closes ray-project#59682 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦#59616) ## Description We observed that raylet frequently emits log messages of the form βDropping sync message with stale versionβ, which can become quite noisy in practice. This behavior occurs because raylet does not update the message version for sync messages received from the GCS, and stale-version broadcast messages are expected to be skipped by default. As a result, these log entries are generated repeatedly even though this is normal and non-actionable behavior. Given that this does not indicate an error or unexpected state, logging it at the INFO level significantly increases log noise and makes it harder to identify genuinely important events. We propose demoting this log from INFO to DEBUG in RaySyncerBidiReactorBase to keep raylet logs cleaner while still preserving the information for debugging purposes when needed.  ## Related issues Closes ray-project#59615 ## Additional information - Change log level from INFO to DEBUG for βDropping sync message with stale versionβ in RaySyncerBidiReactorBase. Signed-off-by: Mao Yancan <yancan.mao@bytedance.com> Co-authored-by: Mao Yancan <yancan.mao@bytedance.com>
## Description Runs linkcheck on docs, in particular for RLlib where we've moved tuned-examples to examples/algorithms Further, updated github links that were automatically redirected There are problems with some of the RLlib examples missing but I'm going to fix these in the algorithm premerge PRs, i.e., ray-project#59007 --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
β¦9787) Signed-off-by: ahao-anyscale <ahao@anyscale.com>
β¦project#60050) Add support for authenticating HTTPS downloads in runtime environments using bearer tokens via the RAY_RUNTIME_ENV_BEARER_TOKEN environment variable. Fixes [ray-project#46833](ray-project#46833) Signed-off-by: Denis Khachyan <khachyanda@gmail.com>
β¦ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() β _release_server() β DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <ulyer555@hotmail.com> Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
Context --- This change aims at revisiting of the `HashShuffleAggregator` protocol by - Removing global lock (per aggregator) - Making shard accepting flow lock-free - Relocating all state from `ShuffleAggregation` into Aggregator itself - Adding dynamic compaction (exponentially increasing compaction period) to amortize compaction costs - Adding debugging state dumps ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Adding CONSTRAINTS_FILE docker arg for ray base-deps image release test run: https://buildkite.com/ray-project/release/builds/74879 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Description 1. Jax dependency is introduced in ray-project#58322 2. The current test environment is for CUDA 12.1, which limit jax version below 0.4.14. 3. jax <= 0.4.14 does not support py 3.12. 4. skip jax test if it runs against py3.12+. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>
β¦se (ray-project#60080) Signed-off-by: Future-Outlier <eric901201@gmail.com>
## Description Add model inference release test that closely reflects user workloads. Release test run: https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_glehkcquv9k26ta69f8lkc94nl?job-logs-section-tabs=application_logs&job-tab=overview&metrics-tab=data ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
Reverts ray-project#59983 symlink does not work with newer version of wanda, where the newer version of wanda is doing the right thing.
β¦ct#59987) - Bump .rayciversion from 0.21.0 to 0.25.0 - Move rules files to .buildkite/ with *.rules.txt naming convention - Add always.rules.txt for always-run lint rules - Add test.rules.test.txt with test cases - Add test-rules CI step in cicd.rayci.yml (auto-discovery) - Update macOS config to use new rules file paths Topic: update-rayci-latest Signed-off-by: andrew <andrew@anyscale.com>
β¦ay-project#60057) ## Summary When running prefill-decode disaggregation with NixlConnector and data parallelism, both prefill and decode deployments were using the same port base for their ZMQ side channel. This caused "Address already in use" errors when both deployments had workers on the same node: ``` zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009') Exception in thread nixl_handshake_listener ``` ## Changes Fix by setting different `NIXL_SIDE_CHANNEL_PORT_BASE` values for prefill (40000) and decode (41000) configs to ensure port isolation. ## Test plan - Run `test_llm_serve_prefill_decode_with_data_parallelism` - should complete without timeout - The test previously hung forever waiting for "READY message from DP Coordinator" Signed-off-by: Seiji Eicher <seiji@anyscale.com>
β¦se (ray-project#60092) Signed-off-by: Future-Outlier <eric901201@gmail.com>
- Fix ProgressBar to honor `use_ray_tqdm` in `DataContext`. - Note that `tqdm_ray` is designed to work in non-interactive contexts (workers/actors) by sending JSON progress updates to the driver. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
β¦ray-project#59933) ## Description The `DefaultAutoscaler2` implementation needs an `AutoscalingCoordinator` and a way to get all of the `_NodeResourceSpec`. Currently, we can't explicitly inject fake implementations of either dependency. This is problematic because the tests need to assume what the implementation of each dependency looks like and use brittle mocks. To solve this: - Add the `FakeAutoscalingCoordinator` implementation to a new `fake_autoscaling_coordinator.py` module (you can use the code below) - `DefaultClusterAutoscalerV2` has two new parameters `autoscaling_coordinator: Optional[AutoscalingCoordinator] = None` and `get_node_counts: Callable[[], Dict[_NodeResourceSpec, int]] = get_node_resource_spec_and_count`. If `autoscaling_coordinator` is None, you can use the default implementation. - Update `test_try_scale_up_cluster` to use the explicit seams rather than mocks. Where possible, assert against the public interface rather than implementation details ## Related issues Closes ray-project#59683 --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com>
## Description RLlib's rayci.yml [file](https://github.com/ray-project/ray/blob/master/.buildkite/rllib.rayci.yml) and the BUILD.bazel [file](https://github.com/ray-project/ray/blob/master/rllib/BUILD.bazel) are disconnected such that there are old tags in the BUILD not the rayci and vice-versa. This PR attempts to clean up both files without modifying what tests are or aren't run currently --------- Signed-off-by: Mark Towers <mark@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kamil Kaczmarek <kamil@anyscale.com>
β¦g tracer file handles (ray-project#60078) This fix resolves serve's window test failure: ``` [2026-01-12T22:52:13Z] =================================== ERRORS ==================================== -- [2026-01-12T22:52:13Z] _______ ERROR at teardown of test_deployment_remote_calls_with_tracing ________ [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] @pytest.fixture [2026-01-12T22:52:13Z] def cleanup_spans(): [2026-01-12T22:52:13Z] """Cleanup temporary spans_dir folder at beginning and end of test.""" [2026-01-12T22:52:13Z] if os.path.exists(spans_dir): [2026-01-12T22:52:13Z] shutil.rmtree(spans_dir) [2026-01-12T22:52:13Z] os.makedirs(spans_dir, exist_ok=True) [2026-01-12T22:52:13Z] yield [2026-01-12T22:52:13Z] # Enable tracing only sets up tracing once per driver process. [2026-01-12T22:52:13Z] # We set ray.__traced__ to False here so that each [2026-01-12T22:52:13Z] # test will re-set up tracing. [2026-01-12T22:52:13Z] ray.__traced__ = False [2026-01-12T22:52:13Z] if os.path.exists(spans_dir): [2026-01-12T22:52:13Z] > shutil.rmtree(spans_dir) [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] python\ray\serve\tests\test_serve_with_tracing.py:30: [2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:750: in rmtree [2026-01-12T22:52:13Z] return _rmtree_unsafe(path, onerror) [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:620: in _rmtree_unsafe [2026-01-12T22:52:13Z] onerror(os.unlink, fullname, sys.exc_info()) [2026-01-12T22:52:13Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] path = '/tmp/spans/' [2026-01-12T22:52:13Z] onerror = <function rmtree.<locals>.onerror at 0x000002C0FFBBDA20> [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] def _rmtree_unsafe(path, onerror): [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] with os.scandir(path) as scandir_it: [2026-01-12T22:52:13Z] entries = list(scandir_it) [2026-01-12T22:52:13Z] except OSError: [2026-01-12T22:52:13Z] onerror(os.scandir, path, sys.exc_info()) [2026-01-12T22:52:13Z] entries = [] [2026-01-12T22:52:13Z] for entry in entries: [2026-01-12T22:52:13Z] fullname = entry.path [2026-01-12T22:52:13Z] if _rmtree_isdir(entry): [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] if entry.is_symlink(): [2026-01-12T22:52:13Z] # This can only happen if someone replaces [2026-01-12T22:52:13Z] # a directory with a symlink after the call to [2026-01-12T22:52:13Z] # os.scandir or entry.is_dir above. [2026-01-12T22:52:13Z] raise OSError("Cannot call rmtree on a symbolic link") [2026-01-12T22:52:13Z] except OSError: [2026-01-12T22:52:13Z] onerror(os.path.islink, fullname, sys.exc_info()) [2026-01-12T22:52:13Z] continue [2026-01-12T22:52:13Z] _rmtree_unsafe(fullname, onerror) [2026-01-12T22:52:13Z] else: [2026-01-12T22:52:13Z] try: [2026-01-12T22:52:13Z] > os.unlink(fullname) [2026-01-12T22:52:13Z] E PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '/tmp/spans/15464.txt' [2026-01-12T22:52:13Z] [2026-01-12T22:52:13Z] C:\Miniconda3\lib\shutil.py:618: PermissionError ``` **Cause:** The `setup_local_tmp_tracing.py` module opens a file handle for the `ConsoleSpanExporter` that is never explicitly closed. On Windows, files cannot be deleted while they're open, causing `shutil.rmtree` to fail with `PermissionError: [WinError 32]` during the `cleanup_spans` fixture teardown. **Fix:** Added `trace.get_tracer_provider().shutdown()` in the `ray_serve_with_tracing` fixture teardown to properly flush and close the span exporter's file handles before the cleanup fixture attempts to delete the spans directory. --------- Signed-off-by: doyoung <doyoung@anyscale.com>
### Why are these changes needed?
When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).
This violates the documented behavior in the `fit()` docstring:
> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.
**Example of the bug:**
```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()
# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}
# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)
# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
# "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```
---------
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦project#60072) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
β¦ay-project#60028) Capture the install script content in BuildContext digest by inlining it as a constant and adding install_python_deps_script_digest field. This ensures build reproducibility when the script changes. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
the "test-rules" test job was missing the forge dependency Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This migrates ray wheel builds from CLI-based approach to wanda-based container builds for x86_64. Changes: - Add ray-wheel.wanda.yaml and Dockerfile for wheel builds - Update build.rayci.yml wheel steps to use wanda - Add wheel upload steps that extract from wanda cache Topic: ray-wheel Signed-off-by: andrew <andrew@anyscale.com>
ray-project#60114) β¦eed up iter_batches (ray-project#58467)" This reverts commit 2a042d4. ## Description Reverts # 58467 ## Related issues ## Additional information Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦ctors (ray-project#59850) Signed-off-by: dragongu <andrewgu@vip.qq.com>
β¦ay-project#60468) What I observe 1. test_metrics is timing out at 900s 2. It sometimes passes (1 out of 3 times) 3. It does not consistently fail on one specific test. So thatever the problem is exogenous individual test I am speculating, but trying these two changes 1. Health check metrics serve before starting serve metrics tests 4. Split metrics tests into two files, this would work in the event that one large metrics file is taking more than 900s to run. When i run locally, it takes about 500s, so this is plausible. https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe66-42d2-85e7-2e90d74fba17/L11134 https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe64-480d-99d1-68846a93f0f1/L11107 https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe61-4c22-bcb2-4909d6ddb6f8/L4432 --------- Signed-off-by: abrar <abrar@anyscale.com>
## Description Add Test for repr function for MapWorker to ensure that the string always outputs even if args aren't recoverable. This is adding the test related to this PR: ray-project#58731 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <goutam@anyscale.com>
β¦ay-project#60377) - Add new `s3_url` data format that lists JPEG files from S3 and downloads images via `map_batches` --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
β¦ject#60513) ## Summary - Add CODEOWNERS entry for `/doc/source/data/doc_code/working-with-llms/` to assign ownership to the ray-llm team ## Test plan - N/A (CODEOWNERS change only) Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description ### Goal: Make `ray.data._internal.logical.operator`s a package entry point with short imports and alphabetized `__all__`. ### Changes: Add/complete __all__ in operator modules and re-export via `__init__.py`. Update imports to from `ray.data._internal.logical.operators import ...`. Keep intra-operator dependencies using module paths to avoid cycles. ## Related issues Related to ray-project#60204 ## Additional information --------- Signed-off-by: 400Ping <jiekai.chang326@gmail.com> Signed-off-by: Jie-Kai Chang <fourhundredping@gmail.com> Signed-off-by: 400Ping <fourhundredping@gmail.com> Co-authored-by: Jie-Kai Chang <fourhundredping@gmail.com>
β¦ray-project#60334) ## Description > PR1: Remove inβplace mutations in logical rules #### Issue goal: Make LogicalOperator immutable and comparable to prevent in-place mutations during optimization. The issue is split into two PRs for easier review. #### This PR focuses on: changing logical optimization rules from in-place edits to copy/rebuild the DAG, as a precursor to immutability. ## Related issues > Link related issues: "Fixes ray-project#60312", "Closes ray-project#60312", or "Related to ray-project#60312". ## Additional information #### Implementation details update limit_pushdown, predicate_pushdown, and inherit_batch_format to rebuild nodes and rewire inputs instead of mutating dependencies; optimization semantics are unchangedβonly construction changes. #### API changes none externally; internal logic switches from in-place mutation to rebuilding. --------- Signed-off-by: yaommen <myanstu@163.com>
β¦0507) ## Summary - Remove all runtime pip install commands from basic_llm_example.py - Add `doc/source/data/doc_code/working-with-llms/` to LLM CI test rules ## Why is this change needed? ### 1. Remove unnecessary pip installs The basic_llm_example.py doc test was running pip install commands at runtime: - `pip install --upgrade ray[llm]` - `pip install --upgrade transformers` - `pip install numpy==1.26.4` These are unnecessary because the llmgpubuild Docker image already has all dependencies installed via the lock file. The `pip install --upgrade transformers` line specifically caused the test to break when transformers v5.0.0 was released (Jan 26, 2026), because vLLM 0.13.0 imports `ALLOWED_LAYER_TYPES` from `transformers.configuration_utils` - a constant that was split into separate constants in v5. ### 2. Fix CI test triggering Changes to `doc/source/data/doc_code/working-with-llms/` were not triggering LLM CI tests because the path wasn't in `.buildkite/test.rules.txt`. The tests have `team:llm` and `gpu` tags and run on the llmgpubuild image, so they should be triggered by the LLM rules. ## Related issue - vLLM issue: vllm-project/vllm#31181 ## Test plan - [ ] LLM CI tests should now be triggered for this PR - [ ] `//doc:source/data/doc_code/working-with-llms/basic_llm_example` test should pass --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The test test_proxy_router_updated_replicas_then_gcs_failure was failing with httpx.ReadTimeout because it didn't ensure the proxy's replica queue length cache was populated before killing GCS. The equivalent handle test (test_handle_router_updated_replicas_then_gcs_failure) uses check_cache_populated=True to ensure the cache is populated, but the proxy test was missing this check. https://buildkite.com/ray-project/postmerge/builds/15633#019becb0-fe62-4df7-b279-6e41f6cbd6c3/L1137 https://buildkite.com/ray-project/postmerge/builds/15625#019bebd9-ce8e-4cd0-bac9-c6659cb3c659/L1111 https://buildkite.com/ray-project/postmerge/builds/15625#019bebd9-ce85-48a0-84df-530baea6c481/L1111 I was not able to repro this locally since this is a timing issue between when the GCS is killed and new replica getting added + probe happening. Signed-off-by: abrar <abrar@anyscale.com>
## Description The existing release tests only include TPC-H query Q1. To achieve full coverage, we plan to incorporate the remaining 21 TPCH queries, for a total of 22 test cases into our test suite. This PR move the TPCH tests into a new folder and extract common logic into common.py for other future tests. Also, [examples.citusdata.com/tpch_queries.html](https://examples.citusdata.com/tpch_queries.html) is unavailable now, so this pr updated to [tpc.org/tpch](https://www.tpc.org/tpch/) Release test link to make sure the changes is corrrect: https://buildkite.com/ray-project/release/builds/77070/steps/canvas --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
β¦y-project#60430) Signed-off-by: Seiji Eicher <seiji@anyscale.com>
β¦ray-project#60271) ## Description - Limit the user-code event loopβs default ThreadPoolExecutor size to the deploymentβs ray_actor_options["num_cpus"] (fractional values round up, <=0 leaves defaults). - This ensures asyncio.to_thread in Serve replicas respects the CPU reservation and avoids oversubscription. - Added a Serve test that verifies the default executorβs max_workers matches num_cpus. ## Related issues > Link related issues: "Fixes ray-project#59750 ", "Closes ray-project#59750 ", or "Related to ray-project#59750 ". ## Additional information - Tests run: - python -m pytest python/ray/serve/tests/unit/test_user_callable_wrapper.py - python -m pytest python/ray/serve/tests/test_replica_sync_methods.py --------- Signed-off-by: yaommen <myanstu@163.com>
β¦ct spaces (ray-project#60451) ## Description We don't natively build encoders for dict spaces and so we don't account for them in the forward method of the DQN rlm. This is an issue because users may still want to use encoder configs for dictionaries or they may want to override DQNRLModule.build_encoder etc. This PR makes a fix and introduces testing for different types of forward passes, observations spaces and configurations for the DQN RL Module. --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
## Description This should say `False` ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦ct#60222) The actor repr name is _only_ used in task receiver when replying to the `PushTask` RPC for an actor creation task. Making it one of the task execution outputs instead of a stateful field. I've opted to make it an outparam for the core worker task execution callback as well, rather than adding a custom method for it. My meta goal is to make the logic that handles a task execution result in the task receiver fully stateless. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦eue` (ray-project#60538) ray-project#60017 and ray-project#60228 refactored the `FIFOBundleQueue` interface and renamed `FIFOBundleQueue.popleft` with `FIFOBundleQueue.get_next`. However, this name change wasn't reflected in the `UnionOperator` implementation, and as a result the operator can error when it clears its output queue. This change also fixes the flaky `test_union.py`. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** β For normal data (variance/range > 1e-8), behavior is **identical** to before β Only triggers new logic for extreme edge cases (variance/range < 1e-8) β All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std β 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range β 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <slfan1989@apache.org>
β¦ject#60479) ## Description Add type annotations to Ray's annotation decorators so type checkers can properly infer return types through decorated functions. Before this change, decorators like `@PublicAPI` caused type checkers to lose function signature information. After this change, decorated functions retain their full type signatures. ## Related issues Related to ray-project#59303 ## Additional information Running pyrefly with ray was complaining when calling take_all() which led me down this rabbit hole. I tried to add annotations to all the public facing decorators I could find that had reasonably clear fixes. I did some drive-by type fixes in annotations.py to make it fully pass --------- Signed-off-by: Julian Meyers <Julian@MeyersWorld.com>
## Description Moved arrow_utils.py to a direct subpackage of `ray.data.util`. ## Related issues Closes ray-project#60420 ## Additional information moved file to `ray.data` subpackage. modified import paths. A minor readability issue. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com> Signed-off-by: Hyunoh Yeo <113647638+Hyunoh-Yeo@users.noreply.github.com>
β¦ behavior (ray-project#60394) ## Summary This PR fixes a startup crash when running `ray start --head --no-redirect-output` (and the same flag in KubeRay-generated `ray start` commands). The CLI previously routed this option through a deprecated `RayParams.redirect_output` parameter, which raises a `DeprecationWarning` as an exception and prevents Ray from starting. The PR also corrects the effective behavior of `--no-redirect-output` by using the supported mechanism (`RAY_LOG_TO_STDERR=1`) to disable log redirection. ## Description ### What happened - The CLI option `--no-redirect-output` was mapped to `RayParams.redirect_output`. - `RayParams._check_usage()` raises `DeprecationWarning("The redirect_output argument is deprecated.")` whenever `redirect_output` is not `None`, which terminates `ray start`. - Additionally, the previous mapping effectively inverted intent by setting `redirect_output=True` when `--no-redirect-output` was provided. ### What was expected to happen - `ray start --no-redirect-output` should **not crash**. - It should disable redirecting non-worker stdout/stderr into `.out/.err` files (i.e., logs should go to stderr/console), consistent with the flag name and help text. ### What this PR changes - Stop passing the deprecated `redirect_output` argument into `RayParams` from the `ray start` CLI. - When `--no-redirect-output` is set, configure the supported behavior by setting:`RAY_LOG_TO_STDERR=1` - This leverages the existing fallback logic in `Node.should_redirect_logs()` which checks `RAY_LOG_TO_STDERR` when `RayParams.redirect_output` is `None`. ### Testing <img width="1280" height="468" alt="image" src="https://github.com/user-attachments/assets/6eb32b2e-80fa-4c05-b308-1700e92b1efb" /> ## Related issues Closes ray-project#60367 --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
β¦ray-project#60526) ## Description Currently we use `get_browsers_no_post_put_middleware` to block PUT/POST requests from browsers since these endpoints are not intended to be called from a browser context (e.g., via DNS rebinding or CSRF). However, DELETE methods were not blocked, allowing browser-based requests to delete jobs or shut down Serve applications. This PR switches from a blocklist (POST/PUT) to an allowlist (GET/HEAD/OPTIONS) approach, ensuring only explicitly safe methods are permitted from browsers. This also covers PATCH and any future HTTP methods by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
β¦ject#60502) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
β¦ct#60536) Use to consolidate MANYLINUX_VERSION. Future will also use rayci.env to consolidate RAY_VERSION and other related fields. Signed-off-by: andrew <andrew@anyscale.com>
β¦kerfiles (ray-project#60386) - Add --mount=type=cache to ray-core and ray-java Dockerfiles - Update ray-cpp-core to use shared cache ID (ray-bazel-cache-${HOSTTYPE}) - Configure Bazel repository cache inside the mount for faster dependency resolution - Auto-disable remote cache uploads when BUILDKITE_BAZEL_CACHE_URL is empty, preventing 403 errors on local builds without AWS credentials All python-agnostic images now share the same Bazel cache per architecture, maximizing cache reuse while preventing cross-architecture toolchain conflicts. Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request is an automated daily merge from master to main. It contains a large number of changes, primarily focused on a major refactoring of the CI/CD and build system. Key changes include migrating to a more modular, Wanda-based build process, improving multi-architecture support, dropping support for Python 3.9, and adding support for Python 3.13. There are also extensive documentation updates, including new examples, better organization, and clarifications. The overall changes appear to be a significant step forward in modernizing the project's infrastructure. I've found one minor inconsistency in a CI configuration file, for which I've left a comment.
| RAYCI_DISABLE_JAVA: "false" | ||
| RAYCI_WANDA_ALWAYS_REBUILD: "true" | ||
| JDK_SUFFIX: "-jdk" | ||
| ARCH_SUFFIX: "aarch64" |
There was a problem hiding this comment.
The ARCH_SUFFIX environment variable is defined here for the manylinux-cibase-jdk-aarch64 step, but it's not defined for the other aarch64 step (manylinux-cibase-aarch64) or for any of the x86_64 steps. This seems inconsistent. Based on the build scripts, this variable doesn't appear to be used. For consistency and to avoid confusion, consider removing this line.
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2026-01-28
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.