Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-21#753

Open
antfin-oss wants to merge 434 commits intomainfrom
create-pull-request/patch-b16808c752
Open

πŸ”„ daily merge: master β†’ main 2026-01-21#753
antfin-oss wants to merge 434 commits intomainfrom
create-pull-request/patch-b16808c752

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-21
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

nadongjun and others added 30 commits January 6, 2026 01:54
…t#58695)

## Description

This PR adds a new documentation page, Head Node Memory Management,
under the Ray Core advanced topics section.

## Related issues
Closes ray-project#58621

## Additional information
<img width="2048" height="1358" alt="image"
src="https://github.com/user-attachments/assets/3b98150d-05e6-4d15-9cd3-7e05e82ff516"
/>
<img width="2048" height="498" alt="image"
src="https://github.com/user-attachments/assets/4ec8fe43-e3a5-4df4-bca7-376ae407c77b"
/>

---------

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
…y-project#59845)

- [x] Update the docstring for `ray.shutdown()` in
`python/ray/_private/worker.py` to clarify:
- When connecting to a remote cluster via `ray.init(address="xxx")`,
`ray.shutdown()` only disconnects the client and does NOT terminate the
remote cluster
- Only local clusters started by `ray.init()` will have their processes
terminated by `ray.shutdown()`
- Clarified that `ray.init()` without address argument will auto-detect
existing clusters
- [x] Add documentation note to `doc/source/ray-core/starting-ray.rst`
explaining the same behavior difference
- [x] Review the changes via code_review
- [x] Run codeql_checker for security scan (no code changes requiring
analysis)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
## Description
upgrading cuda base gpu image from 11.8 to 12.8.1
This is required for future py3.13 dependency upgrades

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…roject#59735)

## Description
### Problem
Using --entrypoint-resources '{"fragile_node":"!1"}' with the Job API
raises an error saying only numeric values are allowed.

### Expect
--entrypoint-resources should accept label selectors just like
ray.remote/PlacementGroups, so entrypoints can target or avoid nodes
with specific labels.



## Related issues
> Link related issues: "Fixes ray-project#58662 ", "Closes ray-project#58662", or "Related to
ray-project#58662".

## Additional information

### Implementation approach
- Relax `JobSubmitRequest.entrypoint_resources` validation to allow
string values (`python/ray/dashboard/modules/job/common.py`).
- Add `_split_entrypoint_resources()` to separate numeric requests from
selector strings and run them through `validate_label_selector`
(`python/ray/dashboard/modules/job/job_manager.py`).
- Pass numeric resources via the existing `resources` option and
selector dict via `label_selector` when spawning the job supervisor,
leaving the field unset if only resources were provided
(`python/ray/dashboard/modules/job/job_manager.py`).
- Extend CLI parsing/tests to cover string-valued resources and assert
selector plumbing through the job manager
(`python/ray/dashboard/modules/job/tests/test_cli.py`,
`python/ray/dashboard/modules/job/tests/test_common.py`,
`python/ray/dashboard/modules/job/tests/test_job_manager.py`).

Signed-off-by: yaommen <myanstu@163.com>
update with more up-to-date information, and format the markdown file a
bit

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…-project#59848)

# Fix StreamingRepartition hang with empty upstream results

## Summary
Fix a bug where `StreamingRepartitionRefBundler` would hang when
processing empty datasets (0 rows).

## Problem
When upstream operations (e.g., `filter`, `map`, etc.) produce an empty
result (0 rows), resulting empty `RefBundle` gets added to
`_pending_bundles` but never gets flushed because:
1. `add_bundle()` adds empty bundles (0 rows) to `_pending_bundles`
2. `_total_pending_rows` remains 0
3. `done_adding_bundles()` checks `len(_pending_bundles) > 0` and calls
`flush_remaining=True`
4. `_try_build_ready_bundle(flush_remaining=True)` checks
`_total_pending_rows > 0` β†’ False, so no flush happens
5. Empty bundles remain in `_pending_bundles` forever (memory leak)

## Reproduction
```python
import ray
ray.init()
ds = ray.data.range(5).filter(lambda row: row['id'] > 100)
ds = ds.repartition(target_num_rows_per_block=8)
ds.count()
```

## Solution
Changed flush condition in `_try_build_ready_bundle()` from checking
`_total_pending_rows > 0` to `len(self._pending_bundles) > 0`:

```python
# Before:
if flush_remaining and self._total_pending_rows > 0:

# After:
if flush_remaining and len(self._pending_bundles) > 0:
```

This ensures empty bundles never enter the bundler state, preventing
both hangs and memory leaks.

Signed-off-by: dragongu <andrewgu@vip.qq.com>
…uts (ray-project#59883)

If you have a pipeline like `read --> [some cpu transformation] --> [gpu
transformation init_concurrency =N] --> write`, the `gpu transformation`
might downscale to 0 actors if the CPU transformation is slow. This
basically nullifies `init_concurrency` and can cause cold-start delays.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
stop using python 3.9

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…butes (ray-project#59894)

## Description
The `StatelessCartPole` example form APPO is timing out. This could be
due to the latest changes in the APPO data pipeline. This PR modifies
the setup of the example by using the new APPO attributes.

## Related issues
Fixes
https://buildkite.com/ray-project/postmerge/builds/15188#019b8f6e-2850-465e-a98c-63c29fbf98f7/L4702

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
this avoid the need to put in the dummy no-op files, and also allows us
to add env vars in the future.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description
ray: handle dual task errors with read-only args

- avoid writing to user-defined args when building RayTaskError hybrids
- fall back to RayTaskError-only with warning if subclassing fails
- add regression test covering read-only args user exceptions

## Related issues
Fixes ray-project#59437
ray-project#59846)

## Description
dashboard agent services such as reporter agent and event aggregator
agent do not run in minimal ray installs (`pip install ray`). this pr
skips client creation (and adds a info log to guide users) when using
minimal installs.

## Related issues
Fixes ray-project#59665

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
## Description
this pr adds auth middleware to dashboard http agent service and
configures clients to include token headers in their requests. Pr also
covers passing auth headers in state_manager runtime env agent api call
which was previously missed.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…prising (ray-project#59390)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
This PR adds support in the `JaxTrainer` to schedule across multiple TPU
slices using the `ray.util.tpu` public utilities.

To support this, this PR adds new `AcceleratorConfig`s to the V2 scaling
config, which consolidate the accelerator related fields for TPU and
GPU. When `TPUAcceleratorConfig` is specified, the JaxTrainer utilizes a
`SlicePlacementGroup` to atomically reserve `num_slices` TPU slices of
the desired topology, auto-detecting the required values for
`num_workers` and `resources_per_worker` when unspecified.

TODO: I'll add some manual testing and usage examples in the comments.

## Related issues
ray-project#55162

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e policy (ray-project#59803)

## Description
Given a typical scenario of a fast producing operator followed by a slow
producing operator how does the backpressure policy and resource
allocator behave? This change just adds tests to cement the expected
behavior.

## Related issues
DATA-1712

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
This PR adds documentation for several Ray Serve environment variables
that were defined in `constants.py` but missing from the documentation,
and also cleans up deprecated legacy environment variable names.

### Changes Made

#### Documentation additions

**`doc/source/serve/production-guide/config.md`** (Proxy config
section):
- `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE` - Control whether to always
run a proxy on the head node
- `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` - Proxy health check timeout
- `RAY_SERVE_PROXY_HEALTH_CHECK_PERIOD_S` - Proxy health check period
- `RAY_SERVE_PROXY_READY_CHECK_TIMEOUT_S` - Proxy ready check timeout
- `RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S` - Minimum proxy draining
period

**`doc/source/serve/production-guide/fault-tolerance.md`** (New "Replica
constructor retries" section):
- `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT` - Max constructor retries per
replica
- `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` - Max constructor
retries per deployment

**`doc/source/serve/advanced-guides/performance.md`**:
- `RAY_SERVE_PROXY_PREFER_LOCAL_NODE_ROUTING` - Proxy node locality
routing preference
- `RAY_SERVE_PROXY_PREFER_LOCAL_AZ_ROUTING` - Proxy AZ locality routing
preference
- `RAY_SERVE_MAX_CACHED_HANDLES` - Max cached deployment handles
(controller debugging section)

**`doc/source/serve/monitoring.md`**:
- `RAY_SERVE_HTTP_PROXY_CALLBACK_IMPORT_PATH` - HTTP proxy
initialization callback
- `SERVE_SLOW_STARTUP_WARNING_S` - Slow startup warning threshold
- `SERVE_SLOW_STARTUP_WARNING_PERIOD_S` - Slow startup warning interval

#### Code cleanup

**`python/ray/serve/_private/constants.py`**:
- Removed legacy fallback for `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT`
(now only `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT`)
- Removed legacy fallback for `MAX_PER_REPLICA_RETRY_COUNT` (now only
`RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT`)
- Removed legacy fallback for `MAX_CACHED_HANDLES` (now only
`RAY_SERVE_MAX_CACHED_HANDLES`)

**`python/ray/serve/_private/constants_utils.py`**:
- Removed `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` and
`MAX_PER_REPLICA_RETRY_COUNT` from the deprecated names whitelist

---------

Signed-off-by: harshit <harshit@anyscale.com>
…reating (ray-project#59610)

Signed-off-by: dayshah <dhyey2019@gmail.com>
## Description
allow
`RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES`
to accept `ALL` so that all events are exported. will be used by history
server. (without this config, kuberay needs to explicitly list each
event type which is tedious as this list may grow in the future)

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…project#59784)

## Description
run state api and task event unit tests with both the default
(task_event -> gcs flow) and aggregator (task_event -> aggregator ->
gcs) to smoothen the transition from default to aggregator flow

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
AnyscaleJobRunner is the only implementation/child class of
CommandRunner right now. There is no need to use inheritance.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
)

Add BuildContext TypedDict to capture post_build_script, python_depset,
their SHA256 digests, and environment variables for custom BYOD image
builds.

Changes:
- Add build_context.py with BuildContext TypedDict and helper functions:
- make_build_context: constructs BuildContext with computed file digests
  - encode_build_context: deterministic minified JSON serialization
  - decode_build_context: JSON deserialization
  - build_context_digest: SHA256 digest of encoded context
- Refactor build_anyscale_custom_byod_image to accept BuildContext
instead of individual post_build_script and python_depset arguments
- Update callers: custom_byod_build.py, ray_bisect.py
- Add comprehensive unit tests

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…project#59839)

# Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow
chunks to NumPy arrays

## Issue

Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow
chunks to NumPy arrays with `zero_copy_only=True`:

```
  File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block
    masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk
    ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True)
  File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
```

This error occurs when checkpoint data is loaded from Ray's object
store, where PyArrow buffers may reside in shared memory and cannot be
zero-copied to NumPy.

## Reproduction

```python
#!/usr/bin/env python3
import ray
from ray.data import DataContext
from ray.data.checkpoint import CheckpointConfig
import tempfile

ray.init()

with tempfile.TemporaryDirectory() as ckpt_dir, \
     tempfile.TemporaryDirectory() as data_dir, \
     tempfile.TemporaryDirectory() as output_dir:
    # Step 1: Create data
    ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir)

    # Step 2: Enable checkpoint and write
    ctx = DataContext.get_current()
    ctx.checkpoint_config = CheckpointConfig(
        checkpoint_path=ckpt_dir,
        id_column="id",
        delete_checkpoint_on_success=False
    )
    ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir)

    # Step 3: Second write triggers checkpoint filtering
    ray.data.read_parquet(data_dir).write_parquet(output_dir)

ray.shutdown()
```

## Solution

Change `to_numpy(zero_copy_only=True)` to
`to_numpy(zero_copy_only=False)` in
`BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows
PyArrow to copy data when necessary.

### Changes

**File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py`

- Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to
`ckpt_chunk.to_numpy(zero_copy_only=False)`

### Performance Impact

No performance regression expected. PyArrow will only perform a copy
when zero-copy is not possible.

Signed-off-by: dragongu <andrewgu@vip.qq.com>
## Description
Adds repr_name field to actor_lifecycle_event schema and populates it
when available.

## Related issues
Closes ray-project#59813

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
…y-project#59893)

## Description

Fix inconsistent task name in metrics between RUNNING and FINISHED
states.

When a Ray task is defined with a custom name via
`.options(name="custom_name")`, the `ray_tasks` metrics show
inconsistent names:
- **RUNNING** state: shows the original function name (e.g., `RemoteFn`)
- **FINISHED/FAILED** state: shows the custom name (e.g., `test`)

**Root cause:** The RUNNING task counter in `CoreWorker` uses
`FunctionDescriptor()->CallString()` to get the task name, while
finished task events correctly use `TaskSpecification::GetName()`.

**Fix:** Changed both `HandlePushTask` and `ExecuteTask` in
`core_worker.cc` to use `task_spec.GetName()` consistently, which
properly returns the custom name when set.

## Related issues

None - this PR addresses a newly discovered bug.

## Additional information

**Files changed:**
- `src/ray/core_worker/core_worker.cc` - Use `GetName()` instead of
`FunctionDescriptor()->CallString()` for metrics
- `python/ray/tests/test_task_metrics.py` - Added test
`test_task_custom_name_metrics` to verify custom names appear correctly
in metrics

Signed-off-by: Yuan Jiewei <jieweihh.yuan@gmail.com>
Co-authored-by: Yuan Jiewei <jieweihh.yuan@gmail.com>
## Description
update metrics export docs based on changes in
ray-project#59337

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…ray-project#59808)

Adds a new RLlib algorithm TQC, which extends SAC with distributional
critics using quantile regression to control Q-function overestimation
bias.

Key components:
- TQC algorithm configuration and implementation
- Default TQC RLModule with multiple quantile critics
- TQC catalog for building network components
- Comprehensive test suite covering compilation, simple environments,
and parameter validation
- Documentation including

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: tk42 <nsplat@gmail.com>
Co-authored-by: simonsays1980 <simon.zehnder@gmail.com>
kyuds and others added 22 commits January 19, 2026 21:39
…60304)

## Description
We had a separate field in `OpState` to keep track of outputted rows.
`OpRuntimeMetrics` exist per `PhysicalOperator`, and also has a field to
keep track of outputted rows, so there is no need to keep track of a
duplicate in OpState.

## Related issues
N/A

## Additional information
N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
## Description
This PR removes an obsolete HalfCheetah release test.

## Related issues
See also: ray-project#59007

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
## Description

Currently ray attach only allows opening an SSH session on the head
node. It could be useful to allow attaching to worker nodes to check
what state the execution environment and file system are in (e.g.
running conda list, examining config files such as ~/.keras/keras.json).


## Related issues

Closes ray-project#7064

## Additional information

This PR add `--node-ip` args to `ray attach` to specify the node IP to
attach to. Usage: `ray attach cluster.yaml --node-ip <node ip>`. Default
to head node if the `--node-ip` is not provided.

Add unit test and tested on GCP (see
ray-project#59931 (comment))

---------

Signed-off-by: machichima <nary12321@gmail.com>
…oject#60276)

so that we are not pretending that we are fetching results or teminating
jobs..

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…in homepage (ray-project#60229)

## Summary

Replaced the Ray Tune example in the homepage (`index.html`) to show
vanilla Ray Tune usage instead of V1 tune+train integration.

  **Changes:**
- Removed `ScalingConfig` and `LightGBMTrainer` imports (Ray Train
components)
  - Added a pure Ray Tune example demonstrating:
- An objective function that trains a model with hyperparameters and
reports metrics
- Hyperparameter search space using common Tune methods (`loguniform`,
`choice`, `randint`)
    - Running 1000 trials with the `Tuner` API
    - Retrieving the best result

This makes the example clearer for users who want to learn Ray Tune's
hyperparameter optimization capabilities without the complexity of Ray
Train integration.

Signed-off-by: xgui <xgui@anyscale.com>
if a test is not stable, it should be on manual frequency. we will no
longer treat unstable tests differently.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…ect#60264)

the alias is not used anywhere.

this clears all the `__init__.py` under `ray_release/` directory. making
it consistent with other files, and easier to convert everything to
idiomatic bazel

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
we have always been using a constant. if one needs more logs, they can
go to anyscale's UI and view logs there.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
)

metrics saving is handled in job wrapper

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…#60277)

just save the sdk as a private member instead

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…oject#60272)

so that it is not going back and forth between the implementation and
the abstract class, and not implemented as a property.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
## Description
Deprecate Predictor API and its concreate subclasses
DLPredictor(Predictor), LightGBMPredictor(Predictor),
TensorflowPredictor(DLPredictor), TorchPredictor(DLPredictor),
XGBoostPredictor(Predictor) TorchDetectionPredictor(TorchPredictor).

## Related issues
Closes ray-project#60266 

## Additional information
added `@Deprecated` annotations to corresponding classes and warns
'DeprecationWarning' when the constructor of the superclass is called.

---------

Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
Signed-off-by: Hyunoh-Yeo <113647638+Hyunoh-Yeo@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…dule IDs (ray-project#60234)

### Description

This PR fixes a bug in the RLLib `MultiAgentEnvRunner` where module
episode returns metrics were incorrectly calculated when multiple agents
share the same module ID. Previously, the code was overwriting returns
instead of accumulating them, leading to incorrect metrics.

- Fixed module return calculation logic in `MultiAgentEnvRunner` to
properly accumulate returns when multiple agents use the same module ID
- Added test case to verify that module metrics returns equal the sum of
agent returns assigned to that module

### Related issues
Fixes ray-project#59860 (ray-project#59860)

### Files modified:
- `rllib/env/multi_agent_env_runner.py`: Core bug fix
- `rllib/env/tests/test_multi_agent_env_runner.py`: New test case called
`test_module_metrics_returns_equal_sum_of_agent_returns()`

---------

Signed-off-by: Adam Kelloway <kelloway@amazon.com>
Co-authored-by: Adam Kelloway <kelloway@amazon.com>
install from tarball from official source, rathar than deb.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…oject#60151)

removing requirement file and constraint file build args from the
following images
base-deps
base-extra
base-extra-test-deps
base-slim (defaulting constraints file as a build arg)

defaulting PYTHON_DEPSET & CONSTRAINTS_FILE args in the dockerfile

Renaming ray-llm, ray-gpu & ray base extra testdeps lock files.
IMAGE_TYPE defined on the BK jobs will determine which lock file to copy
to the image


hello world release test run:
https://buildkite.com/ray-project/release/builds/76001#

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
ray-project#59897)

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
Co-authored-by: Aydin Abiar <aydin@anyscale.com>
…netes token authentication (ray-project#59621)

## Description

Per discussion from REP PR
(ray-project/enhancements#63), this PR adds a
server-side config `RAY_ENABLE_K8S_TOKEN_RBAC=true` to enable
Kubernetes-based token authentication. This must be set in addition to
`RAY_AUTH_MODE=token`. The main benefit of this change is that the
server-side authentication flow becomes opaque to clients, and all
clients only need to set `RAY_AUTH_MODE=token` along with their token.

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…-project#60283)

## Summary
- Fix Ray Data's cluster autoscalers (V1 and V2) to respect
user-configured `resource_limits` set via `ExecutionOptions`
- Cap autoscaling resource requests to not exceed user-specified CPU and
GPU limits
- Update `get_total_resources()` to return the minimum of cluster
resources and user limits

## Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect
user-configured resource limits. When a user set explicit limits like:

```python
ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)
```

The autoscaler would ignore these limits and continue to request more
cluster resources from Ray's autoscaler, causing unnecessary node
upscaling even when the executor couldn't use the additional resources.

This was problematic because:
1. Users explicitly setting resource limits expect Ray Data to stay
within those bounds
2. Unnecessary cluster scaling wastes cloud resources and money
3. The `ResourceManager.get_global_limits()` already respects user
limits, but the autoscaler bypassed this by requesting resources
directly

## Test Plan

Added comprehensive unit tests for both autoscaler implementations

## Related issue number

Fixes ray-project#60085

## Checks
- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [x] I've included any doc changes needed
- [x] I've added any new tests if needed

---

Would you like me to adjust anything in the PR description?

---------

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…t#60267)

it is always an instance of AnyscaleJobRunner.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…0278)

and saves the job ID in `_job_id`. this makes the information flow
clearer and simpler.

this is preparation for refactoring the job sdk usage.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
per anyscale#727

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
… from Ray Data (ray-project#60292)

## Description
Remove all top-level imports of `ray.data` from the `ray.train` module.
Imports needed only for type annotations should be guarded behind if
`TYPE_CHECKING:`. Imports needed at runtime should be moved inline (lazy
imports within functions/methods).

## Related issues
Fixes ray-project#60152.

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily focuses on updating and refactoring the CI/CD pipeline, removing Python 3.9 support, and introducing new build steps for C++ wheels. Several documentation files have also been updated to reflect these changes and improve clarity. The removal of the oss tag from various build steps across different platforms might impact how these jobs are categorized or filtered in the CI system. Additionally, the refactoring of Bazel sharding logic and dependency management indicates a significant overhaul of the build infrastructure.

Comment on lines +151 to +153
if (ConfigInternal::Instance().worker_type != WorkerType::DRIVER) {
options.worker_id = WorkerID::FromHex(ConfigInternal::Instance().worker_id);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The options.startup_token assignment has been replaced with a conditional options.worker_id assignment. This aligns with the renaming of startup_token to worker_id. Ensure that the WorkerID::FromHex conversion is robust and handles all possible string inputs for worker_id.

head_args.insert(head_args.end(), args.begin(), args.end());
}
startup_token = absl::GetFlag<int64_t>(FLAGS_startup_token);
worker_id = absl::GetFlag<std::string>(FLAGS_ray_worker_id);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The assignment startup_token = absl::GetFlag<int64_t>(FLAGS_startup_token); has been replaced with worker_id = absl::GetFlag<std::string>(FLAGS_ray_worker_id);. This change must be carefully reviewed to ensure that the new worker_id is correctly retrieved and used in all relevant parts of the system, especially considering the type change from int64_t to std::string.

Comment on lines +80 to +83
ABSL_FLAG(std::string,
ray_worker_id,
"",
"The worker ID assigned to this worker process by the raylet (hex string).");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The startup_token flag has been replaced with ray_worker_id of type std::string. This is a significant change in how worker identification is handled. Ensure all components that rely on startup_token are updated to use ray_worker_id and that the string format is correctly parsed and used.

Comment on lines +147 to +149
# Correct example of ray.get(), using the object store to fetch the RDT object because the caller
# is not part of the collective group.
print(ray.get(tensor, _use_object_store=True))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _tensor_transport="object_store" parameter has been updated to _use_object_store=True. This is a breaking API change that needs to be clearly communicated to users, along with migration instructions.

Suggested change
# Correct example of ray.get(), using the object store to fetch the RDT object because the caller
# is not part of the collective group.
print(ray.get(tensor, _use_object_store=True))
print(ray.get(tensor, _use_object_store=True))

The :func:`ray.get <ray.get>` function can also be used as usual to retrieve the result of an RDT object. However, :func:`ray.get <ray.get>` will by default use the same tensor transport as the one specified in the :func:`@ray.method <ray.method>` decorator. For collective-based transports, this will not work if the caller is not part of the collective group.

Therefore, users need to specify the Ray object store as the tensor transport explicitly by setting ``_tensor_transport`` in :func:`ray.get <ray.get>`.
Therefore, users need to specify the Ray object store as the tensor transport explicitly by setting ``_use_object_store`` in :func:`ray.get <ray.get>`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The parameter for specifying the Ray object store as the tensor transport has been changed from _tensor_transport to _use_object_store. This is a breaking API change that needs to be clearly communicated to users, along with migration instructions.

- python
- macos_wheels
- oss
job_env: MACOS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The oss tag has been removed from this step. Please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

@@ -19,11 +22,11 @@ steps:
tags:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The oss tag has been removed from this step. Please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

Comment on lines +698 to 704
Note that Ray decouples the lifetime option and the name option. If you only specify
the name without specifying ``lifetime="detached"``, then you can only retrieve the placement group
while the driver where you created the placement group is still running.
It's recommended to always specify the name when creating the detached placement group. If you don't,
there is no way to retrieve the placement group from another process, and there is no way
to kill it once you exit the driver script that created the placement group.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for detached placement groups has been expanded to emphasize the importance of specifying the name when creating a detached placement group, and the consequences of not doing so. This provides crucial guidance for users.

@@ -484,7 +484,6 @@ steps:
# avoid running them for every C++ code change.
tags:
- spark_on_ray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The oss tag has been removed from this step. Please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

@@ -432,7 +433,6 @@ steps:
tags:
- java
- python

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The oss tag has been removed from this step. Similar to other instances, please confirm if this removal is intentional and if there's a new mechanism for handling oss related jobs.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.