Skip to content

[perf] Improve performance for putting jagged tensor#36

Merged
0oshowero0 merged 8 commits intoAscend:mainfrom
0oshowero0:jagged_tensor
Feb 25, 2026
Merged

[perf] Improve performance for putting jagged tensor#36
0oshowero0 merged 8 commits intoAscend:mainfrom
0oshowero0:jagged_tensor

Conversation

@0oshowero0
Copy link
Collaborator

@0oshowero0 0oshowero0 commented Feb 25, 2026

Background

When users input a TensorDict containing jagged tensors (nested tensors), the put_data process becomes extremely slow.

Specifically, the _filter_storage_data function uses itemgetter(*batch_indexes)(data[fname]) to extract individual items from each tensor in the TensorDict. This indexing approach works efficiently for strided tensors but is extremely inefficient for jagged tensors.

Root Cause

For jagged tensors, itemgetter with multiple batch indexes requires repeated indexing operations, which is $\mathcal{O}(n)$ for each access. When extracting multiple samples, this becomes $\mathcal{O}(n²)$ complexity.

Solution

We unbind nested tensor before accessing each sample from it.

  # unbind nested tensor
  results: dict = {}
  for field in sorted(data.keys()):
      field_data = data[field]
      if isinstance(field_data, Tensor) and field_data.is_nested:
          results[field] = field_data.unbind()
      else:
          results[field] = field_data

Simple Reproduction Script

  import torch
  import time
  from operator import itemgetter

  # Create a jagged tensor with 1000 samples
  offsets = torch.tensor([0] + list(torch.randint(10, 50, (1001,)).cumsum(0)))
  values = torch.randn(offsets[-1].item(), 128)
  jagged = torch.nested.as_nested_tensor(
      [values[offsets[i]:offsets[i+1]] for i in range(1000)],
      layout=torch.jagged
  )

  batch_indexes = list(range(0, 1000, 10))  # 100 indexes

  # Method 1: Direct itemgetter on jagged tensor (SLOW)
  start = time.perf_counter()
  result = itemgetter(*batch_indexes)(jagged)
  print(f"Direct itemgetter: {(time.perf_counter() - start)*1000:.2f} ms")

  # Method 2: Unbind first, then itemgetter (FAST)
  start = time.perf_counter()
  field_list = jagged.unbind()
  result = itemgetter(*batch_indexes)(field_list)
  print(f"Unbind + itemgetter: {(time.perf_counter() - start)*1000:.2f} ms")

Output:

Direct itemgetter: 150.94 ms
Unbind + itemgetter: 1.80 ms

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Copilot AI review requested due to automatic review settings February 25, 2026 02:56
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a performance bottleneck when put_data processes TensorDict fields backed by jagged (nested) tensors by avoiding repeated expensive multi-indexing on jagged tensors.

Changes:

  • Optimize _filter_storage_data to unbind jagged tensors before applying itemgetter over multiple batch indexes.
  • Add a note in KVStorageManager._generate_values indicating a similar potential optimization for jagged tensors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
transfer_queue/storage/managers/simple_backend_manager.py Adds a jagged-tensor fast path in _filter_storage_data by unbinding before multi-index selection.
transfer_queue/storage/managers/base.py Adds a TODO note in _generate_values related to jagged tensor handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 205 to 214
# unbind jagged tensor
results: dict = {}
for field in sorted(data.keys()):
field_data = data[field]

# For jagged tensors, unbind() first to accelerate indexing process
if isinstance(field_data, Tensor) and field_data.layout == torch.jagged:
results[field] = field_data.unbind()
else:
results[field] = field_data
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a jagged-tensor fast path (pre-unbind before indexing), but there’s no test exercising put_data with layout=torch.jagged. Adding a unit test that uses a jagged tensor field and asserts the data sent to _put_to_single_storage_unit matches expected samples would prevent regressions (and ensure the performance fix stays wired in).

Copilot uses AI. Check for mistakes.
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 205 to 215
# unbind jagged tensor
results: dict = {}
for field in sorted(data.keys()):
field_data = data[field]

# For jagged tensors, unbind() first to accelerate indexing process
if isinstance(field_data, Tensor) and field_data.layout == torch.jagged:
results[field] = field_data.unbind()
else:
results[field] = field_data

Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a new jagged-tensor fast path (unbind() before indexing), but there isn't a unit test exercising it. Consider extending the existing tests/test_async_simple_storage_manager.py::test_async_storage_manager_mock_operations to include a layout=torch.jagged nested tensor and assert unbind() is called and that _put_to_single_storage_unit receives the expected sliced items.

Copilot uses AI. Check for mistakes.
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1 similar comment
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot
Copy link

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@0oshowero0 0oshowero0 merged commit 73ed4c9 into Ascend:main Feb 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants