Reduce latency

**Is your feature request related to a problem? Please describe.**
CUDA graph:  Necessary to reduce the impact of all sorts of latencies.
Multi tensor apply: reduce kernel launches and saves latency. also more bandwidth optimal.

**Describe the solution you'd like**
All optimizer should support "capturable" argument as native PyTorch does, e.g. https://docs.pytorch.org/docs/stable/generated/torch.optim.adam.Adam_class.html#adam

constants (betas for example), step counter, everything must be on GPU to be CUDA graph capturable.


cc @gdengk @FDecaYed @BoxiangW 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce latency #109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce latency #109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions