This is a beta release of cuPyLMA.

Changelog

Known issues and Further Work

The multi-GPU acceleration is restricted by kernel calls' overheads: we will explore CUDA graph to minimize the overheads.
The optimizer does not inherit torch.optim.Optimizer which brings extra work on migrating the existing code: we will reconstruct our optimizer to make it follow PyTorch optimizer's interface.