Piper is a PyTorch library for training large models with flexible pipeline parallel schedules.
We assume a Linux-based environment
- Create a conda environment with
python==3.10 - Install the requirements in
requirements.txt - Modify PyTorch and Ray dependencies according to the instructions below
PyTorch
Piper RemoteTensor is not traceable by TorchDynamo.
- WIP: following FakeTensor, register operator implementations that will make RemoteTensor transparently traceable by TorchDynamo.
- Modification: Add to the beginning of
transformin convert_frame.py:
####### PIPER MODIFICATION START #######
# Instead of tracing RemoteTensors, trace their
# underlying FakeTensor
from src.piper_utils import RemoteTensor
for k, v in locals.items():
if isinstance(v, RemoteTensor):
locals[k] = v._fake
####### PIPER MODIFICATION END #######
Piper RemoteTensor causes recompilation bugs because it's not traceable by TorchDynamo.
- WIP: same as above
- Modification: Add at the beginning of
CheckFunctionManager.__init__in guards.py:
####### PIPER MODIFICATION START #######
def filter_guards(guard):
return not guard.inner_create_fn().__name__ == "TENSOR_MATCH"
guards = list(filter(filter_guards, guards))
####### PIPER MODIFICATION END #######
Ray
Tensor transport backends currently only support 1 return value per task.
- WIP: Upstream this into Ray.
- Modifications (2): Comment out the assertion in ActorMethod._remote() and add logic for handling multiple return values with a GPU object manager.
####### PIPER MODIFICATION START #######
# if num_returns != 1:
# raise ValueError(
# f"Currently, methods with tensor_transport={tensor_transport.name} only support 1 return value. "
# "Please make sure the actor method is decorated with `@ray.method(num_returns=1)` (the default)."
# )
####### PIPER MODIFICATION END #######
####### PIPER MODIFICATION START #######
gpu_object_manager = ray._private.worker.global_worker.gpu_object_manager
if isinstance(object_refs, ObjectRef):
object_ref = object_refs
gpu_object_manager.add_gpu_object_ref(
object_ref, self._actor, tensor_transport
)
else:
for object_ref in object_refs:
assert isinstance(object_ref, ObjectRef)
gpu_object_manager.add_gpu_object_ref(
object_ref, self._actor, tensor_transport
)
####### PIPER MODIFICATION END #######