-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Terra resource management is not compatible with the spawn start method for ProcessPoolExecutor. Current workaround is to use a ThreadPoolExecutor.
torch appears to require the spawn start method for ProcessPoolExecutor
https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing
This can be set using the following torch code
torch.multiprocessing.set_start_method("spawn", force=True)
Or setting mp_context when initializing the ProcessPoolExecutor
from multiprocessing import get_context
from terra.executor.process import ProcessPoolExecutor
mp_context = get_context('spawn')
Executor = ProcessPoolExecutor(max_workers=3, mp_context=mp_context)
Unfortunately, a spawned ProcessPoolExecutor will re-import python modules for each child process, meaning the resource lock directory is different for each child process due to the dependency on the os.getpid()
terra/terra/executor/resources.py
Lines 126 to 129 in e24792b
| self.lock_dir = os.path.join(settings.terra.lock_dir, | |
| platform.node(), | |
| str(os.getpid()), | |
| resource_name) |
As each child process uses a different lock directory, the result is each child process has no awareness of other child process resource locks. Each child process is thus able to claim the first resource which results in processing failure.
Testing the spawn start method is possible by adding the following to test_executor_resources.py after TestResourceProcess. However, this change currently results in a different error where the data dictionary is empty due to each spawned child re-importing the test module (e.g., simple_acquire is unable to find data[name])
class ProcessPoolExecutorSpawn(ProcessPoolExecutor):
def __init__(self, *args, **kwargs):
kwargs['mp_context'] = get_context('spawn')
return super().__init__(*args, **kwargs)
class TestResourceProcessSpawn(TestResourceProcess):
# Test for multiprocess spwan case
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.Executor = ProcessPoolExecutorSpawn
Issue discovered by @decrispell during terra_real3d development, attempting to run multiple torch tasks each with a single assigned GPU.