-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hello dStripe Development Team,
I am writing to report a critical instability issue with the publicly available dstripe.sif container.
When running the container, even with a standard single-GPU command, a data-loading worker process crashes, producing an EOFError traceback. Despite this critical error, the main script continues to run and incorrectly reports success at the end, leading to untrustworthy and non-reproducible results.
Error Details:
- EOFError originating from torch/multiprocessing/ and batchgenerators/dataloading/multi_threaded_augmenter.py.
Debugging Steps Taken:
- The error occurs consistently, even when using a single GPU (-device 0).
- The issue persists across multiple datasets and on systems with very large amounts of RAM (512 GB).
The traceback and behavior strongly point to an instability in the container's software environment, which is built on Python 3.7. As you know, Python 3.7 has been End-of-Life since June 2023 and no longer receives bug or security fixes. This is likely the root cause of the multiprocessing instability.
For the benefit of the research community and to ensure reproducible science, would it be possible for you to provide an updated container built on a modern, supported Python version (e.g., Python 3.9+)?
Thank you for your work on this valuable tool.
Sean
Error code:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/opt/env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/opt/env/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/env/lib/python3.7/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 102, in results_loop
item = current_queue.get()
File "/opt/env/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/opt/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/opt/env/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/opt/env/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 498, in Client
answer_challenge(c, authkey)
File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 742, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError