Skip to content

Instability and EOFError in dStripe Container due to Outdated Python 3.7 Environment #3

@seanyang9527

Description

@seanyang9527

Hello dStripe Development Team,

I am writing to report a critical instability issue with the publicly available dstripe.sif container.

When running the container, even with a standard single-GPU command, a data-loading worker process crashes, producing an EOFError traceback. Despite this critical error, the main script continues to run and incorrectly reports success at the end, leading to untrustworthy and non-reproducible results.

Error Details:

  • EOFError originating from torch/multiprocessing/ and batchgenerators/dataloading/multi_threaded_augmenter.py.

Debugging Steps Taken:

  • The error occurs consistently, even when using a single GPU (-device 0).
  • The issue persists across multiple datasets and on systems with very large amounts of RAM (512 GB).

The traceback and behavior strongly point to an instability in the container's software environment, which is built on Python 3.7. As you know, Python 3.7 has been End-of-Life since June 2023 and no longer receives bug or security fixes. This is likely the root cause of the multiprocessing instability.

For the benefit of the research community and to ensure reproducible science, would it be possible for you to provide an updated container built on a modern, supported Python version (e.g., Python 3.9+)?

Thank you for your work on this valuable tool.

Sean

Error code:
Exception in thread Thread-1:

Traceback (most recent call last):

  File "/opt/env/lib/python3.7/threading.py", line 926, in _bootstrap_inner

    self.run()

  File "/opt/env/lib/python3.7/threading.py", line 870, in run

    self._target(*self._args, **self._kwargs)

  File "/opt/env/lib/python3.7/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 102, in results_loop

    item = current_queue.get()

  File "/opt/env/lib/python3.7/multiprocessing/queues.py", line 113, in get

    return _ForkingPickler.loads(res)

  File "/opt/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd

    fd = df.detach()

  File "/opt/env/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach

    with _resource_sharer.get_connection(self._id) as conn:

  File "/opt/env/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection

    c = Client(address, authkey=process.current_process().authkey)

  File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 498, in Client

    answer_challenge(c, authkey)

  File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 742, in answer_challenge

    message = connection.recv_bytes(256)         # reject large message

  File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes

    buf = self._recv_bytes(maxlength)

  File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes

    buf = self._recv(4)

  File "/opt/env/lib/python3.7/multiprocessing/connection.py", line 383, in _recv

    raise EOFError

EOFError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions