-
Notifications
You must be signed in to change notification settings - Fork 41
Description
During my use of Mpire, I encountered a peculiar situation randomly:
- The program would hang indefinitely on the last few tasks.
- Using Ctrl+C to terminate the program would fail to exit.
I tried to gain worker_insights for the hanging state, which showed that some workers were not running any tasks and were stuck in the waiting_time state for a long time. At the same time, by monitoring the main thread state using debugpy, I found that the main thread was waiting for these workers (either waiting for them to start or to finish), and these worker processes did exist.
I attempted to connect and debug these hanging processes using debugpy, but I couldn't attach (later I learned it was due to blocking instructions preventing the connection). Eventually, I successfully used py-spy to capture their stack traces. I found that they were stuck at _flush_std_streams when being started by multiprocessing:
def _flush_std_streams():
try:
sys.stdout.flush()
except (AttributeError, ValueError):
pass
try:
sys.stderr.flush()
except (AttributeError, ValueError):
pass(located in multiprocessing.util)
I inspected sys.stdout and found that it was never overwritten with another stream, so it was unlikely that the flush operation would hang.
Later, I found a related issue on Python's GitHub: python/cpython#91776. This issue remains unresolved and is believed to be caused by internal locks during writing. In some cases, users might be running other threads while executing WorkerPool, and these threads might be writing to sys.stdout or sys.stderr. If a fork occurs at this moment, the forked process, controlled by mpire and multiprocessing, executes _flush_std_streams. However, if the internal lock of stdout or stderr is in a writing state, the flush operation will wait indefinitely, causing a hang.
Even without threads, there is a small probability of hanging (much lower, as mentioned in your code when waiting for the task_queue to end with a timeout), possibly due to file descriptor delays related to the operating system, for unknown reasons.
Although this issue is not directly related to mpire, the fork mode is a primary usage method for mpire, and writing to stdout/stderr from other threads when running a WorkerPool is a common operation. So I'm recording it here in case someone needs it:
Reopening stdout and stderr:
def reopen_std_streams():
sys.stdout = os.fdopen(1, "w")
sys.stderr = os.fdopen(2, "w")
os.register_at_fork(after_in_child=reopen_std_streams)(This is only effective on Linux & python >= 3.7)
Another approach is to provide a thread lock for the process's fork (_start_worker) to help synchronize other logging threads, and with documentation.
If you could delve deeper into solving this issue, I would greatly appreciate it.