Skip to content
This repository was archived by the owner on Jul 16, 2021. It is now read-only.
This repository was archived by the owner on Jul 16, 2021. It is now read-only.

[bug] Dask worker dies during dask-xgboost classifier training : test_core.py::test_classifier #68

@pradghos

Description

@pradghos

Dask worker dies while during dask-xgboost classifier training ; It is being observed while running test_core.py::test_classifier

Configuration used -

Dask Version: 2.9.2
Distributed Version: 2.9.3
XGBoost Version: 0.90
Dask-XGBoost Version: 0.1.9
OS-release : 4.14.0-115.16.1.el7a.ppc64le

Description / Steps - :-

  1. Test create two cluster -
> /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/test_core.py(38)test_classifier()
-> with cluster() as (s, [a, b]):
(Pdb) n
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:45767
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:40743
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:40743
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-c6ea91c7-746e-4c7a-9c13-f5afcd244966/worker-ebbqtfdu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:33373
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:33373
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-050815d2-54f6-4edc-9a03-dd075213449d/worker-i1yr8xvc
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:40743
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33373', name: tcp://127.0.0.1:33373, memory: 0, processing: 0>
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33373
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

  1. After couple of steps - fit is being called for dask-xgboost -
-> a.fit(X2, y2)
(Pdb) distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
ndistributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373

distributed.worker - DEBUG - Execute key: array-original-8d35e675b41aad38dc334c7f79ea1982 worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: array-original-8d35e675b41aad38dc334c7f79ea1982, {'op': 'task-finished', 'status': 'OK', 'nbytes': 80, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2651937, 'stop': 1580372953.265216, 'thread': 140735736705456, 'key': 'array-original-8d35e675b41aad38dc334c7f79ea1982'}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2696354, 'stop': 1580372953.2696435, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 0)"}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2705007, 'stop': 1580372953.2705073, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2753158, 'stop': 1580372953.275466, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2762377, 'stop': 1580372953.2763371, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2805014, 'stop': 1580372953.2805073, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2813187, 'stop': 1580372953.2813244, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys

Dask worker dies -

distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 1, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:40743     ===========================>>> One worker dies 
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
distributed.worker - DEBUG - Execute key: train_part-e17e49e3769aaa4870dc8cc01a1e015e worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING   ===  One worker is running infinitely 
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373

It is not clear why does dask worker die at that point .

Thanks!
Pradipta

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions