Skip to content
This repository was archived by the owner on Jul 16, 2020. It is now read-only.
This repository was archived by the owner on Jul 16, 2020. It is now read-only.

XGBoost with GPU hangs on Rabit initialization #61

@trams

Description

@trams

Hello nice people,

I came across this article https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd (and that's why I create an issue)
I am very excited to start using. It took some time to learn which versions are available: 1.0.0-Beta and 1.0.0-Beta2. I picked the latter one

when I launched distributed training without GPUs (tree method hist) to make sure CPU based trainings work I noticed it hanged. And I saw that there were 0 iteration done and I saw

2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Traceback (most recent call last):
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     self.run()
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/usr/lib64/python2.7/threading.py", line 765, in run
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     self.__target(*self.__args, **self.__kwargs)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 324, in run
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     self.accept_slaves(nslave)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 268, in accept_slaves
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     s = SlaveEntry(fd, s_addr)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 64, in __init__
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - AssertionError: invalid magic number=542393671 from 172.28.42.144
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO  XGBoostSpark - Rabit returns with exit code 0

Could you point out to your source code repo and which version (git sha1) you used to build 1.0.0-Beta so I can try to troubleshoot.

Also any pointers how to work around are welcome.
Can I enable scala based tracker? Do you know how?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions