This repository was archived by the owner on Jul 16, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 39
This repository was archived by the owner on Jul 16, 2020. It is now read-only.
XGBoost with GPU hangs on Rabit initialization #61
Copy link
Copy link
Open
Description
Hello nice people,
I came across this article https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd (and that's why I create an issue)
I am very excited to start using. It took some time to learn which versions are available: 1.0.0-Beta and 1.0.0-Beta2. I picked the latter one
when I launched distributed training without GPUs (tree method hist) to make sure CPU based trainings work I noticed it hanged. And I saw that there were 0 iteration done and I saw
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Traceback (most recent call last):
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - self.run()
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/usr/lib64/python2.7/threading.py", line 765, in run
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - self.__target(*self.__args, **self.__kwargs)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 324, in run
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - self.accept_slaves(nslave)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 268, in accept_slaves
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - s = SlaveEntry(fd, s_addr)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 64, in __init__
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - AssertionError: invalid magic number=542393671 from 172.28.42.144
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO XGBoostSpark - Rabit returns with exit code 0
Could you point out to your source code repo and which version (git sha1) you used to build 1.0.0-Beta so I can try to troubleshoot.
Also any pointers how to work around are welcome.
Can I enable scala based tracker? Do you know how?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels