I did nothing to the code except replacing TPU with multi-gpus. And my training is stuck at a very high loss. I assume this is caused by the large learning rate after warm-start since my loss is normal before iteration 6255, which is the number of warm-start iterations.
Here is my training log:
I0618 17:29:12.284162 140720170149632 tf_logging.py:115] global_step/sec: 1.26958
I0618 17:29:12.284616 140720170149632 tf_logging.py:115] loss = 6.7994366, step = 5760 (50.410 sec)
I0618 17:30:03.805228 140720170149632 tf_logging.py:115] global_step/sec: 1.24221
I0618 17:30:03.805763 140720170149632 tf_logging.py:115] loss = 6.8188353, step = 5824 (51.521 sec)
I0618 17:30:54.878273 140720170149632 tf_logging.py:115] global_step/sec: 1.25311
I0618 17:30:54.878723 140720170149632 tf_logging.py:115] loss = 6.8027916, step = 5888 (51.073 sec)
I0618 17:31:46.115418 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:31:46.144254 140720170149632 tf_logging.py:115] loss = 6.8010216, step = 5952 (51.266 sec)
I0618 17:32:37.585529 140720170149632 tf_logging.py:115] global_step/sec: 1.24344
I0618 17:32:37.585925 140720170149632 tf_logging.py:115] loss = 6.789137, step = 6016 (51.442 sec)
I0618 17:33:28.797896 140720170149632 tf_logging.py:115] global_step/sec: 1.2497
I0618 17:33:28.798456 140720170149632 tf_logging.py:115] loss = 6.799903, step = 6080 (51.213 sec)
I0618 17:34:19.681088 140720170149632 tf_logging.py:115] global_step/sec: 1.25778
I0618 17:34:19.681564 140720170149632 tf_logging.py:115] loss = 6.803883, step = 6144 (50.883 sec)
I0618 17:35:09.831330 140720170149632 tf_logging.py:115] global_step/sec: 1.27617
I0618 17:35:09.831943 140720170149632 tf_logging.py:115] loss = 6.7922, step = 6208 (50.150 sec)
I0618 17:35:46.901006 140720170149632 tf_logging.py:115] Saving checkpoints for 6255 into /mnt/cephfs_wj/cv/wangxing/tmp/model-single-path-search/lambda-val-0.020/model.ckpt.
I0618 17:36:07.512706 140720170149632 tf_logging.py:115] global_step/sec: 1.10954
I0618 17:36:07.513106 140720170149632 tf_logging.py:115] loss = 94.23678, step = 6272 (57.681 sec)
I0618 17:36:57.975293 140720170149632 tf_logging.py:115] global_step/sec: 1.26827
I0618 17:36:57.975636 140720170149632 tf_logging.py:115] loss = 84.60893, step = 6336 (50.463 sec)
I0618 17:37:49.209366 140720170149632 tf_logging.py:115] global_step/sec: 1.24917
I0618 17:37:49.210039 140720170149632 tf_logging.py:115] loss = 83.81077, step = 6400 (51.234 sec)
I0618 17:38:40.446595 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:38:40.447212 140720170149632 tf_logging.py:115] loss = 83.7096, step = 6464 (51.237 sec)
I0618 17:39:31.800470 140720170149632 tf_logging.py:115] global_step/sec: 1.24625
I0618 17:39:31.800811 140720170149632 tf_logging.py:115] loss = 75.41687, step = 6528 (51.354 sec)
I0618 17:40:22.979326 140720170149632 tf_logging.py:115] global_step/sec: 1.25052
I0618 17:40:22.979668 140720170149632 tf_logging.py:115] loss = 75.42241, step = 6592 (51.179 sec)
I0618 17:41:14.112971 140720170149632 tf_logging.py:115] global_step/sec: 1.25162
I0618 17:41:14.137188 140720170149632 tf_logging.py:115] loss = 75.344826, step = 6656 (51.157 sec)
I0618 17:42:05.177355 140720170149632 tf_logging.py:115] global_step/sec: 1.25332
I0618 17:42:05.177694 140720170149632 tf_logging.py:115] loss = 75.358315, step = 6720 (51.041 sec)
I0618 17:42:56.014090 140720170149632 tf_logging.py:115] global_step/sec: 1.25893
I0618 17:42:56.014433 140720170149632 tf_logging.py:115] loss = 75.37303, step = 6784 (50.837 sec)
I0618 17:43:47.115759 140720170149632 tf_logging.py:115] global_step/sec: 1.25241
I0618 17:43:47.116162 140720170149632 tf_logging.py:115] loss = 75.35231, step = 6848 (51.102 sec)
I0618 17:44:38.047000 140720170149632 tf_logging.py:115] global_step/sec: 1.2566
I0618 17:44:38.047545 140720170149632 tf_logging.py:115] loss = 75.34932, step = 6912 (50.931 sec)
I did nothing to the code except replacing TPU with multi-gpus. And my training is stuck at a very high loss. I assume this is caused by the large learning rate after warm-start since my loss is normal before iteration 6255, which is the number of warm-start iterations.
Here is my training log:
I0618 17:29:12.284162 140720170149632 tf_logging.py:115] global_step/sec: 1.26958
I0618 17:29:12.284616 140720170149632 tf_logging.py:115] loss = 6.7994366, step = 5760 (50.410 sec)
I0618 17:30:03.805228 140720170149632 tf_logging.py:115] global_step/sec: 1.24221
I0618 17:30:03.805763 140720170149632 tf_logging.py:115] loss = 6.8188353, step = 5824 (51.521 sec)
I0618 17:30:54.878273 140720170149632 tf_logging.py:115] global_step/sec: 1.25311
I0618 17:30:54.878723 140720170149632 tf_logging.py:115] loss = 6.8027916, step = 5888 (51.073 sec)
I0618 17:31:46.115418 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:31:46.144254 140720170149632 tf_logging.py:115] loss = 6.8010216, step = 5952 (51.266 sec)
I0618 17:32:37.585529 140720170149632 tf_logging.py:115] global_step/sec: 1.24344
I0618 17:32:37.585925 140720170149632 tf_logging.py:115] loss = 6.789137, step = 6016 (51.442 sec)
I0618 17:33:28.797896 140720170149632 tf_logging.py:115] global_step/sec: 1.2497
I0618 17:33:28.798456 140720170149632 tf_logging.py:115] loss = 6.799903, step = 6080 (51.213 sec)
I0618 17:34:19.681088 140720170149632 tf_logging.py:115] global_step/sec: 1.25778
I0618 17:34:19.681564 140720170149632 tf_logging.py:115] loss = 6.803883, step = 6144 (50.883 sec)
I0618 17:35:09.831330 140720170149632 tf_logging.py:115] global_step/sec: 1.27617
I0618 17:35:09.831943 140720170149632 tf_logging.py:115] loss = 6.7922, step = 6208 (50.150 sec)
I0618 17:35:46.901006 140720170149632 tf_logging.py:115] Saving checkpoints for 6255 into /mnt/cephfs_wj/cv/wangxing/tmp/model-single-path-search/lambda-val-0.020/model.ckpt.
I0618 17:36:07.512706 140720170149632 tf_logging.py:115] global_step/sec: 1.10954
I0618 17:36:07.513106 140720170149632 tf_logging.py:115] loss = 94.23678, step = 6272 (57.681 sec)
I0618 17:36:57.975293 140720170149632 tf_logging.py:115] global_step/sec: 1.26827
I0618 17:36:57.975636 140720170149632 tf_logging.py:115] loss = 84.60893, step = 6336 (50.463 sec)
I0618 17:37:49.209366 140720170149632 tf_logging.py:115] global_step/sec: 1.24917
I0618 17:37:49.210039 140720170149632 tf_logging.py:115] loss = 83.81077, step = 6400 (51.234 sec)
I0618 17:38:40.446595 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:38:40.447212 140720170149632 tf_logging.py:115] loss = 83.7096, step = 6464 (51.237 sec)
I0618 17:39:31.800470 140720170149632 tf_logging.py:115] global_step/sec: 1.24625
I0618 17:39:31.800811 140720170149632 tf_logging.py:115] loss = 75.41687, step = 6528 (51.354 sec)
I0618 17:40:22.979326 140720170149632 tf_logging.py:115] global_step/sec: 1.25052
I0618 17:40:22.979668 140720170149632 tf_logging.py:115] loss = 75.42241, step = 6592 (51.179 sec)
I0618 17:41:14.112971 140720170149632 tf_logging.py:115] global_step/sec: 1.25162
I0618 17:41:14.137188 140720170149632 tf_logging.py:115] loss = 75.344826, step = 6656 (51.157 sec)
I0618 17:42:05.177355 140720170149632 tf_logging.py:115] global_step/sec: 1.25332
I0618 17:42:05.177694 140720170149632 tf_logging.py:115] loss = 75.358315, step = 6720 (51.041 sec)
I0618 17:42:56.014090 140720170149632 tf_logging.py:115] global_step/sec: 1.25893
I0618 17:42:56.014433 140720170149632 tf_logging.py:115] loss = 75.37303, step = 6784 (50.837 sec)
I0618 17:43:47.115759 140720170149632 tf_logging.py:115] global_step/sec: 1.25241
I0618 17:43:47.116162 140720170149632 tf_logging.py:115] loss = 75.35231, step = 6848 (51.102 sec)
I0618 17:44:38.047000 140720170149632 tf_logging.py:115] global_step/sec: 1.2566
I0618 17:44:38.047545 140720170149632 tf_logging.py:115] loss = 75.34932, step = 6912 (50.931 sec)