Skip to content

Did anyone reproduce the result listed in the paper with multi-GPUs? #9

@xingwangsfu

Description

@xingwangsfu

I did nothing to the code except replacing TPU with multi-gpus. And my training is stuck at a very high loss. I assume this is caused by the large learning rate after warm-start since my loss is normal before iteration 6255, which is the number of warm-start iterations.

Here is my training log:
I0618 17:29:12.284162 140720170149632 tf_logging.py:115] global_step/sec: 1.26958
I0618 17:29:12.284616 140720170149632 tf_logging.py:115] loss = 6.7994366, step = 5760 (50.410 sec)
I0618 17:30:03.805228 140720170149632 tf_logging.py:115] global_step/sec: 1.24221
I0618 17:30:03.805763 140720170149632 tf_logging.py:115] loss = 6.8188353, step = 5824 (51.521 sec)
I0618 17:30:54.878273 140720170149632 tf_logging.py:115] global_step/sec: 1.25311
I0618 17:30:54.878723 140720170149632 tf_logging.py:115] loss = 6.8027916, step = 5888 (51.073 sec)
I0618 17:31:46.115418 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:31:46.144254 140720170149632 tf_logging.py:115] loss = 6.8010216, step = 5952 (51.266 sec)
I0618 17:32:37.585529 140720170149632 tf_logging.py:115] global_step/sec: 1.24344
I0618 17:32:37.585925 140720170149632 tf_logging.py:115] loss = 6.789137, step = 6016 (51.442 sec)
I0618 17:33:28.797896 140720170149632 tf_logging.py:115] global_step/sec: 1.2497
I0618 17:33:28.798456 140720170149632 tf_logging.py:115] loss = 6.799903, step = 6080 (51.213 sec)
I0618 17:34:19.681088 140720170149632 tf_logging.py:115] global_step/sec: 1.25778
I0618 17:34:19.681564 140720170149632 tf_logging.py:115] loss = 6.803883, step = 6144 (50.883 sec)
I0618 17:35:09.831330 140720170149632 tf_logging.py:115] global_step/sec: 1.27617
I0618 17:35:09.831943 140720170149632 tf_logging.py:115] loss = 6.7922, step = 6208 (50.150 sec)
I0618 17:35:46.901006 140720170149632 tf_logging.py:115] Saving checkpoints for 6255 into /mnt/cephfs_wj/cv/wangxing/tmp/model-single-path-search/lambda-val-0.020/model.ckpt.
I0618 17:36:07.512706 140720170149632 tf_logging.py:115] global_step/sec: 1.10954
I0618 17:36:07.513106 140720170149632 tf_logging.py:115] loss = 94.23678, step = 6272 (57.681 sec)
I0618 17:36:57.975293 140720170149632 tf_logging.py:115] global_step/sec: 1.26827
I0618 17:36:57.975636 140720170149632 tf_logging.py:115] loss = 84.60893, step = 6336 (50.463 sec)
I0618 17:37:49.209366 140720170149632 tf_logging.py:115] global_step/sec: 1.24917
I0618 17:37:49.210039 140720170149632 tf_logging.py:115] loss = 83.81077, step = 6400 (51.234 sec)
I0618 17:38:40.446595 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:38:40.447212 140720170149632 tf_logging.py:115] loss = 83.7096, step = 6464 (51.237 sec)
I0618 17:39:31.800470 140720170149632 tf_logging.py:115] global_step/sec: 1.24625
I0618 17:39:31.800811 140720170149632 tf_logging.py:115] loss = 75.41687, step = 6528 (51.354 sec)
I0618 17:40:22.979326 140720170149632 tf_logging.py:115] global_step/sec: 1.25052
I0618 17:40:22.979668 140720170149632 tf_logging.py:115] loss = 75.42241, step = 6592 (51.179 sec)
I0618 17:41:14.112971 140720170149632 tf_logging.py:115] global_step/sec: 1.25162
I0618 17:41:14.137188 140720170149632 tf_logging.py:115] loss = 75.344826, step = 6656 (51.157 sec)
I0618 17:42:05.177355 140720170149632 tf_logging.py:115] global_step/sec: 1.25332
I0618 17:42:05.177694 140720170149632 tf_logging.py:115] loss = 75.358315, step = 6720 (51.041 sec)
I0618 17:42:56.014090 140720170149632 tf_logging.py:115] global_step/sec: 1.25893
I0618 17:42:56.014433 140720170149632 tf_logging.py:115] loss = 75.37303, step = 6784 (50.837 sec)
I0618 17:43:47.115759 140720170149632 tf_logging.py:115] global_step/sec: 1.25241
I0618 17:43:47.116162 140720170149632 tf_logging.py:115] loss = 75.35231, step = 6848 (51.102 sec)
I0618 17:44:38.047000 140720170149632 tf_logging.py:115] global_step/sec: 1.2566
I0618 17:44:38.047545 140720170149632 tf_logging.py:115] loss = 75.34932, step = 6912 (50.931 sec)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions