-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
I try to use TUPE in the NMT encoder, but got loss exploding error. Does it need some fix for TUPE to use in NMT?
the error is like:
2021-01-02 14:08:12 | INFO | train_inner | epoch 001: 12110 / 53999 loss=4.403, nll_loss=2.737, ppl=6.67, wps=46907.8, ups=1.07, wpb=43719.3, bsz=1694.4, num_updates=12100, lr=0.000325246, gnorm=0.245, loss_scale=4, train_wall=93, wall=0
2021-01-02 14:09:40 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 2.0
2021-01-02 14:09:41 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 1.0
2021-01-02 14:09:42 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.5
2021-01-02 14:09:43 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.25
2021-01-02 14:09:44 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.125
2021-01-02 14:09:45 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.0625
2021-01-02 14:09:46 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.03125
2021-01-02 14:09:46 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.015625
2021-01-02 14:09:47 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.0078125
2021-01-02 14:09:48 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.00390625
2021-01-02 14:09:49 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.001953125
2021-01-02 14:09:50 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.0009765625
2021-01-02 14:09:51 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.00048828125
2021-01-02 14:09:52 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.000244140625
2021-01-02 14:09:53 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 0.0001220703125
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.encoder.layers.0.self_attn.dropout_module, shape: torch.Size([2816, 34, 34]), forward input max: nan, input min: nan
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.encoder.layers.0.self_attn.dropout_module, shape: torch.Size([7296, 13, 13]), forward input max: nan, input min: nan
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.encoder.layers.0.self_attn.dropout_module, shape: torch.Size([2304, 38, 38]), forward input max: nan, input min: nan
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.encoder.layers.0.self_attn.dropout_module, shape: torch.Size([2816, 33, 33]), forward input max: nan, input min: nan
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.decoder.output_projection, shape: torch.Size([176, 23, 47038]), backward
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.decoder.output_projection, shape: torch.Size([144, 40, 47038]), backward
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.decoder.output_projection, shape: torch.Size([176, 31, 47038]), backward
2021-01-02 14:09:54 | WARNING | fairseq.nan_detector | NaN detected in output of module.decoder.output_projection, shape: torch.Size([456, 12, 47038]), backward
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.
Any help is appreciate! Thx.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels