Skip to content

[BUG] Loss drops to 0 after a few thousand steps when using fp16=True #493

@silpara

Description

@silpara

The model training loss is suddenly dropping to 0 after over 1000 steps. I've tried iterating over different dataset as well but got the same behaviour.

Details

I am following the notebook Transformers4Rec/examples/tutorial, to train a next item click prediction model for my own dataset on sequence of items.

Params which I've changed are: learning_rate=0.01, fp16=True, per_device_train_batch_size = 64, d_model=16, rest are as in the notebook. Following are the logs for 1st day of data.

{'loss': 14.3249, 'learning_rate': 0.009976389135451803, 'epoch': 0.01}
{'loss': 14.083, 'learning_rate': 0.009964554115628145, 'epoch': 0.01}
{'loss': 13.9319, 'learning_rate': 0.009953074146399196, 'epoch': 0.01}
{'loss': 13.8982, 'learning_rate': 0.009947452511982957, 'epoch': 0.02}
{'loss': 12.6002, 'learning_rate': 0.009938812947511687, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.009926977927688029, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.00991514290786437, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009903307888040712, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009891472868217054, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009879637848393396, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009867802828569739, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.00985596780874608, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009844132788922422, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009832297769098764, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009820462749275106, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.009808627729451447, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00979679270962779, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00978495768980413, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009773122669980473, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009761287650156814, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009749452630333156, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.009737617610509498, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.00972578259068584, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009713947570862181, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009702112551038523, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009690277531214864, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009678442511391206, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009666607491567548, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.00965477247174389, 'epoch': 0.11}
{'loss': 0.0, 'learning_rate': 0.009642937451920231, 'epoch': 0.11}```

Additionally, I am using merlin container nvcr.io/nvidia/merlin/merlin-pytorch-training:22.05 for training.

Any suggestions on what might be the issue here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions