OOM on 4B model training

Hi,

I am getting OOM with 47G or GPU memory when try to train the 4B model. The neat packing is creating sequences of 40960 tokens causing the massive attention matrix that couldn't fit in memory. But even changing 40960 to 2048 gets OOM.

Any suggestions? what GPUs did you use? is there any way I can make it work with 47G memory? (I also tried ZeRO 3)