Global Self-Attention without patch-embedding like ViT

Hey,

just checking if I understand the paper correctly. Are you calculating global self-attention without doing any kind of patch embedding as explained in ViT? 

This could explain why the model is training so slow for me...

Thanks