Missing attention mask for input images

When collating different words, their image size is variable.
Currently, it is 0-padded, but this is not something the model should care about.
Instead, we need to add an attention mask for the pixels as well, to prevent the model from behaving differently when the text is batched with longer or shorter texts