Decoding is currently only greedy:
https://github.com/sign/image-latent-transformer/blob/d655f2206e3a1ac25ec2cd2f8feb89484deaf7cf/README.md?plain=1#L69-L73
However for MT, beam search might work better.
Implementing it is non-trivial, it might require to decode more beam candidates on the byte levels, and then prune them by predicting the next token on the word level, and using the joint probability to prune beams.