How consistent self-atttention fits into the semantic motion predictor？

Great work by everyone! I'd like to ask you a little bit about how consistent self-atttention fits into the semantic motion predictor, I see that the input in the semantic motion predictor in the thesis is that there are only two images (one as the start frame and one as the end frame).