How to predict future frames conditioned on previous 4 frames?

Hi, in the paper it is said that the trained model can predict 1 or 4 future frames conditioned on previous 4 frames. 
How is this achieved for predicting 1 or 4 frames? Like, what's the value of x, the input frame, and z, the latent, for predicting each frame?