Issues with cogvideox custom input

Hello,

I'm trying to generate my own example with 16 frames and I'm getting this issue with CogVideoX. At line 71:

```
latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 4 but got size 13 for tensor number 1 in the list.
```

Is there a fix for this as this is a custom pipeline?