Hi, great work! I have a friendly little question. It seems that CogVideoX is limited to generating videos with frame counts of 4f + 1, while the paper's goal is to generate 12 frames, and the online video demo of the "Multi-View Material Generation" part have 25 frames. I’m curious how the authors managed to generate videos with a frame count of 4f (e.g., by fine-tuning CogVideoX’s VAE), and how the 25-frame online demo was made using the proposed model that generates 12 frames?