Hello, and thank you for sharing your work—very cool!
I've been trying the demo and, from what I've learn from your paper, your model can control the generated video using a single image, a text prompt, and a specified trajectory. However, I noticed that there's no option to input a text prompt in your demo. Where do you condition on the prompt?
I also came across the term "motion bucket" in the demo. Could you clarify what that refers to?