Hi, thanks for your great work! When doing the grounding task, will we input all the video sequences into the LLM or just an image?