Thank you for sharing this interesting work! I'm having trouble figuring out which modules are being trained even after looking at the VISTA code. Is it correct that only the MLP is trained, and the vision encoder and LLM are frozen, as depicted in Figure 1 of the paper?