Thank you for your wonderful project.
Can this language-binding model handle scenarios where a modality is missing? Specifically, is it possible to perform inference without the audio modality, and if so, would it significantly impact performance?
Lastly, is there a model checkpoint that was trained without a specific modality? I’m planning to use this model architecture for my project, but I don’t intend to use the audio modality.
Thank you so much, Sincerely