D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
video-understanding tsinghua-university multimodal-llm video-llm dialogue-centric omni-llm audio-visual-llm
-
Updated
Feb 11, 2026 - Python