-
Notifications
You must be signed in to change notification settings - Fork 288
Description
Hi! First of all, thank you for open-sourcing this project—it's an excellent piece of work and the overall design is really impressive.
I have a clarification question regarding the framework figure, especially panel (b) (Audio–Video cross-attention details). My understanding of the timestep source for the gate in the diagram seems different from what I observe in the code, so I’d like to confirm the intended meaning.
1) What I understand from Fig. (b)
In panel (b), it appears that the Video branch uses a gate derived from the Audio timestep, and the Audio branch uses a gate derived from the Video timestep (i.e., the gate seems to come from the other modality’s timestep).
2) What I observe in the code
However, following the code path, the gates seem to be generated using each modality’s own timestep, rather than the other modality’s:
-
In preprocessing,
MultiModalTransformerArgsPreprocessor._prepare_cross_attention_timestep(...)explicitly usestimestep=modality.timesteps, meaning each modality (audio/video) independently produces:cross_scale_shift_timestepcross_gate_timestep
-
In
BasicAVTransformerBlock.forward(Audio–Video cross-attention):- A2V (Audio→Video) updates
vxand usesgate_out_a2v, computed fromvideo.cross_gate_timestep - V2A (Video→Audio) updates
axand usesgate_out_v2a, computed fromaudio.cross_gate_timestep
- A2V (Audio→Video) updates
So while the scale/shift modulation on the Q/KV sides uses both audio/video timesteps (which makes sense), the final residual gating (gate) appears to be tied to the updated/query side (video for A2V, audio for V2A).
Additionally, LTXModel._init_preprocessors seems to intentionally bind these:
- the video preprocessor uses
av_ca_a2v_gate_adaln_singleascross_gate_adaln - the audio preprocessor uses
av_ca_v2a_gate_adaln_singleascross_gate_adaln
3) My question
Is Fig. (b) meant as a high-level conceptual illustration—to emphasize cross-attention Q/K/V interaction and that both timesteps participate in conditioning—rather than to strictly indicate that the gate is computed from the other modality’s timestep?
Or alternatively:
- is the figure annotation simplified / potentially misleading, or
- does the figure correspond to a different version (e.g., an earlier implementation where the gate was conditioned on the other modality’s timestep), while the current code uses per-modality timesteps?
Thanks a lot for your clarification, and again, great work!