Skip to content

Question about the timestep source of cross-modality AdaLN gates in Fig. (b) vs. the current implementation (audio/video timesteps) #60

@mingtiannihao

Description

@mingtiannihao

Hi! First of all, thank you for open-sourcing this project—it's an excellent piece of work and the overall design is really impressive.

I have a clarification question regarding the framework figure, especially panel (b) (Audio–Video cross-attention details). My understanding of the timestep source for the gate in the diagram seems different from what I observe in the code, so I’d like to confirm the intended meaning.


1) What I understand from Fig. (b)

In panel (b), it appears that the Video branch uses a gate derived from the Audio timestep, and the Audio branch uses a gate derived from the Video timestep (i.e., the gate seems to come from the other modality’s timestep).


2) What I observe in the code

However, following the code path, the gates seem to be generated using each modality’s own timestep, rather than the other modality’s:

  • In preprocessing, MultiModalTransformerArgsPreprocessor._prepare_cross_attention_timestep(...) explicitly uses timestep=modality.timesteps, meaning each modality (audio/video) independently produces:

    • cross_scale_shift_timestep
    • cross_gate_timestep
  • In BasicAVTransformerBlock.forward (Audio–Video cross-attention):

    • A2V (Audio→Video) updates vx and uses gate_out_a2v, computed from video.cross_gate_timestep
    • V2A (Video→Audio) updates ax and uses gate_out_v2a, computed from audio.cross_gate_timestep

So while the scale/shift modulation on the Q/KV sides uses both audio/video timesteps (which makes sense), the final residual gating (gate) appears to be tied to the updated/query side (video for A2V, audio for V2A).

Additionally, LTXModel._init_preprocessors seems to intentionally bind these:

  • the video preprocessor uses av_ca_a2v_gate_adaln_single as cross_gate_adaln
  • the audio preprocessor uses av_ca_v2a_gate_adaln_single as cross_gate_adaln

3) My question

Is Fig. (b) meant as a high-level conceptual illustration—to emphasize cross-attention Q/K/V interaction and that both timesteps participate in conditioning—rather than to strictly indicate that the gate is computed from the other modality’s timestep?

Or alternatively:

  • is the figure annotation simplified / potentially misleading, or
  • does the figure correspond to a different version (e.g., an earlier implementation where the gate was conditioned on the other modality’s timestep), while the current code uses per-modality timesteps?

Thanks a lot for your clarification, and again, great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions