Question about the timestep source of cross-modality AdaLN gates in Fig. (b) vs. the current implementation (audio/video timesteps)


Hi! First of all, thank you for open-sourcing this project—it's an excellent piece of work and the overall design is really impressive.

I have a clarification question regarding the framework figure, especially panel **(b)** (Audio–Video cross-attention details). My understanding of the **timestep source for the gate** in the diagram seems different from what I observe in the code, so I’d like to confirm the intended meaning.

---

#### 1) What I understand from Fig. (b)  
In panel (b), it appears that the **Video branch** uses a gate derived from the **Audio timestep**, and the **Audio branch** uses a gate derived from the **Video timestep** (i.e., the gate seems to come from the *other* modality’s timestep).

---

#### 2) What I observe in the code  
However, following the code path, the gates seem to be generated using **each modality’s own timestep**, rather than the other modality’s:

- In preprocessing, `MultiModalTransformerArgsPreprocessor._prepare_cross_attention_timestep(...)` explicitly uses `timestep=modality.timesteps`, meaning each modality (audio/video) independently produces:
  - `cross_scale_shift_timestep`
  - `cross_gate_timestep`

- In `BasicAVTransformerBlock.forward` (Audio–Video cross-attention):
  - **A2V (Audio→Video)** updates `vx` and uses `gate_out_a2v`, computed from `video.cross_gate_timestep`
  - **V2A (Video→Audio)** updates `ax` and uses `gate_out_v2a`, computed from `audio.cross_gate_timestep`

So while the scale/shift modulation on the Q/KV sides uses both audio/video timesteps (which makes sense), the **final residual gating (gate)** appears to be tied to the **updated/query side** (video for A2V, audio for V2A).

Additionally, `LTXModel._init_preprocessors` seems to intentionally bind these:
- the video preprocessor uses `av_ca_a2v_gate_adaln_single` as `cross_gate_adaln`
- the audio preprocessor uses `av_ca_v2a_gate_adaln_single` as `cross_gate_adaln`

---

#### 3) My question  
Is Fig. (b) meant as a **high-level conceptual illustration**—to emphasize cross-attention Q/K/V interaction and that both timesteps participate in conditioning—rather than to strictly indicate that the gate is computed from the other modality’s timestep?

Or alternatively:
- is the figure annotation simplified / potentially misleading, or
- does the figure correspond to a different version (e.g., an earlier implementation where the gate was conditioned on the other modality’s timestep), while the current code uses per-modality timesteps?

Thanks a lot for your clarification, and again, great work!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about the timestep source of cross-modality AdaLN gates in Fig. (b) vs. the current implementation (audio/video timesteps) #60

1) What I understand from Fig. (b)

2) What I observe in the code

3) My question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the timestep source of cross-modality AdaLN gates in Fig. (b) vs. the current implementation (audio/video timesteps) #60

Description

1) What I understand from Fig. (b)

2) What I observe in the code

3) My question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions