THE INFERENCE PROCESS DOES NOT PERFORM DIFFUSION SAMPLING

After carefully checking and debugging the inference process (i.e., forward_test() for TrajectoryHead), I found that it is entirely incorrect, or at least it is not a diffusion sampling process. **There is no iterative denoising in the inference process. The model just sends 20 noised anchor trajs to the diff decoder and returns its output traj predictions as the final result.**

## Explanation:
At first, the inference timestep number is set to 1000 in forward_test(). https://github.com/hustvl/DiffusionDrive/blob/a3ce7af08a3f2cc5aee656369ed73f50acea3a3a/navsim/agents/diffusiondrive/transfuser_model_v2.py#L508 So the scheduler **only denoises 1 timestep for each scheduler.step(), i.e. 999->998->997...->1->0.**

https://github.com/hustvl/DiffusionDrive/blob/a3ce7af08a3f2cc5aee656369ed73f50acea3a3a/navsim/agents/diffusiondrive/transfuser_model_v2.py#L510

Then, the roll_timesteps is defined as [10, 0] (step_num=2, step_ratio=10), which means there are 2 denoising steps, the first step's noise level is 10, and the second step's noise level is 0. The intended timestep interval is 10, where the scheduler can denoise the sample from ts=10 to ts=0. So we need to set inference timestep number to 100: `self.diffusion_scheduler.set_timesteps(100, device)`.

### Let's first look at what happens in the first iteration (_k=10_)
https://github.com/hustvl/DiffusionDrive/blob/a3ce7af08a3f2cc5aee656369ed73f50acea3a3a/navsim/agents/diffusiondrive/transfuser_model_v2.py#L550-L554
Here, the scheduler denoises the original noisy trajectories (_img_), based on the predicted original sample (_x_start_) and current timestep (_k=10_). However, the code uses 1000 inference steps, which means the scheduler only denoises the sample from ts=10 to ts=9. It is a single step within a total of 1000 steps.

<img width="400" height="300" alt="Image" src="https://github.com/user-attachments/assets/639e5caf-1c60-40d4-8fd7-1f269872cc9d" /> <img width="400" height="300" alt="Image" src="https://github.com/user-attachments/assets/3ee3450c-280b-42d8-9e70-741d90071785" />

The above two figures show the 20 trajectories before and after the scheduler step. Since the scheduler only updated 1/1000 step, input and output trajs are **almost identical**. The updated trajectories are at timestep 9, instead of the expected 0.

https://github.com/hustvl/DiffusionDrive/blob/a3ce7af08a3f2cc5aee656369ed73f50acea3a3a/navsim/agents/diffusiondrive/transfuser_model_v2.py#L518
Actually, there is another bug in the code that sets the initial noise level to 8, which mismatches the timestep (_k=10_) sent to the model.

### Then in the second iteration (_k=0_)

As the updated noisy trajs are almost identical to the original noisy trajs (ts=9 vs ts=10), the diff decoder input in the second iteration is almost identical to the first iteration, which is the raw noisy anchor trajectories. 
https://github.com/hustvl/DiffusionDrive/blob/a3ce7af08a3f2cc5aee656369ed73f50acea3a3a/navsim/agents/diffusiondrive/transfuser_model_v2.py#L555-L558
Meanwhile, as shown in the above code, the method directly returns the second (final) iteration's model output (poses_cls, poses_reg) as the final result. Therefore, we come to our initial conclusion:  **The model just sent 20 noised anchor trajs to the diff decoder and returns its output traj predictions as the final result.**

### Why it works?
The diff decoder is trained to take noisy trajectories and predict the clean trajectories. Therefore, "taking 20 noised anchor trajs to the diff decoder and returns its output traj predictions" is still a valid inference strategy. Here is the visualized final result: 

<img width="861" height="707" alt="Image" src="https://github.com/user-attachments/assets/25733d50-cbf8-4dfb-a3d8-4c3c41b48c0e" />

However, the actual inference is a one-step prediction instead of iterative denoising. It does not perform a standard diffusion sampling process and is not aligned with the paper description. So, it is still a bug and should be fixed.



	img = self.diffusion_scheduler.step(
	model_output=x_start,
	timestep=k,
	sample=img
	).prev_sample

	mode_idx = poses_cls.argmax(dim=-1)
	mode_idx = mode_idx[...,None,None,None].repeat(1,1,self._num_poses,3)
	best_reg = torch.gather(poses_reg, 1, mode_idx).squeeze(1)
	return {"trajectory": best_reg}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

THE INFERENCE PROCESS DOES NOT PERFORM DIFFUSION SAMPLING #85

Explanation:

Let's first look at what happens in the first iteration (k=10)

Then in the second iteration (k=0)

Why it works?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

THE INFERENCE PROCESS DOES NOT PERFORM DIFFUSION SAMPLING #85

Description

Explanation:

Let's first look at what happens in the first iteration (k=10)

Then in the second iteration (k=0)

Why it works?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions