@claude do a profile run with dsr1 fp8 on b200 with conc 4, and for 1 decode step & dsr1 fp8 on mi300 with conc 4, and for 1 decode step
compare the 2 different traces & breakdown where the time went kernel by kernel, wall clock time & %, how long did MoE take, etc?