I observe that the output of cargo asm can disagree with what is executed by cargo bench. The difference can be very significant. Are things supposed to work in this way?
I get one variant of the assembly by launching
RUSTFLAGS='-C target-cpu=native' cargo asm --bench small_matmul small_matmul::matmul4x4_view
To get the other variant, I first execute RUSTFLAGS='-C target-cpu=native' cargo bench, and note the name of the generated executable (for example target/release/deps/small_matmul-dcebd4e101d1c383). I then launch
objdump --disassemble -M intel --no-show-raw-insn --demangle target/release/deps/small_matmul-dcebd4e101d1c383 | less
search for the function of interest within “less”.
I can try to narrow this down further, and/or provide exact instructions, but before I do so, I would like to ask whether this is a known issue or whether perhaps I am committing some mistake.
An example follows.
Assembly produced by cargo asm:
.section .text.small_matmul::matmul4x4_view,"ax",@progbits
.p2align 4
.type small_matmul::matmul4x4_view,@function
small_matmul::matmul4x4_view:
.cfi_startproc
vbroadcastsd ymm0, qword ptr [rdi]
vmovupd ymm1, ymmword ptr [rsi]
vmovupd ymm2, ymmword ptr [rsi + 32]
vmovupd ymm3, ymmword ptr [rsi + 64]
vmovupd ymm4, ymmword ptr [rsi + 96]
vmulpd ymm0, ymm0, ymm1
vaddpd ymm0, ymm0, ymmword ptr [rdx]
vbroadcastsd ymm5, qword ptr [rdi + 8]
vmulpd ymm5, ymm5, ymm2
vaddpd ymm0, ymm5, ymm0
vbroadcastsd ymm5, qword ptr [rdi + 16]
vmulpd ymm5, ymm5, ymm3
vaddpd ymm0, ymm5, ymm0
vbroadcastsd ymm5, qword ptr [rdi + 24]
vmulpd ymm5, ymm5, ymm4
vaddpd ymm0, ymm5, ymm0
vmovupd ymmword ptr [rdx], ymm0
vbroadcastsd ymm0, qword ptr [rdi + 32]
vmulpd ymm0, ymm0, ymm1
vaddpd ymm0, ymm0, ymmword ptr [rdx + 32]
vbroadcastsd ymm1, qword ptr [rdi + 40]
vmulpd ymm1, ymm1, ymm2
vaddpd ymm0, ymm1, ymm0
vbroadcastsd ymm1, qword ptr [rdi + 48]
vmulpd ymm1, ymm1, ymm3
vaddpd ymm0, ymm1, ymm0
vbroadcastsd ymm1, qword ptr [rdi + 56]
vmulpd ymm1, ymm1, ymm4
vaddpd ymm0, ymm1, ymm0
vmovupd ymmword ptr [rdx + 32], ymm0
vbroadcastsd ymm0, qword ptr [rdi + 64]
vmovupd ymm1, ymmword ptr [rsi]
vmovupd ymm2, ymmword ptr [rsi + 32]
vmovupd ymm3, ymmword ptr [rsi + 64]
vmovupd ymm4, ymmword ptr [rsi + 96]
vmulpd ymm0, ymm0, ymm1
vaddpd ymm0, ymm0, ymmword ptr [rdx + 64]
vbroadcastsd ymm5, qword ptr [rdi + 72]
vmulpd ymm5, ymm5, ymm2
vaddpd ymm0, ymm5, ymm0
vbroadcastsd ymm5, qword ptr [rdi + 80]
vmulpd ymm5, ymm5, ymm3
vaddpd ymm0, ymm5, ymm0
vbroadcastsd ymm5, qword ptr [rdi + 88]
vmulpd ymm5, ymm5, ymm4
vaddpd ymm0, ymm5, ymm0
vmovupd ymmword ptr [rdx + 64], ymm0
vbroadcastsd ymm0, qword ptr [rdi + 96]
vmulpd ymm0, ymm0, ymm1
vaddpd ymm0, ymm0, ymmword ptr [rdx + 96]
vbroadcastsd ymm1, qword ptr [rdi + 104]
vmulpd ymm1, ymm1, ymm2
vaddpd ymm0, ymm1, ymm0
vbroadcastsd ymm1, qword ptr [rdi + 112]
vmulpd ymm1, ymm1, ymm3
vaddpd ymm0, ymm1, ymm0
vbroadcastsd ymm1, qword ptr [rdi + 120]
vmulpd ymm1, ymm1, ymm4
vaddpd ymm0, ymm1, ymm0
vmovupd ymmword ptr [rdx + 96], ymm0
vzeroupper
ret
Assembly output by objdump (observe how this variant begins with four vbroadcastsd instructions, while above it starts with a single one):
00000000000b8ed0 <small_matmul::matmul4x4_view>:
b8ed0: vbroadcastsd ymm0,QWORD PTR [rdi+0x18]
b8ed6: vbroadcastsd ymm1,QWORD PTR [rdi+0x10]
b8edc: vbroadcastsd ymm2,QWORD PTR [rdi+0x8]
b8ee2: vbroadcastsd ymm3,QWORD PTR [rdi]
b8ee7: vmulpd ymm3,ymm3,YMMWORD PTR [rsi]
b8eeb: vaddpd ymm3,ymm3,YMMWORD PTR [rdx]
b8eef: vmulpd ymm2,ymm2,YMMWORD PTR [rsi+0x20]
b8ef4: vaddpd ymm2,ymm2,ymm3
b8ef8: vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x40]
b8efd: vaddpd ymm1,ymm1,ymm2
b8f01: vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
b8f06: vaddpd ymm0,ymm0,ymm1
b8f0a: vmovupd YMMWORD PTR [rdx],ymm0
b8f0e: vbroadcastsd ymm0,QWORD PTR [rdi+0x38]
b8f14: vbroadcastsd ymm1,QWORD PTR [rdi+0x28]
b8f1a: vbroadcastsd ymm2,QWORD PTR [rdi+0x20]
b8f20: vmulpd ymm2,ymm2,YMMWORD PTR [rsi]
b8f24: vaddpd ymm2,ymm2,YMMWORD PTR [rdx+0x20]
b8f29: vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x20]
b8f2e: vbroadcastsd ymm3,QWORD PTR [rdi+0x30]
b8f34: vaddpd ymm1,ymm1,ymm2
b8f38: vmulpd ymm2,ymm3,YMMWORD PTR [rsi+0x40]
b8f3d: vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
b8f42: vaddpd ymm1,ymm2,ymm1
b8f46: vaddpd ymm0,ymm0,ymm1
b8f4a: vmovupd YMMWORD PTR [rdx+0x20],ymm0
b8f4f: vbroadcastsd ymm0,QWORD PTR [rdi+0x58]
b8f55: vbroadcastsd ymm1,QWORD PTR [rdi+0x50]
b8f5b: vbroadcastsd ymm2,QWORD PTR [rdi+0x48]
b8f61: vbroadcastsd ymm3,QWORD PTR [rdi+0x40]
b8f67: vmulpd ymm3,ymm3,YMMWORD PTR [rsi]
b8f6b: vaddpd ymm3,ymm3,YMMWORD PTR [rdx+0x40]
b8f70: vmulpd ymm2,ymm2,YMMWORD PTR [rsi+0x20]
b8f75: vaddpd ymm2,ymm2,ymm3
b8f79: vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x40]
b8f7e: vaddpd ymm1,ymm1,ymm2
b8f82: vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
b8f87: vaddpd ymm0,ymm0,ymm1
b8f8b: vmovupd YMMWORD PTR [rdx+0x40],ymm0
b8f90: vbroadcastsd ymm0,QWORD PTR [rdi+0x78]
b8f96: vbroadcastsd ymm1,QWORD PTR [rdi+0x70]
b8f9c: vbroadcastsd ymm2,QWORD PTR [rdi+0x60]
b8fa2: vmulpd ymm2,ymm2,YMMWORD PTR [rsi]
b8fa6: vaddpd ymm2,ymm2,YMMWORD PTR [rdx+0x60]
b8fab: vbroadcastsd ymm3,QWORD PTR [rdi+0x68]
b8fb1: vmulpd ymm3,ymm3,YMMWORD PTR [rsi+0x20]
b8fb6: vaddpd ymm2,ymm3,ymm2
b8fba: vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x40]
b8fbf: vaddpd ymm1,ymm1,ymm2
b8fc3: vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
b8fc8: vaddpd ymm0,ymm0,ymm1
b8fcc: vmovupd YMMWORD PTR [rdx+0x60],ymm0
b8fd1: vzeroupper
b8fd4: ret
b8fd5: int3
b8fd6: int3
b8fd7: int3
b8fd8: int3
b8fd9: int3
b8fda: int3
b8fdb: int3
b8fdc: int3
b8fdd: int3
b8fde: int3
b8fdf: int3
I observe that the output of
cargo asmcan disagree with what is executed bycargo bench. The difference can be very significant. Are things supposed to work in this way?I get one variant of the assembly by launching
To get the other variant, I first execute
RUSTFLAGS='-C target-cpu=native' cargo bench, and note the name of the generated executable (for exampletarget/release/deps/small_matmul-dcebd4e101d1c383). I then launchsearch for the function of interest within “less”.
I can try to narrow this down further, and/or provide exact instructions, but before I do so, I would like to ask whether this is a known issue or whether perhaps I am committing some mistake.
An example follows.
Assembly produced by cargo asm:
Assembly output by objdump (observe how this variant begins with four
vbroadcastsdinstructions, while above it starts with a single one):