Skip to content

Outpout of "cargo asm" does not match objdump of actual benchmark #434

@grothesque

Description

@grothesque

I observe that the output of cargo asm can disagree with what is executed by cargo bench. The difference can be very significant. Are things supposed to work in this way?

I get one variant of the assembly by launching

RUSTFLAGS='-C target-cpu=native' cargo asm --bench small_matmul small_matmul::matmul4x4_view

To get the other variant, I first execute RUSTFLAGS='-C target-cpu=native' cargo bench, and note the name of the generated executable (for example target/release/deps/small_matmul-dcebd4e101d1c383). I then launch

objdump --disassemble -M intel --no-show-raw-insn --demangle target/release/deps/small_matmul-dcebd4e101d1c383 | less

search for the function of interest within “less”.

I can try to narrow this down further, and/or provide exact instructions, but before I do so, I would like to ask whether this is a known issue or whether perhaps I am committing some mistake.

An example follows.

Assembly produced by cargo asm:

.section .text.small_matmul::matmul4x4_view,"ax",@progbits
	.p2align	4
.type	small_matmul::matmul4x4_view,@function
small_matmul::matmul4x4_view:
	.cfi_startproc
	vbroadcastsd ymm0, qword ptr [rdi]
	vmovupd ymm1, ymmword ptr [rsi]
	vmovupd ymm2, ymmword ptr [rsi + 32]
	vmovupd ymm3, ymmword ptr [rsi + 64]
	vmovupd ymm4, ymmword ptr [rsi + 96]
	vmulpd ymm0, ymm0, ymm1
	vaddpd ymm0, ymm0, ymmword ptr [rdx]
	vbroadcastsd ymm5, qword ptr [rdi + 8]
	vmulpd ymm5, ymm5, ymm2
	vaddpd ymm0, ymm5, ymm0
	vbroadcastsd ymm5, qword ptr [rdi + 16]
	vmulpd ymm5, ymm5, ymm3
	vaddpd ymm0, ymm5, ymm0
	vbroadcastsd ymm5, qword ptr [rdi + 24]
	vmulpd ymm5, ymm5, ymm4
	vaddpd ymm0, ymm5, ymm0
	vmovupd ymmword ptr [rdx], ymm0
	vbroadcastsd ymm0, qword ptr [rdi + 32]
	vmulpd ymm0, ymm0, ymm1
	vaddpd ymm0, ymm0, ymmword ptr [rdx + 32]
	vbroadcastsd ymm1, qword ptr [rdi + 40]
	vmulpd ymm1, ymm1, ymm2
	vaddpd ymm0, ymm1, ymm0
	vbroadcastsd ymm1, qword ptr [rdi + 48]
	vmulpd ymm1, ymm1, ymm3
	vaddpd ymm0, ymm1, ymm0
	vbroadcastsd ymm1, qword ptr [rdi + 56]
	vmulpd ymm1, ymm1, ymm4
	vaddpd ymm0, ymm1, ymm0
	vmovupd ymmword ptr [rdx + 32], ymm0
	vbroadcastsd ymm0, qword ptr [rdi + 64]
	vmovupd ymm1, ymmword ptr [rsi]
	vmovupd ymm2, ymmword ptr [rsi + 32]
	vmovupd ymm3, ymmword ptr [rsi + 64]
	vmovupd ymm4, ymmword ptr [rsi + 96]
	vmulpd ymm0, ymm0, ymm1
	vaddpd ymm0, ymm0, ymmword ptr [rdx + 64]
	vbroadcastsd ymm5, qword ptr [rdi + 72]
	vmulpd ymm5, ymm5, ymm2
	vaddpd ymm0, ymm5, ymm0
	vbroadcastsd ymm5, qword ptr [rdi + 80]
	vmulpd ymm5, ymm5, ymm3
	vaddpd ymm0, ymm5, ymm0
	vbroadcastsd ymm5, qword ptr [rdi + 88]
	vmulpd ymm5, ymm5, ymm4
	vaddpd ymm0, ymm5, ymm0
	vmovupd ymmword ptr [rdx + 64], ymm0
	vbroadcastsd ymm0, qword ptr [rdi + 96]
	vmulpd ymm0, ymm0, ymm1
	vaddpd ymm0, ymm0, ymmword ptr [rdx + 96]
	vbroadcastsd ymm1, qword ptr [rdi + 104]
	vmulpd ymm1, ymm1, ymm2
	vaddpd ymm0, ymm1, ymm0
	vbroadcastsd ymm1, qword ptr [rdi + 112]
	vmulpd ymm1, ymm1, ymm3
	vaddpd ymm0, ymm1, ymm0
	vbroadcastsd ymm1, qword ptr [rdi + 120]
	vmulpd ymm1, ymm1, ymm4
	vaddpd ymm0, ymm1, ymm0
	vmovupd ymmword ptr [rdx + 96], ymm0
	vzeroupper
	ret

Assembly output by objdump (observe how this variant begins with four vbroadcastsd instructions, while above it starts with a single one):

00000000000b8ed0 <small_matmul::matmul4x4_view>:
   b8ed0:       vbroadcastsd ymm0,QWORD PTR [rdi+0x18]
   b8ed6:       vbroadcastsd ymm1,QWORD PTR [rdi+0x10]
   b8edc:       vbroadcastsd ymm2,QWORD PTR [rdi+0x8]
   b8ee2:       vbroadcastsd ymm3,QWORD PTR [rdi]
   b8ee7:       vmulpd ymm3,ymm3,YMMWORD PTR [rsi]
   b8eeb:       vaddpd ymm3,ymm3,YMMWORD PTR [rdx]
   b8eef:       vmulpd ymm2,ymm2,YMMWORD PTR [rsi+0x20]
   b8ef4:       vaddpd ymm2,ymm2,ymm3
   b8ef8:       vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x40]
   b8efd:       vaddpd ymm1,ymm1,ymm2
   b8f01:       vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
   b8f06:       vaddpd ymm0,ymm0,ymm1
   b8f0a:       vmovupd YMMWORD PTR [rdx],ymm0
   b8f0e:       vbroadcastsd ymm0,QWORD PTR [rdi+0x38]
   b8f14:       vbroadcastsd ymm1,QWORD PTR [rdi+0x28]
   b8f1a:       vbroadcastsd ymm2,QWORD PTR [rdi+0x20]
   b8f20:       vmulpd ymm2,ymm2,YMMWORD PTR [rsi]
   b8f24:       vaddpd ymm2,ymm2,YMMWORD PTR [rdx+0x20]
   b8f29:       vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x20]
   b8f2e:       vbroadcastsd ymm3,QWORD PTR [rdi+0x30]
   b8f34:       vaddpd ymm1,ymm1,ymm2
   b8f38:       vmulpd ymm2,ymm3,YMMWORD PTR [rsi+0x40]
   b8f3d:       vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
   b8f42:       vaddpd ymm1,ymm2,ymm1
   b8f46:       vaddpd ymm0,ymm0,ymm1
   b8f4a:       vmovupd YMMWORD PTR [rdx+0x20],ymm0
   b8f4f:       vbroadcastsd ymm0,QWORD PTR [rdi+0x58]
   b8f55:       vbroadcastsd ymm1,QWORD PTR [rdi+0x50]
   b8f5b:       vbroadcastsd ymm2,QWORD PTR [rdi+0x48]
   b8f61:       vbroadcastsd ymm3,QWORD PTR [rdi+0x40]
   b8f67:       vmulpd ymm3,ymm3,YMMWORD PTR [rsi]
   b8f6b:       vaddpd ymm3,ymm3,YMMWORD PTR [rdx+0x40]
   b8f70:       vmulpd ymm2,ymm2,YMMWORD PTR [rsi+0x20]
   b8f75:       vaddpd ymm2,ymm2,ymm3
   b8f79:       vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x40]
   b8f7e:       vaddpd ymm1,ymm1,ymm2
   b8f82:       vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
   b8f87:       vaddpd ymm0,ymm0,ymm1
   b8f8b:       vmovupd YMMWORD PTR [rdx+0x40],ymm0
   b8f90:       vbroadcastsd ymm0,QWORD PTR [rdi+0x78]
   b8f96:       vbroadcastsd ymm1,QWORD PTR [rdi+0x70]
   b8f9c:       vbroadcastsd ymm2,QWORD PTR [rdi+0x60]
   b8fa2:       vmulpd ymm2,ymm2,YMMWORD PTR [rsi]
   b8fa6:       vaddpd ymm2,ymm2,YMMWORD PTR [rdx+0x60]
   b8fab:       vbroadcastsd ymm3,QWORD PTR [rdi+0x68]
   b8fb1:       vmulpd ymm3,ymm3,YMMWORD PTR [rsi+0x20]
   b8fb6:       vaddpd ymm2,ymm3,ymm2
   b8fba:       vmulpd ymm1,ymm1,YMMWORD PTR [rsi+0x40]
   b8fbf:       vaddpd ymm1,ymm1,ymm2
   b8fc3:       vmulpd ymm0,ymm0,YMMWORD PTR [rsi+0x60]
   b8fc8:       vaddpd ymm0,ymm0,ymm1
   b8fcc:       vmovupd YMMWORD PTR [rdx+0x60],ymm0
   b8fd1:       vzeroupper
   b8fd4:       ret
   b8fd5:       int3
   b8fd6:       int3
   b8fd7:       int3
   b8fd8:       int3
   b8fd9:       int3
   b8fda:       int3
   b8fdb:       int3
   b8fdc:       int3
   b8fdd:       int3
   b8fde:       int3
   b8fdf:       int3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions