Optimize ULID Compare performance #126

Web-Coke · 2025-05-16T03:11:47Z

This is the fix for #125

New:

goos: linux
goarch: amd64
pkg: github.com/oklog/ulid/v2
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
=== RUN   BenchmarkNewCompare
BenchmarkNewCompare
BenchmarkNewCompare-2           770801167                1.494 ns/op    21421.13 MB/s          0 B/op          0 allocs/op
PASS

Old:

goos: linux
goarch: amd64
pkg: github.com/oklog/ulid/v2
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
=== RUN   BenchmarkOldCompare
BenchmarkOldCompare
BenchmarkOldCompare-2           414179769                3.045 ns/op    10509.64 MB/s          0 B/op          0 allocs/op
PASS

It can be seen that the time for ULID Compare has decreased by about 51%!

All test cases and benchmarks have passed!

Performance Optimization

Add benchmark tests for new and old comparison functions: ulid_test.go

Little Endian mode: compare_le.go Move the Compare function to BE mode or LE mode: ulid.go

peterbourgon · 2025-05-21T02:22:32Z

ULIDs are always big-endian, so if you want to look at a ULID as two hi/lo uint64s without copying, I think you can just do e.g.

var id ULID
hi := binary.BigEndian.Uint64(id[:8])
lo := binary.BigEndian.Uint64(id[8:])

Extending that to a full Compare method might look like e.g.

func (id ULID) Compare(other ULID) int {
	ahi := binary.BigEndian.Uint64(id[:8]) 
	bhi := binary.BigEndian.Uint64(other[:8])
	if ahi < bhi {
		return -1
	}
	if ahi > bhi {
		return 1
	}
	alo := binary.BigEndian.Uint64(id[8:]) 
	blo := binary.BigEndian.Uint64(other[8:])
	if alo < blo {
		return -1
	}
	if alo > blo {
		return 1
	}
	return 0
}

I think your proposed Compare methods actually return incorrect results for e.g. 01JVR914GPGV0BTVAK499X0591 vs 01JVR915GP0DP7QFH5JKQ1HETF.

The current BenchmarkCompare uses the same two ULIDs in each b.N loop, which differ immediately in their first byte, so they're not really exercising the Compare method(s) in a useful way. One quick guess at a more useful benchmark would be e.g.

func BenchmarkCompare(b *testing.B) {
	for _, testcase := range []struct {
		name string
		a    ulid.ULID
		b    ulid.ULID
	}{
		{
			name: "zeroes",
			a:    ulid.ULID{},
			b:    ulid.ULID{},
		},
		{
			name: "time a/b",
			a:    ulid.MustParseStrict(`01JVR914GPGV0BTVAK499X0591`),
			b:    ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HETF`),
		},
		{
			name: "time b/a",
			a:    ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HETF`),
			b:    ulid.MustParseStrict(`01JVR914GPGV0BTVAK499X0591`),
		},
		{
			name: "entropy by 1",
			a:    ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HET0`),
			b:    ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HET1`),
		},
		{
			name: "handcrafted",
			a:    ulid.ULID{5, 4, 3, 2, 1},
			b:    ulid.ULID{1, 2, 3, 4, 5},
		},
	} {
		b.Run(testcase.name, func(b *testing.B) {
			for _, fn := range []struct {
				name string
				fn   func(a, b ulid.ULID) int
			}{
				{"CompareBytes", ulid.ULID.CompareBytes},
				{"CompareUint64", ulid.ULID.Compare64},
			} {
				b.Run(fn.name, func(b *testing.B) {
					b.ReportAllocs()

					var want int
					switch {
					case testcase.a.String() < testcase.b.String():
						want = -1
					case testcase.a.String() > testcase.b.String():
						want = 1
					}

					for i := 0; i < b.N; i++ {
						if have := fn.fn(testcase.a, testcase.b); want != have {
							b.Fatalf("%s Compare %s: want %v, have %v", testcase.a, testcase.b, want, have)
						}
					}
				})
			}
		})
	}
}

which for my arm64 macbook pro returns


goos: darwin
goarch: arm64
pkg: github.com/oklog/ulid/v2
cpu: Apple M3 Pro
BenchmarkCompare
BenchmarkCompare/zeroes
BenchmarkCompare/zeroes/CompareBytes
BenchmarkCompare/zeroes/CompareBytes-11         	475559071	         2.423 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/zeroes/CompareUint64
BenchmarkCompare/zeroes/CompareUint64-11        	740542461	         1.623 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_a/b
BenchmarkCompare/time_a/b/CompareBytes
BenchmarkCompare/time_a/b/CompareBytes-11       	553243116	         2.165 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_a/b/CompareUint64
BenchmarkCompare/time_a/b/CompareUint64-11      	873565628	         1.377 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_b/a
BenchmarkCompare/time_b/a/CompareBytes
BenchmarkCompare/time_b/a/CompareBytes-11       	555538088	         2.167 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_b/a/CompareUint64
BenchmarkCompare/time_b/a/CompareUint64-11      	869904038	         1.375 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/entropy_by_1
BenchmarkCompare/entropy_by_1/CompareBytes
BenchmarkCompare/entropy_by_1/CompareBytes-11   	491751549	         2.437 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/entropy_by_1/CompareUint64
BenchmarkCompare/entropy_by_1/CompareUint64-11  	735243434	         1.647 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/handcrafted
BenchmarkCompare/handcrafted/CompareBytes
BenchmarkCompare/handcrafted/CompareBytes-11    	552388168	         2.164 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/handcrafted/CompareUint64
BenchmarkCompare/handcrafted/CompareUint64-11   	868318581	         1.380 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/oklog/ulid/v2	14.032s

Of course the performance of CompareUint64 would be totally different on a 32-bit architecture, probably CompareBytes would be faster there!

This is just my first pass response. I'm not sure how I feel about architecture-specific optimizations in general, and I'm not sure how to evaluate architecture-specific optimizations that improve something from 2ns to 1.3ns. To me this is not a large improvement; if it were 10x or 100x that's a different thing!

Web-Coke · 2025-05-21T08:06:06Z

First of all, thank you very much for your reply.
It's problematic if you code it this way:

var id ULID
hi := binary.BigEndian.Uint64(id[:8])
...

It's equivalent to the following code:

var id ULID
hi := uint64(id[0x07]) | uint64(id[0x06])<<8 | uint64(id[0x05])<<16 | uint64(id[0x04])<<24 | uint64(id[0x03])<<32 | uint64(id[0x02])<<40 | uint64(id[0x01])<<48 | uint64(id[0x00])<<56
...

Here's his assembly code (conversion part only, X86_64 CPU):

// The previous AX stores the address of id[0], which is equivalent to a dereference

// This instruction is exactly equal To
// hi := uint64(id[0x00]) | uint64(id[0x01])<<8 | uint64(id[0x02])<<16 | uint64(id[0x03])<<24 | uint64(id[0x04])<<32 | uint64(id[0x05])<<40 | uint64(id[0x06])<<48 | uint64(id[0x07])<<56
// hi := *uint64(*void)(&a[0])

// If our id is [0x01, 0x02, 0x03, 0x04], we get a 0x04030201 result
// The reason for this is X86_64 CPU stores numbers in Little Endian mode
MOVQ (AX), AX
// Invert the byte order of the AX registers, and we get a result of 0x01020304
BSWAPQ AX

If you're running on a CPU in Little Endian mode, you won't have any issues (e.g. x86, amd64, arm, etc.). However, running on a CPU in Big Endian mode, the code may behave abnormally (e.g. IBM POWER processors) because *uint64(*void)(&a[0]) will get the 0x01020304 result directly!

Next we will talk about the performance issue, the reason why your Arm64 MacBook Pro doesn't have a significant performance boost is that ARM CPUs don't use SIMD (Single Instruction Multiple Data) instructions for acceleration like x86_64 CPUs. Please refer to https://github.com/golang/go/blob/master/src/internal/bytealg/compare_arm64.s for details
At this point, our code is equivalent to a version with no loops and no useful comparisons, so the performance gains are not very significant

But on the x86_64 CPU, it is different, because it also takes a certain amount of time for the CPU to switch to SIMD mode, and the SIMD mode will reduce the frequency of the CPU! If we have a large amount of data, then it is excellent to use SIMD related instructions, but our data length is only 16, in this case the overhead caused by SIMD is very uneconomical, so on my x86_64-bit host, the performance improvement is still relatively obvious:

cpu: AMD Ryzen 9 4900H with Radeon Graphics         
BenchmarkCompare/zeroes/CompareBytes-16         	341527234	         3.523 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/zeroes/CompareUint64-16        	524140376	         2.331 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_a/b/CompareBytes-16       	264096885	         4.558 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_a/b/CompareUint64-16      	598199716	         1.995 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_b/a/CompareBytes-16       	260912161	         4.611 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/time_b/a/CompareUint64-16      	611868098	         1.974 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/entropy_by_1/CompareBytes-16   	264065038	         4.565 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/entropy_by_1/CompareUint64-16  	515535445	         2.307 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/handcrafted/CompareBytes-16    	260333334	         4.584 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompare/handcrafted/CompareUint64-16   	611201800	         1.956 ns/op	       0 B/op	       0 allocs/op
PASS

To sum up, I think we may be able to choose different implementations for different CPUs, such as x86_64, x86, arm64 CPUs can adopt a new way of compare, except for x86_64, x86, arm64 CPUs continue to use bytes.Compare, you can use //go:build to distinguish it, what do you think?
Finally, I would like to express my gratitude to you for your hard work on this library, it is very useful!

peterbourgon · 2025-05-21T21:58:14Z

AFAIU one of the primary purposes of functions like binary.BigEndian.Uint64 is to parse input bytes of a known endian-ness into a primitive value (in this case a uint64) that's correct for the current architecture, whatever it is. That is, for the same input bytes, it should produce a uint64 with little-endian byte order on a little-endian host, and the same uint64 with big-endian byte order on a big-endian host.

I would be very surprised if binary.BigEndian.Uint64 didn't work correctly on a little-endian host. Do you have any way to test this claim out?

Next we will talk about the performance issue, the reason why your Arm64 MacBook Pro doesn't have a significant performance boost...

Well, my benchmarks show CompareBytes running in ~2ns, and CompareUint64 in ~1ns, a ~2x performance improvement. And your results show CompareBytes running in 3.5-4.5ns, and CompareUint64 in 1.9-2.3ns, which is in the best case a ~4x performance improvement. To me these are basically equivalent results 😇 -- if it was 10x difference, then that would be a different story.

I'm pushing back on this a little bit, because if we start using build tags to specialize implementations, then it wouldn't make sense to just optimize Compare, many (most?) methods would benefit from specialized implementations in this way. And that's a whole lot of work, and more importantly an enormously larger maintenance burden for me. I'm reticent to opt-in to those additional costs, especially since these optimizations can easily be made by consumers in their own code.

Web-Coke · 2025-05-22T11:32:47Z

I'm so sorry, but it turns out I misunderstood! The reason why I would think that the CPU in Big Endian mode of the binary.BigEndian.Uint64 function produces incorrect results is because I mistakenly think that *(*uint64)(unsafe.Pointer(&id[0])) is equivalent to uint64(id[0x00]) | uint64(id[0x01])<<8 | uint64(id[0x02])<<16 | uint64(id[0x03])<<24 | uint64(id[0x04])<<32 | uint64(id[0x05])<<40 | uint64(id[0x06])<<48 | uint64(id[0x07])<<56, when in fact they behave differently in different Endian's!

Whether in Little Enbian mode or Big Endian mode, the command generated by *(*uint64)(unsafe.Pointer(&id[0])) is always MOVQ (AX), AX.

In Little Enbian mode:

// MOVQ (AX), AX
*(*uint64)(unsafe.Pointer(&id[0]))

// MOVQ (AX), AX
uint64(id[0x00]) | uint64(id[0x01])<<8 | uint64(id[0x02])<<16 | uint64(id[0x03])<<24 | uint64(id[0x04])<<32 | uint64(id[0x05])<<40 | uint64(id[0x06])<<48 | uint64(id[0x07])<<56

// MOVQ (AX), AX
// BSWAPQ AX
uint64(id[0x07]) | uint64(id[0x06])<<8 | uint64(id[0x05])<<16 | uint64(id[0x04])<<24 | uint64(id[0x03])<<32 | uint64(id[0x02])<<40 | uint64(id[0x01])<<48 | uint64(id[0x00])<<56

In Big Endian mode:

// MOVQ (AX), AX
*(*uint64)(unsafe.Pointer(&id[0]))

// MOVQ (AX), AX
// BSWAPQ AX
uint64(id[0x00]) | uint64(id[0x01])<<8 | uint64(id[0x02])<<16 | uint64(id[0x03])<<24 | uint64(id[0x04])<<32 | uint64(id[0x05])<<40 | uint64(id[0x06])<<48 | uint64(id[0x07])<<56

// MOVQ (AX), AX
uint64(id[0x07]) | uint64(id[0x06])<<8 | uint64(id[0x05])<<16 | uint64(id[0x04])<<24 | uint64(id[0x03])<<32 | uint64(id[0x02])<<40 | uint64(id[0x01])<<48 | uint64(id[0x00])<<56

That is, the binary.BigEndian.Uint64 function will always produce a consistent result anyway, I was mistaken! I'm really sorry😭!!!

I'm pushing back on this a little bit, because if we start using build tags to specialize implementations, then it wouldn't make sense to just optimize Compare, many (most?) methods would benefit from specialized implementations in this way. And that's a whole lot of work, and more importantly an enormously larger maintenance burden for me. I'm reticent to opt-in to those additional costs, especially since these optimizations can easily be made by consumers in their own code.

It's true that I'm ill-thought through when it comes to using build tags, and as you said, that's a lot of work! It's a better idea to let the user choose how to code (optimize)!

In about 1-3 days (after making sure you see the reply) I'm going to close this pull request, I'm sorry I didn't contribute anything to this library (and even caused some trouble), here, I hope you and other contributors, and everyone, Have a pleasant life!

peterbourgon · 2025-05-22T11:37:21Z

Haha, no problem at all! I'm happy to have these kinds of discussions. And please don't read my comments as negative!

Web-Coke · 2025-05-22T11:47:23Z

And please don't read my comments as negative!

And not !! It is a pleasure to communicate with you, now that you've seen the reply, I'm closing this pull request, and I wish you a happy life!

Web-Coke added 3 commits May 13, 2025 14:43

Update ulid.go

4fb5718

Performance Optimization

Correct endianness: ulid.go

2ac0fa0

Add benchmark tests for new and old comparison functions: ulid_test.go

Big Endian mode: compare_be.go

36b9df2

Little Endian mode: compare_le.go Move the Compare function to BE mode or LE mode: ulid.go

ross-spencer mentioned this pull request May 18, 2025

Benchmarking ffdev-info/jsonid#29

Open

Web-Coke closed this May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ULID Compare performance #126

Optimize ULID Compare performance #126

Uh oh!

Web-Coke commented May 16, 2025

Uh oh!

peterbourgon commented May 21, 2025 •

edited

Loading

Uh oh!

Web-Coke commented May 21, 2025

Uh oh!

peterbourgon commented May 21, 2025 •

edited

Loading

Uh oh!

Web-Coke commented May 22, 2025 •

edited

Loading

Uh oh!

peterbourgon commented May 22, 2025

Uh oh!

Web-Coke commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize ULID Compare performance #126

Optimize ULID Compare performance #126

Uh oh!

Conversation

Web-Coke commented May 16, 2025

Uh oh!

peterbourgon commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Web-Coke commented May 21, 2025

Uh oh!

peterbourgon commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Web-Coke commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbourgon commented May 22, 2025

Uh oh!

Web-Coke commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peterbourgon commented May 21, 2025 •

edited

Loading

peterbourgon commented May 21, 2025 •

edited

Loading

Web-Coke commented May 22, 2025 •

edited

Loading