-
Notifications
You must be signed in to change notification settings - Fork 180
Optimize ULID Compare performance #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Performance Optimization
Add benchmark tests for new and old comparison functions: ulid_test.go
Little Endian mode: compare_le.go Move the Compare function to BE mode or LE mode: ulid.go
|
ULIDs are always big-endian, so if you want to look at a ULID as two hi/lo uint64s without copying, I think you can just do e.g. var id ULID
hi := binary.BigEndian.Uint64(id[:8])
lo := binary.BigEndian.Uint64(id[8:])Extending that to a full Compare method might look like e.g. func (id ULID) Compare(other ULID) int {
ahi := binary.BigEndian.Uint64(id[:8])
bhi := binary.BigEndian.Uint64(other[:8])
if ahi < bhi {
return -1
}
if ahi > bhi {
return 1
}
alo := binary.BigEndian.Uint64(id[8:])
blo := binary.BigEndian.Uint64(other[8:])
if alo < blo {
return -1
}
if alo > blo {
return 1
}
return 0
}I think your proposed Compare methods actually return incorrect results for e.g. The current BenchmarkCompare uses the same two ULIDs in each b.N loop, which differ immediately in their first byte, so they're not really exercising the Compare method(s) in a useful way. One quick guess at a more useful benchmark would be e.g. func BenchmarkCompare(b *testing.B) {
for _, testcase := range []struct {
name string
a ulid.ULID
b ulid.ULID
}{
{
name: "zeroes",
a: ulid.ULID{},
b: ulid.ULID{},
},
{
name: "time a/b",
a: ulid.MustParseStrict(`01JVR914GPGV0BTVAK499X0591`),
b: ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HETF`),
},
{
name: "time b/a",
a: ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HETF`),
b: ulid.MustParseStrict(`01JVR914GPGV0BTVAK499X0591`),
},
{
name: "entropy by 1",
a: ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HET0`),
b: ulid.MustParseStrict(`01JVR915GP0DP7QFH5JKQ1HET1`),
},
{
name: "handcrafted",
a: ulid.ULID{5, 4, 3, 2, 1},
b: ulid.ULID{1, 2, 3, 4, 5},
},
} {
b.Run(testcase.name, func(b *testing.B) {
for _, fn := range []struct {
name string
fn func(a, b ulid.ULID) int
}{
{"CompareBytes", ulid.ULID.CompareBytes},
{"CompareUint64", ulid.ULID.Compare64},
} {
b.Run(fn.name, func(b *testing.B) {
b.ReportAllocs()
var want int
switch {
case testcase.a.String() < testcase.b.String():
want = -1
case testcase.a.String() > testcase.b.String():
want = 1
}
for i := 0; i < b.N; i++ {
if have := fn.fn(testcase.a, testcase.b); want != have {
b.Fatalf("%s Compare %s: want %v, have %v", testcase.a, testcase.b, want, have)
}
}
})
}
})
}
}which for my arm64 macbook pro returns Of course the performance of CompareUint64 would be totally different on a 32-bit architecture, probably CompareBytes would be faster there! This is just my first pass response. I'm not sure how I feel about architecture-specific optimizations in general, and I'm not sure how to evaluate architecture-specific optimizations that improve something from 2ns to 1.3ns. To me this is not a large improvement; if it were 10x or 100x that's a different thing! |
|
First of all, thank you very much for your reply. var id ULID
hi := binary.BigEndian.Uint64(id[:8])
...It's equivalent to the following code: var id ULID
hi := uint64(id[0x07]) | uint64(id[0x06])<<8 | uint64(id[0x05])<<16 | uint64(id[0x04])<<24 | uint64(id[0x03])<<32 | uint64(id[0x02])<<40 | uint64(id[0x01])<<48 | uint64(id[0x00])<<56
...Here's his assembly code (conversion part only, X86_64 CPU): If you're running on a CPU in Little Endian mode, you won't have any issues (e.g. x86, amd64, arm, etc.). However, running on a CPU in Big Endian mode, the code may behave abnormally (e.g. IBM POWER processors) because Next we will talk about the performance issue, the reason why your Arm64 MacBook Pro doesn't have a significant performance boost is that ARM CPUs don't use SIMD (Single Instruction Multiple Data) instructions for acceleration like x86_64 CPUs. Please refer to https://github.com/golang/go/blob/master/src/internal/bytealg/compare_arm64.s for details But on the x86_64 CPU, it is different, because it also takes a certain amount of time for the CPU to switch to SIMD mode, and the SIMD mode will reduce the frequency of the CPU! If we have a large amount of data, then it is excellent to use SIMD related instructions, but our data length is only 16, in this case the overhead caused by SIMD is very uneconomical, so on my x86_64-bit host, the performance improvement is still relatively obvious: To sum up, I think we may be able to choose different implementations for different CPUs, such as x86_64, x86, arm64 CPUs can adopt a new way of compare, except for x86_64, x86, arm64 CPUs continue to use |
|
AFAIU one of the primary purposes of functions like I would be very surprised if
Well, my benchmarks show CompareBytes running in ~2ns, and CompareUint64 in ~1ns, a ~2x performance improvement. And your results show CompareBytes running in 3.5-4.5ns, and CompareUint64 in 1.9-2.3ns, which is in the best case a ~4x performance improvement. To me these are basically equivalent results 😇 -- if it was 10x difference, then that would be a different story. I'm pushing back on this a little bit, because if we start using build tags to specialize implementations, then it wouldn't make sense to just optimize Compare, many (most?) methods would benefit from specialized implementations in this way. And that's a whole lot of work, and more importantly an enormously larger maintenance burden for me. I'm reticent to opt-in to those additional costs, especially since these optimizations can easily be made by consumers in their own code. |
|
Haha, no problem at all! I'm happy to have these kinds of discussions. And please don't read my comments as negative! |
And not !! It is a pleasure to communicate with you, now that you've seen the reply, I'm closing this pull request, and I wish you a happy life! |
This is the fix for #125
New:
Old:
It can be seen that the time for ULID Compare has decreased by about 51%!
All test cases and benchmarks have passed!