-
Notifications
You must be signed in to change notification settings - Fork 2
Optimizations and build improvements. #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also, enable link-time optimize.
Inspired by https://www.microsoft.com/en-us/research/blog/optimizing-barnes-hut-t-sne/ but backported to our fork. First of all, remove all manual memory management and use STL containers as appropriate. Remove the buffer used for caching diffs, and instead compute on the fly - it's a cheap computation, and repeating it reduces memory cache misses which matters a lot more than an extra subtraction or two. It also saves <dimension> doubles worth of memory per node. Share the widths structure between nodes at the same level. This saves (<dimension> - 2) * 2^<dimension> pointers, and costs <dimension> doubles. So at dimension 2 this actually costs more than it saves, but still helps with memory locality. At higher dimension it saves a lot. Minor optimization of the forces inequality, which should help both memory locality and speed. Ran include-what-you-use to get the headers right. Removed using namespace std to avoid problems with future C++ standards upgrades.
Also stop carrying around parent pointers. We don't need them. Stuff that can be easily flowed down from the top at query time now is, rather than being duplicated at every level. All told, by special-casing the top node we can remove 4 pointers, an int, and a bool from every node.
Reduce manual memory management. Allow the compiler to take advantage of compile-time knowledge of the dimension when using the templated form of SPTree. Especially since, for us, D=2 or 3, the benefits of loop unrolling can be significant. Remove blas from the python build. It isn't actually being used, so there's no reason to introduce a dependency on it. Adjust the build settings for the python setup build to be the same as the Makefile, at least on linux. Specifically, turn on LTO and gc-sections. These make a large difference in binary size, especially when statically linking libc++.
Also get rid of 'using namespace std' in tsne.cpp. Instead, 'using std::vector' gets us the same benefit with less risk of being broken by future versions of c++ introducing symbols with names we already use. Use nullptr instead of NULL.
Add some assertions.
Because we use tsne through the cython wrapper, we don't actually use the c++ main method.
As opposed to tricks with linking and #defines which cython doesn't do a great job obeying.
The class had no members, which made it useless.
It's not used by the python wrapper, so no need to compile it.
Properly take advantage of dynamic dispatch.
Rather than an array of pointers, a pointer to an array. Improved memory locality yeilds a further ~20% performance improvement!
Also enable sse3.
Saves 10% or so.
Also, tabs -> spaces.
| break; | ||
| } | ||
| } | ||
| index[i] = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is unreplicated. Is it unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. index is only read for leaf nodes, which this no longer is.
tsne/bh_sne_src/sptree.cpp
Outdated
| div *= 2; | ||
| } | ||
| children[i] = new SPTree(this, data, new_corner, new_width); | ||
| children[i] = unique_ptr<SPTree>(new SPTree(this, data, move(new_corner), child_widths.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? You're passing the child_widths to SPTree::init which then sets the raw ptr to a point_t::width pointer--but child_widths will be freed before subdivide() returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, which is one reason why that changed in a later revision.
tsne/bh_sne_src/tsne.cpp
Outdated
| double H = .0; | ||
| for(int m = 0; m < K; m++) H += beta * (distances[m + 1] * distances[m + 1] * cur_P[m]); | ||
| for (int m = 0; m < K; m++) | ||
| H += beta * (distances[m + 1] * distances[m + 1] * cur_P[m]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do the same factorization of beta here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's more that can be done, but 57% of profile samples are on line 264 of sptree.cpp so unless something was offensive to my sense of aesthetics I didn't look too closely. Will update this, however.
Why do it N times when you can do it just once?
|
More specifics on the profiling data: The rest is trivial. Within │ if (cum_size == 0 ||
0.27 │ mov 0x28(%rdi),%eax
14.48 │ test %eax,%eax
│ ↓ je cc
1.21 │ cmp $0x1,%eaxand │ double mult = cum_size * sqdist;
7.65 │ mov %eax,%eax(which is why the fast-path escape for The │ const double* data_2 = data + col_P[i] * NDims;
2.29 │3ab0: mov 0x148(%rsp),%rsi
5.82 │ mov (%r15,%rdi,4),%ebx
0.65 │ add %ebx,%ebx
│ diffs[d] = data_1[d] - data_2[d];
0.95 │ movupd (%rax),%xmm1
2.71 │ movupd (%rsi,%rbx,8),%xmm2
6.35 │ subpd %xmm2,%xmm1
│ sqdist += diffs[d] * diffs[d];
0.79 │ movapd %xmm1,%xmm2
1.00 │ mulpd %xmm2,%xmm2
2.94 │ movapd %xmm2,%xmm3
6.62 │ addsd %xmm4,%xmm3
0.51 │ movhlp %xmm2,%xmm2
1.17 │ addsd %xmm3,%xmm2
│ sqdist = val_P[i] / sqdist;
2.73 │ movsd (%r12,%rdi,8),%xmm3
8.92 │ divsd %xmm2,%xmm3
│ pos_f[d] += sqdist * diffs[d];
5.33 │ movddu %xmm3,%xmm2
1.75 │ mulpd %xmm1,%xmm2
8.05 │ addpd %xmm2,%xmm0
20.55 │ movupd %xmm0,(%rcx)As you can see the for loops don't show up because they've been unrolled and vectorized, which is of course the whole point of templating on |
Also add visibility attributes for internal classes. This allows for more aggressive linker garbage collection and optimization. Turning on -fvisiblity=hidden globablly would require cython to set visibility attributes on the entry points it exports, so we're not going to get the full benefit, unfortunately. Also change containsPoint comparison to abs(corner-point)>width. This replaces a second subtract/compare with a single vectorized bitwise and instruction, and by reducing branch counts also helps CPU pipelining.
|
Closing in favor of #10. Though this does more optimization than that one, that one made its changes more carefully. Keeping this branch around however for future reference. |
NOTE: Absolutely DO NOT merge this before modifying the sake configs to pin the revision being built for PD/CS tarballs for older versions.
A whole bunch of changes:
Testing on cellranger sample 74484, there are some small changes in the output. I'm not sure if this is because I fixed a bug, introduced a bug, or just a result of numerical instability with some of the changes to arithmetic.
Before:
After: