Optimizations and build improvements. #5

adam-azarchs · 2018-06-25T22:17:48Z

NOTE: Absolutely DO NOT merge this before modifying the sake configs to pin the revision being built for PD/CS tarballs for older versions.

A whole bunch of changes:

Code cleanups.

Use appropriate C++ data structures, rather than manual memory management.
clang-format: no more mixed spaces and tabs.
namespaces
TSNE shouldn't have been a class - it doesn't have members.

Build improvements.

Make a single .so with dynamic dispatch, rather than choosing which dimension mode to use at link time. Much more reliable and debuggable.
Enable LTO
Get rid of build dependency on blas/atlas. It wasn't actually being used, and made it much more complex to package the thing.
Break tsne's main, load, or main methods into separate files, which don't need to be built when linking the python extension.

Run-time optimizations.

Mostly, reducing the size of the tree nodes by through eliminating redundancy and passing data down from above where appropriate.
Flow the templating on dimension further up the stack.
Less use needless pointer indirection.
Small code restructures to allow better loop vectorization by the compiler.
Arithmetic simplifications to eliminate redundant computations.

Testing on cellranger sample 74484, there are some small changes in the output. I'm not sure if this is because I fixed a bug, introduced a bug, or just a result of numerical instability with some of the changes to arithmetic.

Before:

Using random seed: 209652396
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Using max_iter = 1000, stop_lying_iter = 250, mom_switch_iter = 250
Computing input similarities...
Building tree...
 - point 0 of 17030
 - point 10000 of 17030
Input similarities computed in 2.31 seconds (sparsity = 0.007233)!
Learning embedding...
Iteration 50: error is 103.968878 (50 iterations in 3.29 seconds)
Iteration 100: error is 95.993534 (50 iterations in 2.98 seconds)
Iteration 150: error is 84.868840 (50 iterations in 2.63 seconds)
Iteration 200: error is 82.552828 (50 iterations in 2.58 seconds)
Iteration 250: error is 4.310005 (50 iterations in 2.51 seconds)
Iteration 300: error is 3.307660 (50 iterations in 2.29 seconds)
Iteration 350: error is 2.946275 (50 iterations in 2.27 seconds)
Iteration 400: error is 2.718556 (50 iterations in 2.29 seconds)
Iteration 450: error is 2.556061 (50 iterations in 2.30 seconds)
Iteration 500: error is 2.433838 (50 iterations in 2.31 seconds)
Iteration 550: error is 2.337590 (50 iterations in 2.31 seconds)
Iteration 600: error is 2.260225 (50 iterations in 2.32 seconds)
Iteration 650: error is 2.196263 (50 iterations in 2.32 seconds)
Iteration 700: error is 2.143605 (50 iterations in 2.32 seconds)
Iteration 750: error is 2.098874 (50 iterations in 2.32 seconds)
Iteration 800: error is 2.060644 (50 iterations in 2.32 seconds)
Iteration 850: error is 2.027915 (50 iterations in 2.32 seconds)
Iteration 900: error is 1.999645 (50 iterations in 2.32 seconds)
Iteration 950: error is 1.975346 (50 iterations in 2.32 seconds)
Iteration 999: error is 1.955789 (50 iterations in 2.27 seconds)
Fitting performed in 48.59 seconds.

After:

Using random seed: 209652396
Using D = 10, no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Using max_iter = 1000, stop_lying_iter = 250, mom_switch_iter = 250
Computing input similarities...
Computing approximate perplexity...
Building tree...
 - point 0 of 17030
 - point 10000 of 17030
Input similarities computed in 2.43 seconds (sparsity = 0.007233)!
Learning embedding...
Iteration 50: error is 103.968878 (50 iterations in 2.09 seconds)
Iteration 100: error is 95.993534 (50 iterations in 1.85 seconds)
Iteration 150: error is 84.868779 (50 iterations in 1.61 seconds)
Iteration 200: error is 82.552551 (50 iterations in 1.58 seconds)
Iteration 250: error is 4.310008 (50 iterations in 1.53 seconds)
Iteration 300: error is 3.307299 (50 iterations in 1.39 seconds)
Iteration 350: error is 2.945971 (50 iterations in 1.38 seconds)
Iteration 400: error is 2.715925 (50 iterations in 1.38 seconds)
Iteration 450: error is 2.553057 (50 iterations in 1.39 seconds)
Iteration 500: error is 2.431074 (50 iterations in 1.39 seconds)
Iteration 550: error is 2.335568 (50 iterations in 1.39 seconds)
Iteration 600: error is 2.258516 (50 iterations in 1.39 seconds)
Iteration 650: error is 2.195174 (50 iterations in 1.39 seconds)
Iteration 700: error is 2.142492 (50 iterations in 1.40 seconds)
Iteration 750: error is 2.097975 (50 iterations in 1.41 seconds)
Iteration 800: error is 2.060355 (50 iterations in 1.41 seconds)
Iteration 850: error is 2.027727 (50 iterations in 1.44 seconds)
Iteration 900: error is 1.999039 (50 iterations in 1.42 seconds)
Iteration 950: error is 1.974375 (50 iterations in 1.42 seconds)
Iteration 999: error is 1.953511 (50 iterations in 1.39 seconds)
Fitting performed in 29.65 seconds.

Also, enable link-time optimize.

Inspired by https://www.microsoft.com/en-us/research/blog/optimizing-barnes-hut-t-sne/ but backported to our fork. First of all, remove all manual memory management and use STL containers as appropriate. Remove the buffer used for caching diffs, and instead compute on the fly - it's a cheap computation, and repeating it reduces memory cache misses which matters a lot more than an extra subtraction or two. It also saves <dimension> doubles worth of memory per node. Share the widths structure between nodes at the same level. This saves (<dimension> - 2) * 2^<dimension> pointers, and costs <dimension> doubles. So at dimension 2 this actually costs more than it saves, but still helps with memory locality. At higher dimension it saves a lot. Minor optimization of the forces inequality, which should help both memory locality and speed. Ran include-what-you-use to get the headers right. Removed using namespace std to avoid problems with future C++ standards upgrades.

Also stop carrying around parent pointers. We don't need them. Stuff that can be easily flowed down from the top at query time now is, rather than being duplicated at every level. All told, by special-casing the top node we can remove 4 pointers, an int, and a bool from every node.

Reduce manual memory management. Allow the compiler to take advantage of compile-time knowledge of the dimension when using the templated form of SPTree. Especially since, for us, D=2 or 3, the benefits of loop unrolling can be significant. Remove blas from the python build. It isn't actually being used, so there's no reason to introduce a dependency on it. Adjust the build settings for the python setup build to be the same as the Makefile, at least on linux. Specifically, turn on LTO and gc-sections. These make a large difference in binary size, especially when statically linking libc++.

Also get rid of 'using namespace std' in tsne.cpp. Instead, 'using std::vector' gets us the same benefit with less risk of being broken by future versions of c++ introducing symbols with names we already use. Use nullptr instead of NULL.

Add some assertions.

Because we use tsne through the cython wrapper, we don't actually use the c++ main method.

As opposed to tricks with linking and #defines which cython doesn't do a great job obeying.

The class had no members, which made it useless.

It's not used by the python wrapper, so no need to compile it.

Properly take advantage of dynamic dispatch.

Rather than an array of pointers, a pointer to an array. Improved memory locality yeilds a further ~20% performance improvement!

Also enable sse3.

Saves 10% or so.

Also, tabs -> spaces.

nlhepler · 2018-06-26T16:36:25Z

tsne/bh_sne_src/sptree.cpp

+                break;
+            }
        }
-        index[i] = -1;


This line is unreplicated. Is it unnecessary?

Yep. index is only read for leaf nodes, which this no longer is.

nlhepler · 2018-06-26T16:41:29Z

tsne/bh_sne_src/sptree.cpp

            div *= 2;
        }
-        children[i] = new SPTree(this, data, new_corner, new_width);
+        children[i] = unique_ptr<SPTree>(new SPTree(this, data, move(new_corner), child_widths.get()));


Is this safe? You're passing the child_widths to SPTree::init which then sets the raw ptr to a point_t::width pointer--but child_widths will be freed before subdivide() returns.

No, which is one reason why that changed in a later revision.

nlhepler · 2018-06-26T23:59:49Z

tsne/bh_sne_src/tsne.cpp

            double H = .0;
-            for(int m = 0; m < K; m++) H += beta * (distances[m + 1] * distances[m + 1] * cur_P[m]);
+            for (int m = 0; m < K; m++)
+                H += beta * (distances[m + 1] * distances[m + 1] * cur_P[m]);


You could do the same factorization of beta here.

Yeah, there's more that can be done, but 57% of profile samples are on line 264 of sptree.cpp so unless something was offensive to my sense of aesthetics I didn't look too closely. Will update this, however.

Why do it N times when you can do it just once?

adam-azarchs · 2018-06-27T17:23:14Z

More specifics on the profiling data:

55.27% TSNE::_sptree_internal::SPTreeNode<2>::computeNonEdgeForces
16.57% TSNE::run
3.67%  TSNE::_sptree_internal::SPTreeNode<2>::insert

The rest is trivial. Within computeNonEdgeForces, the biggest things are:

       │         if (cum_size == 0 ||
  0.27 │       mov    0x28(%rdi),%eax
 14.48 │       test   %eax,%eax
       │     ↓ je     cc
  1.21 │       cmp    $0x1,%eax

and

       │             double mult = cum_size * sqdist;
  7.65 │       mov    %eax,%eax

(which is why the fast-path escape for cum_size==1)

The TSNE::run samples are actually in SPTree<2>::computeEdgeForces, which was inlined, and is as one would expect dominated by the inner loop:

       │                  const double* data_2 = data + col_P[i] * NDims;
  2.29 │3ab0:   mov    0x148(%rsp),%rsi
  5.82 │        mov    (%r15,%rdi,4),%ebx
  0.65 │        add    %ebx,%ebx
       │                      diffs[d] = data_1[d] - data_2[d];
  0.95 │        movupd (%rax),%xmm1
  2.71 │        movupd (%rsi,%rbx,8),%xmm2
  6.35 │        subpd  %xmm2,%xmm1
       │                      sqdist += diffs[d] * diffs[d];
  0.79 │        movapd %xmm1,%xmm2
  1.00 │        mulpd  %xmm2,%xmm2
  2.94 │        movapd %xmm2,%xmm3
  6.62 │        addsd  %xmm4,%xmm3
  0.51 │        movhlp %xmm2,%xmm2
  1.17 │        addsd  %xmm3,%xmm2
       │                  sqdist = val_P[i] / sqdist;
  2.73 │        movsd  (%r12,%rdi,8),%xmm3
  8.92 │        divsd  %xmm2,%xmm3
       │                      pos_f[d] += sqdist * diffs[d];
  5.33 │        movddu %xmm3,%xmm2
  1.75 │        mulpd  %xmm1,%xmm2
  8.05 │        addpd  %xmm2,%xmm0
 20.55 │        movupd %xmm0,(%rcx)

As you can see the for loops don't show up because they've been unrolled and vectorized, which is of course the whole point of templating on NDims.

Also add visibility attributes for internal classes. This allows for more aggressive linker garbage collection and optimization. Turning on -fvisiblity=hidden globablly would require cython to set visibility attributes on the entry points it exports, so we're not going to get the full benefit, unfortunately. Also change containsPoint comparison to abs(corner-point)>width. This replaces a second subtract/compare with a single vectorized bitwise and instruction, and by reducing branch counts also helps CPU pipelining.

adam-azarchs · 2019-03-29T00:44:56Z

Closing in favor of #10. Though this does more optimization than that one, that one made its changes more carefully. Keeping this branch around however for future reference.

adam-azarchs added 20 commits May 31, 2018 23:38

Don't override CXX, CFLAGS from environment.

5347229

Also, enable link-time optimize.

Namespace the internal implementation details.

ef51dc0

More sorting and correction of includes.

47eaa06

Also get rid of 'using namespace std' in tsne.cpp. Instead, 'using std::vector' gets us the same benefit with less risk of being broken by future versions of c++ introducing symbols with names we already use. Use nullptr instead of NULL.

Fix some array init stuff.

82917dd

Add some assertions.

Factor out main.

dc8199c

Because we use tsne through the cython wrapper, we don't actually use the c++ main method.

Use dynamic dispatch for choosing no_dims.

0019c9b

As opposed to tricks with linking and #defines which cython doesn't do a great job obeying.

Use a namespace instead of class.

930525e

The class had no members, which made it useless.

Remove obsolete bh_sne_3d.

6885a19

Break out io stuff into separate obj.

6da8ab8

It's not used by the python wrapper, so no need to compile it.

Improve build/link.

c2a2ef1

Properly take advantage of dynamic dispatch.

Remove size from SPTreeNode.

b266c94

Trim the fat off VpTree.

dedcafe

Store child nodes in a contiguous array.

0ffa88d

Rather than an array of pointers, a pointer to an array. Improved memory locality yeilds a further ~20% performance improvement!

Add alignment directives for better vectorization.

8a5efce

Also enable sse3.

Factor theta into max_width_squared early on.

2b02650

Saves 10% or so.

Replace more malloc with vector.

aaf77bd

Also, tabs -> spaces.

Apply consistent clang-format.

58c2070

adam-azarchs requested a review from nlhepler June 25, 2018 22:17

nlhepler reviewed Jun 26, 2018

View reviewed changes

Factor beta multiplication.

c644302

Why do it N times when you can do it just once?

adam-azarchs closed this Mar 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations and build improvements. #5

Optimizations and build improvements. #5

Uh oh!

adam-azarchs commented Jun 25, 2018 •

edited

Loading

Uh oh!

nlhepler Jun 26, 2018

Uh oh!

adam-azarchs Jun 26, 2018

Uh oh!

nlhepler Jun 26, 2018

Uh oh!

adam-azarchs Jun 26, 2018

Uh oh!

nlhepler Jun 26, 2018

Uh oh!

adam-azarchs Jun 27, 2018

Uh oh!

adam-azarchs commented Jun 27, 2018

Uh oh!

adam-azarchs commented Mar 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimizations and build improvements. #5

Optimizations and build improvements. #5

Uh oh!

Conversation

adam-azarchs commented Jun 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nlhepler Jun 26, 2018

Choose a reason for hiding this comment

Uh oh!

adam-azarchs Jun 26, 2018

Choose a reason for hiding this comment

Uh oh!

nlhepler Jun 26, 2018

Choose a reason for hiding this comment

Uh oh!

adam-azarchs Jun 26, 2018

Choose a reason for hiding this comment

Uh oh!

nlhepler Jun 26, 2018

Choose a reason for hiding this comment

Uh oh!

adam-azarchs Jun 27, 2018

Choose a reason for hiding this comment

Uh oh!

adam-azarchs commented Jun 27, 2018

Uh oh!

adam-azarchs commented Mar 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adam-azarchs commented Jun 25, 2018 •

edited

Loading