Skip to content

Conversation

@adam-azarchs
Copy link

@adam-azarchs adam-azarchs commented Jun 25, 2018

NOTE: Absolutely DO NOT merge this before modifying the sake configs to pin the revision being built for PD/CS tarballs for older versions.

A whole bunch of changes:

  1. Code cleanups.
  • Use appropriate C++ data structures, rather than manual memory management.
  • clang-format: no more mixed spaces and tabs.
  • namespaces
  • TSNE shouldn't have been a class - it doesn't have members.
  1. Build improvements.
  • Make a single .so with dynamic dispatch, rather than choosing which dimension mode to use at link time. Much more reliable and debuggable.
  • Enable LTO
  • Get rid of build dependency on blas/atlas. It wasn't actually being used, and made it much more complex to package the thing.
  • Break tsne's main, load, or main methods into separate files, which don't need to be built when linking the python extension.
  1. Run-time optimizations.
  • Mostly, reducing the size of the tree nodes by through eliminating redundancy and passing data down from above where appropriate.
  • Flow the templating on dimension further up the stack.
  • Less use needless pointer indirection.
  • Small code restructures to allow better loop vectorization by the compiler.
  • Arithmetic simplifications to eliminate redundant computations.

Testing on cellranger sample 74484, there are some small changes in the output. I'm not sure if this is because I fixed a bug, introduced a bug, or just a result of numerical instability with some of the changes to arithmetic.

Before:

Using random seed: 209652396
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Using max_iter = 1000, stop_lying_iter = 250, mom_switch_iter = 250
Computing input similarities...
Building tree...
 - point 0 of 17030
 - point 10000 of 17030
Input similarities computed in 2.31 seconds (sparsity = 0.007233)!
Learning embedding...
Iteration 50: error is 103.968878 (50 iterations in 3.29 seconds)
Iteration 100: error is 95.993534 (50 iterations in 2.98 seconds)
Iteration 150: error is 84.868840 (50 iterations in 2.63 seconds)
Iteration 200: error is 82.552828 (50 iterations in 2.58 seconds)
Iteration 250: error is 4.310005 (50 iterations in 2.51 seconds)
Iteration 300: error is 3.307660 (50 iterations in 2.29 seconds)
Iteration 350: error is 2.946275 (50 iterations in 2.27 seconds)
Iteration 400: error is 2.718556 (50 iterations in 2.29 seconds)
Iteration 450: error is 2.556061 (50 iterations in 2.30 seconds)
Iteration 500: error is 2.433838 (50 iterations in 2.31 seconds)
Iteration 550: error is 2.337590 (50 iterations in 2.31 seconds)
Iteration 600: error is 2.260225 (50 iterations in 2.32 seconds)
Iteration 650: error is 2.196263 (50 iterations in 2.32 seconds)
Iteration 700: error is 2.143605 (50 iterations in 2.32 seconds)
Iteration 750: error is 2.098874 (50 iterations in 2.32 seconds)
Iteration 800: error is 2.060644 (50 iterations in 2.32 seconds)
Iteration 850: error is 2.027915 (50 iterations in 2.32 seconds)
Iteration 900: error is 1.999645 (50 iterations in 2.32 seconds)
Iteration 950: error is 1.975346 (50 iterations in 2.32 seconds)
Iteration 999: error is 1.955789 (50 iterations in 2.27 seconds)
Fitting performed in 48.59 seconds.

After:

Using random seed: 209652396
Using D = 10, no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Using max_iter = 1000, stop_lying_iter = 250, mom_switch_iter = 250
Computing input similarities...
Computing approximate perplexity...
Building tree...
 - point 0 of 17030
 - point 10000 of 17030
Input similarities computed in 2.43 seconds (sparsity = 0.007233)!
Learning embedding...
Iteration 50: error is 103.968878 (50 iterations in 2.09 seconds)
Iteration 100: error is 95.993534 (50 iterations in 1.85 seconds)
Iteration 150: error is 84.868779 (50 iterations in 1.61 seconds)
Iteration 200: error is 82.552551 (50 iterations in 1.58 seconds)
Iteration 250: error is 4.310008 (50 iterations in 1.53 seconds)
Iteration 300: error is 3.307299 (50 iterations in 1.39 seconds)
Iteration 350: error is 2.945971 (50 iterations in 1.38 seconds)
Iteration 400: error is 2.715925 (50 iterations in 1.38 seconds)
Iteration 450: error is 2.553057 (50 iterations in 1.39 seconds)
Iteration 500: error is 2.431074 (50 iterations in 1.39 seconds)
Iteration 550: error is 2.335568 (50 iterations in 1.39 seconds)
Iteration 600: error is 2.258516 (50 iterations in 1.39 seconds)
Iteration 650: error is 2.195174 (50 iterations in 1.39 seconds)
Iteration 700: error is 2.142492 (50 iterations in 1.40 seconds)
Iteration 750: error is 2.097975 (50 iterations in 1.41 seconds)
Iteration 800: error is 2.060355 (50 iterations in 1.41 seconds)
Iteration 850: error is 2.027727 (50 iterations in 1.44 seconds)
Iteration 900: error is 1.999039 (50 iterations in 1.42 seconds)
Iteration 950: error is 1.974375 (50 iterations in 1.42 seconds)
Iteration 999: error is 1.953511 (50 iterations in 1.39 seconds)
Fitting performed in 29.65 seconds.

Also, enable link-time optimize.
Inspired by https://www.microsoft.com/en-us/research/blog/optimizing-barnes-hut-t-sne/
but backported to our fork.

First of all, remove all manual memory management and use STL containers
as appropriate.

Remove the buffer used for caching diffs, and instead compute on the fly
- it's a cheap computation, and repeating it reduces memory cache misses
which matters a lot more than an extra subtraction or two.  It also
saves <dimension> doubles worth of memory per node.

Share the widths structure between nodes at the same level.  This saves
(<dimension> - 2) * 2^<dimension> pointers, and costs <dimension>
doubles.  So at dimension 2 this actually costs more than it saves, but
still helps with memory locality.  At higher dimension it saves a lot.

Minor optimization of the forces inequality, which should help both
memory locality and speed.

Ran include-what-you-use to get the headers right.  Removed using
namespace std to avoid problems with future C++ standards upgrades.
Also stop carrying around parent pointers.  We don't need them.

Stuff that can be easily flowed down from the top at query time now is,
rather than being duplicated at every level.

All told, by special-casing the top node we can remove 4 pointers, an
int, and a bool from every node.
Reduce manual memory management.

Allow the compiler to take advantage of compile-time knowledge of the
dimension when using the templated form of SPTree.  Especially since,
for us, D=2 or 3, the benefits of loop unrolling can be significant.

Remove blas from the python build.  It isn't actually being used, so
there's no reason to introduce a dependency on it.

Adjust the build settings for the python setup build to be the same as
the Makefile, at least on linux.  Specifically, turn on LTO and
gc-sections.  These make a large difference in binary size, especially
when statically linking libc++.
Also get rid of 'using namespace std' in tsne.cpp.  Instead,
'using std::vector' gets us the same benefit with less risk of being
broken by future versions of c++ introducing symbols with names we
already use.

Use nullptr instead of NULL.
Add some assertions.
Because we use tsne through the cython wrapper, we don't actually use
the c++ main method.
As opposed to tricks with linking and #defines which cython doesn't do a
great job obeying.
The class had no members, which made it useless.
It's not used by the python wrapper, so no need to compile it.
Properly take advantage of dynamic dispatch.
Rather than an array of pointers, a pointer to an array.  Improved
memory locality yeilds a further ~20% performance improvement!
Also, tabs -> spaces.
@adam-azarchs adam-azarchs requested a review from nlhepler June 25, 2018 22:17
break;
}
}
index[i] = -1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is unreplicated. Is it unnecessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. index is only read for leaf nodes, which this no longer is.

div *= 2;
}
children[i] = new SPTree(this, data, new_corner, new_width);
children[i] = unique_ptr<SPTree>(new SPTree(this, data, move(new_corner), child_widths.get()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? You're passing the child_widths to SPTree::init which then sets the raw ptr to a point_t::width pointer--but child_widths will be freed before subdivide() returns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, which is one reason why that changed in a later revision.

double H = .0;
for(int m = 0; m < K; m++) H += beta * (distances[m + 1] * distances[m + 1] * cur_P[m]);
for (int m = 0; m < K; m++)
H += beta * (distances[m + 1] * distances[m + 1] * cur_P[m]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do the same factorization of beta here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's more that can be done, but 57% of profile samples are on line 264 of sptree.cpp so unless something was offensive to my sense of aesthetics I didn't look too closely. Will update this, however.

Why do it N times when you can do it just once?
@adam-azarchs
Copy link
Author

More specifics on the profiling data:

55.27% TSNE::_sptree_internal::SPTreeNode<2>::computeNonEdgeForces
16.57% TSNE::run
3.67%  TSNE::_sptree_internal::SPTreeNode<2>::insert

The rest is trivial. Within computeNonEdgeForces, the biggest things are:

       │         if (cum_size == 0 ||
  0.27mov    0x28(%rdi),%eax
 14.48test   %eax,%eax
       │     ↓ je     cc
  1.21cmp    $0x1,%eax

and

       │             double mult = cum_size * sqdist;
  7.65mov    %eax,%eax

(which is why the fast-path escape for cum_size==1)

The TSNE::run samples are actually in SPTree<2>::computeEdgeForces, which was inlined, and is as one would expect dominated by the inner loop:

       │                  const double* data_2 = data + col_P[i] * NDims;
  2.29 │3ab0:   mov    0x148(%rsp),%rsi
  5.82mov    (%r15,%rdi,4),%ebx
  0.65add    %ebx,%ebx
       │                      diffs[d] = data_1[d] - data_2[d];
  0.95movupd (%rax),%xmm1
  2.71movupd (%rsi,%rbx,8),%xmm2
  6.35subpd  %xmm2,%xmm1
       │                      sqdist += diffs[d] * diffs[d];
  0.79movapd %xmm1,%xmm2
  1.00mulpd  %xmm2,%xmm2
  2.94movapd %xmm2,%xmm3
  6.62addsd  %xmm4,%xmm3
  0.51 │        movhlp %xmm2,%xmm2
  1.17addsd  %xmm3,%xmm2
       │                  sqdist = val_P[i] / sqdist;
  2.73movsd  (%r12,%rdi,8),%xmm3
  8.92divsd  %xmm2,%xmm3
       │                      pos_f[d] += sqdist * diffs[d];
  5.33 │        movddu %xmm3,%xmm2
  1.75mulpd  %xmm1,%xmm2
  8.05addpd  %xmm2,%xmm0
 20.55movupd %xmm0,(%rcx)

As you can see the for loops don't show up because they've been unrolled and vectorized, which is of course the whole point of templating on NDims.

Also add visibility attributes for internal classes.  This allows for
more aggressive linker garbage collection and optimization.  Turning on
-fvisiblity=hidden globablly would require cython to set visibility
attributes on the entry points it exports, so we're not going to get the
full benefit, unfortunately.

Also change containsPoint comparison to abs(corner-point)>width.
This replaces a second subtract/compare with a single vectorized
bitwise and instruction, and by reducing branch counts also helps CPU
pipelining.
@adam-azarchs
Copy link
Author

Closing in favor of #10. Though this does more optimization than that one, that one made its changes more carefully. Keeping this branch around however for future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants