Major build cleanup. #10

adam-azarchs · 2019-02-26T03:20:27Z

Fix the Makefile dependencies so things get rebuilt when they should be.

Only produce one .so file, which knows how to run both 2d and 3d.

Break out main and the I/O methods into a separate cpp file, which is
not used for the python extension module.

Do not create pointless objects. The methods are all static. TSNE is
now a namespace, not a class, and the exported run method is wrapped
in extern "C" to make linkage easier.

Minor cleanup of includes / using statements.

Because I just couldn't help myself, remove a couple places where
new/delete were being used for no good reason.

Get rid of a few places where NDIMS (the build time constant) and a
run-time variable, which are supposed to have the same value, were both
being used in a method. This means flowing the template a bit further
up the tree. There's lots more places this could be done - this only
addresses the cases where not doing so could lead to errors if the
template variable was inconsistent with the run-time variable.

Kill off the last pointless vestiges of the cblas dependency.

Use hidden visibility where appropriate - only export the run() method.
This cuts the size of the .so and allows the compiler to inline more
aggressively, especially since we've also turned on LTO.

Together, all of these changes result in a build with a single 901k
extension module, instead of two 1.9MB extension modules.

nlhepler · 2019-02-26T21:56:58Z

I am happy with where this is, I think the only thing I could pessimistically think of is a test (with fixed seed) that equates the result with a known quantity.

adam-azarchs · 2019-03-13T20:36:45Z

I moved the consistency test to the start of the commit chain and remade it after disabling -ffast-math from the build. The new HEAD commit still passes. There were some differences introduced by different treatment of duplicate points which have been fixed now.

Then I turned -ffast-math back on since it seems to still pass the tests.

Turning off -ffast-math is because otherwise any little compile flags change will change the output. Also remove unwanted library_dirs and include_dirs.

Fix the Makefile dependencies so things get rebuilt when they should be. Add a `test` target. Only produce one .so file, which knows how to run both 2d and 3d. Break out main and the I/O methods into a separate cpp file, which is not used for the python extension module. Do not create pointless objects. The methods are all static. TSNE is now a namespace, not a class, and the exported `run` method is wrapped in `extern "C"` to make linkage easier. Minor cleanup of includes / using statements. Because I just couldn't help myself, remove a couple places where new/delete were being used for no good reason. Get rid of a few places where NDIMS (the build time constant) and a run-time variable, which are supposed to have the same value, were both being used in a method. This means flowing the template a bit further up the tree. There's lots more places this could be done - this only addresses the cases where not doing so could lead to errors if the template variable was inconsistent with the run-time variable. Kill off the last pointless vestiges of the cblas dependency. Use hidden visibility where appropriate - only export the run() method. This cuts the size of the .so and allows the compiler to inline more aggressively, especially since we've also turned on LTO. Together, all of these changes result in a build with a single 901k extension module, instead of two 1.9MB extension modules. Set -std=c++14 in the build. This is required by some of the cleanups because of the use of the `nullptr` keyword. Older gcc will not default to c++11 or higher, so will fail to build nullptr. And c++17 doesn't work because ``` include/python2.7/stringobject.h:175:5: error: ISO C++17 does not allow 'register' storage class specifier [-Wregister] register Py_ssize_t *len /* pointer to length variable or NULL ^~~~~~~~~ ``` Update travis build to Xenial container. This updates gcc to 5.4, which has the required c++ standards support.

Also flow down templating to a few more places, and use arrays instead of vectors where possible.

Allocate an array of children, rather than having an array of child pointers. Cleanup useless methods, constness.

This decreases the in-memory size of the tree, which significantly increases the speed since it's easier for the CPU to fit it in cache. There's more that can be done with this - in particular, the boundary. But that's more complicated.

Rather than downloading an entire miniconda distribution, just `pip install` the requisite packages. Should speed up the build significantly. Also, use clang compiler only, because otherwise we fail the numerical stability test (which is ok, generally, but annoying). Use lld to link, because the llvm LTO plugin for ld.gold doesn't work on travis. Turn of stability test on CI. Minor compiler optimization differences can destroy repeatability. Getting it to not do that is too complicated.

1. Don't copy the data. Just keep a reference. 2. Don't store the dimension in every datapoint. That's ridiculously inefficient. 3. Don't store the index in every datapoint. Just compute it as needed from pointer arithmetic.

This allows for more sse optimizations in a few cases.

Makes the datapoints smaller and allows vectorization in initialization to work better.

In addition to being more clear, this allows the compiler to recognize that pos_f is not aliased anywhere else so it can vectorize the calculation better.

Not only was that an important slowdown, but it also impacted repeatability across libm implementations.

adam-azarchs · 2019-03-31T02:50:03Z

On a somewhat larger dataset, which took around a minute on marsoc, running on hydra we see

key              change    before       after
     walltime   -21.1 %        35          28
 max_walltime   -21.1 %        35          28
        utime   -21.5 %        34          27
    max_utime   -21.5 %        34          27
        stime    -5.5 %         1           1
    max_stime    -5.5 %         1           1
          rss   -------    628028      627996
      max_rss   -------    623012      622860
     max_vmem   -------   1294012     1291956

which is roughly consistent with the improvement I expected to see.

adam-azarchs requested a review from nlhepler February 26, 2019 03:20

adam-azarchs force-pushed the azarchs/build-improvements branch from 1450688 to 18b22ea Compare March 13, 2019 20:32

adam-azarchs force-pushed the azarchs/build-improvements branch 8 times, most recently from 4dd81ff to 4184349 Compare March 13, 2019 23:10

Lance Hepler and others added 11 commits March 13, 2019 16:17

Add algorithm stability test, turn off -ffast-math

86098d4

Turning off -ffast-math is because otherwise any little compile flags change will change the output. Also remove unwanted library_dirs and include_dirs.

Turn on LTO.

68d9b11

Remove TSNE namespace, leave anonymous namespace.

69eb8ba

clang-format c++ code.

c33068f

Replace uses of malloc/free with RAII types.

9be2fb5

Also flow down templating to a few more places, and use arrays instead of vectors where possible.

Specialize for QT_NODE_CAPACITY=1.

4fea127

Remove some useless members from sptree nodes.

4c3309f

Flatten sptree child memory.

0f7ef1e

Allocate an array of children, rather than having an array of child pointers. Cleanup useless methods, constness.

adam-azarchs force-pushed the azarchs/build-improvements branch from 4184349 to 8c86710 Compare March 13, 2019 23:23

Add support for PGO build with clang.

6f1065d

adam-azarchs force-pushed the azarchs/build-improvements branch from 70d6d93 to 6f1065d Compare March 14, 2019 01:23

adam-azarchs added 2 commits March 25, 2019 18:50

Make VpTree DataPoint objects much more efficient.

ec20c0a

1. Don't copy the data. Just keep a reference. 2. Don't store the dimension in every datapoint. That's ridiculously inefficient. 3. Don't store the index in every datapoint. Just compute it as needed from pointer arithmetic.

Align SPTree nodes on 16-byte bounds.

1fb79c1

This allows for more sse optimizations in a few cases.

adam-azarchs mentioned this pull request Mar 29, 2019

Optimizations and build improvements. #5

Closed

Restructure vptree to work on indices.

cba336e

Makes the datapoints smaller and allows vectorization in initialization to work better.

adam-azarchs added 2 commits March 29, 2019 11:06

Have computeEdgeForces return a vector<double>

7a774ee

In addition to being more clear, this allows the compiler to recognize that pos_f is not aliased anywhere else so it can vectorize the calculation better.

Remove sqrt call.

453270e

Not only was that an important slowdown, but it also impacted repeatability across libm implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major build cleanup. #10

Major build cleanup. #10

Uh oh!

adam-azarchs commented Feb 26, 2019

Uh oh!

nlhepler commented Feb 26, 2019 •

edited

Loading

Uh oh!

adam-azarchs commented Mar 13, 2019

Uh oh!

adam-azarchs commented Mar 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Major build cleanup. #10

Are you sure you want to change the base?

Major build cleanup. #10

Uh oh!

Conversation

adam-azarchs commented Feb 26, 2019

Uh oh!

nlhepler commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-azarchs commented Mar 13, 2019

Uh oh!

adam-azarchs commented Mar 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nlhepler commented Feb 26, 2019 •

edited

Loading