Skip to content

Conversation

@adam-azarchs
Copy link

Fix the Makefile dependencies so things get rebuilt when they should be.

Only produce one .so file, which knows how to run both 2d and 3d.

Break out main and the I/O methods into a separate cpp file, which is
not used for the python extension module.

Do not create pointless objects. The methods are all static. TSNE is
now a namespace, not a class, and the exported run method is wrapped
in extern "C" to make linkage easier.

Minor cleanup of includes / using statements.

Because I just couldn't help myself, remove a couple places where
new/delete were being used for no good reason.

Get rid of a few places where NDIMS (the build time constant) and a
run-time variable, which are supposed to have the same value, were both
being used in a method. This means flowing the template a bit further
up the tree. There's lots more places this could be done - this only
addresses the cases where not doing so could lead to errors if the
template variable was inconsistent with the run-time variable.

Kill off the last pointless vestiges of the cblas dependency.

Use hidden visibility where appropriate - only export the run() method.
This cuts the size of the .so and allows the compiler to inline more
aggressively, especially since we've also turned on LTO.

Together, all of these changes result in a build with a single 901k
extension module, instead of two 1.9MB extension modules.

@nlhepler
Copy link

nlhepler commented Feb 26, 2019

I am happy with where this is, I think the only thing I could pessimistically think of is a test (with fixed seed) that equates the result with a known quantity.

@adam-azarchs adam-azarchs force-pushed the azarchs/build-improvements branch from 1450688 to 18b22ea Compare March 13, 2019 20:32
@adam-azarchs
Copy link
Author

I moved the consistency test to the start of the commit chain and remade it after disabling -ffast-math from the build. The new HEAD commit still passes. There were some differences introduced by different treatment of duplicate points which have been fixed now.

Then I turned -ffast-math back on since it seems to still pass the tests.

@adam-azarchs adam-azarchs force-pushed the azarchs/build-improvements branch 8 times, most recently from 4dd81ff to 4184349 Compare March 13, 2019 23:10
Lance Hepler and others added 11 commits March 13, 2019 16:17
Turning off -ffast-math is because otherwise any little compile flags
change will change the output.

Also remove unwanted library_dirs and include_dirs.
Fix the Makefile dependencies so things get rebuilt when they should be.
Add a `test` target.

Only produce one .so file, which knows how to run both 2d and 3d.

Break out main and the I/O methods into a separate cpp file, which is
not used for the python extension module.

Do not create pointless objects.  The methods are all static.  TSNE is
now a namespace, not a class, and the exported `run` method is wrapped
in `extern "C"` to make linkage easier.

Minor cleanup of includes / using statements.

Because I just couldn't help myself, remove a couple places where
new/delete were being used for no good reason.

Get rid of a few places where NDIMS (the build time constant) and a
run-time variable, which are supposed to have the same value, were both
being used in a method.  This means flowing the template a bit further
up the tree.  There's lots more places this could be done - this only
addresses the cases where not doing so could lead to errors if the
template variable was inconsistent with the run-time variable.

Kill off the last pointless vestiges of the cblas dependency.

Use hidden visibility where appropriate - only export the run() method.
This cuts the size of the .so and allows the compiler to inline more
aggressively, especially since we've also turned on LTO.

Together, all of these changes result in a build with a single 901k
extension module, instead of two 1.9MB extension modules.

Set -std=c++14 in the build.  This is required by some of the cleanups
because of the use of the `nullptr` keyword.  Older gcc will not default
to c++11 or higher, so will fail to build nullptr.
And c++17 doesn't work because
```
include/python2.7/stringobject.h:175:5: error:
      ISO C++17 does not allow 'register' storage class specifier [-Wregister]
    register Py_ssize_t *len    /* pointer to length variable or NULL
    ^~~~~~~~~
```

Update travis build to Xenial container.  This updates gcc to 5.4, which has
the required c++ standards support.
Also flow down templating to a few more places, and use arrays instead
of vectors where possible.
Allocate an array of children, rather than having an array of child
pointers.

Cleanup useless methods, constness.
This decreases the in-memory size of the tree, which significantly
increases the speed since it's easier for the CPU to fit it in cache.

There's more that can be done with this - in particular, the boundary.
But that's more complicated.
Rather than downloading an entire miniconda distribution, just `pip install`
the requisite packages.  Should speed up the build significantly.

Also, use clang compiler only, because otherwise we fail the numerical
stability test (which is ok, generally, but annoying).  Use lld to link,
because the llvm LTO plugin for ld.gold doesn't work on travis.

Turn of stability test on CI.  Minor compiler optimization differences can
destroy repeatability.  Getting it to not do that is too complicated.
@adam-azarchs adam-azarchs force-pushed the azarchs/build-improvements branch from 4184349 to 8c86710 Compare March 13, 2019 23:23
@adam-azarchs adam-azarchs force-pushed the azarchs/build-improvements branch from 70d6d93 to 6f1065d Compare March 14, 2019 01:23
1. Don't copy the data.  Just keep a reference.
2. Don't store the dimension in every datapoint.  That's ridiculously
   inefficient.
3. Don't store the index in every datapoint.  Just compute it as needed
   from pointer arithmetic.
This allows for more sse optimizations in a few cases.
Makes the datapoints smaller and allows vectorization in initialization
to work better.
In addition to being more clear, this allows the compiler to recognize
that pos_f is not aliased anywhere else so it can vectorize the
calculation better.
Not only was that an important slowdown, but it also impacted
repeatability across libm implementations.
@adam-azarchs
Copy link
Author

On a somewhat larger dataset, which took around a minute on marsoc, running on hydra we see

key              change    before       after
     walltime   -21.1 %        35          28
 max_walltime   -21.1 %        35          28
        utime   -21.5 %        34          27
    max_utime   -21.5 %        34          27
        stime    -5.5 %         1           1
    max_stime    -5.5 %         1           1
          rss   -------    628028      627996
      max_rss   -------    623012      622860
     max_vmem   -------   1294012     1291956

which is roughly consistent with the improvement I expected to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants