ivf-pq::search faster kernel selection experiments by achirkin · Pull Request #1 · achirkin/raft

achirkin · 2023-03-15T07:35:29Z

  The `select` function contains a rather heavy logic that chooses the scheduling parameters and
  tweaks the kernel config. This works fine for all but the smallest work sizes (n_queries *
  n_probes).

  However, for very small work sizes (e.g. n_queries == 1), the time spent on scheduling can be
  larger than the execution time of the similarity kernel. The biggest impact comes from the
  cudaFuncSetAttribute calls, which increase the maximum available shmem. Moreover, these change
  the carveout setting, which adds extra gaps in the GPU timeline around the kernel call (the
  driver sets the GPU state).

  To overcome this problem, I try to be more conservative on calling cudaFuncSetAttribute here.
  Unfortunately, in my experiments, the performance improvement is observed only for the
  smallest work sizes (n_queries = 1, n_probes <= 10). In addition, statefullness of the
  GPU/kernel configs makes things more complicated in benchmarks: the carveout setting is
  preserved across runs. The current changeset then may hurt performance: if on the first
  invocation of the kernel the carveout is increased, the algorithm may skip
  cudaFuncSetAttribute on the second invocation; as a result, on the second run the kernel is
  then launched with a wrong carveout setting (which could hurt L1 cache hit rate).

Demangle the error stack trace provided by GCC. Example output: ```bash RAFT failure at file=/workspace/raft/cpp/bench/ann/src/raft/raft_ann_bench_utils.h line=127: Ooops! Obtained 16 stack frames #1 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::logic_error::logic_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5e [0x7fb20acce45e] #2 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::configured_raft_resources::stream_wait(CUstream_st*) const +0x2e3 [0x7fb20acd0ac3] #3 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::RaftIvfPQ<float, long>::search(float const*, int, int, unsigned long*, float*, CUstream_st*) const +0x63e [0x7fb20acd44fe] #4 in ./cpp/build/ANN_BENCH: void raft::bench::ann::bench_search<float>(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective) +0xf76 [0x55853859f586] rapidsai#5 in ./cpp/build/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x84 [0x558538548f14] rapidsai#6 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const +0x168 [0x5585385d6498] rapidsai#7 in ./cpp/build/ANN_BENCH(+0x149108) [0x5585385b7108] rapidsai#8 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoNIterations() +0x34f [0x5585385b8c7f] rapidsai#9 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoOneRepetition() +0x119 [0x5585385b99b9] rapidsai#10 in ./cpp/build/ANN_BENCH(+0x13afdd) [0x5585385a8fdd] rapidsai#11 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) +0x58e [0x5585385aa8fe] rapidsai#12 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks() +0x6a [0x5585385aaada] rapidsai#13 in ./cpp/build/ANN_BENCH: raft::bench::ann::run_main(int, char**) +0x11ed [0x5585385980cd] rapidsai#14 in /lib/x86_64-linux-gnu/libc.so.6(+0x28150) [0x7fb213e28150] rapidsai#15 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x89 [0x7fb213e28209] rapidsai#16 in ./cpp/build/ANN_BENCH(+0xbfcef) [0x55853852dcef] ``` Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#2188

Experiments

c594b59

github-actions bot added the cpp label Mar 15, 2023

achirkin changed the title ~~Experiments~~ ivf-pq::search faster kernel selection experiments Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ivf-pq::search faster kernel selection experiments#1

ivf-pq::search faster kernel selection experiments#1
achirkin wants to merge 1 commit intobranch-23.04from
fea-ivf-pq-decide-faster

achirkin commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achirkin commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant