Skip to content

ivf-pq::search faster kernel selection experiments#1

Open
achirkin wants to merge 1 commit intobranch-23.04from
fea-ivf-pq-decide-faster
Open

ivf-pq::search faster kernel selection experiments#1
achirkin wants to merge 1 commit intobranch-23.04from
fea-ivf-pq-decide-faster

Conversation

@achirkin
Copy link
Owner

  The `select` function contains a rather heavy logic that chooses the scheduling parameters and
  tweaks the kernel config. This works fine for all but the smallest work sizes (n_queries *
  n_probes).

  However, for very small work sizes (e.g. n_queries == 1), the time spent on scheduling can be
  larger than the execution time of the similarity kernel. The biggest impact comes from the
  cudaFuncSetAttribute calls, which increase the maximum available shmem. Moreover, these change
  the carveout setting, which adds extra gaps in the GPU timeline around the kernel call (the
  driver sets the GPU state).

  To overcome this problem, I try to be more conservative on calling cudaFuncSetAttribute here.
  Unfortunately, in my experiments, the performance improvement is observed only for the
  smallest work sizes (n_queries = 1, n_probes <= 10). In addition, statefullness of the
  GPU/kernel configs makes things more complicated in benchmarks: the carveout setting is
  preserved across runs. The current changeset then may hurt performance: if on the first
  invocation of the kernel the carveout is increased, the algorithm may skip
  cudaFuncSetAttribute on the second invocation; as a result, on the second run the kernel is
  then launched with a wrong carveout setting (which could hurt L1 cache hit rate).

@github-actions github-actions bot added the cpp label Mar 15, 2023
@achirkin achirkin changed the title Experiments ivf-pq::search faster kernel selection experiments Mar 15, 2023
achirkin added a commit that referenced this pull request Feb 16, 2024
Demangle the error stack trace provided by GCC.
Example output:
```bash
RAFT failure at file=/workspace/raft/cpp/bench/ann/src/raft/raft_ann_bench_utils.h line=127: Ooops!
Obtained 16 stack frames
#1 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::logic_error::logic_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5e [0x7fb20acce45e]
#2 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::configured_raft_resources::stream_wait(CUstream_st*) const +0x2e3 [0x7fb20acd0ac3]
#3 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::RaftIvfPQ<float, long>::search(float const*, int, int, unsigned long*, float*, CUstream_st*) const +0x63e [0x7fb20acd44fe]
#4 in ./cpp/build/ANN_BENCH: void raft::bench::ann::bench_search<float>(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective) +0xf76 [0x55853859f586]
rapidsai#5 in ./cpp/build/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x84 [0x558538548f14]
rapidsai#6 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const +0x168 [0x5585385d6498]
rapidsai#7 in ./cpp/build/ANN_BENCH(+0x149108) [0x5585385b7108]
rapidsai#8 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoNIterations() +0x34f [0x5585385b8c7f]
rapidsai#9 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoOneRepetition() +0x119 [0x5585385b99b9]
rapidsai#10 in ./cpp/build/ANN_BENCH(+0x13afdd) [0x5585385a8fdd]
rapidsai#11 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) +0x58e [0x5585385aa8fe]
rapidsai#12 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks() +0x6a [0x5585385aaada]
rapidsai#13 in ./cpp/build/ANN_BENCH: raft::bench::ann::run_main(int, char**) +0x11ed [0x5585385980cd]
rapidsai#14 in /lib/x86_64-linux-gnu/libc.so.6(+0x28150) [0x7fb213e28150]
rapidsai#15 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x89 [0x7fb213e28209]
rapidsai#16 in ./cpp/build/ANN_BENCH(+0xbfcef) [0x55853852dcef]


```

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#2188
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant