-
Notifications
You must be signed in to change notification settings - Fork 120
Add SIMD optimizations to sparse inverted index #1414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lyang24 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@lyang24 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
7cbf329 to
d2784cc
Compare
|
to maintainers - what is the best acceptance test/ bench suite for this? i think there might be little perf benefit is doc list is short |
alexanderguzhva
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any benchmarks / unit tests?
| n_cols() const = 0; | ||
| }; | ||
|
|
||
| // CPU feature detection at runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use src/simd/instruction_set.h instead
src/simd/sparse_simd.h
Outdated
|
|
||
| // Apply computer function and scatter | ||
| for (size_t k = 0; k < SIMD_WIDTH; ++k) { | ||
| scores[id_array[k]] = current_scores_array[k] + computer(contrib_array[k], doc_len_ratios_array[k]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no _mm256_scatter_ps?
src/simd/sparse_simd.h
Outdated
| _mm256_store_si256(reinterpret_cast<__m256i*>(id_array), doc_ids); | ||
|
|
||
| for (size_t k = 0; k < SIMD_WIDTH; ++k) { | ||
| scores[id_array[k]] = new_scores_array[k]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add _mm256_scatter_ps here
src/simd/sparse_simd.h
Outdated
| _mm512_store_si512(reinterpret_cast<__m512i*>(id_array), doc_ids); | ||
|
|
||
| for (size_t k = 0; k < SIMD_WIDTH; ++k) { | ||
| scores[id_array[k]] += computer(contrib_array[k], doc_len_ratios_array[k]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no _mm512_scatter_ps here?
src/simd/sparse_simd.h
Outdated
| for (; j + SIMD_WIDTH <= plist_ids.size(); j += SIMD_WIDTH) { | ||
| // Prefetch | ||
| for (size_t k = 0; k < SIMD_WIDTH && j + PREFETCH_DISTANCE + k < plist_ids.size(); ++k) { | ||
| _mm_prefetch(reinterpret_cast<const char*>(&scores[plist_ids[j + PREFETCH_DISTANCE + k]]), _MM_HINT_T1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why _MM_HINT_T1 instead of T0?
src/simd/sparse_simd.h
Outdated
| for (; j + SIMD_WIDTH <= plist_ids.size(); j += SIMD_WIDTH) { | ||
| // Prefetch ahead | ||
| for (size_t k = 0; k < SIMD_WIDTH && j + PREFETCH_DISTANCE + k < plist_ids.size(); ++k) { | ||
| _mm_prefetch(reinterpret_cast<const char*>(&scores[plist_ids[j + PREFETCH_DISTANCE + k]]), _MM_HINT_T1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why _MM_HINT_T1 instead of T0?
also, why is _mm_prefetch() used, while the earlier avx2 version uses __builtin_prefetch()? Would it be possible to use only 1 approach everywhere? :)
thank you for the meticulous review, will fix the suggestins add a comprehensive bench in the next few weeks. |
7de909a to
5f39d24
Compare
I did some initial benchmark prefetching, and avx2 seems to hurt perf and avx512 is a mixed bag. cc @alexanderguzhva does this align with expection? The benchmark is run on aws ec2 instances and code is in the repo. `cd build/Release ./benchmark/benchmark_sparse_simd ai summarized report AVX512 SIMD Benchmark Results - Complete Summary Test Environment
Query Type Explanation The benchmark tests two query generation strategies: Random Query: Query terms selected randomly from vocabulary
Heavy Terms Query: Query forced to include the top-10 most frequent terms
Example from benchmark output:
Complete Results Table
Key Findings ✅ When AVX512 Helps (6/16 cases):
❌ When AVX512 Hurts (10/16 cases):
Why the Dichotomy? AVX512 processes 16 floats per iteration. Performance depends on whether posting lists are long enough:
Real-world queries typically include common terms (stop words, domain keywords), making the "heavy terms" scenario more representative. Conclusions
|
src/simd/instruction_set.h
Outdated
| } // namespace faiss | ||
| #endif // __x86_64__ || _M_X64 | ||
|
|
||
| // Lightweight CPU feature detection for sparse vector SIMD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
InstructionSet class has all the needed facilities to perform checks.
Please remove SIMDCapabilities completely, because it is totally redundant.
If any features are needed, then I would modify InstructionSet class.
Also, AVX2 code is not very interesting in 2026, so it looks like a correct decision to get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep used instruction set instead
|
overall, lgtm. please fix the formatting and this |
src/simd/sparse_simd.h
Outdated
| const boost::span<const float>* doc_len_ratios_spans_ptr) { | ||
| // Static asserts for type safety | ||
| static_assert(sizeof(table_t) == 4, "SIMD gather requires 32-bit doc IDs"); | ||
| static_assert(std::is_same_v<QType, float>, "SIMD operations require float values"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the element type of inverted_index_vals_spans should be uint16_t for BM25, so will this function still be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is a bug, from benchmark bm25 underperforms on avx512 - i will fix this by left out bm25 on regular scalar path.
7b0750d to
5726f8d
Compare
Implementation: - Add AVX512 implementation for IP metric (1.4x-3x speedup on dense posting lists) - Use separate compilation unit (sparse_simd_avx512.cc) for proper runtime CPU detection - Runtime dispatch via faiss::InstructionSet - library works on any CPU - Disable SIMD for BM25 metric (0.77x-0.80x slowdown due to DocValueComputer overhead) - Only enable for IP metric with float values on AVX512-capable x86_64 CPUs Code Quality: - Remove 109 lines of redundant code (duplicate ARM dispatcher, inline implementation) - Unified scalar fallback works across all platforms (x86_64, ARM, etc.) - Add comprehensive benchmark with Zipf distribution for realistic testing - Add TODO for future per-posting-list size threshold optimization Signed-off-by: lyang24 <lanqingy93@gmail.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements AVX512-optimized compute_all_distances() for sparse inverted index search
with runtime CPU detection and scalar fallback.
Implementation details:
AVX512 path (when hardware supports it):
Scalar fallback (matches original code structure):
Runtime dispatcher:
Design decisions:
No AVX2: AVX2 gather is too slow (12-20 cycles vs 4 for scalar loads)
and lacks hardware scatter, making it slower than scalar code
No manual prefetch: Random doc IDs in posting lists cannot be predicted
by software prefetching, and manual prefetch pollutes cache and wastes
memory bandwidth. Hardware prefetchers handle random access better.
Scalar fallback unchanged: Matches original implementation exactly to
ensure correctness and avoid micro-optimizations that may not help