Add SIMD optimizations to sparse inverted index #1414

lyang24 · 2025-12-27T06:39:15Z

Implements AVX512-optimized compute_all_distances() for sparse inverted index search
with runtime CPU detection and scalar fallback.

Implementation details:

AVX512 path (when hardware supports it):
- 16-wide SIMD vectorization using _mm512_i32gather_ps() and _mm512_i32scatter_ps()
- 2x loop unrolling (processes 32 elements per iteration) to hide gather latency
- Static asserts for type safety (32-bit doc IDs, float values)
- Supports both BM25 and IP metrics
- BM25: gathers doc_len_ratios, computes via scalar loop, scatters results
- IP: fully vectorized multiply-add with gather/scatter
Scalar fallback (matches original code structure):
- Simple double loop over query terms and posting lists
- Metric type check inside inner loop (no optimizations added)
- Works on all x86-64 CPUs and ARM/Apple Silicon
Runtime dispatcher:
- CPU capability detection using __builtin_cpu_supports()
- Automatically selects AVX512 or scalar based on hardware

Design decisions:

No AVX2: AVX2 gather is too slow (12-20 cycles vs 4 for scalar loads)
and lacks hardware scatter, making it slower than scalar code
No manual prefetch: Random doc IDs in posting lists cannot be predicted
by software prefetching, and manual prefetch pollutes cache and wastes
memory bandwidth. Hardware prefetchers handle random access better.
Scalar fallback unchanged: Matches original implementation exactly to
ensure correctness and avoid micro-optimizations that may not help

sre-ci-robot · 2025-12-27T06:39:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lyang24
To complete the pull request process, please assign alwayslove2013 after the PR has been reviewed.
You can assign the PR to them by writing /assign @alwayslove2013 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mergify · 2025-12-27T06:39:54Z

@lyang24 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

If you're fixing a bug, label it as kind/bug.
For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

src/index/sparse/sparse_inverted_index.h

lyang24 · 2025-12-29T05:50:52Z

to maintainers - what is the best acceptance test/ bench suite for this? i think there might be little perf benefit is doc list is short

alexanderguzhva

any benchmarks / unit tests?

alexanderguzhva · 2025-12-29T15:16:20Z

src/index/sparse/sparse_inverted_index.h

    n_cols() const = 0;
 };

+// CPU feature detection at runtime


please use src/simd/instruction_set.h instead

alexanderguzhva · 2025-12-29T15:26:14Z

src/simd/sparse_simd.h

+
+                // Apply computer function and scatter
+                for (size_t k = 0; k < SIMD_WIDTH; ++k) {
+                    scores[id_array[k]] = current_scores_array[k] + computer(contrib_array[k], doc_len_ratios_array[k]);


no _mm256_scatter_ps?

alexanderguzhva · 2025-12-29T15:26:48Z

src/simd/sparse_simd.h

+                _mm256_store_si256(reinterpret_cast<__m256i*>(id_array), doc_ids);
+
+                for (size_t k = 0; k < SIMD_WIDTH; ++k) {
+                    scores[id_array[k]] = new_scores_array[k];


please add _mm256_scatter_ps here

alexanderguzhva · 2025-12-29T15:27:17Z

src/simd/sparse_simd.h

+                _mm512_store_si512(reinterpret_cast<__m512i*>(id_array), doc_ids);
+
+                for (size_t k = 0; k < SIMD_WIDTH; ++k) {
+                    scores[id_array[k]] += computer(contrib_array[k], doc_len_ratios_array[k]);


no _mm512_scatter_ps here?

alexanderguzhva · 2025-12-29T15:30:03Z

src/simd/sparse_simd.h

+            for (; j + SIMD_WIDTH <= plist_ids.size(); j += SIMD_WIDTH) {
+                // Prefetch
+                for (size_t k = 0; k < SIMD_WIDTH && j + PREFETCH_DISTANCE + k < plist_ids.size(); ++k) {
+                    _mm_prefetch(reinterpret_cast<const char*>(&scores[plist_ids[j + PREFETCH_DISTANCE + k]]), _MM_HINT_T1);


why _MM_HINT_T1 instead of T0?

alexanderguzhva · 2025-12-29T15:30:16Z

src/simd/sparse_simd.h

+            for (; j + SIMD_WIDTH <= plist_ids.size(); j += SIMD_WIDTH) {
+                // Prefetch ahead
+                for (size_t k = 0; k < SIMD_WIDTH && j + PREFETCH_DISTANCE + k < plist_ids.size(); ++k) {
+                    _mm_prefetch(reinterpret_cast<const char*>(&scores[plist_ids[j + PREFETCH_DISTANCE + k]]), _MM_HINT_T1);


why _MM_HINT_T1 instead of T0?
also, why is _mm_prefetch() used, while the earlier avx2 version uses __builtin_prefetch()? Would it be possible to use only 1 approach everywhere? :)

lyang24 · 2025-12-29T23:18:23Z

any benchmarks / unit tests?

thank you for the meticulous review, will fix the suggestins add a comprehensive bench in the next few weeks.

lyang24 · 2025-12-30T14:02:49Z

any benchmarks / unit tests?

I did some initial benchmark prefetching, and avx2 seems to hurt perf and avx512 is a mixed bag. cc @alexanderguzhva does this align with expection? The benchmark is run on aws ec2 instances and code is in the repo.

`cd build/Release
make benchmark_sparse_simd -j8

./benchmark/benchmark_sparse_simd
`

ai summarized report

AVX512 SIMD Benchmark Results - Complete Summary

Test Environment

CPU: Intel c7i.2xlarge (Ice Lake, AVX512F supported)
Compiler: GCC with -mavx512f -mavx512dq
Dataset: Synthetic data with Zipf distribution (realistic sparse search workload)

Query Type Explanation

The benchmark tests two query generation strategies:

Random Query: Query terms selected randomly from vocabulary

Most terms are infrequent (short posting lists of ~1-10 documents)
SIMD loop rarely triggers due to insufficient elements
Represents unrealistic worst-case scenario

Heavy Terms Query: Query forced to include the top-10 most frequent terms

Heavy terms have long posting lists (hundreds to thousands of documents)
SIMD loop gets properly exercised with sufficient elements to amortize overhead
Represents realistic search queries (common words like "the", "search", "data")

Example from benchmark output:
Sparse IP (avg=32):
Top-10 heaviest terms: 17593 8796 5864 4398 3518 2932 2513 2199 1954 1759
Median posting length: 7

Random query: Likely hits median-length terms (~7 elements) → SIMD overhead dominates
Heavy query: Guaranteed to hit top-10 terms (~2000-17000 elements) → SIMD wins

Complete Results Table

Test Case	Avg Posting Length	Query Type	Metric	Speedup	Verdict
Ultra-sparse IP (random)	8	Random	IP	0.70x	❌ Slower
Ultra-sparse IP (heavy)	8	Heavy terms	IP	1.35x	✅ Faster
Sparse IP (random)	32	Random	IP	0.73x	❌ Slower
Sparse IP (heavy)	32	Heavy terms	IP	2.11x	✅ Faster
Sparse BM25 (heavy)	32	Heavy terms	BM25	0.77x	❌ Slower
Medium IP (random)	128	Random	IP	0.50x	❌ Slower
Medium IP (heavy)	128	Heavy terms	IP	1.58x	✅ Faster
Medium BM25 (heavy)	128	Heavy terms	BM25	0.78x	❌ Slower
Dense IP (random)	512	Random	IP	0.87x	❌ Slower
Dense IP (heavy)	512	Heavy terms	IP	1.69x	✅ Faster
Dense BM25 (heavy)	512	Heavy terms	BM25	0.77x	❌ Slower
Very Dense IP (heavy)	2048	Heavy terms	IP	3.06x	✅ Best Case
Very Dense BM25 (heavy)	2048	Heavy terms	BM25	0.80x	❌ Slower
Real-world IP	256	Heavy terms	IP	1.44x	✅ Faster
Real-world BM25	256	Heavy terms	BM25	0.78x	❌ Slower

Key Findings

✅ When AVX512 Helps (6/16 cases):

IP metric with heavy (frequent) terms: 1.35x - 3.06x speedup
Best performance: Very dense posting lists (avg=2048) → 3.06x speedup
Realistic workload: Real-world distribution (avg=256) → 1.44x speedup

❌ When AVX512 Hurts (10/16 cases):

Random queries (0.50x-0.87x): Queries that miss heavy terms process short posting lists where SIMD overhead dominates
BM25 metric (0.77x-0.80x): Per-element DocValueComputer function call overhead negates SIMD benefits across all densities

Why the Dichotomy?

AVX512 processes 16 floats per iteration. Performance depends on whether posting lists are long enough:

Heavy terms query hitting top-10 terms: Posting lists of 1000-17000 elements → SIMD loop runs 60-1000+ iterations → overhead amortized ✅
Random query hitting median terms: Posting lists of 1-10 elements → SIMD loop skipped, only scalar tail → pure overhead ❌

Real-world queries typically include common terms (stop words, domain keywords), making the "heavy terms" scenario more representative.

Conclusions

AVX512 is beneficial for IP metric with frequent terms - the primary use case for sparse search
Posting list length matters: Need avg > 32 elements for consistent speedup
Query distribution matters: Only queries hitting heavy terms benefit from SIMD
BM25 needs algorithmic changes: Current per-element function call approach incompatible with SIMD efficiency

alexanderguzhva · 2025-12-30T22:08:45Z

src/simd/instruction_set.h

 }  // namespace faiss
+#endif  // __x86_64__ || _M_X64
+
+// Lightweight CPU feature detection for sparse vector SIMD


No.

InstructionSet class has all the needed facilities to perform checks.
Please remove SIMDCapabilities completely, because it is totally redundant.
If any features are needed, then I would modify InstructionSet class.

Also, AVX2 code is not very interesting in 2026, so it looks like a correct decision to get rid of it.

yep used instruction set instead

alexanderguzhva · 2025-12-30T22:13:23Z

overall, lgtm. please fix the formatting and this InstructionSet issue.
@sparknack do you have any suggestions about this code?

sparknack · 2025-12-31T03:53:18Z

src/simd/sparse_simd.h

+                             const boost::span<const float>* doc_len_ratios_spans_ptr) {
+    // Static asserts for type safety
+    static_assert(sizeof(table_t) == 4, "SIMD gather requires 32-bit doc IDs");
+    static_assert(std::is_same_v<QType, float>, "SIMD operations require float values");


the element type of inverted_index_vals_spans should be uint16_t for BM25, so will this function still be called?

yes this is a bug, from benchmark bm25 underperforms on avx512 - i will fix this by left out bm25 on regular scalar path.

Implementation: - Add AVX512 implementation for IP metric (1.4x-3x speedup on dense posting lists) - Use separate compilation unit (sparse_simd_avx512.cc) for proper runtime CPU detection - Runtime dispatch via faiss::InstructionSet - library works on any CPU - Disable SIMD for BM25 metric (0.77x-0.80x slowdown due to DocValueComputer overhead) - Only enable for IP metric with float values on AVX512-capable x86_64 CPUs Code Quality: - Remove 109 lines of redundant code (duplicate ARM dispatcher, inline implementation) - Unified scalar fallback works across all platforms (x86_64, ARM, etc.) - Add comprehensive benchmark with Zipf distribution for realistic testing - Add TODO for future per-posting-list size threshold optimization Signed-off-by: lyang24 <lanqingy93@gmail.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

sre-ci-robot added the do-not-merge/work-in-progress label Dec 27, 2025

sre-ci-robot requested review from alwayslove2013 and cqy123456 December 27, 2025 06:39

sre-ci-robot added the size/XL label Dec 27, 2025

mergify bot added the needs-dco label Dec 27, 2025

mergify bot added the do-not-merge/missing-related-issue label Dec 27, 2025

lyang24 force-pushed the simd4sparse branch from 992ddad to a5367a8 Compare December 27, 2025 07:03

mergify bot added dco-passed and removed needs-dco labels Dec 27, 2025

lyang24 force-pushed the simd4sparse branch from a5367a8 to d985a04 Compare December 27, 2025 07:05

hhy3 reviewed Dec 28, 2025

View reviewed changes

src/index/sparse/sparse_inverted_index.h Outdated Show resolved Hide resolved

sparknack reviewed Dec 29, 2025

View reviewed changes

src/index/sparse/sparse_inverted_index.h Outdated Show resolved Hide resolved

sre-ci-robot added size/L and removed size/XL labels Dec 29, 2025

mergify bot added needs-dco and removed dco-passed labels Dec 29, 2025

lyang24 force-pushed the simd4sparse branch 2 times, most recently from 7cbf329 to d2784cc Compare December 29, 2025 05:41

sre-ci-robot added size/XL and removed size/L labels Dec 29, 2025

lyang24 force-pushed the simd4sparse branch from d2784cc to c6afdaf Compare December 29, 2025 05:49

mergify bot added dco-passed and removed needs-dco labels Dec 29, 2025

lyang24 marked this pull request as ready for review December 29, 2025 05:55

sre-ci-robot removed the do-not-merge/work-in-progress label Dec 29, 2025

alexanderguzhva reviewed Dec 29, 2025

View reviewed changes

lyang24 marked this pull request as draft December 29, 2025 23:18

sre-ci-robot added the do-not-merge/work-in-progress label Dec 29, 2025

lyang24 force-pushed the simd4sparse branch 5 times, most recently from 7de909a to 5f39d24 Compare December 30, 2025 13:38

mergify bot added needs-dco and removed dco-passed labels Dec 30, 2025

alexanderguzhva requested changes Dec 30, 2025

View reviewed changes

sparknack reviewed Dec 31, 2025

View reviewed changes

lyang24 force-pushed the simd4sparse branch 3 times, most recently from 7b0750d to 5726f8d Compare December 31, 2025 07:31

lyang24 force-pushed the simd4sparse branch from 5726f8d to a969d30 Compare January 1, 2026 00:06

mergify bot added dco-passed and removed needs-dco labels Jan 1, 2026

Add SIMD optimizations to sparse inverted index #1414

Are you sure you want to change the base?

Add SIMD optimizations to sparse inverted index #1414

Conversation

lyang24 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sre-ci-robot commented Dec 27, 2025

Uh oh!

mergify bot commented Dec 27, 2025

Uh oh!

Uh oh!

Uh oh!

lyang24 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexanderguzhva left a comment

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyang24 commented Dec 29, 2025

Uh oh!

lyang24 commented Dec 30, 2025

Uh oh!

alexanderguzhva Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

lyang24 Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva commented Dec 30, 2025

Uh oh!

sparknack Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyang24 Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lyang24 commented Dec 27, 2025 •

edited

Loading

lyang24 commented Dec 29, 2025 •

edited

Loading

alexanderguzhva Dec 29, 2025 •

edited

Loading

sparknack Dec 31, 2025 •

edited

Loading

lyang24 Dec 31, 2025 •

edited

Loading