Skip to content

Conversation

@lyang24
Copy link

@lyang24 lyang24 commented Dec 27, 2025

Implements AVX512-optimized compute_all_distances() for sparse inverted index search
with runtime CPU detection and scalar fallback.

Implementation details:

  1. AVX512 path (when hardware supports it):

    • 16-wide SIMD vectorization using _mm512_i32gather_ps() and _mm512_i32scatter_ps()
    • 2x loop unrolling (processes 32 elements per iteration) to hide gather latency
    • Static asserts for type safety (32-bit doc IDs, float values)
    • Supports both BM25 and IP metrics
    • BM25: gathers doc_len_ratios, computes via scalar loop, scatters results
    • IP: fully vectorized multiply-add with gather/scatter
  2. Scalar fallback (matches original code structure):

    • Simple double loop over query terms and posting lists
    • Metric type check inside inner loop (no optimizations added)
    • Works on all x86-64 CPUs and ARM/Apple Silicon
  3. Runtime dispatcher:

    • CPU capability detection using __builtin_cpu_supports()
    • Automatically selects AVX512 or scalar based on hardware

Design decisions:

  • No AVX2: AVX2 gather is too slow (12-20 cycles vs 4 for scalar loads)
    and lacks hardware scatter, making it slower than scalar code

  • No manual prefetch: Random doc IDs in posting lists cannot be predicted
    by software prefetching, and manual prefetch pollutes cache and wastes
    memory bandwidth. Hardware prefetchers handle random access better.

  • Scalar fallback unchanged: Matches original implementation exactly to
    ensure correctness and avoid micro-optimizations that may not help

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lyang24
To complete the pull request process, please assign alwayslove2013 after the PR has been reviewed.
You can assign the PR to them by writing /assign @alwayslove2013 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify
Copy link

mergify bot commented Dec 27, 2025

@lyang24 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

@mergify mergify bot added needs-dco and removed dco-passed labels Dec 29, 2025
@lyang24 lyang24 force-pushed the simd4sparse branch 2 times, most recently from 7cbf329 to d2784cc Compare December 29, 2025 05:41
@mergify mergify bot added dco-passed and removed needs-dco labels Dec 29, 2025
@lyang24
Copy link
Author

lyang24 commented Dec 29, 2025

to maintainers - what is the best acceptance test/ bench suite for this? i think there might be little perf benefit is doc list is short

@lyang24 lyang24 marked this pull request as ready for review December 29, 2025 05:55
Copy link
Collaborator

@alexanderguzhva alexanderguzhva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any benchmarks / unit tests?

n_cols() const = 0;
};

// CPU feature detection at runtime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use src/simd/instruction_set.h instead


// Apply computer function and scatter
for (size_t k = 0; k < SIMD_WIDTH; ++k) {
scores[id_array[k]] = current_scores_array[k] + computer(contrib_array[k], doc_len_ratios_array[k]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no _mm256_scatter_ps?

_mm256_store_si256(reinterpret_cast<__m256i*>(id_array), doc_ids);

for (size_t k = 0; k < SIMD_WIDTH; ++k) {
scores[id_array[k]] = new_scores_array[k];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add _mm256_scatter_ps here

_mm512_store_si512(reinterpret_cast<__m512i*>(id_array), doc_ids);

for (size_t k = 0; k < SIMD_WIDTH; ++k) {
scores[id_array[k]] += computer(contrib_array[k], doc_len_ratios_array[k]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no _mm512_scatter_ps here?

for (; j + SIMD_WIDTH <= plist_ids.size(); j += SIMD_WIDTH) {
// Prefetch
for (size_t k = 0; k < SIMD_WIDTH && j + PREFETCH_DISTANCE + k < plist_ids.size(); ++k) {
_mm_prefetch(reinterpret_cast<const char*>(&scores[plist_ids[j + PREFETCH_DISTANCE + k]]), _MM_HINT_T1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why _MM_HINT_T1 instead of T0?

for (; j + SIMD_WIDTH <= plist_ids.size(); j += SIMD_WIDTH) {
// Prefetch ahead
for (size_t k = 0; k < SIMD_WIDTH && j + PREFETCH_DISTANCE + k < plist_ids.size(); ++k) {
_mm_prefetch(reinterpret_cast<const char*>(&scores[plist_ids[j + PREFETCH_DISTANCE + k]]), _MM_HINT_T1);
Copy link
Collaborator

@alexanderguzhva alexanderguzhva Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why _MM_HINT_T1 instead of T0?
also, why is _mm_prefetch() used, while the earlier avx2 version uses __builtin_prefetch()? Would it be possible to use only 1 approach everywhere? :)

@lyang24
Copy link
Author

lyang24 commented Dec 29, 2025

any benchmarks / unit tests?

thank you for the meticulous review, will fix the suggestins add a comprehensive bench in the next few weeks.

@lyang24 lyang24 marked this pull request as draft December 29, 2025 23:18
@lyang24 lyang24 force-pushed the simd4sparse branch 5 times, most recently from 7de909a to 5f39d24 Compare December 30, 2025 13:38
@mergify mergify bot added needs-dco and removed dco-passed labels Dec 30, 2025
@lyang24
Copy link
Author

lyang24 commented Dec 30, 2025

any benchmarks / unit tests?

I did some initial benchmark prefetching, and avx2 seems to hurt perf and avx512 is a mixed bag. cc @alexanderguzhva does this align with expection? The benchmark is run on aws ec2 instances and code is in the repo.

`cd build/Release
make benchmark_sparse_simd -j8

./benchmark/benchmark_sparse_simd
`

ai summarized report

AVX512 SIMD Benchmark Results - Complete Summary

Test Environment

  • CPU: Intel c7i.2xlarge (Ice Lake, AVX512F supported)
  • Compiler: GCC with -mavx512f -mavx512dq
  • Dataset: Synthetic data with Zipf distribution (realistic sparse search workload)

Query Type Explanation

The benchmark tests two query generation strategies:

Random Query: Query terms selected randomly from vocabulary

  • Most terms are infrequent (short posting lists of ~1-10 documents)
  • SIMD loop rarely triggers due to insufficient elements
  • Represents unrealistic worst-case scenario

Heavy Terms Query: Query forced to include the top-10 most frequent terms

  • Heavy terms have long posting lists (hundreds to thousands of documents)
  • SIMD loop gets properly exercised with sufficient elements to amortize overhead
  • Represents realistic search queries (common words like "the", "search", "data")

Example from benchmark output:
Sparse IP (avg=32):
Top-10 heaviest terms: 17593 8796 5864 4398 3518 2932 2513 2199 1954 1759
Median posting length: 7

  • Random query: Likely hits median-length terms (~7 elements) → SIMD overhead dominates
  • Heavy query: Guaranteed to hit top-10 terms (~2000-17000 elements) → SIMD wins

Complete Results Table

Test Case Avg Posting Length Query Type Metric Speedup Verdict
Ultra-sparse IP (random) 8 Random IP 0.70x ❌ Slower
Ultra-sparse IP (heavy) 8 Heavy terms IP 1.35x ✅ Faster
Sparse IP (random) 32 Random IP 0.73x ❌ Slower
Sparse IP (heavy) 32 Heavy terms IP 2.11x ✅ Faster
Sparse BM25 (heavy) 32 Heavy terms BM25 0.77x ❌ Slower
Medium IP (random) 128 Random IP 0.50x ❌ Slower
Medium IP (heavy) 128 Heavy terms IP 1.58x ✅ Faster
Medium BM25 (heavy) 128 Heavy terms BM25 0.78x ❌ Slower
Dense IP (random) 512 Random IP 0.87x ❌ Slower
Dense IP (heavy) 512 Heavy terms IP 1.69x ✅ Faster
Dense BM25 (heavy) 512 Heavy terms BM25 0.77x ❌ Slower
Very Dense IP (heavy) 2048 Heavy terms IP 3.06x ✅ Best Case
Very Dense BM25 (heavy) 2048 Heavy terms BM25 0.80x ❌ Slower
Real-world IP 256 Heavy terms IP 1.44x ✅ Faster
Real-world BM25 256 Heavy terms BM25 0.78x ❌ Slower

Key Findings

✅ When AVX512 Helps (6/16 cases):

  • IP metric with heavy (frequent) terms: 1.35x - 3.06x speedup
  • Best performance: Very dense posting lists (avg=2048) → 3.06x speedup
  • Realistic workload: Real-world distribution (avg=256) → 1.44x speedup

❌ When AVX512 Hurts (10/16 cases):

  1. Random queries (0.50x-0.87x): Queries that miss heavy terms process short posting lists where SIMD overhead dominates
  2. BM25 metric (0.77x-0.80x): Per-element DocValueComputer function call overhead negates SIMD benefits across all densities

Why the Dichotomy?

AVX512 processes 16 floats per iteration. Performance depends on whether posting lists are long enough:

  • Heavy terms query hitting top-10 terms: Posting lists of 1000-17000 elements → SIMD loop runs 60-1000+ iterations → overhead amortized ✅
  • Random query hitting median terms: Posting lists of 1-10 elements → SIMD loop skipped, only scalar tail → pure overhead ❌

Real-world queries typically include common terms (stop words, domain keywords), making the "heavy terms" scenario more representative.

Conclusions

  1. AVX512 is beneficial for IP metric with frequent terms - the primary use case for sparse search
  2. Posting list length matters: Need avg > 32 elements for consistent speedup
  3. Query distribution matters: Only queries hitting heavy terms benefit from SIMD
  4. BM25 needs algorithmic changes: Current per-element function call approach incompatible with SIMD efficiency

} // namespace faiss
#endif // __x86_64__ || _M_X64

// Lightweight CPU feature detection for sparse vector SIMD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

InstructionSet class has all the needed facilities to perform checks.
Please remove SIMDCapabilities completely, because it is totally redundant.
If any features are needed, then I would modify InstructionSet class.

Also, AVX2 code is not very interesting in 2026, so it looks like a correct decision to get rid of it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep used instruction set instead

@alexanderguzhva
Copy link
Collaborator

overall, lgtm. please fix the formatting and this InstructionSet issue.
@sparknack do you have any suggestions about this code?

const boost::span<const float>* doc_len_ratios_spans_ptr) {
// Static asserts for type safety
static_assert(sizeof(table_t) == 4, "SIMD gather requires 32-bit doc IDs");
static_assert(std::is_same_v<QType, float>, "SIMD operations require float values");
Copy link
Collaborator

@sparknack sparknack Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the element type of inverted_index_vals_spans should be uint16_t for BM25, so will this function still be called?

Copy link
Author

@lyang24 lyang24 Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a bug, from benchmark bm25 underperforms on avx512 - i will fix this by left out bm25 on regular scalar path.

@lyang24 lyang24 force-pushed the simd4sparse branch 3 times, most recently from 7b0750d to 5726f8d Compare December 31, 2025 07:31
Implementation:
- Add AVX512 implementation for IP metric (1.4x-3x speedup on dense posting lists)
- Use separate compilation unit (sparse_simd_avx512.cc) for proper runtime CPU detection
- Runtime dispatch via faiss::InstructionSet - library works on any CPU
- Disable SIMD for BM25 metric (0.77x-0.80x slowdown due to DocValueComputer overhead)
- Only enable for IP metric with float values on AVX512-capable x86_64 CPUs

Code Quality:
- Remove 109 lines of redundant code (duplicate ARM dispatcher, inline implementation)
- Unified scalar fallback works across all platforms (x86_64, ARM, etc.)
- Add comprehensive benchmark with Zipf distribution for realistic testing
- Add TODO for future per-posting-list size threshold optimization

Signed-off-by: lyang24 <lanqingy93@gmail.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants