Skip to content

Writing low level CPU kernels with OMP & AVX2 support #28

@shivendrra

Description

@shivendrra

Overview

Optimize CPU kernel implementations in the csrc/cpu directory to utilize SIMD instructions (AVX2) and multi-threading (OpenMP) for improved performance in tensor operations. This significantly accelerates inference workloads on CPU, making the library more efficient for production deployments.

Operations to Optimize

Binary Operations (Priority 1)

  • add_ops, add_scalar_ops
  • sub_ops, sub_scalar_ops
  • mul_ops, mul_scalar_ops
  • div_ops, div_scalar_ops
  • pow_array_ops, pow_scalar_ops
  • Broadcasted operations: add/sub/mul/div_broadcasted_array_ops

Helper Operations (Priority 1)

  • ones_array_ops, zeros_array_ops
  • fill_array_ops
  • linspace_array_ops
  • arange_array_ops (already efficient, low priority)

Future Operations (Priority 2)

  • Unary ops: sqrt, exp, log, sin, cos, tan
  • Reduction ops: sum, mean, min, max, var, std, clip, clamp
  • Shape ops: transpose, reshape, smaller, greater, flatten

Core Operations (Priority 3)

  • core file
  • dtype & assignments
  • contiguous ops on arrays

Implementation Strategy

Adaptive Thresholds

#define SIMD_THRESHOLD 64
#define OMP_THRESHOLD 4096

template:

operation(float* a, float* b, float* out, size_t size) {
#if USE_AVX2
  if (size >= SIMD_THRESHOLD) {
    const size_t simd_size = size & ~7UL;
#ifdef _OPENMP
    if (size >= OMP_THRESHOLD) {
      #pragma omp parallel for schedule(static)
      for (size_t i = 0; i < simd_size; i += 8) {
      }
    } else
#endif
    {
      for (size_t i = 0; i < simd_size; i += 8) {
      }
    }
    for (size_t i = simd_size; i < size; i++) {
    }
    return;
  }
#endif
#ifdef _OPENMP
  if (size >= OMP_THRESHOLD) {
    #pragma omp parallel for schedule(static)
    for (size_t i = 0; i < size; i++) {
    }
    return;
  }
#endif
  for (size_t i = 0; i < size; i++) {
  }
}
Operation AVX2 Intrinsic
Load _mm256_loadu_ps
Store _mm256_storeu_ps
Set scalar _mm256_set1_ps
Add _mm256_add_ps
Sub _mm256_sub_ps
Mul _mm256_mul_ps
Div _mm256_div_ps

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureslists new features required

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions