This code implements a highly optimized version of the **SGEMV...

September 1, 2025 at 06:58 AM

static inline void sgemv(float *out, const float *weights, int rows, int cols, int col_stride, const float *x) { int i, j; i=0; for (;i<rows-15;i+=16) { float *y; __m256 vy0, vy8; y = &out[i]; vy0 = _mm256_setzero_ps(); vy8 = _mm256_setzero_ps(); for (j=0;j<cols;j++) { __m256 vxj; __m256 vw; vxj = _mm256_broadcast_ss(&x[j]); vw = _mm256_loadu_ps(&weights[j*col_stride + i]); vy0 = _mm256_fmadd_ps(vw, vxj, vy0); vw = _mm256_loadu_ps(&weights[j*col_stride + i + 8]); vy8 = _mm256_fmadd_ps(vw, vxj, vy8); } _mm256_storeu_ps (&y[0], vy0); _mm256_storeu_ps (&y[8], vy8); } for (;i<rows-7;i+=8) { float *y; __m256 vy0; y = &out[i]; vy0 = _mm256_setzero_ps(); for (j=0;j<cols;j++) { __m256 vxj; __m256 vw; vxj = _mm256_broadcast_ss(&x[j]); vw = _mm256_loadu_ps(&weights[j*col_stride + i]); vy0 = _mm256_fmadd_ps(vw, vxj, vy0); } _mm256_storeu_ps (&y[0], vy0); } for (;i<rows-3;i+=4) { float *y; __m128 vy0; y = &out[i]; vy0 = _mm_setzero_ps(); for (j=0;j<cols;j++) { __m128 vxj; __m128 vw; vxj = _mm_set1_ps(x[j]); vw = _mm_loadu_ps(&weights[j*col_stride + i]); vy0 = _mm_fmadd_ps(vw, vxj, vy0); } _mm_storeu_ps (&y[0], vy0); } for (;i<rows;i++) { out[i] = 0; for (j=0;j<cols;j++) out[i] += weights[j*col_stride + i]*x[j]; } }

This code implements a highly optimized version of the SGEMV (single-precision general matrix-vector multiplication) operation, often used in linear algebra and machine learning computations. The function performs the matrix-vector multiplication ( y = A \cdot x ), where:

  • ( A ) is a dense single-precision floating-point matrix (represented by weights),
  • ( x ) is a vector (represented by x),
  • ( y ) is the result vector (represented by out).

What this code does:

The implementation utilizes SIMD (Single Instruction, Multiple Data) vectorization with AVX (Advanced Vector Extensions) and AVX2 instruction sets from Intel (e.g., _mm256_fmadd_ps), which allow simultaneous processing of multiple floating-point operations in parallel, significantly boosting performance for large matrix computations.

Here is a breakdown of the function's purpose and structure:


Inputs:

  1. float *out: Output vector ( y ), where results will be stored.
  2. const float *weights: Input matrix ( A ), stored in column-major form.
  3. int rows: Number of rows in the matrix ( A ).
  4. int cols: Number of columns in the matrix ( A ).
  5. int col_stride: The stride (or spacing) between successive columns in memory.
  6. const float *x: Input vector ( x ).

How It Works:

  1. Optimization for Vector Lengths:

    • The function processes the rows of the matrix in chunks of 16, 8, 4, then falls back to scalar (1 row at a time) computation for any remainder.
    • This chunking allows the exploitation of SIMD registers to handle multiple rows in parallel.
  2. Inner Loop:

    • For each chunk, the function iterates over the columns of the matrix ( A ) and computes the corresponding dot product with vector ( x ).
    • The AVX _mm256 intrinsics are used to load 8 or 16 floating-point values in a single register and apply fused multiply-add operations (_mm256_fmadd_ps), which computes: [ register = (a \cdot b) + register ] in one instruction.
  3. Handling Edge Cases:

    • When the number of rows isn't a multiple of 16, 8, or 4, scalar operations are performed for the remaining rows.
  4. Intermediate Storage:

    • Results are stored in out[i] (the vector ( y )) using _mm256_storeu_ps or _mm_storeu_ps.

Performance Highlights:

  • Chunked computation: The code starts with the largest SIMD-friendly chunk (16 rows at a time) and moves to smaller chunks as it handles edge cases.
  • SIMD intrinsics: AVX2 intrinsics allow up to 8 floating-point numbers (32-bit) to be processed in one register, dramatically reducing the number of scalar operations.
  • Fused multiply-add (FMA): These instructions combine multiplication and addition, reducing latency and improving throughput.

Code Flow Summary:

  1. For chunks of 16 rows:
    • Process using two 256-bit AVX registers (vy0 and vy8 for 8 rows each).
  2. For chunks of 8 rows:
    • Use one 256-bit AVX register.
  3. For chunks of 4 rows:
    • Use one 128-bit SSE register (_mm operations).
  4. Handle remaining rows (less than 4) using scalar operations.

Output:

The function fills the array out with the result ( y = A \cdot x ), where: [ y[i] = \sum_{j=0}^{\text{cols}-1} A[i,j] \cdot x[j] ]

Use Case:

This function is commonly found in:

  • Machine learning libraries (for dense matrix-vector multiplications).
  • High-performance numerical computing.
  • Situations requiring low-level performance tuning for linear algebra operations.
Generate your own explanations
Download our vscode extension
Read other generated explanations

Built by @thebuilderjr
Sponsored by beam analytics
Read our terms and privacy policy
Forked from openai-quickstart-node