This code implements a highly optimized version of the **SGEMV...

September 1, 2025 at 06:58 AM

static inline void sgemv(float *out, const float *weights, int rows, int cols, int col_stride, const float *x)
{
  int i, j;
  i=0;
  for (;i<rows-15;i+=16)
  {
     float *y;
     __m256 vy0, vy8;
     y = &out[i];
     vy0 = _mm256_setzero_ps();
     vy8 = _mm256_setzero_ps();
     for (j=0;j<cols;j++)
     {
        __m256 vxj;
        __m256 vw;
        vxj = _mm256_broadcast_ss(&x[j]);

        vw = _mm256_loadu_ps(&weights[j*col_stride + i]);
        vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

        vw = _mm256_loadu_ps(&weights[j*col_stride + i + 8]);
        vy8 = _mm256_fmadd_ps(vw, vxj, vy8);
     }
     _mm256_storeu_ps (&y[0], vy0);
     _mm256_storeu_ps (&y[8], vy8);
  }
  for (;i<rows-7;i+=8)
  {
     float *y;
     __m256 vy0;
     y = &out[i];
     vy0 = _mm256_setzero_ps();
     for (j=0;j<cols;j++)
     {
        __m256 vxj;
        __m256 vw;
        vxj = _mm256_broadcast_ss(&x[j]);

        vw = _mm256_loadu_ps(&weights[j*col_stride + i]);
        vy0 = _mm256_fmadd_ps(vw, vxj, vy0);
     }
     _mm256_storeu_ps (&y[0], vy0);
  }
  for (;i<rows-3;i+=4)
  {
     float *y;
     __m128 vy0;
     y = &out[i];
     vy0 = _mm_setzero_ps();
     for (j=0;j<cols;j++)
     {
        __m128 vxj;
        __m128 vw;
        vxj = _mm_set1_ps(x[j]);

        vw = _mm_loadu_ps(&weights[j*col_stride + i]);
        vy0 = _mm_fmadd_ps(vw, vxj, vy0);
     }
     _mm_storeu_ps (&y[0], vy0);
  }
  for (;i<rows;i++)
  {
    out[i] = 0;
    for (j=0;j<cols;j++) out[i] += weights[j*col_stride + i]*x[j];
  }
}

This code implements a highly optimized version of the SGEMV (single-precision general matrix-vector multiplication) operation, often used in linear algebra and machine learning computations. The function performs the matrix-vector multiplication ( y = A \cdot x ), where:

( A ) is a dense single-precision floating-point matrix (represented by weights),
( x ) is a vector (represented by x),
( y ) is the result vector (represented by out).

What this code does:

The implementation utilizes SIMD (Single Instruction, Multiple Data) vectorization with AVX (Advanced Vector Extensions) and AVX2 instruction sets from Intel (e.g., _mm256_fmadd_ps), which allow simultaneous processing of multiple floating-point operations in parallel, significantly boosting performance for large matrix computations.

Here is a breakdown of the function's purpose and structure:

Inputs:

float *out: Output vector ( y ), where results will be stored.
const float *weights: Input matrix ( A ), stored in column-major form.
int rows: Number of rows in the matrix ( A ).
int cols: Number of columns in the matrix ( A ).
int col_stride: The stride (or spacing) between successive columns in memory.
const float *x: Input vector ( x ).

How It Works:

Optimization for Vector Lengths:
- The function processes the rows of the matrix in chunks of 16, 8, 4, then falls back to scalar (1 row at a time) computation for any remainder.
- This chunking allows the exploitation of SIMD registers to handle multiple rows in parallel.
Inner Loop:
- For each chunk, the function iterates over the columns of the matrix ( A ) and computes the corresponding dot product with vector ( x ).
- The AVX _mm256 intrinsics are used to load 8 or 16 floating-point values in a single register and apply fused multiply-add operations (_mm256_fmadd_ps), which computes: [ register = (a \cdot b) + register ] in one instruction.
Handling Edge Cases:
- When the number of rows isn't a multiple of 16, 8, or 4, scalar operations are performed for the remaining rows.
Intermediate Storage:
- Results are stored in out[i] (the vector ( y )) using _mm256_storeu_ps or _mm_storeu_ps.

Performance Highlights:

Chunked computation: The code starts with the largest SIMD-friendly chunk (16 rows at a time) and moves to smaller chunks as it handles edge cases.
SIMD intrinsics: AVX2 intrinsics allow up to 8 floating-point numbers (32-bit) to be processed in one register, dramatically reducing the number of scalar operations.
Fused multiply-add (FMA): These instructions combine multiplication and addition, reducing latency and improving throughput.

Code Flow Summary:

For chunks of 16 rows:
- Process using two 256-bit AVX registers (vy0 and vy8 for 8 rows each).
For chunks of 8 rows:
- Use one 256-bit AVX register.
For chunks of 4 rows:
- Use one 128-bit SSE register (_mm operations).
Handle remaining rows (less than 4) using scalar operations.

Output:

The function fills the array out with the result ( y = A \cdot x ), where: [ y[i] = \sum_{j=0}^{\text{cols}-1} A[i,j] \cdot x[j] ]

Use Case:

This function is commonly found in:

Machine learning libraries (for dense matrix-vector multiplications).
High-performance numerical computing.
Situations requiring low-level performance tuning for linear algebra operations.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr