This code defines a function `sgemv` that performs a **single-precision...

September 4, 2025 at 06:51 AM

static inline void sgemv(float *out, const float *weights, int rows, int cols, int col_stride, const float *x)
{
  //printf("running sgemv with matrix of size %i x %i \n",rows, cols);
  int i, j;
  i=0;
 
  for (;i<rows-15;i+=16)
  {
     float *y;
     __m256 vy0, vy8;
     y = &out[i];
     vy0 = _mm256_setzero_ps();
     vy8 = _mm256_setzero_ps();
     for (j=0;j<cols;j++)
     {
        __m256 vxj;
        __m256 vw;
        vxj = _mm256_broadcast_ss(&x[j]);

        vw = _mm256_loadu_ps(&weights[j*col_stride + i]);
        vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

        vw = _mm256_loadu_ps(&weights[j*col_stride + i + 8]);
        vy8 = _mm256_fmadd_ps(vw, vxj, vy8);
     }
     _mm256_storeu_ps (&y[0], vy0);
     _mm256_storeu_ps (&y[8], vy8);
  }
  for (;i<rows-7;i+=8)
  {
     float *y;
     __m256 vy0;
     y = &out[i];
     vy0 = _mm256_setzero_ps();
     for (j=0;j<cols;j++)
     {
        __m256 vxj;
        __m256 vw;
        vxj = _mm256_broadcast_ss(&x[j]);

        vw = _mm256_loadu_ps(&weights[j*col_stride + i]);
        vy0 = _mm256_fmadd_ps(vw, vxj, vy0);
     }
     _mm256_storeu_ps (&y[0], vy0);
  }
 
  for (;i<rows-3;i+=4)
  {
     float *y;
     __m128 vy0;
     y = &out[i];
     vy0 = _mm_setzero_ps();
     for (j=0;j<cols;j++)
     {
        __m128 vxj;
        __m128 vw;
        vxj = _mm_set1_ps(x[j]);
        vw = _mm_loadu_ps(&weights[j*col_stride + i]);
        //printf(" x = %f , w = %f %f %f %f \n", x[j], weights[j * col_stride + i], weights[j * col_stride + i +1], weights[j * col_stride + i + 2], weights[j * col_stride + i + 3]);
        vy0 = _mm_fmadd_ps(vw, vxj, vy0);
     }
     //printf("calculations on 4 rows complete for output! \n");
     _mm_storeu_ps (&y[0], vy0);
     //printf("sums are given by output[row %d - row %d] = %f %f %f %f \n ", i , i + 3, y[0], y[1], y[2], y[3]);
  }
  //printf("entire (n mod 4) matrix complete! \n");

  //do the operation for 1,2, or 3 remaining rows
  for (;i<rows;i++)
  {
    out[i] = 0;
    for (j = 0; j < cols; j++)
    {
        out[i] += weights[j * col_stride + i] * x[j];
        //printf(" x = %f , w = %f %f %f %f \n", x[j], weights[j * col_stride + i], weights[j * col_stride + i + 1], weights[j * col_stride + i + 2], weights[j * col_stride + i + 3]);
    }
  }
  
  //print final output
 /*for (int z = 0;z < rows;z++) {
      printf(" output @ %d = %f \n ", z, out[z]);
  }*/

}

This code defines a function sgemv that performs a single-precision generalized matrix-vector multiplication operation using highly optimized SIMD (Single Instruction, Multiple Data) instructions, specifically leveraging AVX and SSE intrinsics.

Purpose

The function computes the product of a tall matrix weights (with dimensions rows x cols) and a vector x (of size cols). The result is stored in the output vector out (of size rows).

weights: A pointer to the matrix (in row-major or column-major format, but assumed column-major).
x: A pointer to the vector to multiply.
out: A pointer to the output vector where the results of the matrix-vector multiplication will be stored.
Additional arguments (like col_stride) are included to accommodate alignment or padding in the matrix.

Breakdown of What the Code Does

Parameters:
- weights: A rows x cols matrix stored in memory.
- x: A cols-dimensional vector to be multiplied with the matrix.
- out: A rows-dimensional output vector to store the result.
- rows: The number of rows in the matrix.
- cols: The number of columns in the matrix.
- col_stride: The stride between successive column elements, meaning the actual memory layout distance for accessing matrix columns.
Optimization with SIMD: The key optimization here is the use of AVX and SSE intrinsics for SIMD (parallelized operations). The matrix multiplication is split into blocks of rows and vectorized to speed up computation:
- AVX intrinsics (256-bit wide registers) handle 8 floating-point numbers at a time (single precision) in chunks of 8 or 16 rows.
- SSE intrinsics (128-bit wide registers) are used for computing 4 floating-point numbers at a time when fewer rows remain.
- The rest of the remaining rows (when less than 4 rows are left) are computed in a standard scalar loop.
Loop Description:
- First Loop (16 rows at a time):
  - The first loop processes 16 rows at a time using AVX registers (_mm256 intrinsics).
  - In this loop, the matrix weights is multiplied with the vector x in blocks of 16 rows, and the results are accumulated in vy0 and vy8. The results are then stored in the out array.
- Second Loop (8 rows at a time):
  - If fewer than 16 rows remain, the next loop processes the next 8 rows at a time using AVX.
- Third Loop (4 rows at a time):
  - When the number of remaining rows drops below 8, the next loop processes 4 rows at a time using SSE registers (_mm intrinsics for 128-bit operations).
- Final Scalar Loop (1 row at a time):
  - For the remaining rows (less than 4), a standard scalar operation is performed.
Vectorization Details: The actual matrix-vector multiplication is performed using fused multiply-add (FMA) intrinsics:
- _mm256_fmadd_ps (for AVX) and _mm_fmadd_ps (for SSE). These instructions compute a * b + c in a single operation, which is efficient and reduces intermediate rounding errors.
Memory Access:
- The function uses "unpacked" and aligned memory access (_mm256_loadu_ps and _mm_loadu_ps) for loading matrix rows into registers.
- Broadcasting is used for the vector x entries (_mm256_broadcast_ss and _mm_set1_ps), allowing the same scalar value to be used across a SIMD register.
Final Scalar Block:
- Once all rows that can fit into SIMD registers are computed, a final scalar loop handles the last 1-3 remaining rows sequentially, ensuring that all rows are processed.

Performance Consideration

This implementation is designed to efficiently utilize modern x86-64 CPUs with AVX and SSE instruction sets for highly optimized matrix-vector multiplication, taking advantage of:

Register-level parallelism.
Fused multiply-add operations.
SIMD to process multiple rows simultaneously.

However, performance may depend on the alignment of weights, x, and out in memory, as unaligned memory access can slightly reduce efficiency.

Summary

In simple terms, the code implements a highly efficient single-precision matrix-vector multiplication (y = Wx) operation for a tall matrix, using AVX/SSE for SIMD optimization. It processes the matrix block-wise (16, 8, or 4 rows at a time) and then computes any remaining rows using scalar code.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr