This code essentially implements a matrix-vector multiplication with single-precision floating-point...

September 2, 2025 at 06:43 AM

static inline void sgemv(float *out, const float *weights, int rows, int cols, int col_stride, const float *x)
{
  printf("Running sgemv with matrix of size %i x %i \n",rows, cols);
  int i, j;
  i=0;for (;i<rows-3;i+=4)
  {
     float *y;
     __m128 vy0;
     y = &out[i];
     vy0 = _mm_setzero_ps();
     for (j=0;j<cols;j++)
     {
        __m128 vxj;
        __m128 vw;
        vxj = _mm_set1_ps(x[j]);
        vw = _mm_loadu_ps(&weights[j*col_stride + i]);
        printf(" x = %f , w = %f %f %f %f \n", x[j], weights[j * col_stride + i], weights[j * col_stride + i +1], weights[j * col_stride + i + 2], weights[j * col_stride + i + 3]);
        vy0 = _mm_fmadd_ps(vw, vxj, vy0);
     }
     _mm_storeu_ps (&y[0], vy0);
  }

  //Do the operation for 1,2, or 3 remaining rows
  for (;i<rows;i++)
  {
    out[i] = 0;
    for (j=0;j<cols;j++) out[i] += weights[j*col_stride + i]*x[j];
  }

}

This code essentially implements a matrix-vector multiplication with single-precision floating-point numbers (float), utilizing SIMD (Single Instruction, Multiple Data) intrinsics for optimization. Here’s a detailed breakdown of what the code does:

Function Overview

Name: sgemv (single-precision general matrix-vector multiply)
Inputs:
- float *out: Pointer where the resulting output vector is stored.
- const float *weights: Pointer to the input matrix (weights).
- int rows: Number of rows in the matrix.
- int cols: Number of columns in the matrix.
- int col_stride: Stride between columns (distance in memory for indexing columns).
- const float *x: Pointer to the input vector for multiplication.
Output: The out pointer is filled with the resulting vector of the matrix-vector multiplication.

The function computes:

out[i] = Dot(weights[i, :], x)  // For each row of the matrix

Where weights[i, :] is the ith row of the matrix, and x is the input vector.

Key Details / Steps:

Debug Statement
The first line prints the dimensions of the matrix (rows x cols), providing debug information about the input.
```
printf("Running sgemv with matrix of size %i x %i \n", rows, cols);
```
SIMD-Optimized Loop
The main logic for matrix-vector multiplication is split into two parts: one optimized part using SIMD intrinsics, and another for the remaining rows that are processed conventionally.

SIMD Details:
- Outer Loop (Rows): The outer loop processes 4 rows at a time using SIMD instructions (_mm_... intrinsics). It steps by increments of 4, which is why the loop iterates as i = 0; i < rows - 3; i += 4.
```
for (; i < rows - 3; i += 4) { ... }
```
- Core SIMD Computation: For each group of 4 rows, the code initializes a __m128 vector (vy0) with zeros (representing accumulated values of the dot product).
  Then, for each column (j), it:
  1. Broadcasts the jth element of the vector x into a SIMD register vxj using _mm_set1_ps(x[j]).
  2. Loads 4 elements of the matrix corresponding to the current column and 4 rows into another SIMD register vw using _mm_loadu_ps(&weights[j * col_stride + i]).
  3. Performs a fused multiply-add (FMA) operation on the vy0 vector.
    This accumulates the results of the dot product across columns for the current set of 4 rows.
```
vy0 = _mm_fmadd_ps(vw, vxj, vy0);
```
- Store the Results: After finishing the inner loop for the j columns, the computed vector (vy0) is stored back into the out array.
```
_mm_storeu_ps(&y[0], vy0);
```
- During this loop, debug information is printed for each calculation, showing values of x[j], the four weights, and the corresponding computation.
Tail Processing (Remaining Rows): After completing the SIMD loop for groups of 4 rows, any remaining rows (up to 3, since they couldn’t fit in groups of 4) are computed using a straightforward dot-product loop. This loop iterates over the unprocessed rows (i < rows) and calculates the dot product of each row with vector x manually.
```
for (; i < rows; i++) {
    out[i] = 0;
    for (j = 0; j < cols; j++) {
        out[i] += weights[j * col_stride + i] * x[j];
    }
}
```

Summary

Purpose: This code computes out = weights * x, where weights is a rows x cols matrix, and x is a vector of size cols.
Optimization: The bulk of the computation is optimized using SIMD intrinsics to process 4 rows in parallel, leveraging vectorized instructions available with _mm_... commands (e.g., _mm_fmadd_ps for fused multiply-add).
Handling Edge Cases: If rows is not divisible by 4, the last few rows are processed sequentially.
Debugging: It includes print statements to log intermediate values, making it useful for debugging but not production-ready due to performance overhead.

This is an optimized and partially-debugged implementation of a basic matrix-vector multiplication.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr

This code essentially implements a matrix-vector multiplication with single-precision floating-point...

Function Overview

Key Details / Steps:

SIMD Details:

Summary