This code essentially implements a matrix-vector multiplication with single-precision floating-point...
This code essentially implements a matrix-vector multiplication with single-precision floating-point numbers (float
), utilizing SIMD (Single Instruction, Multiple Data) intrinsics for optimization. Here’s a detailed breakdown of what the code does:
Function Overview
- Name:
sgemv
(single-precision general matrix-vector multiply) - Inputs:
float *out
: Pointer where the resulting output vector is stored.const float *weights
: Pointer to the input matrix (weights).int rows
: Number of rows in the matrix.int cols
: Number of columns in the matrix.int col_stride
: Stride between columns (distance in memory for indexing columns).const float *x
: Pointer to the input vector for multiplication.
- Output: The
out
pointer is filled with the resulting vector of the matrix-vector multiplication.
The function computes:
out[i] = Dot(weights[i, :], x) // For each row of the matrix
Where weights[i, :]
is the i
th row of the matrix, and x
is the input vector.
Key Details / Steps:
-
Debug Statement
The first line prints the dimensions of the matrix (rows
xcols
), providing debug information about the input.printf("Running sgemv with matrix of size %i x %i \n", rows, cols);
-
SIMD-Optimized Loop
The main logic for matrix-vector multiplication is split into two parts: one optimized part using SIMD intrinsics, and another for the remaining rows that are processed conventionally.SIMD Details:
-
Outer Loop (Rows): The outer loop processes 4 rows at a time using SIMD instructions (
_mm_...
intrinsics). It steps by increments of 4, which is why the loop iterates asi = 0; i < rows - 3; i += 4
.for (; i < rows - 3; i += 4) { ... }
-
Core SIMD Computation: For each group of 4 rows, the code initializes a
__m128
vector (vy0
) with zeros (representing accumulated values of the dot product).
Then, for each column (j
), it:-
Broadcasts the
j
th element of the vectorx
into a SIMD registervxj
using_mm_set1_ps(x[j])
. -
Loads 4 elements of the matrix corresponding to the current column and 4 rows into another SIMD register
vw
using_mm_loadu_ps(&weights[j * col_stride + i])
. -
Performs a fused multiply-add (FMA) operation on the
vy0
vector.
This accumulates the results of the dot product across columns for the current set of 4 rows.vy0 = _mm_fmadd_ps(vw, vxj, vy0);
-
-
Store the Results: After finishing the inner loop for the
j
columns, the computed vector (vy0
) is stored back into theout
array._mm_storeu_ps(&y[0], vy0);
-
During this loop, debug information is printed for each calculation, showing values of
x[j]
, the four weights, and the corresponding computation.
-
-
Tail Processing (Remaining Rows): After completing the SIMD loop for groups of 4 rows, any remaining rows (up to 3, since they couldn’t fit in groups of 4) are computed using a straightforward dot-product loop. This loop iterates over the unprocessed rows (
i < rows
) and calculates the dot product of each row with vectorx
manually.for (; i < rows; i++) { out[i] = 0; for (j = 0; j < cols; j++) { out[i] += weights[j * col_stride + i] * x[j]; } }
Summary
- Purpose: This code computes
out = weights * x
, whereweights
is arows x cols
matrix, andx
is a vector of sizecols
. - Optimization: The bulk of the computation is optimized using SIMD intrinsics to process 4 rows in parallel, leveraging vectorized instructions available with
_mm_...
commands (e.g.,_mm_fmadd_ps
for fused multiply-add). - Handling Edge Cases: If
rows
is not divisible by 4, the last few rows are processed sequentially. - Debugging: It includes print statements to log intermediate values, making it useful for debugging but not production-ready due to performance overhead.
This is an optimized and partially-debugged implementation of a basic matrix-vector multiplication.