This code implements a highly optimized matrix-vector multiplication for a...

September 1, 2025 at 08:29 AM

static inline void sgemv16x1(float *out, const float *weights, int rows, int cols, int col_stride, const float *x)
{
    int i, j;
    for (i=0;i<rows;i+=16)
    {
	float * restrict y = &out[i];

	/* keep y[0..15] in registers for duration of inner loop */

	float32x4_t y0_3 = vdupq_n_f32(0);
	float32x4_t y4_7 = vdupq_n_f32(0);
	float32x4_t y8_11 = vdupq_n_f32(0);
	float32x4_t y12_15 = vdupq_n_f32(0);

	for (j=0;j<cols;j++)
	{
	    const float * restrict w;
	    float32x4_t wvec0_3, wvec4_7, wvec8_11, wvec12_15;
	    float32x4_t xj;

	    w = &weights[j*col_stride + i];
	    wvec0_3 = vld1q_f32(&w[0]);
	    wvec4_7 = vld1q_f32(&w[4]);
	    wvec8_11 = vld1q_f32(&w[8]);
	    wvec12_15 = vld1q_f32(&w[12]);

	    xj = vld1q_dup_f32(&x[j]);

	    y0_3 = vmlaq_f32(y0_3, wvec0_3, xj);
	    y4_7 = vmlaq_f32(y4_7, wvec4_7, xj);
	    y8_11 = vmlaq_f32(y8_11, wvec8_11, xj);
	    y12_15 = vmlaq_f32(y12_15, wvec12_15, xj);
	}

	/* save y[0..15] back to memory */

	vst1q_f32(&y[0], y0_3);
	vst1q_f32(&y[4], y4_7);
	vst1q_f32(&y[8], y8_11);
	vst1q_f32(&y[12], y12_15);

    }
}

This code implements a highly optimized matrix-vector multiplication for a specific case. It specifically computes the product of a 2D matrix (given as a series of weights) with a vector x, resulting in a vector out. The implementation takes advantage of SIMD (Single Instruction Multiple Data) instructions, specifically NEON intrinsics for ARM processors, to accelerate the computation.

Here is an explanation of what the code does:

Input Parameters:

out: Pointer to the output array, which will hold the result of the matrix-vector multiplication. It has a size equal to the number of rows of the matrix.
weights: Pointer to the weight matrix (input matrix of size rows x cols).
rows: Number of rows in the matrix.
cols: Number of columns in the matrix.
col_stride: Stride (step size) for accessing columns in the matrix stored in memory.
x: Pointer to the input vector, which has a length of cols.

What the Code Does:

Outer Loop (for (i=0; i<rows; i+=16)):
- The matrix is processed in blocks of 16 rows at a time for efficiency. Each iteration of the loop computes the outputs for 16 rows of the matrix.
Register Initialization:
- Four 128-bit SIMD registers (y0_3, y4_7, y8_11, y12_15) are initialized to zero at the beginning of the outer loop. These registers will hold the partial sums for the rows processed in this iteration.
Inner Loop (for (j=0; j<cols; j++)):
- This loop iterates over the columns of the matrix. For each column: a. Load 16 floating-point values (weights for the current column across 16 rows) into four SIMD registers (wvec0_3, wvec4_7, wvec8_11, wvec12_15). b. Broadcast (load) the corresponding element of the input vector x[j] into all positions of a SIMD register xj. c. Perform fused multiply-add operations (vmlaq_f32) to accumulate the product of weights and the input vector into the corresponding y register.
Store Results:
- After processing all columns for a block of 16 rows, the computed results in the SIMD registers are stored back into memory (out[i] to out[i+15]) using vst1q_f32.

Optimizations Used:

SIMD with NEON Intrinsics:
- The code utilizes NEON intrinsics (vdupq_n_f32, vld1q_f32, vld1q_dup_f32, vmlaq_f32, vst1q_f32) to perform operations on 4 floating-point values simultaneously, providing significant performance improvements on compatible ARM processors.
Blocking with 16 Rows:
- The matrix is processed in chunks of 16 rows at a time, which is typically a multiple of SIMD register sizes. This approach ensures efficient memory usage and fewer memory access operations.
Fused Multiply-Add:
- The usage of vmlaq_f32 combines multiplications and additions into a single operation, reducing computational cycles.
Memory Access Efficiency:
- Memory reads and writes are aligned for SIMD operations and use restrict pointers. This informs the compiler that no aliasing occurs, allowing for better optimization.

Final Result:

The function performs the mathematical operation:

out[i] = ∑_(j=0 to cols-1) (weights[i * col_stride + j] * x[j]) for i = 0 to rows-1

Where the processing of rows is optimized by grouping the computations in blocks of 16 rows using NEON intrinsics.

Effectively, this is a highly optimized variant of a standard matrix-vector multiplication implementation tailored for ARM architectures.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr