This code performs a sparse Single-Precision General Matrix-Vector multiplication (SGEMV)...

September 1, 2025 at 03:12 AM

static inline void sparse_sgemv8x4(float *out, const float *weights, const int *idx, int rows, const float *x)
{
   int i, j;
   for (i=0;i<rows;i+=8)
   {
      float *y;
      int cols;
      __m256 vy0;
      y = &out[i];
      vy0 = _mm256_setzero_ps();
      cols = *idx++;
      for (j=0;j<cols;j++)
      {
         int id;
         __m256 vxj;
         __m256 vw;
         id = *idx++;
         vxj = _mm256_broadcast_ss(&x[id]);
         vw = _mm256_loadu_ps(&weights[0]);
         vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

         vxj = _mm256_broadcast_ss(&x[id+1]);
         vw = _mm256_loadu_ps(&weights[8]);
         vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

         vxj = _mm256_broadcast_ss(&x[id+2]);
         vw = _mm256_loadu_ps(&weights[16]);
         vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

         vxj = _mm256_broadcast_ss(&x[id+3]);
         vw = _mm256_loadu_ps(&weights[24]);
         vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

         weights += 32;
      }
      _mm256_storeu_ps (&y[0], vy0);
   }
}

This code performs a sparse Single-Precision General Matrix-Vector multiplication (SGEMV) for an 8x4 block, utilizing SIMD (Single Instruction, Multiple Data) intrinsics via AVX (Advanced Vector Extensions). Here's a detailed breakdown of its functionality:

Purpose:

The function computes a matrix-vector product (out = weights * x) for a sparse matrix represented in a compressed format using SIMD AVX instructions for efficiency. It operates on 8 rows (hence the 8x4 terminology) at a time to exploit vectorized operations.

Arguments:

float *out: The output vector where the result of the multiplication is stored.
const float *weights: A compressed representation of the matrix's non-zero values.
const int *idx: Indices representing the sparse structure of the matrix (column positions of non-zero entries).
int rows: Total number of rows in the sparse matrix.
const float *x: Input dense vector (right-hand side of the multiplication).

High-Level Functionality:

The loop processes the sparse matrix in chunks of 8 rows at a time.
For each row block:
- It initializes an 8-element AVX register vy0 to accumulate the dot product.
- Reads the number of non-zero columns in this row block (cols) from *idx.
- For each non-zero column:
  - Index (id) of the column in x is retrieved from *idx++.
  - Broadcasts the corresponding value from x into an AVX vector (vxj).
  - Loads 8 corresponding weight values into an AVX register (vw).
  - Performs an FMA (Fused Multiply-Add) operation with vw, vxj, and vy0, effectively accumulating the partial result into vy0.
- Advances the pointer in weights by the block size (32 since the block is 8x4).
After processing the block, vy0 is stored in the out array for the corresponding 8 rows.

Key Components:

SIMD Operations:
- _mm256_setzero_ps(): Initializes a 256-bit AVX register to zero, representing 8 single-precision floating-point zero values.
- _mm256_broadcast_ss(&x[id]): Broadcasts a single float from the vector x to all 8 elements of an AVX register.
- _mm256_loadu_ps(&weights[n]): Loads 8 sequential float values from weights into an AVX register.
- _mm256_fmadd_ps(vw, vxj, vy0): Performs vy0 = (vw * vxj) + vy0 in a single operation.
- _mm256_storeu_ps(&y[0], vy0): Stores the 8 floats in vy0 to the output vector out.
Sparse Matrix Handling:
- The array idx provides indices of non-zero elements in the sparse matrix, allowing the algorithm to skip computations for zeros.
- cols = *idx++ retrieves the number of non-zero elements in the current block of rows.
Efficiency:
- By operating on 8 rows simultaneously, the code leverages SIMD instructions for higher performance compared to scalar processing.
- Compressed sparse matrix representation minimizes memory bandwidth usage by only storing non-zero entries.

Example Use Case:

The function is useful in scenarios where a sparse matrix needs to be multiplied by a dense vector efficiently, such as in:

Machine learning (e.g., sparse neural network layers, recommendation systems).
Scientific computing (e.g., solving sparse systems of linear equations).
Data compression or sparse data representations.

Conclusion:

This function performs a sparse matrix-vector multiplication using AVX intrinsics for efficient computation. It processes a block of 8 rows at a time, leveraging SIMD parallelism to accelerate the dot product computation between sparse rows and the input vector.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr