How to make Matrix Multiplication fast on CPU (Part 2 with SIMD)

On the last blog, we figured out how to write cache coherent algorithms to boost the performance of our matrix multiplication algorithm on CPU. We went from 1 gflop/s to 30 gflop/s. In this part, we will discuss how we can use the SIMD feature on our CPU to further boost that performance.

What is SIMD?

SIMD stands for “Single instruction/Multiple data”. It describes instructions that perform the same operation on multiple data points simultaneously. SIMD works on operations where the same operation is applied to independent data elements (e.g. adding two arrays element-wise). It does not apply when there is a sequential dependency between operations (e.g. a for loop to sum up a whole array.

AVX (stands for Advanced Vectors Extensions) is a set of CPU instructions designed for performing SIMD operations. AVX has instructions for operations like addition, multiplication that act on wide 256-bit registers. These 256-bit registers can load 4 double or 8 single precision floating point numbers at the same time.

In C, there are header files that allow you to easily access these instructions without writing assembly. Here is an example of using SIMD in C to write a simple "vector add" like program.

#include <immintrin.h> // provides access to all SIMD intrinsics

int main(int argc, char** argv){
    float a[4], b[4], result[4];
    __m128 va, vb, vresult; // 128-bit register type for 4 single-precision floats
    // __m128d is the type for 2 doubles instead of 4 floats
    va = _mm_loadu_ps(a); // load content from a to va
    vb = _mm_loadu_ps(b); // load content from b to vb
    vresult = _mm_add_ps(va, vb); // add va and vb together
    _mm_storeu_ps(result, vresult); // store content from register to memory
}

Matrix multiplication with SIMD

It is really easy to use SIMD instructions to speed up our matrix multiplication algorithm (Doing the indexing for everything can get really annoying, I had to draw it out via pen and paper when debugging ._.)

If you look at where we left off last part, we just have to find a part where we can parallize easily. Choosing the k loop seems to be the easiest option to implement.

void matmul(float* A, float* BT, float* C){
    for (int bx = 0; bx < N; bx += BLOCK_SIZE){
        for (int by = 0; by < N; by += BLOCK_SIZE){
            // inner loop
            for (int k = 0; k < N; k++){
                for (int x = bx; x < bx + BLOCK_SIZE; x++){
                    for (int y = by; y < by + BLOCK_SIZE; y++){
                        C[x][y] += A[x][k] * BT[y][k];
}}}}}}

With SIMD, we can initialize an array of __m256 registers to store the results (I use __m256 as my CPU does not support AVX512 which has 512-bit registers), and then use the fused multiply-add intrinsic _mm256_fmadd_ps which performs the operation result[i] = a[i]*b[i] + result[i] to replace the C[x][y] += A[x][k] * BT[y][k] part of our computation from the previous algorithm.

You can find a list and documentation of SIMD instructions at the Intel Intrinsics Guide.

Note, I also changed BLOCK_SIZE to a more fine-grained BLOCK_X and BLOCK_Y for better cache coherency.

void matmul(float* A, float* BT, float* C){
    for(int by = 0; by < N; by += BLOCK_Y){
        for(int bx = 0; bx < N; bx += BLOCK_X){

            __m256 tc[BLOCK_Y][BLOCK_X] = {};

            for(int k = 0; k < N; k+= 8){
                for(int y = 0; y < BLOCK_Y; y++){
                    __m256 a_vec = _mm256_loadu_ps(&A[by+y][k]);
                    for(int x = 0; x < BLOCK_X; x++){
                        __m256 b_vec = _mm256_loadu_ps(&BT[bx+x][k]);
                        tc[y][x] = _mm256_fmadd_ps(a_vec, b_vec, tc[y][x]);
                    }
                }
            }

            for(int y = 0; y < BLOCK_Y; y++){
                for(int x = 0; x < BLOCK_X; x++){
                    float tmp = 0.0;
                    for(int i = 0; i < 8; i++)tmp += tc[y][x][i];
                    C[by+y][bx+x] = tmp;
                }
            }

        }
    }
}

This implementation can achieve ~110 gflop/s on my machine, which is close to the ~140 gflop/s by openBLAS. Based on my research, further improvements and optimizations require tuning for specific hardware specifications of my CPU. I may pursue that in the future, but for now I am happy with the result we achieved!