How to make Matrix Multiplication fast on CPU (Part 1)

This blog is part 1 of my matmul series, where I try to explore different ways to implement matrix multiplication, and iteratively improve upon it. (Unsurprisingly, there are many blogs and tutorials on this topic. This blog aims to explain it from the perspective that makes sense to me)

If you have just learned about matrix multiplication from Linear Algebra 1, and you were to program like how you would hand calculate it, it would look something like this.

// assuming size NxN matrix
void matmul(float* A, float* B, float* C){
    for(int x = 0; x < N; x++){
        for(int y = 0; y < N; y++){
            float tmp = 0;
            for(int k = 0; k < N; k++) tmp += A[x][k] * B[k][y];
            C[x][y] = tmp;
        }
    }
}

For each element on output matrix C, it requires the values on a row of matrix A to be added elementwise with a column from matrix B, and have those values summed. Adding values elementwise require N flops, and summing them also require N flops. There are in total N^2 output elements, so the number of flops required for this operation is 2*N^3.

On my machine using an i9-13900k CPU, this algorithm gets about 1 gflops. For reference, if you were to perform a matmul in numpy (which uses OpenBLAS), it can reach about 160 gflops.

import os
os.environ['OPENBLAS_NUM_THREADS'] = '1' # uses only 1 thread for fair comparison
import numpy as np
import time

N = 8192
if __name__ == "__main__":
    TOTAL_FLOP = 2*N*N*N
    A = np.random.randn(N, N).astype(np.float32)
    B = np.random.randn(N, N).astype(np.float32)

    for _ in range(1):
        start = time.monotonic()
        C = A @ B
        end = time.monotonic()
        exec_time = end - start
        GFLOPS = TOTAL_FLOP/exec_time*1e-9
        print(f"GFLOP/S: {GFLOPS}")

So what are the optimizations that provides 2+ magnitudes of performance to our naive algorithm? Surprisingly its still the same algorithm with the same # of flops computed, so all the optimizations lie within making the hardware run faster.

(There are algorithms with less computation, but they are rarely used due to large constant overhead, see Strassen_algorithm)

CPU Hardware Background

Before we get into how to optimize the naive implementation, we should clarify where we can get more CPU performance. If we look at the algorithm above, we notice that all the work gets done on these two lines. Lets try and break down what goes down on the hardware level when we run these lines of code.

tmp += A[x][k] * B[k][y]; # performed N times per output
C[x][y] = tmp; # performed once per output

Load A[x][k] from matrix A
Load B[k][y] from matrix B
Multiply them and add the result to tmp
Repeat the above N times
Store the result in C[x][y]

We know that step 3 should only take a few CPU cycles, so most of the time is spent on loading from matrix A and B. The key to speed this up is by leveraging the CPU's cache. CPU cache access time is not well documented, but here is a rough estimate of how long it takes to fetch data from different hardware level.

Register access: 1 cpu cycles
L1 cache hit: 3 cpu cycles
L2 cache hit: 10 cpu cycles
L3 cache hit: 70 cpu cycles
DRAM access: 250 cpu cycles

A cache line on the i9-13900k is 64 bytes, so in order to maintain cache hits on our CPU, we want to 1. access data that are contiguous as they are loaded in 64 byte chunks and 2. reuse that data as much as possible once it is in the cache.

Cache-aware implementations

To ensure that the accessed data is contiguous, we will make a simple yet important change. Notice how when we load A[x][k] and B[k][y], k is the inner loop variable that changes fast. If both A and B are stored in a row-majored fashion, the access of B[k][y] is segmented and prone to cache misses. A simple change is to transpose B, and reverse the access to B[y][k].

void transpose(float* B){
    for (int i = 0; i < N; i+=8) {
        for (int j = 0; j < N; j++) {
            // an inner loop helps data reuse, a sneakpeak into the next optimization
            for (int l = 0; l < 8; l++) {
                BT[i + l][j] = B[j][i + l];
            }
        }
    }
}

This simple transposes increases our naive algorithm's performance from 1 gflop/s to 6 gflop/s!

Now to ensure we reuse data as much as possible when its in cache, we implement tiling. The key idea of tiling is to solve NxN tiles of outputs in one inner loop instead of 1 element at once. This reuses the elements within that NxN tile as much as possible.

void matmul(float* A, float* BT, float* C){
    for (int bx = 0; bx < N; bx += BLOCK_SIZE){
        for (int by = 0; by < N; by += BLOCK_SIZE){
            // inner loop
            for (int k = 0; k < N; k++){
                for (int x = bx; x < bx + BLOCK_SIZE; x++){
                    for (int y = by; y < by + BLOCK_SIZE; y++){
                        C[x][y] += A[x][k] * BT[y][k];
                    }
                }
            }
        }
    }
}

The block size should correspond to the cache size of the CPU, on my i9-13900k a block size of 8 can achieve 30 gflop/s.

So we went from 1 gflop/s to 30 gflop/s through cache-aware optimizations. This is around the maximum we can achieve without using some more advanced features or very CPU model specific finetuning. Now, it is time to move onto SIMD instructions!

(TODO: You can mathematically determine theoretical maximum performance by cross validating with an instruction latency table)