This blog documents a step-by-step tutorial on how to leverage cache-aware algorithms and SIMD features on CPU to write an efficient matrix multiplication kernel on x86 CPUs.
To recap, this is how you would implement the matrix multiplication as if you were calculating it on paper.
// assuming size NxN matrix
void matmul(float* A, float* B, float* C){
for(int x = 0; x < N; x++){
for(int y = 0; y < N; y++){
float tmp = 0;
for(int k = 0; k < N; k++) tmp += A[x][k] * B[k][y];
C[x][y] = tmp;
}}}
For each element on output matrix C, it requires values from a row of matrix A to be multiplied elementwise with a column from matrix B, and performing a reduction sum on the products. Multiplying values elementwise require N flops, and summing them also require N flops. There are in total N^2 output elements, so the number of flops required for this operation is 2*N^3. Where N is the size of the matrix
To set some numbers as benchmark, numpy uses openBLAS which can reach around 160 gflop/s on my CPU (i9-13900k). Conveniently, the native algorithm only reaches 1 gflop/s.
import time, os
os.environ['OPENBLAS_NUM_THREADS'] = '1' # uses only 1 thread
import numpy as np
N = 8192
if __name__ == "__main__":
TOTAL_FLOP = 2*N*N*N
A = np.random.randn(N, N).astype(np.float32)
B = np.random.randn(N, N).astype(np.float32)
for _ in range(1):
start = time.monotonic()
C = A @ B
end = time.monotonic()
exec_time = end - start
GFLOPS = TOTAL_FLOP/exec_time*1e-9
print(f"GFLOP/S: {GFLOPS}")
So what are the optimizations that provides 2+ magnitudes of performance to our naive algorithm? Surprisingly its still the same algorithm with the same # of flops computed, so all the optimizations lie within making the hardware run faster.
(There are algorithms with less computation, but they are rarely used due to large constant overhead, see Strassen_algorithm)
CPU Hardware Background
Before we get into how to optimize the naive implementation, we should clarify where we can get more CPU performance. If we look at the algorithm above, we notice that all the work gets done on these two lines. Lets try and break down what goes down on the hardware level when we run these lines of code.
tmp += A[x][k] * B[k][y]; # performed N times per output
C[x][y] = tmp; # performed once per output
- Load A[x][k] from matrix A
- Load B[k][y] from matrix B
- Multiply them and add the result to tmp
- Repeat the above N times
- Store the result in C[x][y]
We know that step 3 should only take a few CPU cycles, so most of the time is spent on loading from matrix A and B. The key to speed this up is by leveraging the CPU's cache. CPU cache access time is not well documented, but here is a rough estimate of how long it takes to fetch data from different hardware level.
- Register access: 1 cpu cycles
- L1 cache hit: 3 cpu cycles
- L2 cache hit: 10 cpu cycles
- L3 cache hit: 70 cpu cycles
- DRAM access: 250 cpu cycles
A cache line on the i9-13900k is 64 bytes, so in order to maintain cache hits on our CPU, we want to 1. access data that are contiguous as they are loaded in 64 byte chunks and 2. reuse that data as much as possible once it is in the cache.
Cache-aware implementations
To ensure that the accessed data is contiguous, we will make a simple yet important change. Notice how when we load A[x][k] and B[k][y], k is the inner loop variable that changes fast. If both A and B are stored in a row-majored fashion, the access of B[k][y] is segmented and prone to cache misses. A simple change is to transpose B, and reverse the access to B[y][k].
void transpose(float* B){
for (int i = 0; i < N; i+=8) {
for (int j = 0; j < N; j++) {
// an inner loop helps data reuse, a sneakpeak into the next optimization
for (int l = 0; l < 8; l++) {
BT[i + l][j] = B[j][i + l];
}
}
}
}
This simple transposes increases our naive algorithm's performance from 1 gflop/s to 6 gflop/s!
Now to ensure we reuse data as much as possible when its in cache, we implement tiling. The key idea of tiling is to solve NxN tiles of outputs in one inner loop instead of 1 element at once. This reuses the elements within that NxN tile as much as possible.
void matmul(float* A, float* BT, float* C){
for (int bx = 0; bx < N; bx += BLOCK_SIZE){
for (int by = 0; by < N; by += BLOCK_SIZE){
// inner loop
for (int k = 0; k < N; k++){
for (int x = bx; x < bx + BLOCK_SIZE; x++){
for (int y = by; y < by + BLOCK_SIZE; y++){
C[x][y] += A[x][k] * BT[y][k];
}
}
}
}
}
}
The block size should correspond to the cache size of the CPU, on my i9-13900k a block size of 8 can achieve 30 gflop/s.
So we went from 1 gflop/s to 30 gflop/s through cache-aware optimizations. This is around the maximum we can achieve without using some more advanced features or very CPU model specific finetuning. Now, it is time to move onto SIMD instructions!
(You can mathematically determine theoretical maximum performance by cross validating with an instruction latency table)
What is SIMD?
SIMD stands for “Single instruction/Multiple data”. It describes instructions that perform the same operation on multiple data points simultaneously. SIMD works on operations where the same operation is applied to independent data elements (e.g. adding two arrays element-wise). It does not apply when there is a sequential dependency between operations (e.g. a for loop to sum up a whole array.
AVX (stands for Advanced Vectors Extensions) is a set of CPU instructions designed for performing SIMD operations. AVX has instructions for operations like addition, multiplication that act on wide 256-bit registers. These 256-bit registers can load 4 double or 8 single precision floating point numbers at the same time.
In C, there are header files that allow you to easily access these instructions without writing assembly. Here is an example of using SIMD in C to write a simple "vector add" like program.
#include <immintrin.h> // provides access to all SIMD intrinsics
int main(int argc, char** argv){
float a[4], b[4], result[4];
__m128 va, vb, vresult; // 128-bit register type for 4 single-precision floats
// __m128d is the type for 2 doubles instead of 4 floats
va = _mm_loadu_ps(a); // load content from a to va
vb = _mm_loadu_ps(b); // load content from b to vb
vresult = _mm_add_ps(va, vb); // add va and vb together
_mm_storeu_ps(result, vresult); // store content from register to memory
}
Matrix multiplication with SIMD
It is really easy to use SIMD instructions to speed up our matrix multiplication algorithm (Doing the indexing for everything can get really annoying, I had to draw it out via pen and paper when debugging ._.)
If you look at where we left off last part, we just have to find a part where we can parallize easily. Choosing the k loop seems to be the easiest option to implement.
void matmul(float* A, float* BT, float* C){
for (int bx = 0; bx < N; bx += BLOCK_SIZE){
for (int by = 0; by < N; by += BLOCK_SIZE){
// inner loop
for (int k = 0; k < N; k++){
for (int x = bx; x < bx + BLOCK_SIZE; x++){
for (int y = by; y < by + BLOCK_SIZE; y++){
C[x][y] += A[x][k] * BT[y][k];
}}}}}}
With SIMD, we can initialize an array of __m256 registers to store the results (I use __m256 as my CPU does not support AVX512 which has 512-bit registers), and then use the fused multiply-add intrinsic _mm256_fmadd_ps which performs the operation result[i] = a[i]*b[i] + result[i] to replace the C[x][y] += A[x][k] * BT[y][k] part of our computation from the previous algorithm.
You can find a list and documentation of SIMD instructions at the Intel Intrinsics Guide.
Note, I also changed BLOCK_SIZE to a more fine-grained BLOCK_X and BLOCK_Y for better cache coherency.
void matmul(float* A, float* BT, float* C){
for(int by = 0; by < N; by += BLOCK_Y){
for(int bx = 0; bx < N; bx += BLOCK_X){
__m256 tc[BLOCK_Y][BLOCK_X] = {};
for(int k = 0; k < N; k+= 8){
for(int y = 0; y < BLOCK_Y; y++){
__m256 a_vec = _mm256_loadu_ps(&A[by+y][k]);
for(int x = 0; x < BLOCK_X; x++){
__m256 b_vec = _mm256_loadu_ps(&BT[bx+x][k]);
tc[y][x] = _mm256_fmadd_ps(a_vec, b_vec, tc[y][x]);
}
}
}
for(int y = 0; y < BLOCK_Y; y++){
for(int x = 0; x < BLOCK_X; x++){
float tmp = 0.0;
for(int i = 0; i < 8; i++)tmp += tc[y][x][i];
C[by+y][bx+x] = tmp;
}
}
}
}
}
This implementation can achieve ~110 gflop/s on my machine, which is close to the ~140 gflop/s by openBLAS. Based on my research, further improvements and optimizations require tuning for specific hardware specifications of my CPU. I may pursue that in the future, but for now I am happy with the result we achieved!