SIMD Instructions for Transformer Math

SIMD (Single Instruction Multiple Data) lets one CPU instruction operate on a vector of values. Modern transformer kernels rely on SIMD to hit peak FLOPS. Knowing what's available helps you pick the right libraries and chips.

Advertisement

AVX2 — 256-bit vectors

Introduced in 2013. Processes 8 FP32 values per instruction. ~32 GFLOPS peak per core. Still common on consumer Intel/AMD. Enough for SLM inference at 5-20 tokens/sec on 7B-class models.

AVX-512 — 512-bit vectors

16 FP32 or 32 BF16 per instruction. ~64-100 GFLOPS per core. Intel Skylake-X+, AMD Zen 4+. Doubles AVX2 throughput. Lots of weird sub-extensions (AVX-512-VNNI, BF16, FP16) — check support before optimizing.

Advertisement

AMX (Advanced Matrix Extensions)

# Sapphire Rapids (Xeon 4th gen) + later
# 2D tile registers: 8 tiles of 16x64 INT8 or BF16
# Single instruction: tile matmul (1024 multiply-adds)

Designed exactly for transformer matmul. ~1 TFLOPS BF16 per core. 10× AVX-512 throughput on matmul-heavy code. Intel's response to GPUs eating the AI workload. Requires kernel + library support — OpenBLAS, MKL, oneDNN already use it.

ARM NEON / SVE

Apple Silicon (M1/M2/M3): NEON 128-bit + custom matrix coprocessors. ARM Neoverse: SVE (variable-length vectors), and SVE2 + BF16 matrix extension. Very fast for inference. llama.cpp's ARM kernels are well-tuned.

Picking hardware for CPU inference

Mid-tier:   AMD 7900X / Intel i7-14700K (AVX-512, no AMX)
            ~10-30 tokens/sec on 7B Q4

Server:     Intel Sapphire Rapids+ (AMX)
            ~50-100 tokens/sec on 7B Q4

Mobile:     Apple M3 Pro+
            ~30-60 tokens/sec on 7B Q4 (via MLX)

CPU inference speed depends heavily on SIMD generation. AMX on the latest Xeons is competitive with consumer GPUs for SLM serving. Apple Silicon's matrix coprocessor (AMX, different acronym) is the best for laptops.

AVX-512 = 16 FP32 per cycle. AMX adds tile matmul, ~10× faster on matmul. Pick hardware by SIMD generation.