Articles in this category
All 18 articles, sorted alphabetically
Advertisement
ARTICLE · 01
AWQ vs GPTQ Quantization
Two ways to get 4-bit weights — and when each one wins.
Read article →ARTICLE · 02
bitsandbytes vs AutoAWQ vs AutoGPTQ
Library comparison for INT4 quant.
Read article →ARTICLE · 03
FP8 Inference Explained
H100-era format with INT8-like speed and FP-like dynamic range.
Read article →ARTICLE · 04
Future Of Extreme Quantization
Read article →ARTICLE · 05
GGUF Format Explained
The on-disk format powering llama.cpp and local inference.
Read article →ARTICLE · 06
GPU Kernels for INT4 Inference
Why hardware support is moving the bar.
Read article →ARTICLE · 07
Guide To Quantizing With Bitsandbytes
Read article →ARTICLE · 08
INT8 vs INT4 Quantization
What you actually lose and what you gain.
Read article →ARTICLE · 09
INT8 Calibration for LLMs
SmoothQuant and the activation outlier story.
Read article →ARTICLE · 10
Mixed-Precision Inference
FP16 BF16 FP8 INT4 in one model.
Read article →ARTICLE · 11
QLoRA Fine-Tuning Explained
4-bit base + LoRA adapter = fine-tune big models cheaply.
Read article →ARTICLE · 12
Quantization-Aware Training
When QAT beats post-training.
Read article →ARTICLE · 13
Quantization Deep Dive: How 4-bit and 1.5-bit Models Retain 99% of Their Original Accuracy
Read article →ARTICLE · 14
Quantization Evaluation Methodology
Beyond perplexity — task-specific eval.
Read article →ARTICLE · 15
Quantization for Attention
KV cache attention scores special cases.
Read article →ARTICLE · 16
Quantization for Embeddings
Scalar binary product quantization.
Read article →ARTICLE · 17
SmoothQuant Intuition
Why migrating outliers works.
Read article →ARTICLE · 18
Speculative Decoding
Use a small model to draft, big model to verify.
Read article →