Quantization

Quantization

INT8/INT4, GGUF, AWQ, GPTQ, SmoothQuant, FP8 KV cache.

18Articles
18Topics covered
Articles in this category

All 18 articles, sorted alphabetically

Advertisement
ARTICLE · 01

AWQ vs GPTQ Quantization

Two ways to get 4-bit weights — and when each one wins.

Read article
ARTICLE · 02

bitsandbytes vs AutoAWQ vs AutoGPTQ

Library comparison for INT4 quant.

Read article
ARTICLE · 03

FP8 Inference Explained

H100-era format with INT8-like speed and FP-like dynamic range.

Read article
ARTICLE · 04

Future Of Extreme Quantization

Read article
ARTICLE · 05

GGUF Format Explained

The on-disk format powering llama.cpp and local inference.

Read article
ARTICLE · 06

GPU Kernels for INT4 Inference

Why hardware support is moving the bar.

Read article
ARTICLE · 07

Guide To Quantizing With Bitsandbytes

Read article
ARTICLE · 08

INT8 vs INT4 Quantization

What you actually lose and what you gain.

Read article
ARTICLE · 09

INT8 Calibration for LLMs

SmoothQuant and the activation outlier story.

Read article
ARTICLE · 10

Mixed-Precision Inference

FP16 BF16 FP8 INT4 in one model.

Read article
ARTICLE · 11

QLoRA Fine-Tuning Explained

4-bit base + LoRA adapter = fine-tune big models cheaply.

Read article
ARTICLE · 12

Quantization-Aware Training

When QAT beats post-training.

Read article
ARTICLE · 13

Quantization Deep Dive: How 4-bit and 1.5-bit Models Retain 99% of Their Original Accuracy

Read article
ARTICLE · 14

Quantization Evaluation Methodology

Beyond perplexity — task-specific eval.

Read article
ARTICLE · 15

Quantization for Attention

KV cache attention scores special cases.

Read article
ARTICLE · 16

Quantization for Embeddings

Scalar binary product quantization.

Read article
ARTICLE · 17

SmoothQuant Intuition

Why migrating outliers works.

Read article
ARTICLE · 18

Speculative Decoding

Use a small model to draft, big model to verify.

Read article