Quantization for Embeddings

Vector databases at scale (>10M vectors) hit a memory wall: 768-dim floats × 16M vectors = 50GB just for the embeddings. Quantization slashes this 4-32x with controllable quality loss. The techniques are different from LLM weight quantization.

Advertisement

Scalar quantization (SQ)

FP32 → INT8 per dimension. 4x memory reduction. ~1-3% recall drop on standard benchmarks. Cheap; widely supported (HNSW + SQ in pgvector, Faiss). The default first step.

Product quantization (PQ)

Split each vector into M sub-vectors; cluster each sub-space into 256 centroids; store centroid IDs. 32-128x memory reduction. Larger recall drop (5-15%); recoverable with re-ranking. Used in Faiss IVFPQ.

Advertisement

Binary quantization

Each dimension → 1 bit (sign). 32x memory reduction. Surprisingly good recall for high-dim embeddings (Cohere's Embed v3 is designed for this). Hamming distance is fast; re-rank top-K with full precision.

Matryoshka embeddings

Trained so first N dims are useful, first 2N are better, full is best. Pick truncation level at runtime. Combines naturally with quantization (truncated + scalar quantized = 16x memory).

Two-stage retrieval

Stage 1: search quantized vectors (fast, recall-imperfect). Stage 2: rerank top K candidates with full-precision vectors (accurate). Recovers most recall at small extra cost. Standard pattern at scale.

SQ for easy 4x. PQ for 32x with care. Binary for the right embedding model. Matryoshka for runtime flexibility. Two-stage retrieval rescues recall.