Model weights live as tensors in memory and as files on disk. The disk format determines load speed, mmap-ability, and quantization support. Three formats dominate: SafeTensors, GGUF, and PyTorch's native .pt/.bin. Knowing them helps you debug 'why does loading take 30 seconds'.

Advertisement

SafeTensors

[8-byte header length, little-endian]
[JSON header with tensor metadata]
[raw tensor bytes, contiguous, aligned]

Hugging Face's format. Single file, no Python pickle (avoids arbitrary code execution). Mmap-friendly. Tensors stored in defined order with shapes, dtypes, offsets. Used as the default for most Hugging Face model uploads since 2023.

GGUF — for llama.cpp

[magic 'GGUF' + version]
[metadata KV pairs: arch, hparams, tokenizer]
[tensor info: name, shape, dtype, offset]
[aligned tensor data]

Self-describing: metadata, tokenizer, and weights in one file. Mmap-friendly. Supports quantization variants (Q4_K_M, Q5_K_M, etc.) inline. Used by llama.cpp, Ollama, LM Studio. The de facto local LLM format.

Advertisement

PyTorch native (.bin / .pt)

Uses Python's pickle protocol. Can execute arbitrary code on load — security risk. Slower than SafeTensors. Tensors stored individually with file naming convention (pytorch_model.bin, pytorch_model-00001-of-00004.bin). Being phased out in favor of SafeTensors.

Loading speed: mmap is the trick

# Naive: read all bytes into memory, then construct tensors
# Slow for big models, requires 2× RAM during load

# mmap: map file directly into virtual memory
weights = mmap_open('model.safetensors')
tensor = create_tensor_view(weights, offset, shape)
# No copy — pages fault in on access

mmap means cold-start time = time to read the bytes you actually use, paged in by the OS. For sparse inference (per-token, you touch all weights anyway), full model loads. But: process start is fast even on slow disk.

Sharding for big models

Models > a few GB are sharded across multiple files. Hugging Face: model-00001-of-00007.safetensors + an index file mapping tensor names to shard files. SafeTensors and GGUF both support sharding. Allows partial download / load.

SafeTensors for HF, GGUF for llama.cpp ecosystem. Both mmap-friendly. Avoid raw .pt for security. Sharding for big models.