Three attention patterns dominate transformer designs: causal (decoder-only LLMs), bidirectional (encoder-only models like BERT), and prefix-LM (encoder-decoder hybrids). Each fits different tasks. Knowing which one to reach for shapes your model choice.
Causal attention — decoder-only
Each token attends only to previous tokens. Mask out future positions. Used by GPT, Llama, Claude, Mistral — every modern LLM. Right for: generation, completion, chat. Most flexible for general-purpose deployment.
Bidirectional attention — encoder-only
Each token attends to all positions. Used by BERT, RoBERTa, sentence-transformers. Right for: classification, embeddings, NER. Better representations for understanding tasks; can't generate.
Prefix-LM — the hybrid
Bidirectional attention on the prompt, causal on the generation. Used in T5, original encoder-decoder transformers. Right for: translation, summarization, structured generation. Lost popularity to decoder-only models that match its quality on most tasks.
Why decoder-only won
Same training objective (next token prediction) scales unified across all tasks. Architectural simplicity (no separate encoder). The bitter-lesson conclusion: scale + unified objective beats specialized architectures.
When to use which today
Generation: decoder-only LLM. Embeddings or classification: bidirectional encoder (BERT family or modern fine-tunes). Specific seq2seq (translation): encoder-decoder if you have a quality reason; otherwise decoder-only LLM trained with prompting works.