Causal vs Bidirectional Attention

Three attention patterns dominate transformer designs: causal (decoder-only LLMs), bidirectional (encoder-only models like BERT), and prefix-LM (encoder-decoder hybrids). Each fits different tasks. Knowing which one to reach for shapes your model choice.

Advertisement

Causal attention — decoder-only

Each token attends only to previous tokens. Mask out future positions. Used by GPT, Llama, Claude, Mistral — every modern LLM. Right for: generation, completion, chat. Most flexible for general-purpose deployment.

Bidirectional attention — encoder-only

Each token attends to all positions. Used by BERT, RoBERTa, sentence-transformers. Right for: classification, embeddings, NER. Better representations for understanding tasks; can't generate.

Advertisement

Prefix-LM — the hybrid

Bidirectional attention on the prompt, causal on the generation. Used in T5, original encoder-decoder transformers. Right for: translation, summarization, structured generation. Lost popularity to decoder-only models that match its quality on most tasks.

Why decoder-only won

Same training objective (next token prediction) scales unified across all tasks. Architectural simplicity (no separate encoder). The bitter-lesson conclusion: scale + unified objective beats specialized architectures.

When to use which today

Generation: decoder-only LLM. Embeddings or classification: bidirectional encoder (BERT family or modern fine-tunes). Specific seq2seq (translation): encoder-decoder if you have a quality reason; otherwise decoder-only LLM trained with prompting works.

Causal for generation, bidirectional for embeddings/classification, encoder-decoder rarely needed. Decoder-only won the scaling race.