Masked LM

Mask 15% of tokens. Predict masked tokens from bidirectional context. Encoder sees full sentence.

Advertisement

NSP

Next-sentence prediction: two sentences, predict if consecutive. Later shown unnecessary (RoBERTa dropped it).

Advertisement

Fine-tuning

Add task-specific head. Train few epochs on downstream data. Beat previous SOTA on 11 tasks with same architecture.

Descendants

RoBERTa (better training). DeBERTa (disentangled attention). ELECTRA (replaced-token detection). DistilBERT (compressed).