Masked LM
Mask 15% of tokens. Predict masked tokens from bidirectional context. Encoder sees full sentence.
Advertisement
NSP
Next-sentence prediction: two sentences, predict if consecutive. Later shown unnecessary (RoBERTa dropped it).
Advertisement
Fine-tuning
Add task-specific head. Train few epochs on downstream data. Beat previous SOTA on 11 tasks with same architecture.
Descendants
RoBERTa (better training). DeBERTa (disentangled attention). ELECTRA (replaced-token detection). DistilBERT (compressed).