Demonstrates how different "heads" in the attention mechanism specialize in learning different linguistic features, such as grammar, pronouns, or subject-verb relationships.
Different "heads" learn different grammar relationships.