Multi-Headed Self Attention — By Hand | by Daniel Warfield | Jul, 2024


Hand computing the cornerstone of modern AI

Daniel Warfield
Towards Data Science
“Focus” By Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained.

Multi-Headed Attention is likely the most important architectural paradigm in machine learning. This summary goes over all critical mathematical operations within multi-headed self attention, allowing you to understand it’s inner workings at a fundamental level. If you’d like to learn more about the intuition behind this topic, check out the IAEE article.

Multi-headed self attention (MHSA) is used in a variety of contexts, each of which might format the input differently. In a natural language processing context one would likely use a word to vector embedding, paired with positional encoding, to calculate a vector that represents each word. Generally, regardless of the type of data, multi-headed self attention expects of sequence of vectors, where each vector represents something.



Source link

[aisg_get_postavatar size=64]