Attention Pattern Matrix
(Redirected from attention pattern)
Jump to navigation
Jump to search
A Attention Pattern Matrix is a matrix that represents the relevance or importance of each Token to every other token within the Attention Mechanism of Transformer Models.
- Context:
- It can have each column reflect the Attention Distribution for a specific token, indicating the extent to which each other token influences its updated embedding.
- It can be computed by performing a Softmax on the Dot Products between each token's Query Vector and all the Key Vectors, normalizing these values between 0 and 1 and ensuring they sum to 1 for each column.
- It can be applied Column-Wise Matrix Operation.
- It can have a size of Square of the Context Size (Sequence Length), which can be a bottleneck when scaling to very long sequences.
- It can incorporate masking during LLM Training to prevent tokens from attending to subsequent tokens in sequence prediction tasks, by setting the upper triangular part of the matrix to negative infinity before applying softmax.
- It can use the Weighted Sums of the Value Vectors, where the weights are derived from the attention pattern, to update the token embeddings to be context-aware.
- ...
- Example(s):
- an Attention Pattern in Machine Translation that effectively captures the relevance of source language words to each target language word during translation.
- an Attention Pattern in Document Summarization that identifies key sentences or phrases relevant to the overall context of the document.
- ...
- Counter-Example(s):
- Kernel Methods, which use predefined kernel functions to compute similarities, as opposed to learning dynamic, context-based patterns like attention mechanisms.
- ...
- See: Attention Mechanism, Transformer (Model), Query Vector, Key Vector, Value Vector, Softmax Function, Masking (Machine Learning).