Grouped Query Attention (GQA) Mechanism

From GM-RKB
(Redirected from Grouped Query Attention)
Jump to navigation Jump to search

A Grouped Query Attention (GQA) Mechanism is an attention mechanism that enhances Large Language Models' efficiency by allowing for the grouped allocation of attention resources.



References

2024

  Input: Query, Key, and Value tensors; number of query heads and key-value heads
  Output: Attention output tensor
  - Split queries into more heads than keys and values
  - For each query head:
    - Match it with a key-value head (cycling if necessary)
    - Compute attention scores and output
  - Concatenate outputs from all heads

2024

  • GPT-4
    • The Grouped Query Attention (GQA) mechanism is a development in attention mechanisms for transformer models, which effectively combines elements of Multi-Head Attention (MHA) and Multi-Query Attention (MQA) to enhance computational efficiency without significantly compromising model performance.
    • GQA divides the query heads from a traditional multi-head model into several groups, with each group sharing a single key and value head. This setup can be tuned to various configurations depending on the specific needs of the model: GQA-1 where all heads are grouped together similar to MQA, GQA-H where each head is its own group like MHA, and GQA-G which allows any number of groups between the two extremes .
    • This approach helps to reduce the memory overhead associated with storing separate keys and values for each head, especially beneficial in models handling large context windows or batch sizes. For example, GQA has been applied to models like DeciLM, which adapts the GQA configuration across different layers to optimize the speed-accuracy tradeoff, demonstrating variable grouping that enhances transformer efficiency with layer-specific optimization .
    • Overall, GQA offers a flexible trade-off between computational efficiency and model expressiveness, providing a way to balance the quality of attention with the need for speed in large-scale models .

2023