-
Notifications
You must be signed in to change notification settings - Fork 11.1k
-
In the book I'm reading, I see the following words on page
Additionally, we added an output projection layer (self.out_proj) to Multi-HeadAttention after combining the heads, which is not present in the Causal-Attention class. This output projection layer is not strictly necessary (see appendix B for more details), but it is commonly used in many LLM architectures, which is why I added it here for completeness.
I tried using the same d_out, setting dropout to 0.0, and comment the self.out_proj, it turns out that MultiHeadAttention
and CausalAttention
generate similar results. I asked AI, which told me that
Multi-head attention ≈ a way to allow the model to learn multiple types of relationships in parallel and recombine them. Without the out_proj, multi-head is not much more expressive than a single-head with the same dimensionality, which is why your outputs look nearly identical.
If it is what AI says, then self.out_proj
should be stressed otherwise.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 1 reply
-
Thanks for the feedback!
When I understand correctly, when removing the output projection layer, the results from using 1 large head (via the CausalAttention
class) versus multiple smaller heads that add to the same size (via the MultiHeadAttention
class) lead to approximately the same results?
I think this could be because we use a very simple dataset and short training here. In practice, on a larger-scale, I don't think that this will be true. I.e., I think the AI answer is incorrect.
That's because a single head with the same overall dimensionality is not equivalent to multi-head, because the single head has just one Q/K/V projection, while multi-head has n_heads
sets of them, and the attention matrices can differ between them then. I.e., the different heads can then learn to pay attention to different aspects.
Regarding the out_proj
, it can be useful to "combine" the heads in the sense that it just adds a learned linear transformation that is useful here. But I was saying that it is not necessary because this paper found that it can be removed without affecting the loss.
Beta Was this translation helpful? Give feedback.
All reactions
-
You are absolute right. I redo test with both large and small dimensions and found that even for small dimensions, even if sharing W_key, W_query, W_value between Multi-head and single-head, the sub-dimension operation of multi-head results in different attain_scores in both classes, even if without masking, scaling, or softmax, the final matrix multiplication of multi-head just use sub-dimension attain_scores to multiple sub-dimension of values, and then just concat along the last dimension (d_out), which leads to fundamental and meaningful different output.
AI also corrected its initial answer.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1