Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Is self.out_proj necessary? #804

Jessen-Li started this conversation in General
Discussion options

In the book I'm reading, I see the following words on page
Additionally, we added an output projection layer (self.out_proj) to Multi-HeadAttention after combining the heads, which is not present in the Causal-Attention class. This output projection layer is not strictly necessary (see appendix B for more details), but it is commonly used in many LLM architectures, which is why I added it here for completeness.
I tried using the same d_out, setting dropout to 0.0, and comment the self.out_proj, it turns out that MultiHeadAttention and CausalAttention generate similar results. I asked AI, which told me that
Multi-head attention ≈ a way to allow the model to learn multiple types of relationships in parallel and recombine them. Without the out_proj, multi-head is not much more expressive than a single-head with the same dimensionality, which is why your outputs look nearly identical.
If it is what AI says, then self.out_proj should be stressed otherwise.

You must be logged in to vote

Replies: 1 comment 1 reply

Comment options

Thanks for the feedback!

When I understand correctly, when removing the output projection layer, the results from using 1 large head (via the CausalAttention class) versus multiple smaller heads that add to the same size (via the MultiHeadAttention class) lead to approximately the same results?

I think this could be because we use a very simple dataset and short training here. In practice, on a larger-scale, I don't think that this will be true. I.e., I think the AI answer is incorrect.

That's because a single head with the same overall dimensionality is not equivalent to multi-head, because the single head has just one Q/K/V projection, while multi-head has n_heads sets of them, and the attention matrices can differ between them then. I.e., the different heads can then learn to pay attention to different aspects.

Regarding the out_proj, it can be useful to "combine" the heads in the sense that it just adds a learned linear transformation that is useful here. But I was saying that it is not necessary because this paper found that it can be removed without affecting the loss.

You must be logged in to vote
1 reply
Comment options

You are absolute right. I redo test with both large and small dimensions and found that even for small dimensions, even if sharing W_key, W_query, W_value between Multi-head and single-head, the sub-dimension operation of multi-head results in different attain_scores in both classes, even if without masking, scaling, or softmax, the final matrix multiplication of multi-head just use sub-dimension attain_scores to multiple sub-dimension of values, and then just concat along the last dimension (d_out), which leads to fundamental and meaningful different output.
AI also corrected its initial answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants

AltStyle によって変換されたページ (->オリジナル) /