Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit a83c6ee

Browse files
Create 1.4 Attention Block - Python Example.md
1 parent d5f1d27 commit a83c6ee

File tree

1 file changed

+88
-0
lines changed

1 file changed

+88
-0
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
The attention mechanism in Transformers is a powerful mathematical framework that enables models to focus on different parts of the input sequence, allowing for better understanding of context and relationships within the data. This is particularly useful in tasks such as natural language processing and image recognition.
2+
3+
## Mathematical Intuition of Attention Block
4+
5+
### Key Concepts
6+
7+
1. **Queries, Keys, and Values**: In the context of attention, each input is transformed into three vectors:
8+
- **Query (Q)**: Represents the item for which we want to find relevant information.
9+
- **Key (K)**: Represents the items in the input that can provide information.
10+
- **Value (V)**: Represents the actual information associated with each key.
11+
12+
2. **Scaled Dot-Product Attention**: The attention score between queries and keys is computed using the dot product, scaled by the square root of the dimension of the key vectors, followed by a softmax operation to obtain attention weights. The output is then a weighted sum of the value vectors.
13+
14+
The formula for the attention mechanism can be summarized as:
15+
16+
$$
17+
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
18+
$$
19+
20+
where $$d_k$$ is the dimension of the key vectors.
21+
22+
3. **Multi-Head Attention**: Instead of performing a single attention function, multiple attention heads are used. Each head learns different representations by applying the attention mechanism independently and then concatenating their outputs.
23+
24+
### End-to-End Process Example
25+
26+
To illustrate the attention mechanism, we can implement a simple version using Python and NumPy. Below is a step-by-step example.
27+
28+
```python
29+
import numpy as np
30+
31+
# Define input dimensions
32+
d_model = 4 # Dimension of the model
33+
d_k = 2 # Dimension of keys and queries
34+
d_v = 2 # Dimension of values
35+
num_heads = 2 # Number of attention heads
36+
37+
# Sample input data (3 tokens in the sequence, each represented by a vector of size d_model)
38+
X = np.array([[1, 0, 1, 0],
39+
[0, 1, 0, 1],
40+
[1, 1, 1, 1]])
41+
42+
# Randomly initialize weight matrices for queries, keys, and values
43+
W_Q = np.random.rand(d_model, d_k)
44+
W_K = np.random.rand(d_model, d_k)
45+
W_V = np.random.rand(d_model, d_v)
46+
47+
# Compute queries, keys, and values. @ is the Matrix Multiplication Op.
48+
Q = X @ W_Q
49+
K = X @ W_K
50+
V = X @ W_V
51+
52+
# Compute attention scores
53+
scores = Q @ K.T / np.sqrt(d_k) # Scale scores
54+
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True) # Softmax
55+
56+
# Compute output
57+
output = attention_weights @ V
58+
59+
print("Queries:\n", Q)
60+
print("Keys:\n", K)
61+
print("Values:\n", V)
62+
print("Attention Weights:\n", attention_weights)
63+
print("Output:\n", output)
64+
```
65+
66+
### Explanation of the Code
67+
68+
1. **Input Data**: We define a simple input matrix `X` representing three tokens, each with a feature vector of size `d_model`.
69+
70+
2. **Weight Matrices**: Random weight matrices `W_Q`, `W_K`, and `W_V` are initialized for transforming the input into queries, keys, and values.
71+
72+
3. **Computing Q, K, V**: The input matrix is multiplied by the corresponding weight matrices to obtain the queries, keys, and values.
73+
74+
4. **Attention Scores**: The dot product of queries and keys is computed, scaled, and passed through a softmax function to obtain attention weights.
75+
76+
5. **Output Calculation**: The final output is computed as a weighted sum of the values based on the attention weights.
77+
78+
This example demonstrates the core functionality of the attention mechanism, capturing the relationships between different tokens in the input sequence. The multi-head attention can be implemented similarly by repeating the process for multiple sets of weight matrices and concatenating the results.
79+
80+
Citations:
81+
[1] https://learnopencv.com/attention-mechanism-in-transformer-neural-networks/
82+
[2] https://transformer-circuits.pub/2021/framework/index.html
83+
[3] https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
84+
[4] https://machinelearningmastery.com/the-transformer-attention-mechanism/
85+
[5] https://towardsdatascience.com/the-math-behind-multi-head-attention-in-transformers-c26cba15f625
86+
[6] https://nlp.seas.harvard.edu/2018/04/03/attention.html
87+
[7] https://www.youtube.com/watch?v=kO0XdAsY5YA
88+
[8] https://towardsdatascience.com/transformers-intuitively-and-exhaustively-explained-58a5c5df8dbb

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /