|
| 1 | +The attention mechanism in Transformers is a powerful mathematical framework that enables models to focus on different parts of the input sequence, allowing for better understanding of context and relationships within the data. This is particularly useful in tasks such as natural language processing and image recognition. |
| 2 | + |
| 3 | +## Mathematical Intuition of Attention Block |
| 4 | + |
| 5 | +### Key Concepts |
| 6 | + |
| 7 | +1. **Queries, Keys, and Values**: In the context of attention, each input is transformed into three vectors: |
| 8 | + - **Query (Q)**: Represents the item for which we want to find relevant information. |
| 9 | + - **Key (K)**: Represents the items in the input that can provide information. |
| 10 | + - **Value (V)**: Represents the actual information associated with each key. |
| 11 | + |
| 12 | +2. **Scaled Dot-Product Attention**: The attention score between queries and keys is computed using the dot product, scaled by the square root of the dimension of the key vectors, followed by a softmax operation to obtain attention weights. The output is then a weighted sum of the value vectors. |
| 13 | + |
| 14 | + The formula for the attention mechanism can be summarized as: |
| 15 | + |
| 16 | + $$ |
| 17 | + \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V |
| 18 | +$$ |
| 19 | + |
| 20 | + where $$d_k$$ is the dimension of the key vectors. |
| 21 | + |
| 22 | +3. **Multi-Head Attention**: Instead of performing a single attention function, multiple attention heads are used. Each head learns different representations by applying the attention mechanism independently and then concatenating their outputs. |
| 23 | + |
| 24 | +### End-to-End Process Example |
| 25 | + |
| 26 | +To illustrate the attention mechanism, we can implement a simple version using Python and NumPy. Below is a step-by-step example. |
| 27 | + |
| 28 | +```python |
| 29 | +import numpy as np |
| 30 | + |
| 31 | +# Define input dimensions |
| 32 | +d_model = 4 # Dimension of the model |
| 33 | +d_k = 2 # Dimension of keys and queries |
| 34 | +d_v = 2 # Dimension of values |
| 35 | +num_heads = 2 # Number of attention heads |
| 36 | + |
| 37 | +# Sample input data (3 tokens in the sequence, each represented by a vector of size d_model) |
| 38 | +X = np.array([[1, 0, 1, 0], |
| 39 | + [0, 1, 0, 1], |
| 40 | + [1, 1, 1, 1]]) |
| 41 | + |
| 42 | +# Randomly initialize weight matrices for queries, keys, and values |
| 43 | +W_Q = np.random.rand(d_model, d_k) |
| 44 | +W_K = np.random.rand(d_model, d_k) |
| 45 | +W_V = np.random.rand(d_model, d_v) |
| 46 | + |
| 47 | +# Compute queries, keys, and values. @ is the Matrix Multiplication Op. |
| 48 | +Q = X @ W_Q |
| 49 | +K = X @ W_K |
| 50 | +V = X @ W_V |
| 51 | + |
| 52 | +# Compute attention scores |
| 53 | +scores = Q @ K.T / np.sqrt(d_k) # Scale scores |
| 54 | +attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True) # Softmax |
| 55 | + |
| 56 | +# Compute output |
| 57 | +output = attention_weights @ V |
| 58 | + |
| 59 | +print("Queries:\n", Q) |
| 60 | +print("Keys:\n", K) |
| 61 | +print("Values:\n", V) |
| 62 | +print("Attention Weights:\n", attention_weights) |
| 63 | +print("Output:\n", output) |
| 64 | +``` |
| 65 | + |
| 66 | +### Explanation of the Code |
| 67 | + |
| 68 | +1. **Input Data**: We define a simple input matrix `X` representing three tokens, each with a feature vector of size `d_model`. |
| 69 | + |
| 70 | +2. **Weight Matrices**: Random weight matrices `W_Q`, `W_K`, and `W_V` are initialized for transforming the input into queries, keys, and values. |
| 71 | + |
| 72 | +3. **Computing Q, K, V**: The input matrix is multiplied by the corresponding weight matrices to obtain the queries, keys, and values. |
| 73 | + |
| 74 | +4. **Attention Scores**: The dot product of queries and keys is computed, scaled, and passed through a softmax function to obtain attention weights. |
| 75 | + |
| 76 | +5. **Output Calculation**: The final output is computed as a weighted sum of the values based on the attention weights. |
| 77 | + |
| 78 | +This example demonstrates the core functionality of the attention mechanism, capturing the relationships between different tokens in the input sequence. The multi-head attention can be implemented similarly by repeating the process for multiple sets of weight matrices and concatenating the results. |
| 79 | + |
| 80 | +Citations: |
| 81 | +[1] https://learnopencv.com/attention-mechanism-in-transformer-neural-networks/ |
| 82 | +[2] https://transformer-circuits.pub/2021/framework/index.html |
| 83 | +[3] https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html |
| 84 | +[4] https://machinelearningmastery.com/the-transformer-attention-mechanism/ |
| 85 | +[5] https://towardsdatascience.com/the-math-behind-multi-head-attention-in-transformers-c26cba15f625 |
| 86 | +[6] https://nlp.seas.harvard.edu/2018/04/03/attention.html |
| 87 | +[7] https://www.youtube.com/watch?v=kO0XdAsY5YA |
| 88 | +[8] https://towardsdatascience.com/transformers-intuitively-and-exhaustively-explained-58a5c5df8dbb |
0 commit comments