|
| 1 | +To illustrate the functioning of the attention block in Transformers, let's break down the process using a sample sentence, its embedding vector, and the attention mechanism step-by-step. We will also show how to compute the probability distribution of the next best word based on the attention scores. |
| 2 | + |
| 3 | +### Example Sentence |
| 4 | + |
| 5 | +Let's take the sentence: **"Life is short"**. |
| 6 | + |
| 7 | +### Step 1: Word Embedding |
| 8 | + |
| 9 | +First, we need to convert the words into embedding vectors. For simplicity, we'll use random embeddings for each word. |
| 10 | + |
| 11 | +```python |
| 12 | +import numpy as np |
| 13 | + |
| 14 | +# Define the sentence and create a dictionary for word indices |
| 15 | +sentence = "Life is short" |
| 16 | +words = sentence.split() |
| 17 | +word_to_index = {word: i for i, word in enumerate(words)} |
| 18 | + |
| 19 | +# Create random embeddings for each word |
| 20 | +embedding_dim = 4 # Dimension of the embedding |
| 21 | +embeddings = np.random.rand(len(words), embedding_dim) |
| 22 | + |
| 23 | +print("Word Indices:", word_to_index) |
| 24 | +print("Word Embeddings:\n", embeddings) |
| 25 | +``` |
| 26 | + |
| 27 | +### Step 2: Compute Queries, Keys, and Values |
| 28 | + |
| 29 | +In the attention mechanism, we need to compute the queries (Q), keys (K), and values (V) from the embeddings. We will use learned weight matrices for this purpose. |
| 30 | + |
| 31 | +```python |
| 32 | +# Initialize weight matrices for Q, K, and V |
| 33 | +W_Q = np.random.rand(embedding_dim, embedding_dim) |
| 34 | +W_K = np.random.rand(embedding_dim, embedding_dim) |
| 35 | +W_V = np.random.rand(embedding_dim, embedding_dim) |
| 36 | + |
| 37 | +# Compute Q, K, V |
| 38 | +Q = embeddings @ W_Q |
| 39 | +K = embeddings @ W_K |
| 40 | +V = embeddings @ W_V |
| 41 | + |
| 42 | +print("Queries (Q):\n", Q) |
| 43 | +print("Keys (K):\n", K) |
| 44 | +print("Values (V):\n", V) |
| 45 | +``` |
| 46 | + |
| 47 | +### Step 3: Compute Attention Scores |
| 48 | + |
| 49 | +Next, we calculate the attention scores using the dot product of the queries and keys, followed by a softmax to obtain the attention weights. |
| 50 | + |
| 51 | +```python |
| 52 | +# Compute attention scores |
| 53 | +scores = Q @ K.T / np.sqrt(embedding_dim) # Scale by the square root of the dimension |
| 54 | +attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True) # Softmax |
| 55 | + |
| 56 | +print("Attention Scores:\n", scores) |
| 57 | +print("Attention Weights:\n", attention_weights) |
| 58 | +``` |
| 59 | + |
| 60 | +### Step 4: Compute Output of the Attention Block |
| 61 | + |
| 62 | +The output of the attention block is computed as a weighted sum of the values, using the attention weights. |
| 63 | + |
| 64 | +```python |
| 65 | +# Compute the output of the attention block |
| 66 | +output = attention_weights @ V |
| 67 | + |
| 68 | +print("Output of Attention Block:\n", output) |
| 69 | +``` |
| 70 | + |
| 71 | +### Step 5: Probability Distribution for Next Word |
| 72 | + |
| 73 | +To predict the next word, we can apply a simple linear layer followed by a softmax function to the output of the attention block. This simulates how we would generate probabilities for the next word in a sequence. |
| 74 | + |
| 75 | +```python |
| 76 | +# Initialize weights for the output layer |
| 77 | +W_out = np.random.rand(embedding_dim, len(words)) |
| 78 | + |
| 79 | +# Compute logits |
| 80 | +logits = output @ W_out |
| 81 | + |
| 82 | +# Compute probabilities using softmax |
| 83 | +probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True) |
| 84 | + |
| 85 | +print("Logits:\n", logits) |
| 86 | +print("Probability Distribution for Next Word:\n", probabilities) |
| 87 | +``` |
| 88 | + |
| 89 | +### Summary of the Process |
| 90 | + |
| 91 | +1. **Word Embedding**: Convert words into embedding vectors. |
| 92 | +2. **Compute Q, K, V**: Use learned weight matrices to compute queries, keys, and values from the embeddings. |
| 93 | +3. **Attention Scores**: Calculate scores using the dot product of queries and keys, then apply softmax to obtain attention weights. |
| 94 | +4. **Output of Attention Block**: Compute the output as a weighted sum of the values based on the attention weights. |
| 95 | +5. **Next Word Probability**: Generate a probability distribution for the next word using a linear transformation followed by softmax. |
| 96 | + |
| 97 | +### Final Output |
| 98 | + |
| 99 | +The final output will show the probability distribution of the next best word based on the attention mechanism applied to the input sentence. This allows the model to capture the context and relationships between the words effectively. |
| 100 | + |
| 101 | +Citations: |
| 102 | +[1] https://nlp.gluon.ai/examples/sentence_embedding/self_attentive_sentence_embedding.html |
| 103 | +[2] https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html |
| 104 | +[3] https://datascience.stackexchange.com/questions/95134/how-to-encode-a-sentence-using-an-attention-mechanism |
| 105 | +[4] https://towardsdatascience.com/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e?gi=0dee21177e82 |
| 106 | +[5] https://github.com/gazelle93/Transformer-Various-Positional-Encoding |
| 107 | +[6] https://www.linkedin.com/pulse/deep-dive-positional-encodings-transformer-neural-network-ajay-taneja |
| 108 | +[7] https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-depth-understanding-part-1-552f0b41d021?gi=4b6a109307fe |
| 109 | +[8] https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/ |
0 commit comments