Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit f6348cc

Browse files
Create 1.7 End to End process of Attention.md
1 parent 94c8ca0 commit f6348cc

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
To illustrate the functioning of the attention block in Transformers, let's break down the process using a sample sentence, its embedding vector, and the attention mechanism step-by-step. We will also show how to compute the probability distribution of the next best word based on the attention scores.
2+
3+
### Example Sentence
4+
5+
Let's take the sentence: **"Life is short"**.
6+
7+
### Step 1: Word Embedding
8+
9+
First, we need to convert the words into embedding vectors. For simplicity, we'll use random embeddings for each word.
10+
11+
```python
12+
import numpy as np
13+
14+
# Define the sentence and create a dictionary for word indices
15+
sentence = "Life is short"
16+
words = sentence.split()
17+
word_to_index = {word: i for i, word in enumerate(words)}
18+
19+
# Create random embeddings for each word
20+
embedding_dim = 4 # Dimension of the embedding
21+
embeddings = np.random.rand(len(words), embedding_dim)
22+
23+
print("Word Indices:", word_to_index)
24+
print("Word Embeddings:\n", embeddings)
25+
```
26+
27+
### Step 2: Compute Queries, Keys, and Values
28+
29+
In the attention mechanism, we need to compute the queries (Q), keys (K), and values (V) from the embeddings. We will use learned weight matrices for this purpose.
30+
31+
```python
32+
# Initialize weight matrices for Q, K, and V
33+
W_Q = np.random.rand(embedding_dim, embedding_dim)
34+
W_K = np.random.rand(embedding_dim, embedding_dim)
35+
W_V = np.random.rand(embedding_dim, embedding_dim)
36+
37+
# Compute Q, K, V
38+
Q = embeddings @ W_Q
39+
K = embeddings @ W_K
40+
V = embeddings @ W_V
41+
42+
print("Queries (Q):\n", Q)
43+
print("Keys (K):\n", K)
44+
print("Values (V):\n", V)
45+
```
46+
47+
### Step 3: Compute Attention Scores
48+
49+
Next, we calculate the attention scores using the dot product of the queries and keys, followed by a softmax to obtain the attention weights.
50+
51+
```python
52+
# Compute attention scores
53+
scores = Q @ K.T / np.sqrt(embedding_dim) # Scale by the square root of the dimension
54+
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True) # Softmax
55+
56+
print("Attention Scores:\n", scores)
57+
print("Attention Weights:\n", attention_weights)
58+
```
59+
60+
### Step 4: Compute Output of the Attention Block
61+
62+
The output of the attention block is computed as a weighted sum of the values, using the attention weights.
63+
64+
```python
65+
# Compute the output of the attention block
66+
output = attention_weights @ V
67+
68+
print("Output of Attention Block:\n", output)
69+
```
70+
71+
### Step 5: Probability Distribution for Next Word
72+
73+
To predict the next word, we can apply a simple linear layer followed by a softmax function to the output of the attention block. This simulates how we would generate probabilities for the next word in a sequence.
74+
75+
```python
76+
# Initialize weights for the output layer
77+
W_out = np.random.rand(embedding_dim, len(words))
78+
79+
# Compute logits
80+
logits = output @ W_out
81+
82+
# Compute probabilities using softmax
83+
probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
84+
85+
print("Logits:\n", logits)
86+
print("Probability Distribution for Next Word:\n", probabilities)
87+
```
88+
89+
### Summary of the Process
90+
91+
1. **Word Embedding**: Convert words into embedding vectors.
92+
2. **Compute Q, K, V**: Use learned weight matrices to compute queries, keys, and values from the embeddings.
93+
3. **Attention Scores**: Calculate scores using the dot product of queries and keys, then apply softmax to obtain attention weights.
94+
4. **Output of Attention Block**: Compute the output as a weighted sum of the values based on the attention weights.
95+
5. **Next Word Probability**: Generate a probability distribution for the next word using a linear transformation followed by softmax.
96+
97+
### Final Output
98+
99+
The final output will show the probability distribution of the next best word based on the attention mechanism applied to the input sentence. This allows the model to capture the context and relationships between the words effectively.
100+
101+
Citations:
102+
[1] https://nlp.gluon.ai/examples/sentence_embedding/self_attentive_sentence_embedding.html
103+
[2] https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html
104+
[3] https://datascience.stackexchange.com/questions/95134/how-to-encode-a-sentence-using-an-attention-mechanism
105+
[4] https://towardsdatascience.com/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e?gi=0dee21177e82
106+
[5] https://github.com/gazelle93/Transformer-Various-Positional-Encoding
107+
[6] https://www.linkedin.com/pulse/deep-dive-positional-encodings-transformer-neural-network-ajay-taneja
108+
[7] https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-depth-understanding-part-1-552f0b41d021?gi=4b6a109307fe
109+
[8] https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /