Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 9bdaf97

Browse files
Create 1.5 MLP Block - Python Example.md
1 parent a83c6ee commit 9bdaf97

File tree

1 file changed

+71
-0
lines changed

1 file changed

+71
-0
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
## Multi-Layer Perceptron (MLP) in Transformers
2+
3+
The Multi-Layer Perceptron (MLP) is a key component of the Transformer architecture, responsible for refining the representation of each token using a non-linear transformation. Here's the mathematical intuition behind the MLP in Transformers:
4+
5+
### Mathematical Formulation
6+
7+
The MLP in Transformers operates across the features of each token, applying the same non-linear transformation to each token independently. Given the output of the self-attention layer `y(m)_n` for token `n` at layer `m`, the MLP computes:
8+
9+
$$
10+
x^{(m+1)}_n = \text{MLP}_\theta(y^{(m)}_n)
11+
$$
12+
13+
where `\theta` represents the parameters of the MLP, which are shared across all tokens.
14+
15+
The MLP typically consists of one or two hidden layers with a dimension equal to the number of features `D` (or larger). The computational cost of this step is roughly `N * D * D`, where `N` is the sequence length.
16+
17+
### Example Implementation in Python and NumPy
18+
19+
Here's a simple example of implementing the MLP component in Transformers using Python and NumPy:
20+
21+
```python
22+
import numpy as np
23+
24+
# Define MLP parameters
25+
D = 4 # Number of features
26+
hidden_size = 8 # Size of the hidden layer
27+
28+
# Sample input from the self-attention layer
29+
y = np.array([[1, 0, 1, 0],
30+
[0, 1, 0, 1],
31+
[1, 1, 1, 1]])
32+
33+
# Initialize MLP weights
34+
W1 = np.random.rand(D, hidden_size)
35+
b1 = np.random.rand(1, hidden_size)
36+
W2 = np.random.rand(hidden_size, D)
37+
b2 = np.random.rand(1, D)
38+
39+
# Compute MLP output
40+
h = np.maximum(0, y @ W1 + b1) # ReLU activation in the hidden layer
41+
x = h @ W2 + b2 # Linear output layer
42+
43+
print("Input from self-attention layer:\n", y)
44+
print("Output of the MLP:\n", x)
45+
```
46+
47+
In this example:
48+
49+
1. We define the MLP parameters, including the number of features `D` and the size of the hidden layer.
50+
51+
2. We create a sample input `y` from the self-attention layer.
52+
53+
3. We initialize the weights and biases of the MLP randomly.
54+
55+
4. We compute the output of the MLP by applying the following steps:
56+
- Compute the hidden layer activation using a ReLU non-linearity.
57+
- Apply the output layer weights and biases to obtain the final output.
58+
59+
5. Finally, we print the input from the self-attention layer and the output of the MLP.
60+
61+
The MLP in Transformers acts as a non-linear feature extractor, processing the output of the self-attention layer independently for each token. It helps capture complex interactions between features and refine the representations learned by the self-attention mechanism.
62+
63+
Citations:
64+
[1] https://www.youtube.com/watch?v=kO0XdAsY5YA
65+
[2] https://transformer-circuits.pub/2021/framework/index.html
66+
[3] https://arxiv.org/abs/2304.10557
67+
[4] https://learnopencv.com/attention-mechanism-in-transformer-neural-networks/
68+
[5] https://arxiv.org/pdf/2304.10557.pdf
69+
[6] https://www.youtube.com/watch?v=idVm0DMaDR4
70+
[7] https://towardsdatascience.com/the-math-behind-multi-head-attention-in-transformers-c26cba15f625
71+
[8] https://www.youtube.com/watch?v=qw7wFGgNCSU

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /