|
| 1 | +## Multi-Layer Perceptron (MLP) in Transformers |
| 2 | + |
| 3 | +The Multi-Layer Perceptron (MLP) is a key component of the Transformer architecture, responsible for refining the representation of each token using a non-linear transformation. Here's the mathematical intuition behind the MLP in Transformers: |
| 4 | + |
| 5 | +### Mathematical Formulation |
| 6 | + |
| 7 | +The MLP in Transformers operates across the features of each token, applying the same non-linear transformation to each token independently. Given the output of the self-attention layer `y(m)_n` for token `n` at layer `m`, the MLP computes: |
| 8 | + |
| 9 | +$$ |
| 10 | +x^{(m+1)}_n = \text{MLP}_\theta(y^{(m)}_n) |
| 11 | +$$ |
| 12 | + |
| 13 | +where `\theta` represents the parameters of the MLP, which are shared across all tokens. |
| 14 | + |
| 15 | +The MLP typically consists of one or two hidden layers with a dimension equal to the number of features `D` (or larger). The computational cost of this step is roughly `N * D * D`, where `N` is the sequence length. |
| 16 | + |
| 17 | +### Example Implementation in Python and NumPy |
| 18 | + |
| 19 | +Here's a simple example of implementing the MLP component in Transformers using Python and NumPy: |
| 20 | + |
| 21 | +```python |
| 22 | +import numpy as np |
| 23 | + |
| 24 | +# Define MLP parameters |
| 25 | +D = 4 # Number of features |
| 26 | +hidden_size = 8 # Size of the hidden layer |
| 27 | + |
| 28 | +# Sample input from the self-attention layer |
| 29 | +y = np.array([[1, 0, 1, 0], |
| 30 | + [0, 1, 0, 1], |
| 31 | + [1, 1, 1, 1]]) |
| 32 | + |
| 33 | +# Initialize MLP weights |
| 34 | +W1 = np.random.rand(D, hidden_size) |
| 35 | +b1 = np.random.rand(1, hidden_size) |
| 36 | +W2 = np.random.rand(hidden_size, D) |
| 37 | +b2 = np.random.rand(1, D) |
| 38 | + |
| 39 | +# Compute MLP output |
| 40 | +h = np.maximum(0, y @ W1 + b1) # ReLU activation in the hidden layer |
| 41 | +x = h @ W2 + b2 # Linear output layer |
| 42 | + |
| 43 | +print("Input from self-attention layer:\n", y) |
| 44 | +print("Output of the MLP:\n", x) |
| 45 | +``` |
| 46 | + |
| 47 | +In this example: |
| 48 | + |
| 49 | +1. We define the MLP parameters, including the number of features `D` and the size of the hidden layer. |
| 50 | + |
| 51 | +2. We create a sample input `y` from the self-attention layer. |
| 52 | + |
| 53 | +3. We initialize the weights and biases of the MLP randomly. |
| 54 | + |
| 55 | +4. We compute the output of the MLP by applying the following steps: |
| 56 | + - Compute the hidden layer activation using a ReLU non-linearity. |
| 57 | + - Apply the output layer weights and biases to obtain the final output. |
| 58 | + |
| 59 | +5. Finally, we print the input from the self-attention layer and the output of the MLP. |
| 60 | + |
| 61 | +The MLP in Transformers acts as a non-linear feature extractor, processing the output of the self-attention layer independently for each token. It helps capture complex interactions between features and refine the representations learned by the self-attention mechanism. |
| 62 | + |
| 63 | +Citations: |
| 64 | +[1] https://www.youtube.com/watch?v=kO0XdAsY5YA |
| 65 | +[2] https://transformer-circuits.pub/2021/framework/index.html |
| 66 | +[3] https://arxiv.org/abs/2304.10557 |
| 67 | +[4] https://learnopencv.com/attention-mechanism-in-transformer-neural-networks/ |
| 68 | +[5] https://arxiv.org/pdf/2304.10557.pdf |
| 69 | +[6] https://www.youtube.com/watch?v=idVm0DMaDR4 |
| 70 | +[7] https://towardsdatascience.com/the-math-behind-multi-head-attention-in-transformers-c26cba15f625 |
| 71 | +[8] https://www.youtube.com/watch?v=qw7wFGgNCSU |
0 commit comments