neet / attentif Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A toy implementation of "Attention Is All You Need"

neet/attentif

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
assets		assets
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Repository files navigation

attentif

codecov

A toy implementation of "Attention Is All You Need"

A matplotlib capture for loss vs step

Demo

BERT

Screenshot of Jupyter Lab, solving a fill-mask task by BERT

GPT2

Screenshot of Jupyter Lab, solving a generate text task by GPT2

Motivation

I made this project in order to get a deeper understanding for the Transformer architecture, BERT, RoBERTa, T5, and GPT models. We often rely on existing Transformer implementation such as Hugging Face Transformers when we need to train a model. However, I wanted to test if I can implement them from scratch, referring to the paper.

This project does include:

torch.nn.Module
torch.nn.Parameter
Existing tokenizer implementation from transformers
And other primitive functions offered by PyTorch

While this project does not include:

Any models from transformers
nn.Transformer
nn.MultiheadAttention
nn.Embedding
nn.LayerNorm
nn.functional.softmax
And other existing modules that plays an essential role in Transformer architecture

Features

We implemented the following features so far. You can find the layers and functions in src/layers, and models in src/models.

Functions

dropout
softmax
gelu
positional_encoding

Layers

MultiHeadAttention
FeedForwardNetwork
LayerNorm
TokenEmbedding
TransformerEncoder
TransformerEncoderBlock
TransformerDecoder
TransformerDecoderBlock

Models

BertModel
GPT2Model
T5Model

Schedulers

We use transformers for schedulers for now, but have a plan to implement them from scratch in the future.

AdamW
CrossEntropy

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.

About

A toy implementation of "Attention Is All You Need"

Languages

Python 100.0%

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neet/attentif

Folders and files

Latest commit

History

Repository files navigation

attentif

Demo

Motivation

Features

Functions

Layers

Models

Schedulers

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages