-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New Bonus Materials: Byte Pair Encoding (BPE) Tokenizer From Scratch #489
-
Hi all, here's some new bonus material that I thought you might enjoy 😊
Byte Pair Encoding (BPE) Tokenizer From Scratch
Happy weekend!
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 36 -
😄 2 -
❤️ 15
Replies: 3 comments 11 replies
-
What I have not yet fully understood is why preserving an empty space in some tokens in the BPE's training process rather treating empty spaces as separate tokens. Is this because it would grow the context window utilization significantly?
Beta Was this translation helpful? Give feedback.
All reactions
-
Btw for GPT-4, multiple whitespaces have dedicated tokens, which makes it
Screenshot 2025年01月20日 at 9 07 56 AM
a much better tokenizer for coding tasks.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Interesting - yeah, that makes sense for code indentions! 👍🏻 Actually quite insightful how the tokenizers evolved by time, becoming multimodal 🧠
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
Oh nice! This will be added right to my bookmark list. Kind of reminds me of the byte-latent transformer in December. These tokenizer-free approaches could be a nice article one day.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
These tokenizer-free approaches could be a nice article one day.
Yeah, this would be awesome!
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
BTW as you mentioned minbpe
in the notebook:
- You should be able to train the tokenizer with HF's tokenizer framework. This should be possible with the BpeTrainer, see this example code.
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer # Initialize tokenizer tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) # Configure trainer trainer = BpeTrainer( vocab_size=1000, special_tokens=["<|endoftext|>"] ) # Train the tokenizer tokenizer.train(files=["the-verdict.txt"], trainer=trainer) # Save the tokenizer tokenizer.save("tokenizer.json")
- You can also load the original tokenizer, e.g. for GPT-2
Beta Was this translation helpful? Give feedback.
All reactions
-
You should be able to train the tokenizer with HF's tokenizer framework.
Ah yes, this is what @Aananda-giri did in #485
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, exactly. I think it would be great to add this to the introduction text in the notebook at
- The difference between the implementations above and my implementation in this notebook, besides it being is that it also includes a function for training the tokenizer (for educational purposes)
- There's also an implementation called minBPE with training support, which is maybe more performant (my implementation here is focused on educational purposes); in contrast to minbpe my implementation additionally allows loading the original OpenAI tokenizer vocabulary and merges
because HF's tokenizers
can do both training and loading a pretrained tokenizer (along with transformers
)
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@d-kleine Sure, I can add a note about that. I find HF code really hard to read tbh so I would prefer recommending minBPE. But yeah, I added the note as part of #495
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Thanks! 👍🏻
Beta Was this translation helpful? Give feedback.