Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

New Bonus Materials: Byte Pair Encoding (BPE) Tokenizer From Scratch #489

rasbt announced in Announcements
Discussion options

Hi all, here's some new bonus material that I thought you might enjoy 😊

Byte Pair Encoding (BPE) Tokenizer From Scratch

Happy weekend!

You must be logged in to vote

Replies: 3 comments 11 replies

Comment options

What I have not yet fully understood is why preserving an empty space in some tokens in the BPE's training process rather treating empty spaces as separate tokens. Is this because it would grow the context window utilization significantly?

You must be logged in to vote
7 replies
Comment options

rasbt Jan 20, 2025
Maintainer Author

Btw for GPT-4, multiple whitespaces have dedicated tokens, which makes it
Screenshot 2025年01月20日 at 9 07 56 AM
a much better tokenizer for coding tasks.

Comment options

Interesting - yeah, that makes sense for code indentions! 👍🏻 Actually quite insightful how the tokenizers evolved by time, becoming multimodal 🧠

Comment options

BTW Aleph Alpha just proposed a new tokenizer-free autoregressive LLM architecture:

Comment options

rasbt Jan 24, 2025
Maintainer Author

Oh nice! This will be added right to my bookmark list. Kind of reminds me of the byte-latent transformer in December. These tokenizer-free approaches could be a nice article one day.

Comment options

These tokenizer-free approaches could be a nice article one day.

Yeah, this would be awesome!

Comment options

Hah, he's got a good point! Not all zeros are the same kind of zero
...
On Sat, Jan 18, 2025 at 6:09 PM Sebastian Raschka ***@***.***> wrote: Great question. I am also not sure about the history behind it but I strongly suspect it's because keeping the context window util small. Note that if you have multiple white spaces after each other, they get treated as separate white space characters, so white space characters do exist in GPT-2 tokenizers. Screenshot.2025年01月18日.at.12.08.48.PM.png (view on web) <https://github.com/user-attachments/assets/5d78d0df-d3d5-403f-b5ef-e3a90e6a9583> — Reply to this email directly, view it on GitHub <#489 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADKNJ3VFQJSWQAQLJKGZMJD2LKKF7AVCNFSM6AAAAABVMPI5VWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOBXGY3TONQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** .com>
You must be logged in to vote
0 replies
Comment options

BTW as you mentioned minbpe in the notebook:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
# Initialize tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
# Configure trainer
trainer = BpeTrainer(
 vocab_size=1000,
 special_tokens=["<|endoftext|>"]
)
# Train the tokenizer
tokenizer.train(files=["the-verdict.txt"], trainer=trainer)
# Save the tokenizer
tokenizer.save("tokenizer.json")
  • You can also load the original tokenizer, e.g. for GPT-2
You must be logged in to vote
4 replies
Comment options

rasbt Jan 21, 2025
Maintainer Author

You should be able to train the tokenizer with HF's tokenizer framework.

Ah yes, this is what @Aananda-giri did in #485

Comment options

Yes, exactly. I think it would be great to add this to the introduction text in the notebook at

  • The difference between the implementations above and my implementation in this notebook, besides it being is that it also includes a function for training the tokenizer (for educational purposes)
  • There's also an implementation called minBPE with training support, which is maybe more performant (my implementation here is focused on educational purposes); in contrast to minbpe my implementation additionally allows loading the original OpenAI tokenizer vocabulary and merges

because HF's tokenizers can do both training and loading a pretrained tokenizer (along with transformers)

Comment options

rasbt Jan 21, 2025
Maintainer Author

@d-kleine Sure, I can add a note about that. I find HF code really hard to read tbh so I would prefer recommending minBPE. But yeah, I added the note as part of #495

Comment options

Thanks! 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /