New Bonus Materials: Byte Pair Encoding (BPE) Tokenizer From Scratch · rasbt/LLMs-from-scratch · Discussion #489

rasbt
Jan 17, 2025
Maintainer

Hi all, here's some new bonus material that I thought you might enjoy 😊

Byte Pair Encoding (BPE) Tokenizer From Scratch

Happy weekend!

Replies: 3 comments 11 replies

What I have not yet fully understood is why preserving an empty space in some tokens in the BPE's training process rather treating empty spaces as separate tokens. Is this because it would grow the context window utilization significantly?

7 replies

@rasbt

rasbt Jan 20, 2025
Maintainer Author

Btw for GPT-4, multiple whitespaces have dedicated tokens, which makes it
Screenshot 2025年01月20日 at 9 07 56 AM
a much better tokenizer for coding tasks.

@d-kleine

d-kleine Jan 20, 2025

Interesting - yeah, that makes sense for code indentions! 👍🏻 Actually quite insightful how the tokenizers evolved by time, becoming multimodal 🧠

@d-kleine

d-kleine Jan 24, 2025

BTW Aleph Alpha just proposed a new tokenizer-free autoregressive LLM architecture:

@rasbt

rasbt Jan 24, 2025
Maintainer Author

Oh nice! This will be added right to my bookmark list. Kind of reminds me of the byte-latent transformer in December. These tokenizer-free approaches could be a nice article one day.

@d-kleine

d-kleine Jan 24, 2025

These tokenizer-free approaches could be a nice article one day.

Yeah, this would be awesome!

superJen99
Jan 18, 2025

Hah, he's got a good point! Not all zeros are the same kind of zero

...

On Sat, Jan 18, 2025 at 6:09 PM Sebastian Raschka ***@***.***> wrote: Great question. I am also not sure about the history behind it but I strongly suspect it's because keeping the context window util small. Note that if you have multiple white spaces after each other, they get treated as separate white space characters, so white space characters do exist in GPT-2 tokenizers. Screenshot.2025年01月18日.at.12.08.48.PM.png (view on web) <https://github.com/user-attachments/assets/5d78d0df-d3d5-403f-b5ef-e3a90e6a9583> — Reply to this email directly, view it on GitHub <#489 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADKNJ3VFQJSWQAQLJKGZMJD2LKKF7AVCNFSM6AAAAABVMPI5VWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOBXGY3TONQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** .com>

0 replies

d-kleine
Jan 20, 2025

BTW as you mentioned minbpe in the notebook:

You should be able to train the tokenizer with HF's tokenizer framework. This should be possible with the BpeTrainer, see this example code.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
# Initialize tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
# Configure trainer
trainer = BpeTrainer(
 vocab_size=1000,
 special_tokens=["<|endoftext|>"]
)
# Train the tokenizer
tokenizer.train(files=["the-verdict.txt"], trainer=trainer)
# Save the tokenizer
tokenizer.save("tokenizer.json")

You can also load the original tokenizer, e.g. for GPT-2

4 replies

@rasbt

rasbt Jan 21, 2025
Maintainer Author

You should be able to train the tokenizer with HF's tokenizer framework.

Ah yes, this is what @Aananda-giri did in #485

@d-kleine

d-kleine Jan 21, 2025

Yes, exactly. I think it would be great to add this to the introduction text in the notebook at

The difference between the implementations above and my implementation in this notebook, besides it being is that it also includes a function for training the tokenizer (for educational purposes)

There's also an implementation called minBPE with training support, which is maybe more performant (my implementation here is focused on educational purposes); in contrast to minbpe my implementation additionally allows loading the original OpenAI tokenizer vocabulary and merges

because HF's tokenizers can do both training and loading a pretrained tokenizer (along with transformers)

@rasbt

rasbt Jan 21, 2025
Maintainer Author

@d-kleine Sure, I can add a note about that. I find HF code really hard to read tbh so I would prefer recommending minBPE. But yeah, I added the note as part of #495

@d-kleine

d-kleine Jan 21, 2025

Thanks! 👍🏻

New Bonus Materials: Byte Pair Encoding (BPE) Tokenizer From Scratch #489

Uh oh!

rasbt Jan 17, 2025 Maintainer

Replies: 3 comments · 11 replies

Uh oh!

Uh oh!

rasbt Jan 20, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasbt Jan 24, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasbt Jan 21, 2025 Maintainer Author

Uh oh!

Uh oh!

rasbt Jan 21, 2025 Maintainer Author

Uh oh!

rasbt
Jan 17, 2025
Maintainer

Replies: 3 comments 11 replies

rasbt Jan 20, 2025
Maintainer Author

rasbt Jan 24, 2025
Maintainer Author

rasbt Jan 21, 2025
Maintainer Author

rasbt Jan 21, 2025
Maintainer Author