BPE implementation tokenize beyond word boundaries? · rasbt/LLMs-from-scratch · Discussion #608

asafam
Apr 8, 2025

Thanks for a great review and clear code. I have a question: Shouldn't the training in BPE be applied at the word level? Your implementation can generate merged tokens that include multiple words.

Replies: 1 comment 3 replies

rasbt
Apr 8, 2025
Maintainer

Thanks for the comment. I thought the words = line.split() would prevent that. Do you have an example where you had this issue occurring?

3 replies

@asafam

In the initial tokenization step int the train() method, it looks like the code processes processed_text as a flat sequence of characters, rather than splitting it into words first (i.e., no whitespace-level chunking before tokenization):

token_ids = [self.inverse_vocab[char] for char in processed_text]

To my best knowledge, standard BPE tokenization expects the input to be at the word level before applying merges. Without word boundaries, the merging ends up crossing word boundaries.

Would love to hear your thoughts — maybe there’s a reason behind this choice, I'm getting to the depths of it now and may be missing a key point.

Anyhow, thanks for putting it out. That's a great blog post and it was very helpful to me.

@d-kleine

d-kleine Apr 14, 2025

In the initial tokenization step int the train() method, it looks like the code processes processed_text as a flat sequence of characters, rather than splitting it into words first (i.e., no whitespace-level chunking before tokenization):

Upon examining the implementation, this looks correct to me. But it's important to note that this implementation represents a specific variant of BPE known as byte-level BPE.

The approach from the byte-level BPE from 2019 differs from the original BPE from 2016: It uses the marker Ġ as pre-tokenization step to indicate a beginning of a word and applies merging the token pairs with the highest rank (= iteratively merging the token pair that occurs the most often in the corpus; the lower the rank value, the higher the rank) while training.

Your implementation can generate merged tokens that include multiple words.

Addressing this point: While this would be technically possible in byte-level BPE, in practice, this is rare due to these factors:

With the Ġ markers, word boundaries will be recognized
The highest rank merging rule makes sure that only token pairs that occur most often in the training corpus will be considered for the next merge. Also, this is limited to only 50,000 learned merges (in GPT-2).

So, therefore you typically won't find cross-word tokens because the LLM has been trained on a large, diverse corpus of text data where the vocab rather consists of meaning subwords or words rather than multi-word combinations.

@rasbt What do you think?

@rasbt

rasbt Aug 19, 2025
Maintainer

Seeing this a bit late, and many thanks for the thorough answer here @d-kleine. I agree with your points! (Regarding Ġ, this was actually one of the biggest issues I had when reimplementing it; I kind of missed this at first and was wondering why things were working different from the 2019 implementation that I used as a reference)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BPE implementation tokenize beyond word boundaries? #608

Uh oh!

{{title}}

Uh oh!

asafam
Apr 8, 2025

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

rasbt
Apr 8, 2025
Maintainer

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

asafam Apr 9, 2025
Author

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

d-kleine Apr 14, 2025

Uh oh!

{{title}}

Uh oh!

rasbt Aug 19, 2025
Maintainer

Select a reply

Uh oh!

BPE implementation tokenize beyond word boundaries? #608

Uh oh!

asafam Apr 8, 2025

Replies: 1 comment · 3 replies

Uh oh!

rasbt Apr 8, 2025 Maintainer

Uh oh!

Uh oh!

asafam Apr 9, 2025 Author

Uh oh!

Uh oh!

d-kleine Apr 14, 2025

Uh oh!

rasbt Aug 19, 2025 Maintainer

asafam
Apr 8, 2025

Replies: 1 comment 3 replies

rasbt
Apr 8, 2025
Maintainer

asafam Apr 9, 2025
Author

rasbt Aug 19, 2025
Maintainer