Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

BPE implementation tokenize beyond word boundaries? #608

Unanswered
asafam asked this question in Q&A
Discussion options

Thanks for a great review and clear code. I have a question: Shouldn't the training in BPE be applied at the word level? Your implementation can generate merged tokens that include multiple words.

You must be logged in to vote

Replies: 1 comment 3 replies

Comment options

Thanks for the comment. I thought the words = line.split() would prevent that. Do you have an example where you had this issue occurring?

You must be logged in to vote
3 replies
Comment options

In the initial tokenization step int the train() method, it looks like the code processes processed_text as a flat sequence of characters, rather than splitting it into words first (i.e., no whitespace-level chunking before tokenization):

token_ids = [self.inverse_vocab[char] for char in processed_text]

To my best knowledge, standard BPE tokenization expects the input to be at the word level before applying merges. Without word boundaries, the merging ends up crossing word boundaries.

Would love to hear your thoughts — maybe there’s a reason behind this choice, I'm getting to the depths of it now and may be missing a key point.

Anyhow, thanks for putting it out. That's a great blog post and it was very helpful to me.

Comment options

In the initial tokenization step int the train() method, it looks like the code processes processed_text as a flat sequence of characters, rather than splitting it into words first (i.e., no whitespace-level chunking before tokenization):

Upon examining the implementation, this looks correct to me. But it's important to note that this implementation represents a specific variant of BPE known as byte-level BPE.

The approach from the byte-level BPE from 2019 differs from the original BPE from 2016: It uses the marker Ġ as pre-tokenization step to indicate a beginning of a word and applies merging the token pairs with the highest rank (= iteratively merging the token pair that occurs the most often in the corpus; the lower the rank value, the higher the rank) while training.

Your implementation can generate merged tokens that include multiple words.

Addressing this point: While this would be technically possible in byte-level BPE, in practice, this is rare due to these factors:

  • With the Ġ markers, word boundaries will be recognized
  • The highest rank merging rule makes sure that only token pairs that occur most often in the training corpus will be considered for the next merge. Also, this is limited to only 50,000 learned merges (in GPT-2).

So, therefore you typically won't find cross-word tokens because the LLM has been trained on a large, diverse corpus of text data where the vocab rather consists of meaning subwords or words rather than multi-word combinations.

@rasbt What do you think?

Comment options

Seeing this a bit late, and many thanks for the thorough answer here @d-kleine. I agree with your points! (Regarding Ġ, this was actually one of the biggest issues I had when reimplementing it; I kind of missed this at first and was wondering why things were working different from the 2019 implementation that I used as a reference)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /