-
Notifications
You must be signed in to change notification settings - Fork 11.1k
-
Thanks for a great review and clear code. I have a question: Shouldn't the training in BPE be applied at the word level? Your implementation can generate merged tokens that include multiple words.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 3 replies
-
Thanks for the comment. I thought the words = line.split()
would prevent that. Do you have an example where you had this issue occurring?
Beta Was this translation helpful? Give feedback.
All reactions
-
In the initial tokenization step int the train()
method, it looks like the code processes processed_text
as a flat sequence of characters, rather than splitting it into words first (i.e., no whitespace-level chunking before tokenization):
token_ids = [self.inverse_vocab[char] for char in processed_text]
To my best knowledge, standard BPE tokenization expects the input to be at the word level before applying merges. Without word boundaries, the merging ends up crossing word boundaries.
Would love to hear your thoughts — maybe there’s a reason behind this choice, I'm getting to the depths of it now and may be missing a key point.
Anyhow, thanks for putting it out. That's a great blog post and it was very helpful to me.
Beta Was this translation helpful? Give feedback.
All reactions
-
In the initial tokenization step int the
train()
method, it looks like the code processesprocessed_text
as a flat sequence of characters, rather than splitting it into words first (i.e., no whitespace-level chunking before tokenization):
Upon examining the implementation, this looks correct to me. But it's important to note that this implementation represents a specific variant of BPE known as byte-level BPE.
The approach from the byte-level BPE from 2019 differs from the original BPE from 2016: It uses the marker Ġ
as pre-tokenization step to indicate a beginning of a word and applies merging the token pairs with the highest rank (= iteratively merging the token pair that occurs the most often in the corpus; the lower the rank value, the higher the rank) while training.
Your implementation can generate merged tokens that include multiple words.
Addressing this point: While this would be technically possible in byte-level BPE, in practice, this is rare due to these factors:
- With the
Ġ
markers, word boundaries will be recognized - The highest rank merging rule makes sure that only token pairs that occur most often in the training corpus will be considered for the next merge. Also, this is limited to only 50,000 learned merges (in GPT-2).
So, therefore you typically won't find cross-word tokens because the LLM has been trained on a large, diverse corpus of text data where the vocab rather consists of meaning subwords or words rather than multi-word combinations.
@rasbt What do you think?
Beta Was this translation helpful? Give feedback.
All reactions
-
Seeing this a bit late, and many thanks for the thorough answer here @d-kleine. I agree with your points! (Regarding Ġ
, this was actually one of the biggest issues I had when reimplementing it; I kind of missed this at first and was wondering why things were working different from the 2019 implementation that I used as a reference)
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1