Copied to Clipboard
l', 'o', 'w', '</w>'): 1,
('l', 'o', 'w', 'e', 'r', '</w>'): 1,
('l', 'o', 'w', 'e', 's', 't', '</w>'): 1
}
The </w> marker helps distinguish word boundaries.
Step 2: Count Adjacent Pairs
We scan all words and count adjacent symbol pairs.
For example:
l o w </w>
produces:
(l, o)
(o, w)
(w, </w>)
Each pair is counted weighted by word frequency.
This gives us a global frequency table of all pairs in the corpus.
Step 3: Select the Most Frequent Pair
We select the most frequent adjacent pair:
('l', 'o')
This becomes a merge rule:
(l, o) → lo
Step 4: Merge Across the Entire Vocabulary
We replace every occurrence of:
l o
with:
lo
So:
('l','o','w','</w>')
becomes:
('lo','w','</w>')
Step 5: Repeat the Process
We recompute pair frequencies and continue merging:
(l,o) → lo
(lo,w) → low
(low,e) → lowe
Each iteration builds more complex tokens from simpler ones.
Training vs Tokenization (CRITICAL CONCEPT)
This is where many people get confused.
During Training
BPE:
- Counts pair frequencies
- Learns merge rules
- Builds a vocabulary
Example:
(l,o) → lo
(lo,w) → low
During Tokenization (Inference)
The tokenizer:
- Does NOT learn anything
- Does NOT count frequencies
- ONLY applies learned merge rules
Example:
Input:
lowly
Start:
l o w l y
Apply merges:
(l,o) → lo
(lo,w) → low
Result:
low l y
No learning happens here — only rule application.
Why BPE Works So Well
1. Solves OOV Problem
Even unseen words can be represented:
playfulness → play + ful + ness
2. Smaller Vocabulary
Instead of storing full words:
play
playing
player
played
we reuse:
play
ing
er
3. Better Generalization
Words with shared roots reuse tokens:
play
playing
player
This allows models to share meaning across related words.
Limitations of BPE
Despite its power, BPE has limitations:
- Sensitive to training corpus
- Greedy merging strategy
- Weak for multilingual balance
- Does not "understand" language — only frequency patterns
Because of this, modern LLMs often use:
- Byte-level BPE (GPT-style tokenizers)
- WordPiece (BERT)
- SentencePiece Unigram (LLaMA, T5)
My Python Implementation (From Scratch)
Below is my simplified BPE implementation:
import sys
def main(av: list[str]) -> int:
ac = len(av)
# Step 1: Build word frequencies
vocab = {}
for i in range(1, ac):
vocab[av[i]] = vocab.get(av[i], 0) + 1
# Step 2: Split into characters
word_dict = {
tuple(word) + ('</w>',): freq
for word, freq in vocab.items()
}
# BPE training loop
num_merges = 200
for merge_step in range(num_merges):
# Step 3: Count pairs
pair_count = {}
for word, freq in word_dict.items():
symbols = list(word)
for i in range(len(symbols) - 1):
pair = (symbols[i], symbols[i + 1])
pair_count[pair] = pair_count.get(pair, 0) + freq
if not pair_count:
break
# Step 4: Best pair
best_pair = max(pair_count, key=pair_count.get)
if pair_count[best_pair] < 2:
break
merged_symbol = ''.join(best_pair)
print(f"Merge {merge_step + 1}: {best_pair} -> {merged_symbol}")
# Step 5: Merge vocabulary
new_word_dict = {}
for word, freq in word_dict.items():
symbols = list(word)
new_symbols = []
i = 0
while i < len(symbols):
if (
i < len(symbols) - 1 and
(symbols[i], symbols[i + 1]) == best_pair
):
new_symbols.append(merged_symbol)
i += 2
else:
new_symbols.append(symbols[i])
i += 1
new_word_dict[tuple(new_symbols)] = freq
word_dict = new_word_dict
print("\nFinal vocabulary:")
print(word_dict)
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv))
This implementation:
- Builds vocabulary
- Computes pair frequencies
- Learns merge rules
- Iteratively updates tokens
Key Insight
BPE is surprisingly simple:
It does not understand language — it only learns which character pairs appear together most frequently.
Yet this simple idea is powerful enough to be used in all modern LLMs.
Conclusion
Byte Pair Encoding sits at the core of modern NLP systems.
It provides a balance between:
- Word-level tokenization (too rigid)
- Character-level tokenization (too long)
By learning frequent subword patterns, BPE allows models to efficiently represent both common and unseen words.
Understanding BPE is one of the best entry points into understanding how LLMs actually process language.