DEV Community

l', 'o', 'w', '</w>'): 1, ('l', 'o', 'w', 'e', 'r', '</w>'): 1, ('l', 'o', 'w', 'e', 's', 't', '</w>'): 1 }

The </w> marker helps distinguish word boundaries.

Step 2: Count Adjacent Pairs

We scan all words and count adjacent symbol pairs.

For example:

l o w </w>

produces:

(l, o)
(o, w)
(w, </w>)

Each pair is counted weighted by word frequency.

This gives us a global frequency table of all pairs in the corpus.

Step 3: Select the Most Frequent Pair

We select the most frequent adjacent pair:

('l', 'o')

This becomes a merge rule:

(l, o) → lo

Step 4: Merge Across the Entire Vocabulary

We replace every occurrence of:

l o

with:

lo

So:

('l','o','w','</w>')

becomes:

('lo','w','</w>')

Step 5: Repeat the Process

We recompute pair frequencies and continue merging:

(l,o) → lo
(lo,w) → low
(low,e) → lowe

Each iteration builds more complex tokens from simpler ones.

Training vs Tokenization (CRITICAL CONCEPT)

This is where many people get confused.

During Training

BPE:

Counts pair frequencies
Learns merge rules
Builds a vocabulary

Example:

(l,o) → lo
(lo,w) → low

During Tokenization (Inference)

The tokenizer:

Does NOT learn anything
Does NOT count frequencies
ONLY applies learned merge rules

Example:

Input:

lowly

Start:

l o w l y

Apply merges:

(l,o) → lo
(lo,w) → low

Result:

low l y

No learning happens here — only rule application.

Why BPE Works So Well

1. Solves OOV Problem

Even unseen words can be represented:

playfulness → play + ful + ness

2. Smaller Vocabulary

Instead of storing full words:

play
playing
player
played

we reuse:

play
ing
er

3. Better Generalization

Words with shared roots reuse tokens:

play
playing
player

This allows models to share meaning across related words.

Limitations of BPE

Despite its power, BPE has limitations:

Sensitive to training corpus
Greedy merging strategy
Weak for multilingual balance
Does not "understand" language — only frequency patterns

Because of this, modern LLMs often use:

Byte-level BPE (GPT-style tokenizers)
WordPiece (BERT)
SentencePiece Unigram (LLaMA, T5)

My Python Implementation (From Scratch)

Below is my simplified BPE implementation:

import sys
def main(av: list[str]) -> int:
 ac = len(av)
 # Step 1: Build word frequencies
 vocab = {}
 for i in range(1, ac):
 vocab[av[i]] = vocab.get(av[i], 0) + 1
 # Step 2: Split into characters
 word_dict = {
 tuple(word) + ('</w>',): freq
 for word, freq in vocab.items()
 }
 # BPE training loop
 num_merges = 200
 for merge_step in range(num_merges):
 # Step 3: Count pairs
 pair_count = {}
 for word, freq in word_dict.items():
 symbols = list(word)
 for i in range(len(symbols) - 1):
 pair = (symbols[i], symbols[i + 1])
 pair_count[pair] = pair_count.get(pair, 0) + freq
 if not pair_count:
 break
 # Step 4: Best pair
 best_pair = max(pair_count, key=pair_count.get)
 if pair_count[best_pair] < 2:
 break
 merged_symbol = ''.join(best_pair)
 print(f"Merge {merge_step + 1}: {best_pair} -> {merged_symbol}")
 # Step 5: Merge vocabulary
 new_word_dict = {}
 for word, freq in word_dict.items():
 symbols = list(word)
 new_symbols = []
 i = 0
 while i < len(symbols):
 if (
 i < len(symbols) - 1 and
 (symbols[i], symbols[i + 1]) == best_pair
 ):
 new_symbols.append(merged_symbol)
 i += 2
 else:
 new_symbols.append(symbols[i])
 i += 1
 new_word_dict[tuple(new_symbols)] = freq
 word_dict = new_word_dict
 print("\nFinal vocabulary:")
 print(word_dict)
 return 0
if __name__ == "__main__":
 sys.exit(main(sys.argv))

This implementation:

Builds vocabulary
Computes pair frequencies
Learns merge rules
Iteratively updates tokens

Key Insight

BPE is surprisingly simple:

It does not understand language — it only learns which character pairs appear together most frequently.

Yet this simple idea is powerful enough to be used in all modern LLMs.

Conclusion

Byte Pair Encoding sits at the core of modern NLP systems.

It provides a balance between:

Word-level tokenization (too rigid)
Character-level tokenization (too long)

By learning frequent subword patterns, BPE allows models to efficiently represent both common and unseen words.

Understanding BPE is one of the best entry points into understanding how LLMs actually process language.