A parallelized BPE tokenizer built from scratch as part of Stanford's CS336 assignment.
No HuggingFace. No SentencePiece. Just raw Python and a lot of profiling.
train.py- BPE training with multiprocessing for pre-tokenizationtokenizer.py- CLI for BPE encoding and decodingtrained-tokenizers/- Trained vocabulary and merge files for TinyStories (10K) and OpenWebText (32K)
# Train a tokenizer python train.py --input sample-data/TinyStoriesV2-GPT4-valid.txt --vocab-size 10000 # Encode text python tokenizer.py --encode "Hello world" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txt # Decode tokens python tokenizer.py --decode "15496 995" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txt
Profiled with Scalene.
Evaluated on validation sets:
- OpenWebText (32K vocab): 4.37
- TinyStories (10K vocab): 4.12
Wrote about the whole process here: Building a BPE Tokenizer from Scratch