Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 1f1e59c

Browse files
Added tokenizer.
1 parent 08e7d0e commit 1f1e59c

File tree

3 files changed

+51776
-0
lines changed

3 files changed

+51776
-0
lines changed

‎tokenizer.py‎

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from curtsies.fmtfuncs import red, green, on_blue, yellow, blue, cyan
2+
from tokenizers import ByteLevelBPETokenizer
3+
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
4+
5+
TRAIN_BASE = False
6+
TOKENIZER_DIR = "tokenizer"
7+
8+
paths = ["data.txt"]
9+
10+
if TRAIN_BASE:
11+
tokenizer = ByteLevelBPETokenizer()
12+
13+
tokenizer.train(files=paths, vocab_size=52000, min_frequency=2, special_tokens=[
14+
"<s>",
15+
"<pad>",
16+
"</s>",
17+
"<unk>",
18+
"<mask>",
19+
])
20+
21+
tokenizer.save_model(TOKENIZER_DIR)
22+
23+
inp = "print('hello world!')"
24+
25+
tokenizer = GPT2Tokenizer.from_pretrained(TOKENIZER_DIR)
26+
tokenizer.add_special_tokens({
27+
"eos_token": "</s>",
28+
"bos_token": "<s>",
29+
"unk_token": "<unk>",
30+
"pad_token": "<pad>",
31+
"mask_token": "<mask>"
32+
})
33+
34+
t = tokenizer.encode(inp)
35+
print(t)

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /