is there a GPT2 tokenizer optimized for Chinese characters? · rasbt/LLMs-from-scratch · Discussion #739

Jessen-Li
Jul 12, 2025

tokenizer = AutoTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall", cache_dir="./gpt2_ch")

debug information shows <class 'transformers.models.bert.tokenization_bert.BertTokenizer'>

model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-cluecorpussmall", cache_dir="./gpt2_ch")

is this a mismatch, how to solve it? since GT2Tokenizer has no "uer/gpt2-chinese-cluecorpussmall"

The response of the trained model contains special tokens like [CLS] and [SEP], just remove them would work?