This repository was archived by the owner on Sep 10, 2025. It is now read-only.
-
Couldn't load subscription status.
- Fork 249
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Improve Tokenizer New Type Onboarding #1536
Open
Assignees
@zhenyan-zhang-meta
Description
🚀 The feature, motivation and pitch
As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:
Tasks
- Move TokenizerType to a centralized place
- We now have two of them:
Lines 67 to 69 in 0299a37
class TokenizerType(Enum):Tiktoken = auto()SentencePiece = auto()torchchat/torchchat/cli/builder.py
Lines 241 to 245 in 0299a37
class TokenizerType(Enum):NONE = 0TIKTOKEN = 1SENTENCEPIECE = 2HF_TOKENIZER = 3
- We now have two of them:
- Check all getters of tokenizer types
- It may be able to be simplified as inline
torchchat/torchchat/generate.py
Line 368 in 0299a37
self.is_llama3_model = self.tokenizer_args.is_tiktoken()
- It may be able to be simplified as inline
- Add documentation for future tokenizer onboard.
- We may need to point people to update the model validation logic:
torchchat/torchchat/cli/builder.py
Lines 290 to 322 in 0299a37
def validate_model(self,model: Optional[Model],model_description: str = "model",) -> None:if model is None:returnif self.tokenizer_type == TokenizerType.NONE:raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}")is_tiktoken = self.is_tiktoken()is_sentencepiece = self.is_sentencepiece()is_hf_tokenizer = self.is_hf_tokenizer()use_tiktoken = model.config.use_tiktokenuse_hf_tokenizer = model.config.use_hf_tokenizeruse_sentencepiece = not (use_tiktoken or use_hf_tokenizer)if ((is_tiktoken and not use_tiktoken) or(is_hf_tokenizer and not use_hf_tokenizer) or(is_sentencepiece and not use_sentencepiece)):raise RuntimeError("model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format(tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer),tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer),model_description,))return
- We may need to point people to update the model validation logic:
To test, run a model with each tokenizer type:
- python torchchat.py generate llama2
- python torchchat.py generate llama3
- python torchchat.py generate granite-code
cc @Jack-Khuu @byjlw
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status