This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Improve Tokenizer New Type Onboarding #1536

Open

#1540

Assignees

srikary12

Labels

actionable good first issue triaged

@zhenyan-zhang-meta

Description

@zhenyan-zhang-meta

zhenyan-zhang-meta

opened

on Apr 28, 2025

🚀 The feature, motivation and pitch

As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:

Tasks

Move TokenizerType to a centralized place
- We now have two of them:
  
  torchchat/dist_run.py
  
  Lines 67 to 69 in 0299a37
  
  class TokenizerType(Enum):
  
  Tiktoken = auto()
  
  SentencePiece = auto()
  
  torchchat/torchchat/cli/builder.py
  
  Lines 241 to 245 in 0299a37
  
  class TokenizerType(Enum):
  
  NONE = 0
  
  TIKTOKEN = 1
  
  SENTENCEPIECE = 2
  
  HF_TOKENIZER = 3
Check all getters of tokenizer types
- It may be able to be simplified as inline
  
  torchchat/torchchat/generate.py
  
  Line 368 in 0299a37
  
  self.is_llama3_model = self.tokenizer_args.is_tiktoken()
Add documentation for future tokenizer onboard.
- We may need to point people to update the model validation logic:
  
  torchchat/torchchat/cli/builder.py
  
  Lines 290 to 322 in 0299a37
  
  def validate_model(
  
  self,
  
  model: Optional[Model],
  
  model_description: str = "model",
  
  ) -> None:
  
  if model is None:
  
  return
  
  if self.tokenizer_type == TokenizerType.NONE:
  
  raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}")
  
  is_tiktoken = self.is_tiktoken()
  
  is_sentencepiece = self.is_sentencepiece()
  
  is_hf_tokenizer = self.is_hf_tokenizer()
  
  use_tiktoken = model.config.use_tiktoken
  
  use_hf_tokenizer = model.config.use_hf_tokenizer
  
  use_sentencepiece = not (use_tiktoken or use_hf_tokenizer)
  
  if (
  
  (is_tiktoken and not use_tiktoken) or
  
  (is_hf_tokenizer and not use_hf_tokenizer) or
  
  (is_sentencepiece and not use_sentencepiece)
  
  ):
  
  raise RuntimeError(
  
  "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format(
  
  tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer),
  
  tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer),
  
  model_description,
  
  )
  
  )
  
  return

To test, run a model with each tokenizer type:

python torchchat.py generate llama2
python torchchat.py generate llama3
python torchchat.py generate granite-code

cc @Jack-Khuu @byjlw

Metadata

Assignees

@srikary12
srikary12

Labels

actionable good first issue triaged

Type

No type

Projects

[torchchat] Looking for Contributors

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve Tokenizer New Type Onboarding #1536

Description

🚀 The feature, motivation and pitch

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions