Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Improve Tokenizer New Type Onboarding #1536

Open
Assignees
Labels
actionableItems in the backlog waiting for an appropriate impl/fix good first issueGood for newcomers triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
@zhenyan-zhang-meta

Description

🚀 The feature, motivation and pitch


As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:

Tasks


  • Move TokenizerType to a centralized place
  • Check all getters of tokenizer types
  • Add documentation for future tokenizer onboard.
    • We may need to point people to update the model validation logic:
      def validate_model(
      self,
      model: Optional[Model],
      model_description: str = "model",
      ) -> None:
      if model is None:
      return
      if self.tokenizer_type == TokenizerType.NONE:
      raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}")
      is_tiktoken = self.is_tiktoken()
      is_sentencepiece = self.is_sentencepiece()
      is_hf_tokenizer = self.is_hf_tokenizer()
      use_tiktoken = model.config.use_tiktoken
      use_hf_tokenizer = model.config.use_hf_tokenizer
      use_sentencepiece = not (use_tiktoken or use_hf_tokenizer)
      if (
      (is_tiktoken and not use_tiktoken) or
      (is_hf_tokenizer and not use_hf_tokenizer) or
      (is_sentencepiece and not use_sentencepiece)
      ):
      raise RuntimeError(
      "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format(
      tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer),
      tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer),
      model_description,
      )
      )
      return

To test, run a model with each tokenizer type:

  • python torchchat.py generate llama2
  • python torchchat.py generate llama3
  • python torchchat.py generate granite-code

cc @Jack-Khuu @byjlw

Metadata

Metadata

Labels

actionableItems in the backlog waiting for an appropriate impl/fix good first issueGood for newcomers triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

No status

Milestone

No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      AltStyle によって変換されたページ (->オリジナル) /