Cannot tokenize byte sequences that are not valid UTF-8 due to design flaw #51

New issue

Open

@sharpobject

Description

@sharpobject

sharpobject

opened

on Mar 8, 2025

Hello,

The BPE algorithm is capable of tokenizing any byte sequence, and LLMs generally accept any sequence of tokens and use token dictionaries that can successfully represent any byte sequence, but the encode method in bpe-openai accepts a type that has to be valid UTF-8. So there are lots of byte sequences, many of which are only 1 byte long, which you cannot tokenize using this library.

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot tokenize byte sequences that are not valid UTF-8 due to design flaw #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions