Newest 'tokenize' Questions

Stack Overflow

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

3,020 questions

Newest Active Bountied Unanswered

Advice

0 votes

0 replies

83 views

Does OpenAI API TPM limit count input tokens, output tokens, or both?

I’m a bit confused about how OpenAI’s API rate limits work - specifically the TPM (tokens per minute) limit. If I have, for example, 2 million TPM, is that limit calculated based on: only the input ...

Adabler's user avatar

Adabler

asked Nov 23, 2025 at 12:12

1 vote

2 answers

464 views

How can I match the token count used by BGE-M3 embedding model before embedding?

For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the ...

ManBearPigeon's user avatar

ManBearPigeon

asked Sep 2, 2025 at 18:38

1 vote

0 answers

175 views

Convert SentencePiece tokenizer to ONNX

I'm developing (Python) an FAQ system based on embeddings to perform similarity queries between a user's question and the FAQ knowledge base. The FAQ needs to run on Android smartphones. I'm using the ...

ltu's user avatar

ltu

asked Aug 27, 2025 at 11:17

1 vote

2 answers

150 views

Strtok retains old data

I am currently writing a shell after taking a bit of a break from C. and I have found this problem with strtok. if I were to write "cd ../" on one line and then "ls" on the next it ...

The QNX girl's user avatar

The QNX girl

asked Aug 1, 2025 at 15:00

0 votes

1 answer

55 views

Efficient multi-host TPU dataset processing

I want to train LLM on TPUv4-32 using JAX/Flax. The dataset is stored in a mounted google storage bucket. The dataset (Red-Pajama-v2) consists of 5000 shards, which are stored in .json.gz files: ~/...

innerproduct's user avatar

innerproduct

asked Jul 10, 2025 at 21:35

2 votes

1 answer

849 views

How to properly save and load a PEFT-trained Unsloth model with resized token embeddings?

I'm using Unsloth's FastVisionModel with the base model unsloth/qwen2-VL-2B-Instruct to train on a dataset that includes text with many unique characters. Here's the overall process I followed: ...

GauravGiri's user avatar

GauravGiri

asked May 22, 2025 at 5:07

1 vote

1 answer

109 views

OpenAI GPT-3 token logprobs and word-level surprisal: inconsistent values and missing outputs for multi-token targets

I’m trying to compute word-level surprisal values for a set of sentence stimuli using OpenAI’s Completions API (legacy endpoint). In information-theoretic terms, surprisal is the negative base-2 ...

Odysseus Myresiotis Alivertis's user avatar

Odysseus Myresiotis Alivertis

asked May 20, 2025 at 15:08

0 votes

0 answers

46 views

byte level tokenizer, slow merges

I need to speed up the merging process within train_bpe function. The merging is fast enough while using a smaller pretoken dictionary. However, if the pretoken dictionary is very large / the text ...

N L's user avatar

N L

asked May 20, 2025 at 1:02

0 votes

0 answers

32 views

Retrieving spaCy transformer tokenization ids

While using spacy transformer pipeline en_core_web_trf. How to retrieve the transformer tokenization (often roberta-base), it can be the tokenizer ids, tokenizer strings, or both (preferably). Actual ...

Lin's user avatar

Lin

1,211

asked Apr 19, 2025 at 7:58

0 votes

0 answers

51 views

How do I test a tflite generative model?

I have a Qwen2.5-0.5 tflite model and I would like to test it in Python (not just the encode/decode aspect but the model generation abilities) and C or C++ before deploying on edge and then deploy it ...

Marie Delporte-Landat's user avatar

Marie Delporte-Landat

asked Mar 27, 2025 at 9:01

2 votes

0 answers

45 views

How to get an exact substring match search with wildcards for Solr in ColdFusion?

I am trying to implement a search in ColdFusion (with indexing through Solr) where it gets a match on exact substrings and exact substrings only. Here's my sample code: <cfset criteriaString = '*#...

jadedQuail's user avatar

jadedQuail

asked Mar 13, 2025 at 21:34

1 vote

1 answer

107 views

How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?

I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer. To tokenize the text, I wrote the implementation as follows: from transformers import GPT2Tokenizer text = "...

RajibTheKing's user avatar

RajibTheKing

1,372

asked Mar 3, 2025 at 22:32

0 votes

2 answers

207 views

Fixing Missing NLTK Tokenizer Resources

Repeated Lookup error eventhough NLTK is downloaded: Resource [93mpunkt_tab[0m not found. Please use the NLTK Downloader to obtain the resource: 31m>>> import nltk nltk.download('...

Ellster's user avatar

Ellster

asked Feb 27, 2025 at 21:00

1 vote

1 answer

93 views

How do I remove escape characters from output of nltk.word_tokenize?

How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...

green_ruby's user avatar

green_ruby

asked Feb 18, 2025 at 20:10

0 votes

0 answers

61 views

PunktTokenizer does not work with Russian `я.`

When tokenizing paragraphs to sentences in the Russian language, I am observing the special case when the sequence is not treated as the end of the sentence. The case is with the я. at the end of the ...

pepr's user avatar

pepr

21.1k

asked Feb 3, 2025 at 9:12

15 30 50 per page

2 3 4 5

...

202 Next

CollectivesTM on Stack Overflow

Does OpenAI API TPM limit count input tokens, output tokens, or both?

How can I match the token count used by BGE-M3 embedding model before embedding?

Convert SentencePiece tokenizer to ONNX

Strtok retains old data

Efficient multi-host TPU dataset processing

How to properly save and load a PEFT-trained Unsloth model with resized token embeddings?

OpenAI GPT-3 token logprobs and word-level surprisal: inconsistent values and missing outputs for multi-token targets

byte level tokenizer, slow merges

Retrieving spaCy transformer tokenization ids

How do I test a tflite generative model?

How to get an exact substring match search with wildcards for Solr in ColdFusion?

How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?

Fixing Missing NLTK Tokenizer Resources

How do I remove escape characters from output of nltk.word_tokenize?

PunktTokenizer does not work with Russian `я.`

Hot Network Questions