3,020 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
Advice
0
votes
0
replies
83
views
Does OpenAI API TPM limit count input tokens, output tokens, or both?
I’m a bit confused about how OpenAI’s API rate limits work - specifically the TPM (tokens per minute) limit.
If I have, for example, 2 million TPM, is that limit calculated based on:
only the input ...
1
vote
2
answers
464
views
How can I match the token count used by BGE-M3 embedding model before embedding?
For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the ...
1
vote
0
answers
175
views
Convert SentencePiece tokenizer to ONNX
I'm developing (Python) an FAQ system based on embeddings to perform similarity queries between a user's question and the FAQ knowledge base. The FAQ needs to run on Android smartphones.
I'm using the ...
1
vote
2
answers
150
views
Strtok retains old data
I am currently writing a shell after taking a bit of a break from C. and I have found this problem with strtok. if I were to write "cd ../" on one line and then "ls" on the next it ...
0
votes
1
answer
55
views
Efficient multi-host TPU dataset processing
I want to train LLM on TPUv4-32 using JAX/Flax. The dataset is stored in a mounted google storage bucket. The dataset (Red-Pajama-v2) consists of 5000 shards, which are stored in .json.gz files: ~/...
2
votes
1
answer
849
views
How to properly save and load a PEFT-trained Unsloth model with resized token embeddings?
I'm using Unsloth's FastVisionModel with the base model unsloth/qwen2-VL-2B-Instruct to train on a dataset that includes text with many unique characters. Here's the overall process I followed:
...
1
vote
1
answer
109
views
OpenAI GPT-3 token logprobs and word-level surprisal: inconsistent values and missing outputs for multi-token targets
I’m trying to compute word-level surprisal values for a set of sentence stimuli using OpenAI’s Completions API (legacy endpoint).
In information-theoretic terms, surprisal is the negative base-2 ...
0
votes
0
answers
46
views
byte level tokenizer, slow merges
I need to speed up the merging process within train_bpe function. The merging is fast enough while using a smaller pretoken dictionary. However, if the pretoken dictionary is very large / the text ...
0
votes
0
answers
32
views
Retrieving spaCy transformer tokenization ids
While using spacy transformer pipeline en_core_web_trf. How to retrieve the transformer tokenization (often roberta-base), it can be the tokenizer ids, tokenizer strings, or both (preferably).
Actual ...
0
votes
0
answers
51
views
How do I test a tflite generative model?
I have a Qwen2.5-0.5 tflite model and I would like to test it in Python (not just the encode/decode aspect but the model generation abilities) and C or C++ before deploying on edge and then deploy it ...
2
votes
0
answers
45
views
How to get an exact substring match search with wildcards for Solr in ColdFusion?
I am trying to implement a search in ColdFusion (with indexing through Solr) where it gets a match on exact substrings and exact substrings only.
Here's my sample code:
<cfset criteriaString = '*#...
1
vote
1
answer
107
views
How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?
I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer.
To tokenize the text, I wrote the implementation as follows:
from transformers import GPT2Tokenizer
text = "...
0
votes
2
answers
207
views
Fixing Missing NLTK Tokenizer Resources
Repeated Lookup error eventhough NLTK is downloaded:
Resource [93mpunkt_tab[0m not found.
Please use the NLTK Downloader to obtain the resource:
31m>>> import nltk
nltk.download('...
1
vote
1
answer
93
views
How do I remove escape characters from output of nltk.word_tokenize?
How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...
0
votes
0
answers
61
views
PunktTokenizer does not work with Russian `я.`
When tokenizing paragraphs to sentences in the Russian language, I am observing the special case when the sequence is not treated as the end of the sentence. The case is with the я. at the end of the ...
pepr's user avatar
- 21.1k