Skip to main content
Stack Overflow
  1. About
  2. For Teams
Filter by
Sorted by
Tagged with
Advice
0 votes
0 replies
83 views

I’m a bit confused about how OpenAI’s API rate limits work - specifically the TPM (tokens per minute) limit. If I have, for example, 2 million TPM, is that limit calculated based on: only the input ...
Adabler's user avatar
  • 34
1 vote
2 answers
464 views

For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the ...
1 vote
0 answers
175 views

I'm developing (Python) an FAQ system based on embeddings to perform similarity queries between a user's question and the FAQ knowledge base. The FAQ needs to run on Android smartphones. I'm using the ...
ltu's user avatar
  • 177
1 vote
2 answers
150 views

I am currently writing a shell after taking a bit of a break from C. and I have found this problem with strtok. if I were to write "cd ../" on one line and then "ls" on the next it ...
The QNX girl's user avatar
0 votes
1 answer
55 views

I want to train LLM on TPUv4-32 using JAX/Flax. The dataset is stored in a mounted google storage bucket. The dataset (Red-Pajama-v2) consists of 5000 shards, which are stored in .json.gz files: ~/...
2 votes
1 answer
849 views

I'm using Unsloth's FastVisionModel with the base model unsloth/qwen2-VL-2B-Instruct to train on a dataset that includes text with many unique characters. Here's the overall process I followed: ...
1 vote
1 answer
109 views

I’m trying to compute word-level surprisal values for a set of sentence stimuli using OpenAI’s Completions API (legacy endpoint). In information-theoretic terms, surprisal is the negative base-2 ...
0 votes
0 answers
46 views

I need to speed up the merging process within train_bpe function. The merging is fast enough while using a smaller pretoken dictionary. However, if the pretoken dictionary is very large / the text ...
0 votes
0 answers
32 views

While using spacy transformer pipeline en_core_web_trf. How to retrieve the transformer tokenization (often roberta-base), it can be the tokenizer ids, tokenizer strings, or both (preferably). Actual ...
0 votes
0 answers
51 views

I have a Qwen2.5-0.5 tflite model and I would like to test it in Python (not just the encode/decode aspect but the model generation abilities) and C or C++ before deploying on edge and then deploy it ...
2 votes
0 answers
45 views

I am trying to implement a search in ColdFusion (with indexing through Solr) where it gets a match on exact substrings and exact substrings only. Here's my sample code: <cfset criteriaString = '*#...
1 vote
1 answer
107 views

I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer. To tokenize the text, I wrote the implementation as follows: from transformers import GPT2Tokenizer text = "...
0 votes
2 answers
207 views

Repeated Lookup error eventhough NLTK is downloaded: Resource [93mpunkt_tab[0m not found. Please use the NLTK Downloader to obtain the resource: 31m>>> import nltk nltk.download('...
Ellster's user avatar
1 vote
1 answer
93 views

How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...
0 votes
0 answers
61 views

When tokenizing paragraphs to sentences in the Russian language, I am observing the special case when the sequence is not treated as the end of the sentence. The case is with the я. at the end of the ...
pepr's user avatar
  • 21.1k

15 30 50 per page
1
2 3 4 5
...
202

AltStyle によって変換されたページ (->オリジナル) /