Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Why any model running with llama server behave differently? #9660

Unanswered
alexcardo asked this question in Q&A
Discussion options

What exactly I need to do to force llama server behave the same way as it works in llama cli or any implementation mode?

I'll explain it. Every model running with llama cpp works as expected when it's run from within any app like ollama or LM Studio, or even llama-cli. Yet, once I'm trying to run the model in the llama cpp server mode I stumble upon the same issue for months.

My idea is to use a model as a translator. I've been trying lots of them. Currently, I'm trying to work with Qwen 2.5 Q4.

If I ask the model to (literally): "Translate this text from Dutch to English" in -cnv (chat) mode, the result will always be an English output. Yet, once I'm attempting to do the same in production mode (in my case llama server), the model can accidentally write the same text in Dutch totally ignoring my instructions. The bug can happen and can not. But if the bug happens, it will be continuing every next run of the model; I mean with every API call.

I spent months with this issue. There is no flexible Python instruction, so I'm using the one presented in the official documentation (using it via openai)....

I'm totally disappointed. And i don't know what to do.

All I need is that model to behave absolutely the same way as it behave in conversation mode and that's it.

You must be logged in to vote

Replies: 2 comments 4 replies

Comment options

are you using curl for llamacpp server? What configuration parameters are you sending to via curl?

You must be logged in to vote
4 replies
Comment options

No, I don't use CURL as I need a Python implementation. As mentioned above, I use the approach provided in the official instruction... In this particular example, I'm trying to deal with the LM Studio server (which is based on llama cpp), but I experience the same behavior with the bare llama server.

url = "http://127.0.0.1:8080/v1/chat/completions"
headers = {
 "Content-Type": "application/json"
}
data = {
 "model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
 "messages": [
 {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
 {"role": "user", "content": f'''Translate this text from Dutch to English. Keep markdown: {markdown_output}'''}
 ],
 "temperature": 0,
 "max_tokens": -1,
 "stream": False
}

As mentioned here:

Examples:

You can use either Python openai library with appropriate checkpoints:

import openai
client = openai.OpenAI(
 base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
 api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
 {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
 {"role": "user", "content": "Write a limerick about python exceptions"}
]
)
print(completion.choices[0].message)
Comment options

You can take this code presented in the official documentation, take the same model with the same quant, send 10 articles to translate to the llama cpp server API, and you'll get 4 of them translated from Dutch to English, while the rest 6 will remain in Dutch.

Meanwhile, if you feed them all in the -cnv (chat) mode, all of them will be translated correctly.

I've been trying LLAMA 3.1-2, Gemma, QWEN, OLMoE, etc... All of them behave the same way with the llama cpp server.

Perhaps I need to use a prompt template somehow in the API request...

Comment options

Did you discover the solution to this? I am having the same problem

Comment options

I have exactly same problem too.

Comment options

Hi!
I hope this helps:
I noticed that the default parameters for the cli tools and the server are different, mainly because when you run the model in a cli tool the params are defined at the start, but in the case of the server mode (and api) you can define params for every request.
In my case i was experimenting with google gemma-3-4b-it.
There are lots of more params besides temperature that might be different in the server mode.
You can request at /props to see them, or in the browser UI you can check the settings menu at the top right corner.
For example the llama-mtmd.cli tool default params:

--temp N temperature (default: 0.2)
--top-k N top-k sampling (default: 40, 0 = disabled)
--top-p N top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N min-p sampling (default: 0.1, 0.0 = disabled)
--xtc-probability N xtc probability (default: 0.0, 0.0 = disabled)
--xtc-threshold N xtc threshold (default: 0.1, 1.0 = disabled)
--typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1
= ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dry-multiplier N set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
--dry-base N set DRY sampling base value (default: 1.75)
--dry-allowed-length N set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
context size)
--dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers
('\n', ':', '"', '*') in the process; use "none" to not use any
sequence breakers
--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N dynamic temperature exponent (default: 1.0)
--mirostat N use Mirostat sampling.
Top K, Nucleus and Locally Typical samplers are ignored if used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0)

while the server shows:

"params": {
"n_predict": -1,
"seed": 4294967295,
"temperature": 0.800000011920929,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"top_n_sigma": -1.0,
"xtc_probability": 0.0,
"xtc_threshold": 0.10000000149011612,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"dry_multiplier": 0.0,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_penalty_last_n": 4096,
"dry_sequence_breakers": [
"\n",
":",
""",
"*"
],
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"stop": [],
"max_tokens": -1,
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": true,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "",
"grammar_lazy": false,
"grammar_triggers": [],
"preserved_tokens": [],
"chat_format": "Content-only",
"reasoning_format": "none",
"reasoning_in_content": false,
"thinking_forced_open": false,
"samplers": [
"penalties",
"dry",
"top_n_sigma",
"top_k",
"typ_p",
"top_p",
"min_p",
"xtc",
"temperature"
],
"speculative.n_max": 16,
"speculative.n_min": 0,
"speculative.p_min": 0.75,
"timings_per_token": false,
"post_sampling_probs": false,
"lora": []
},

As you can see the default temperature and the default min_p are different. In your case the temperature was defined in your request but the min_p was no set.
The min_p roughly means this: "a minimum value that a token must reach to be considered at all"
So your server ran with a lower min_p.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /