Why any model running with llama server behave differently? · ggml-org/llama.cpp · Discussion #9660

alexcardo
Sep 26, 2024

What exactly I need to do to force llama server behave the same way as it works in llama cli or any implementation mode?

I'll explain it. Every model running with llama cpp works as expected when it's run from within any app like ollama or LM Studio, or even llama-cli. Yet, once I'm trying to run the model in the llama cpp server mode I stumble upon the same issue for months.

My idea is to use a model as a translator. I've been trying lots of them. Currently, I'm trying to work with Qwen 2.5 Q4.

If I ask the model to (literally): "Translate this text from Dutch to English" in -cnv (chat) mode, the result will always be an English output. Yet, once I'm attempting to do the same in production mode (in my case llama server), the model can accidentally write the same text in Dutch totally ignoring my instructions. The bug can happen and can not. But if the bug happens, it will be continuing every next run of the model; I mean with every API call.

I spent months with this issue. There is no flexible Python instruction, so I'm using the one presented in the official documentation (using it via openai)....

I'm totally disappointed. And i don't know what to do.

All I need is that model to behave absolutely the same way as it behave in conversation mode and that's it.

Replies: 2 comments 4 replies

mirek190
Sep 26, 2024

are you using curl for llamacpp server? What configuration parameters are you sending to via curl?

4 replies

@alexcardo

alexcardo Sep 26, 2024
Author

No, I don't use CURL as I need a Python implementation. As mentioned above, I use the approach provided in the official instruction... In this particular example, I'm trying to deal with the LM Studio server (which is based on llama cpp), but I experience the same behavior with the bare llama server.

url = "http://127.0.0.1:8080/v1/chat/completions"
headers = {
 "Content-Type": "application/json"
}
data = {
 "model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
 "messages": [
 {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
 {"role": "user", "content": f'''Translate this text from Dutch to English. Keep markdown: {markdown_output}'''}
 ],
 "temperature": 0,
 "max_tokens": -1,
 "stream": False
}

As mentioned here:

Examples:

You can use either Python openai library with appropriate checkpoints:

import openai
client = openai.OpenAI(
 base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
 api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
 {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
 {"role": "user", "content": "Write a limerick about python exceptions"}
]
)
print(completion.choices[0].message)

@alexcardo

alexcardo Sep 26, 2024
Author

You can take this code presented in the official documentation, take the same model with the same quant, send 10 articles to translate to the llama cpp server API, and you'll get 4 of them translated from Dutch to English, while the rest 6 will remain in Dutch.

Meanwhile, if you feed them all in the -cnv (chat) mode, all of them will be translated correctly.

I've been trying LLAMA 3.1-2, Gemma, QWEN, OLMoE, etc... All of them behave the same way with the llama cpp server.

Perhaps I need to use a prompt template somehow in the API request...

@JoshHaver

JoshHaver Nov 20, 2024

Did you discover the solution to this? I am having the same problem

@todtjs92

todtjs92 Oct 12, 2025

I have exactly same problem too.

stephen-lev
May 31, 2025

Hi!
I hope this helps:
I noticed that the default parameters for the cli tools and the server are different, mainly because when you run the model in a cli tool the params are defined at the start, but in the case of the server mode (and api) you can define params for every request.
In my case i was experimenting with google gemma-3-4b-it.
There are lots of more params besides temperature that might be different in the server mode.
You can request at /props to see them, or in the browser UI you can check the settings menu at the top right corner.
For example the llama-mtmd.cli tool default params:

--temp N temperature (default: 0.2)
--top-k N top-k sampling (default: 40, 0 = disabled)
--top-p N top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N min-p sampling (default: 0.1, 0.0 = disabled)
--xtc-probability N xtc probability (default: 0.0, 0.0 = disabled)
--xtc-threshold N xtc threshold (default: 0.1, 1.0 = disabled)
--typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1
= ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dry-multiplier N set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
--dry-base N set DRY sampling base value (default: 1.75)
--dry-allowed-length N set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
context size)
--dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers
('\n', ':', '"', '*') in the process; use "none" to not use any
sequence breakers
--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N dynamic temperature exponent (default: 1.0)
--mirostat N use Mirostat sampling.
Top K, Nucleus and Locally Typical samplers are ignored if used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0)

while the server shows:

"params": {
"n_predict": -1,
"seed": 4294967295,
"temperature": 0.800000011920929,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"top_n_sigma": -1.0,
"xtc_probability": 0.0,
"xtc_threshold": 0.10000000149011612,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"dry_multiplier": 0.0,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_penalty_last_n": 4096,
"dry_sequence_breakers": [
"\n",
":",
""",
"*"
],
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"stop": [],
"max_tokens": -1,
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": true,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "",
"grammar_lazy": false,
"grammar_triggers": [],
"preserved_tokens": [],
"chat_format": "Content-only",
"reasoning_format": "none",
"reasoning_in_content": false,
"thinking_forced_open": false,
"samplers": [
"penalties",
"dry",
"top_n_sigma",
"top_k",
"typ_p",
"top_p",
"min_p",
"xtc",
"temperature"
],
"speculative.n_max": 16,
"speculative.n_min": 0,
"speculative.p_min": 0.75,
"timings_per_token": false,
"post_sampling_probs": false,
"lora": []
},

As you can see the default temperature and the default min_p are different. In your case the temperature was defined in your request but the min_p was no set.
The min_p roughly means this: "a minimum value that a token must reach to be considered at all"
So your server ran with a lower min_p.

0 replies

Why any model running with llama server behave differently? #9660

Uh oh!

alexcardo Sep 26, 2024

Replies: 2 comments · 4 replies

Uh oh!

mirek190 Sep 26, 2024

Uh oh!

Uh oh!

alexcardo Sep 26, 2024 Author

Uh oh!

alexcardo Sep 26, 2024 Author

Uh oh!

JoshHaver Nov 20, 2024

Uh oh!

todtjs92 Oct 12, 2025

Uh oh!

stephen-lev May 31, 2025

alexcardo
Sep 26, 2024

Replies: 2 comments 4 replies

mirek190
Sep 26, 2024

alexcardo Sep 26, 2024
Author

alexcardo Sep 26, 2024
Author

stephen-lev
May 31, 2025