Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add support for different reasoning fields #15362

aldehir started this conversation in Ideas
Discussion options

Several clients do not support the reasoning_content field, and it seems like both clients and inference servers have converged to a reasoning field.

One such example is OpenAI's own compatibility test for GPT-OSS.
llama.cpp fails 30/30 tests simply because it places the reasoning in reasoning_content and not reasoning.

The agent-js library from OpenAI now includes this functionality (openai/openai-agents-js@7b437d9). I've verified that they return the reasoning within the message, which meets the gpt-oss spec requiring CoT output from the final tool call message.

Analysis (with reasoning_content field)

> start
> tsx index.ts -k 1 --provider=llamacpp
❯ Processing case 0 (attempt 1)
❯ Processing case 1 (attempt 1)
❯ Processing case 2 (attempt 1)
❯ Processing case 3 (attempt 1)
❯ Processing case 4 (attempt 1)
› Case 0 (attempt 1): Failed 
✔ Processing case 0 (attempt 1)
❯ Processing case 5 (attempt 1)
› Case 2 (attempt 1): Failed 
✔ Processing case 2 (attempt 1)
❯ Processing case 6 (attempt 1)
› Case 6 (attempt 1): Failed 
✔ Processing case 6 (attempt 1)
❯ Processing case 7 (attempt 1)
› Case 3 (attempt 1): Failed 
✔ Processing case 3 (attempt 1)
❯ Processing case 8 (attempt 1)
› Case 5 (attempt 1): Failed 
✔ Processing case 5 (attempt 1)
❯ Processing case 9 (attempt 1)
› Case 7 (attempt 1): Failed 
✔ Processing case 7 (attempt 1)
❯ Processing case 10 (attempt 1)
› Case 8 (attempt 1): Failed 
✔ Processing case 8 (attempt 1)
❯ Processing case 11 (attempt 1)
› Case 10 (attempt 1): Failed 
✔ Processing case 10 (attempt 1)
❯ Processing case 12 (attempt 1)
› Case 11 (attempt 1): Failed 
✔ Processing case 11 (attempt 1)
❯ Processing case 13 (attempt 1)
› Case 9 (attempt 1): Failed 
✔ Processing case 9 (attempt 1)
❯ Processing case 14 (attempt 1)
› Case 12 (attempt 1): Failed 
✔ Processing case 12 (attempt 1)
❯ Processing case 15 (attempt 1)
› Case 13 (attempt 1): Failed 
✔ Processing case 13 (attempt 1)
❯ Processing case 16 (attempt 1)
› Case 14 (attempt 1): Failed 
✔ Processing case 14 (attempt 1)
❯ Processing case 17 (attempt 1)
› Case 17 (attempt 1): Failed 
✔ Processing case 17 (attempt 1)
❯ Processing case 18 (attempt 1)
› Case 16 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[10],[20],[30]],"chart_type":"bar","title":"Sales Q1–Q3","x_label":"Quarter","y_label":"Sales"} Expected: {"data":[[1,10],[2,20],[3,30]],"chart_type":"bar","title":"Quarterly Sales"}
✔ Processing case 16 (attempt 1)
❯ Processing case 19 (attempt 1)
› Case 19 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[100],[150],[120]],"chart_type":"bar","title":"Visits per Day","x_label":"Day","y_label":"Visits"} Expected: {"data":[[1,100],[2,150],[3,120]],"chart_type":"bar","title":"Daily Visits","y_label":"Visitors"}
✔ Processing case 19 (attempt 1)
❯ Processing case 20 (attempt 1)
› Case 18 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[70,72,68,65]],"chart_type":"line","title":"Temperature over 4 days","x_label":"Day","y_label":"Temperature"} Expected: {"data":[[1,70],[2,72],[3,68],[4,65]],"chart_type":"line","x_label":"Day"}
✔ Processing case 18 (attempt 1)
❯ Processing case 21 (attempt 1)
› Case 20 (attempt 1): Failed 
✔ Processing case 20 (attempt 1)
❯ Processing case 22 (attempt 1)
› Case 21 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'","limit":100} Expected: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'"}
✔ Processing case 21 (attempt 1)
❯ Processing case 23 (attempt 1)
› Case 22 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"products","columns":["name","price"],"filters":"","order_by":"price DESC"} Expected: {"table":"products","columns":["name","price"],"limit":10,"order_by":"price DESC"}
✔ Processing case 22 (attempt 1)
❯ Processing case 24 (attempt 1)
› Case 4 (attempt 1): Failed 
✔ Processing case 4 (attempt 1)
❯ Processing case 25 (attempt 1)
› Case 1 (attempt 1): Failed 
✔ Processing case 1 (attempt 1)
❯ Processing case 26 (attempt 1)
› Case 25 (attempt 1): Failed 
✔ Processing case 25 (attempt 1)
❯ Processing case 27 (attempt 1)
› Case 27 (attempt 1): Failed 
✔ Processing case 27 (attempt 1)
❯ Processing case 28 (attempt 1)
› Case 26 (attempt 1): Failed 
✔ Processing case 26 (attempt 1)
❯ Processing case 29 (attempt 1)
› Case 28 (attempt 1): Failed 
✔ Processing case 28 (attempt 1)
› Case 29 (attempt 1): Failed 
✔ Processing case 29 (attempt 1)
› Case 23 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"audit_log","columns":["*"],"filters":"","limit":3} Expected: {"table":"audit_log","columns":["id","timestamp","action"],"limit":3}
✔ Processing case 23 (attempt 1)
› Case 24 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"customers","columns":["name","city"],"filters":"city = 'Berlin'","limit":100} Expected: {"table":"customers","columns":["name","city"],"filters":"city = 'Berlin'"}
✔ Processing case 24 (attempt 1)
› Case 15 (attempt 1): Failed Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line","title":"Sample Line Chart","x_label":"X","y_label":"Y"} Expected: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line"}
✔ Processing case 15 (attempt 1)
Results written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/rollout_llamacpp_20250816_132634.jsonl
Summary:
 Provider: llamacpp
 Total input cases: 30
 Tries: 1
 Total tasks: 30
 Total runs: 30
 Invalid Chat Completions API responses: 30 (out of 30)
 pass@k (k=1..1): 1=0.000
 pass^k (k=1..1): 1=0.000
 pass@k (k=1): 0.000
 pass^k (k=1): 0.000
 Wrong-input tool calls: 8
 Invalid cases.jsonl lines: 0
 Analysis written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/analysis_llamacpp_20250816_132634.json

When modified to use reasoning, llama.cpp passes 30/30 tests with reasoning_effort = low.

Analysis (with reasoning field)
> start
> tsx index.ts -k 1 --provider=llamacpp
❯ Processing case 0 (attempt 1)
❯ Processing case 1 (attempt 1)
❯ Processing case 2 (attempt 1)
❯ Processing case 3 (attempt 1)
❯ Processing case 4 (attempt 1)
› Case 0 (attempt 1): Success 
✔ Processing case 0 (attempt 1)
❯ Processing case 5 (attempt 1)
› Case 4 (attempt 1): Success 
✔ Processing case 4 (attempt 1)
❯ Processing case 6 (attempt 1)
› Case 3 (attempt 1): Success 
✔ Processing case 3 (attempt 1)
❯ Processing case 7 (attempt 1)
› Case 2 (attempt 1): Success 
✔ Processing case 2 (attempt 1)
❯ Processing case 8 (attempt 1)
› Case 5 (attempt 1): Success 
✔ Processing case 5 (attempt 1)
❯ Processing case 9 (attempt 1)
› Case 1 (attempt 1): Success 
✔ Processing case 1 (attempt 1)
❯ Processing case 10 (attempt 1)
› Case 10 (attempt 1): Success 
✔ Processing case 10 (attempt 1)
❯ Processing case 11 (attempt 1)
› Case 6 (attempt 1): Success 
✔ Processing case 6 (attempt 1)
❯ Processing case 12 (attempt 1)
› Case 11 (attempt 1): Success 
✔ Processing case 11 (attempt 1)
❯ Processing case 13 (attempt 1)
› Case 12 (attempt 1): Success 
✔ Processing case 12 (attempt 1)
❯ Processing case 14 (attempt 1)
› Case 9 (attempt 1): Success 
✔ Processing case 9 (attempt 1)
❯ Processing case 15 (attempt 1)
› Case 14 (attempt 1): Success 
✔ Processing case 14 (attempt 1)
❯ Processing case 16 (attempt 1)
› Case 13 (attempt 1): Success 
✔ Processing case 13 (attempt 1)
❯ Processing case 17 (attempt 1)
› Case 16 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[10],[20],[30]],"chart_type":"bar","title":"Sales Q1-Q3","x_label":"Quarter","y_label":"Sales"} Expected: {"data":[[1,10],[2,20],[3,30]],"chart_type":"bar","title":"Quarterly Sales"}
✔ Processing case 16 (attempt 1)
❯ Processing case 18 (attempt 1)
› Case 17 (attempt 1): Success 
✔ Processing case 17 (attempt 1)
❯ Processing case 19 (attempt 1)
› Case 15 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line","title":"Simple Line Chart","x_label":"X","y_label":"Y"} Expected: {"data":[[1,2],[2,4],[3,9]],"chart_type":"line"}
✔ Processing case 15 (attempt 1)
❯ Processing case 20 (attempt 1)
› Case 19 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[100],[150],[120]],"chart_type":"bar","title":"Visits per Day","x_label":"Day","y_label":"Visits"} Expected: {"data":[[1,100],[2,150],[3,120]],"chart_type":"bar","title":"Daily Visits","y_label":"Visitors"}
✔ Processing case 19 (attempt 1)
❯ Processing case 21 (attempt 1)
› Case 18 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"data":[[70],[72],[68],[65]],"chart_type":"line","title":"Temperature over 4 days","x_label":"Day","y_label":"Temperature"} Expected: {"data":[[1,70],[2,72],[3,68],[4,65]],"chart_type":"line","x_label":"Day"}
✔ Processing case 18 (attempt 1)
❯ Processing case 22 (attempt 1)
› Case 20 (attempt 1): Success 
✔ Processing case 20 (attempt 1)
❯ Processing case 23 (attempt 1)
› Case 21 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'","limit":100} Expected: {"table":"orders","columns":["order_id","amount"],"filters":"status = 'shipped'"}
✔ Processing case 21 (attempt 1)
❯ Processing case 24 (attempt 1)
› Case 22 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"products","columns":["name","price"],"order_by":"price DESC"} Expected: {"table":"products","columns":["name","price"],"limit":10,"order_by":"price DESC"}
✔ Processing case 22 (attempt 1)
❯ Processing case 25 (attempt 1)
› Case 23 (attempt 1): Success Tool call with wrong arguments but correct schema. Check logs for full details. Not failing this test. Parsed: {"table":"audit_log","columns":["*"],"filters":"","limit":3,"order_by":"id ASC"} Expected: {"table":"audit_log","columns":["id","timestamp","action"],"limit":3}
✔ Processing case 23 (attempt 1)
❯ Processing case 26 (attempt 1)
› Case 25 (attempt 1): Success 
✔ Processing case 25 (attempt 1)
❯ Processing case 27 (attempt 1)
› Case 26 (attempt 1): Success 
✔ Processing case 26 (attempt 1)
❯ Processing case 28 (attempt 1)
› Case 24 (attempt 1): Success 
✔ Processing case 24 (attempt 1)
❯ Processing case 29 (attempt 1)
› Case 27 (attempt 1): Success 
✔ Processing case 27 (attempt 1)
› Case 28 (attempt 1): Success 
✔ Processing case 28 (attempt 1)
› Case 29 (attempt 1): Success 
✔ Processing case 29 (attempt 1)
› Case 7 (attempt 1): Success 
✔ Processing case 7 (attempt 1)
› Case 8 (attempt 1): Success 
✔ Processing case 8 (attempt 1)
Results written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/rollout_llamacpp_20250816_132243.jsonl
Summary:
 Provider: llamacpp
 Total input cases: 30
 Tries: 1
 Total tasks: 30
 Total runs: 30
 Invalid Chat Completions API responses: 0 (out of 30)
 pass@k (k=1..1): 1=1.000
 pass^k (k=1..1): 1=1.000
 pass@k (k=1): 1.000
 pass^k (k=1): 1.000
 Wrong-input tool calls: 7
 Invalid cases.jsonl lines: 0
 Analysis written to /home/alde/dev/github.com/openai/gpt-oss/compatibility-test/analysis_llamacpp_20250816_132243.json

I'm willing to submit a pull request for this issue, but since it appears to be a feature request, I'm posting it in discussions first to gather feedback.

From what I can recall, agentic coding tools like codex and crush support reasoning but lack reasoning_content support. There are likely other tools that behave similarly.

Thoughts?

You must be logged in to vote

Replies: 3 comments 10 replies

Comment options

Since llama-server is supposed to provide OpenAI compatible endpoints it makes a lot of sense to support reasoning as the default.

We could add "openai" as an option for —reasoning-format. The operators choose to
use auto, none, deepseek, openai.

You must be logged in to vote
4 replies
Comment options

Would it make sense to send both "reasoning" and "reasoning_content"? At least until there is consensus in the community. This would avoid adding an extra argument and complicating the UX.

Comment options

Sending both keys would make the api response really confusing. It should send one or the other.

Adding an argument to make it configurable is a good tradeoff. The server would support more clients in exchange for a bit of code and tiny bit more complexity.

The UX of the server is a steep learning curve for new users. However, I think that’s ok because it’s so configurable/powerful. The UX of the server would benefit more from better docs than fewer options IMO.

Comment options

Ok makes sense.

I am getting the impression that reasoning is the more widely adopted field, so it makes sense to be the default.

Comment options

Cloud providers like Groq and Cerebras also use "reasoning" for the reasoning tokens. In addition to renaming the response field to "reasoning" it would make sense to document if and how llamacpp-server supports reasoning in incoming requests. It seems like this is currently documented nowhere. This is especially important with models like GPT-OSS that need that past reasoning because they perform tool calls during reasoning [1].
[1] https://platform.openai.com/docs/guides/reasoning#keeping-reasoning-items-in-context

Comment options

What are those Success Tool call with wrong arguments but correct schema errors about? Is that something that needs fixing?

You must be logged in to vote
5 replies
Comment options

aldehir Aug 18, 2025
Collaborator Author

@Mushoz gpt-oss is responding back with values for optional parameters but the tests only expect the required parameters. I don't believe it to be indicative of a failure, since they are optional and the model is opting to include them.

Comment options

aldehir Aug 18, 2025
Collaborator Author

Actually, maybe the template isn't properly marking them optional in the system prompt. I'll check it out later.

Comment options

aldehir Aug 19, 2025
Collaborator Author

The generated system prompt looks correct. The 20B model does appear to produce incorrect results in 1 or 2 tests cases, but for the rest the message emits because it includes optional parameters. This is within the margin of error for OpenAI's compatibility test.

system prompt
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025年08月19日
Reasoning: medium
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Tools
## functions
namespace functions {
// Creates a base64-encoded PNG chart from tabular data—use for quick visualizations inside chat.
type generate_chart = (_: {
// Type of chart to generate
chart_type: "line" | "bar" | "scatter",
// 2-D numeric data matrix
data: number[][],
// Chart title
title?: string, // default: "",
// Label for the x-axis
x_label?: string, // default: "",
// Label for the y-axis
y_label?: string, // default: "",
}) => any;
} // namespace functions<|end|><|start|>user<|message|>Plot a simple line chart for these points: (1,2),(2,4),(3,9).<|end|><|start|>assistant<|channel|>analysis<|message|>Need to call generate_chart function. Plot simple line chart. Data as numeric matrix? Probably 2-D matrix of coordinates. Provide data: [[1,2],[2,4],[3,9]] and chart_type 'line'. Title? maybe "Simple Line Chart". Provide x_label and y_label maybe "X" and "Y". Use the function.<|end|><|start|>assistant to=functions.generate_chart<|channel|>commentary json<|message|>{"chart_type":"line","data":[[1,2],[2,4],[3,9]],"title":"Simple Line Chart","x_label":"X","y_label":"Y"}<|call|><|start|>functions.generate_chart to=assistant<|channel|>commentary<|message|>{"image_png_base64":"iVBORw0KGgoAAAANSUhEUgAA..."}<|end|><|start|>assistant

I also found this:

There is currently no generally agreed upon specification in the community with the general properties on a message being either reasoning or reasoning_content. To be compatible with clients like the OpenAI Agents SDK we recommend using a reasoning field as the primary property for the raw CoT in Chat Completions.

https://cookbook.openai.com/articles/gpt-oss/verifying-implementations#chat-completions

Comment options

Thanks for double checking!

Comment options

Those errors are just there because we verify if the content is the same as what we expected but in some cases the model might return slightly different data. It's not a big deal as it's most likely a model performance or prompting issue hence why it's not treated as a failure.

Comment options

Are there still plans to implement this?

You must be logged in to vote
1 reply
Comment options

aldehir Oct 25, 2025
Collaborator Author

I no longer have plans to implement this, per #15408 (comment). It seems the preferred solution is to implement the OpenAI Responses API. It supports passing along the CoT, both in stateful and stateless/ZDR operation.

From what I can tell, it doesn't appear that many clients have adopted passing back the CoT. So even if changed, there is still a lack of support from clients. In contrast, implementing a Response API endpoint would address this issue in supported clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /