-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
-
I'm trying to benchmark speed. I feed one prompt at a time via OpenAI API and wait for a complete response before submitting next request.
However, I get multiple speed readings for long prompt. I guess it's splitting into multiple batches?
Is there a way to configure so that it also reports overall speed for the entire request?
I running my vllm like this.
vllm serve Qwen/Qwen3-30B-A3B-FP8 --max-model-len 34100 --tensor-parallel-size 2 --max-log-len 200 --disable-uvicorn-access-log --no-enable-prefix-caching > log.txt
I disabled prefix-caching to make sure every request gets processed fresh without prompt caching.
Here's the log for one request:
INFO 04-30 12:14:21 [logger.py:39] Received request chatcmpl-eb86ff143abf4dbb91c69374aacea6a2: prompt: '<|im_start|>system\nYou are a helpful assistant. /no_think<|im_end|>\n<|im_start|>user\nProvide a summary as well as a detail analysis of the following:\nPortugal (Portuguese pronunciation: [puɾtuˈɣal] ),', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 04-30 12:14:21 [async_llm.py:252] Added request chatcmpl-eb86ff143abf4dbb91c69374aacea6a2.
INFO 04-30 12:14:26 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 14.0%, Prefix cache hit rate: 0.0%
INFO 04-30 12:14:36 [loggers.py:111] Engine 000: Avg prompt throughput: 3206.6 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 31.6%, Prefix cache hit rate: 0.0%
INFO 04-30 12:14:46 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 32.3%, Prefix cache hit rate: 0.0%
INFO 04-30 12:14:56 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 47.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-30 12:15:06 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Thanks so much!
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment
-
Good question—this is a classic benchmark pitfall (mapped to ProblemMap No.12: "Nonuniform batch slicing & speed reporting").
By default, vLLM will split very long prompts into multiple internal batches or "chunks" for throughput, especially when you set high max_model_len and long max_seq_len.
The "Avg prompt throughput" you see in logs is typically per batch, not for the whole end-to-end request (from prompt receipt to last token generated).
If you want the true overall speed for a single long prompt, you need to:
- Log timestamps at both the start and final completion of the request, outside the per-batch reporting loop.
- Compute:
(total tokens generated + prompt length) / (completion_time - start_time)
This will give you a real end-to-end tokens/sec, including all vLLM internal scheduling/queuing overhead. - (Optional) Patch the vLLM API or server code to emit a custom "request_total_throughput" metric per completed request.
- Always disable prefix caching for fair comparison.
This measurement removes hidden "gaps" caused by micro-batch scheduling, pipeline bubbles, and token handoff.
For a full checklist on fair benchmarking and speed reporting (and to avoid accidental over-reporting on multi-GPU or multi-query setups), see:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
Let me know if you want an example script or custom log parser.
Beta Was this translation helpful? Give feedback.