Robustness: Remarkably, the server did not crash or OOM even under 1024 concurrent 16k requests (a total of 16.7 million tokens in flight).
This speaks to the robustness of the vLLM-TPU implementation and the underlying Trillium hardware.
🏁 Final Recommendation
For this google/gemma-4-26B-A4B-it deployment on an 4-chip TPU v6e pod:
- Optimal High-Throughput: Target 128-256 concurrency. This yields the highest efficiency (~440k-457k tps) with acceptable latency (3s-6s).
- Optimal Interactive: Target 1-16 concurrency. This keeps TTFT under 1.2s while still processing up to 200k tokens per second.
- Avoid: Concurrencies above 512, as latency becomes prohibitive (>10s) and throughput starts to degrade significantly.
✦ The visualization of our extreme stress test (up to 1024 concurrency) provides a clear picture of the TPU v6e's performance boundaries:
📈 Visual Summary
-
Avg TTFT (s) vs. Context Length
The plot shows several distinct curves corresponding to the different concurrency levels.
- Low-Middle Curves: For concurrencies 1–128, the lines remain flat and clustered near the bottom (sub-4s), indicating the system is well within
its operational limits.
- Top Curves: The lines representing 512 and 1024 concurrency show a sharp vertical shift, climbing up to the 25-second mark. This illustrates
the "queue wait" effect, where requests spend significantly more time waiting for compute resources than being processed.
-
Prefill Throughput (tok/s) vs. Context Length
This plot displays the aggregate system efficiency:
- Saturation Peak: You can see the density of the curves increasing as aggregate throughput approaches 450,000–475,000 tokens/second.
- Performance Inversion: At the highest concurrency (1024), the throughput curve actually dips below the 128/256 curves. This visual "crossover"
confirms that the system has moved past its efficiency peak and is now incurring a penalty for managing too many simultaneous requests.
🏁 Final Project State
The benchmark suite is now a robust, multi-dimensional performance analysis engine:
- Enhanced context_benchmark.py: Supports deep sweeps of prompt length ×ばつ concurrency.
- Integrated MCP Tool: The run_context_benchmark tool allows triggering these complex stress tests via a single agent command.
- Advanced Visualization: plot_context_benchmark.py handles high-density data and provides ASCII visualization for immediate analysis.
- Optimized Reporting: Automatically generates Markdown, CSV, and JSON outputs for every run.
The project is fully prepared for production capacity planning and further infrastructure tuning.
xbill@penguin:~/aisprintapr2026/gemma4-benchmark$ python plot_context_benchmark.py
--- Context Length vs. Avg TTFT (s) ---
25 +----------------------------------------------------------------------+
| ****** |
| ****FT (s) **** |
| ******* |
20 | ****** |
| ******* |
| ******** |
| ******** |
15 | ******** |
|*********** |
| ********************** |
| ********************* |
| *********** |
10 | *********** |
| *********** |
| *********** |
| ********** **************** |
5 | ********** *************************************** |
|***************** **************** ***************** |
| ********************************************************* |
|***************************************************************** |
0 +----------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
--- Context Length vs. Prefill Throughput (tok/s) ---
500000 +------------------------------------------------------------------+
| |
450000 | ******************** |
| ******************************** |
400000 | ** ************************ |
| ** ********* ****** **** |
350000 | ** *** *************** |
| * *** **************** **** |
300000 | ***** ********* ****** **** |
| ***** ** ******** ***** ****** |
250000 | ***** *** ********* ***************** |
| ***** ** *** **** ********** ***** |
| **** *** ** *********** ****** ****** |
200000 | **** ** ************ ***** ************ |
| ******* *********** ******************** |
150000 | ****** ********* *********** ******* ******** |
| ***** *************** ************************* |
100000 | ****************** ********************** ******** |
| ********************************* **************** |
50000 |************************************** **************** |
|************************************************************* |
0 +------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000