Gemma-4-26B on v6e-4 TPU Benchmarks - DEV Community

Skip to content

Powered by Algolia

Log in Create account

DEV Community

Copied to Clipboard

Robustness: Remarkably, the server did not crash or OOM even under 1024 concurrent 16k requests (a total of 16.7 million tokens in flight). This speaks to the robustness of the vLLM-TPU implementation and the underlying Trillium hardware.

🏁 Final Recommendation
For this google/gemma-4-26B-A4B-it deployment on an 4-chip TPU v6e pod:

Optimal High-Throughput: Target 128-256 concurrency. This yields the highest efficiency (~440k-457k tps) with acceptable latency (3s-6s).
Optimal Interactive: Target 1-16 concurrency. This keeps TTFT under 1.2s while still processing up to 200k tokens per second.
Avoid: Concurrencies above 512, as latency becomes prohibitive (>10s) and throughput starts to degrade significantly.

✦ The visualization of our extreme stress test (up to 1024 concurrency) provides a clear picture of the TPU v6e's performance boundaries:

📈 Visual Summary

Avg TTFT (s) vs. Context Length
The plot shows several distinct curves corresponding to the different concurrency levels.
- Low-Middle Curves: For concurrencies 1–128, the lines remain flat and clustered near the bottom (sub-4s), indicating the system is well within its operational limits.
- Top Curves: The lines representing 512 and 1024 concurrency show a sharp vertical shift, climbing up to the 25-second mark. This illustrates the "queue wait" effect, where requests spend significantly more time waiting for compute resources than being processed.
Prefill Throughput (tok/s) vs. Context Length
This plot displays the aggregate system efficiency:
- Saturation Peak: You can see the density of the curves increasing as aggregate throughput approaches 450,000–475,000 tokens/second.
- Performance Inversion: At the highest concurrency (1024), the throughput curve actually dips below the 128/256 curves. This visual "crossover" confirms that the system has moved past its efficiency peak and is now incurring a penalty for managing too many simultaneous requests.

🏁 Final Project State
The benchmark suite is now a robust, multi-dimensional performance analysis engine:

Enhanced context_benchmark.py: Supports deep sweeps of prompt length ×ばつ concurrency.
Integrated MCP Tool: The run_context_benchmark tool allows triggering these complex stress tests via a single agent command.
Advanced Visualization: plot_context_benchmark.py handles high-density data and provides ASCII visualization for immediate analysis.
Optimized Reporting: Automatically generates Markdown, CSV, and JSON outputs for every run.

The project is fully prepared for production capacity planning and further infrastructure tuning.

xbill@penguin:~/aisprintapr2026/gemma4-benchmark$ python plot_context_benchmark.py

--- Context Length vs. Avg TTFT (s) ---
25 +----------------------------------------------------------------------+
| ****** |
| ****FT (s) **** |
| ******* |
20 | ****** |
| ******* |
| ******** |
| ******** |
15 | ******** |
|*********** |
| ********************** |
| ********************* |
| *********** |
10 | *********** |
| *********** |
| *********** |
| ********** **************** |
5 | ********** *************************************** |
|***************** **************** ***************** |
| ********************************************************* |
|***************************************************************** |
0 +----------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000

--- Context Length vs. Prefill Throughput (tok/s) ---
500000 +------------------------------------------------------------------+
| |
450000 | ******************** |
| ******************************** |
400000 | ** ************************ |
| ** ********* ****** **** |
350000 | ** *** *************** |
| * *** **************** **** |
300000 | ***** ********* ****** **** |
| ***** ** ******** ***** ****** |
250000 | ***** *** ********* ***************** |
| ***** ** *** **** ********** ***** |
| **** *** ** *********** ****** ****** |
200000 | **** ** ************ ***** ************ |
| ******* *********** ******************** |
150000 | ****** ********* *********** ******* ******** |
| ***** *************** ************************* |
100000 | ****************** ********************** ******** |
| ********************************* **************** |
50000 |************************************** **************** |
|************************************************************* |
0 +------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Gemma4 (25 Part Series)

1 Self-hosted Gemma 4 on TPU with vLLM, MCP, ADK, and Gemini CLI 2 Gemma-4-26B on v6e-4 TPU Benchmarks ... 21 more parts... 3 Gemma-4-31B on v6e-4 TPU Benchmarks 4 Is Brain Float (bf16) Worth it? 5 Gemma4 Speculative Decoding with n-gram 6 KV FP8 with Gemma4 26B 7 vLLM Gemma4 26B Tuning on v6e-4 8 Gemma 4 26B on v6e-4 Turbo-Stable Benchmark 9 Local Gemma 4 Deployment with MCP and Antigravity CLI 10 Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI 11 Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run 12 31B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI 13 26B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI 14 Local Mac Gemma 4 Deployment with MCP and Antigravity CLI 15 26B Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI 16 31B — Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI 17 Debugging Deployments with Gemma 12B, NVIDIA L4, MCP, Cloud Run, and Antigravity CLI 18 12B Gemma 4 QAT Deployment with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI 19 Deployment Planning with Gemma 26B, NVIDIA L4, MCP, Cloud Run, and Antigravity CLI 20 Lessons Learned: Deployment Trade-offs with Gemma4, NVIDIA L4, Cloud Run, and Antigravity CLI 21 Debugging Deployments with Gemma 12B, TPU v6e-1, MCP, and Antigravity CLI 22 MTP Speculative Decoding with the 12B Gemma 4 QAT Model on NVIDIA L4, Cloud Run, MCP, and... 23 12B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI 24 12B Gemma 4 Deployment with NVIDIA Blackwell 6000, QAT, MTP, and Antigravity CLI 25 Debugging Deployments with Gemma 12B, TPU v6e-4, MCP, and Antigravity CLI

Top comments (1)

Subscribe

pic

Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

xbill profile image

xbill Google Developer Experts

Master of MCP and herding cats

Email

xbill@glitnir.com
Location

New York
Joined

Sep 9, 2025

Copy link

Linked In article is here:

https://www.linkedin.com/posts/xbill_tpusprint-tpu-mcp-share-7458498384295849986-raL1/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAADQUEkBBE4OvitPxkaal485Lf0uE6Mjqt8

Code of Conduct • Report abuse

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink.

Hide child comments as well

For further actions, you may consider blocking this person and/or reporting abuse

Google Developer Experts

Join a global network of more than 1,000 professionals. Meet experienced Google technology experts, influencers, and thought leaders. Explore the community, get advice, and network.

Become an Expert

More from Google Developer Experts

Debugging Deployments with Gemma 12B, TPU v6e-4, MCP, and Antigravity CLI

#mcps #gemma #tpu #benchmark

Firebase Midsommer Madnesss with Antigravity CLI

#midsommar #devchallenge #gamechallenge #gamedev

Deploying Gemma 12B to AWS EC2 with NVIDIA L4 and Antigravity CLI

#aws #nvidial4 #mcps #gemma

💎 DEV Diamond Sponsors

Thank you to our Diamond Sponsors for supporting the DEV Community

Google AI - Official AI Model and Platform Partner

Google AI is the official AI Model and Platform Partner of DEV

Neon - Official Database Partner

Neon is the official database partner of DEV

Algolia - Official Search Partner

Algolia is the official search partner of DEV

DEV Community — A space to discuss and keep up software development and manage your software career

Home
DEV Challenges
DEV++
Videos
DEV Education Tracks
DEV Help
Advertise on DEV
Organization Accounts
DEV Showcase
About
Contact
Free Postgres Database
DEV Shop
MLH

Code of Conduct
Privacy Policy
Terms of Use

Built on Forem — the open source software that powers DEV and other inclusive communities.

Made with love and Ruby on Rails. DEV Community © 2016 - 2026.

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

AltStyle によって変換されたページ (->オリジナル) / アドレス: モード: