Practical Gemma 4 Benchmarking with LM Studio

DEV Community

Windows Task Manager Performance tab showing the Intel AI Boost NPU at 0% utilization alongside GPU graphs.

Adjust Settings Carefully

LM Studio can expose runtime settings such as context length, offload behavior, and other advanced options depending on the model and version. These settings matter, but they should be changed carefully.

For benchmarking, I prefer to change one variable at a time. If I change model size, quantization, context length, and offload behavior all at once, then I cannot tell which change caused the difference.

A better approach is:

Pick one model package.
Load it with reasonable defaults.
Record baseline behavior.
Change one setting if needed.
Record what changed.

That keeps the benchmark useful instead of turning it into guesswork.

A few of the settings that can affect results include:

Context length: Larger context windows require more memory, especially because of the KV cache.
GPU offload: Controls how much of the model work is moved to the GPU instead of staying on the CPU.
CPU thread pool size: Affects how much CPU parallelism the runtime may use.
Evaluation batch size: Can affect throughput and memory behavior during generation.

In this experiment, the main setting I modified was GPU offload. For a smaller set of tests, I may also have increased the context length, possibly doubling the default context window. I did not rely on evaluation batch size as a primary tuning variable for these results.

LM Studio model runtime settings showing context length, GPU offload layers, CPU thread pool size, evaluation batch size, and max concurrent predictions.

One important detail: in LM Studio, GPU Offload is not a percentage. A value such as 19 means that 19 model layers are being offloaded to the GPU. The remaining layers continue to run through the CPU/system RAM path.

That makes GPU offload one of the most important tuning controls for larger models. More layers on the GPU can improve speed, but only if there is enough VRAM left for the model, runtime overhead, and KV cache. If too many layers are offloaded, the model may fail to load, consume nearly all VRAM, or slow down because there is not enough memory headroom left for context and cache.

The practical rule is to increase GPU offload until performance stops improving, VRAM headroom becomes too tight, or the model becomes unstable. For final benchmark numbers, I would want to record the exact GPU offload value, context length, whether KV cache was offloaded to GPU memory, and the observed VRAM usage for each run.

The key lesson is that these settings can change the result. If one model is tested with a different context length, offload level, or batch size than another, the comparison may not be fair. For a clean benchmark, the settings should either remain consistent or be recorded clearly when they differ.

8. Benchmark Methodology

For this article, I treated "usable" as more important than "technically loads."

A model is useful if it:

Loads reliably
Uses the intended GPU
Leaves some memory headroom
Responds at a usable speed
Handles the prompt types I actually care about
Does not make the rest of the system unusable

A model is less useful if it:

Barely loads
Consumes nearly all VRAM
Runs painfully slowly
Becomes unstable as context grows
Requires closing everything else on the machine just to function

That distinction matters because the goal is not simply to prove that a large model can start. The goal is to find models that are practical for real work.

Benchmark Test Categories

The original test plan included several categories:

Load test: Does the model load without crashing?
Memory test: How much VRAM does it consume idle and during generation?
Speed test: How quickly does it respond? How many tokens/sec does LM Studio report when available?
Coding test: Can it refactor or explain TypeScript accurately?
Reasoning test: Can it explain tradeoffs and compare approaches?
Structured output test: Can it return tables, lists, and code blocks cleanly?
Long-context test: Does performance degrade as context grows?

These categories mattered because no single prompt tells the whole story. A tiny model may feel excellent for quick coding help but weaker for deep reasoning. A larger model may produce richer answers but become too slow to use interactively. Unfortunately, the number of types of prompts for a given model are large, and I did not test image/vision parsing or generation in these benchmarks. So be sure to curate your prompt tests in accordance with the general type of work you will be applying for this model.

Benchmark Prompt Selection

A benchmark prompt should not be too trivial. If the prompt only asks for a short answer, the model may finish before there is enough time to observe GPU utilization, VRAM behavior, CPU usage, or throughput. On the other hand, if the prompt is too open-ended or inconsistent, the results may be hard to compare between runs.

For the main sustained-generation test, I used a structured CAP theorem prompt:

Explain how distributed systems handle consistency, availability, and partition tolerance (CAP theorem).
Then:
1. Compare strong consistency vs eventual consistency with real-world examples
2. Describe how databases like Cassandra and PostgreSQL handle tradeoffs differently
3. Provide a step-by-step scenario of a network partition and how each system responds
4. Summarize tradeoffs in a comparison table
Be detailed and structured. Aim for ~600-800 words.

This prompt works well because it forces sustained generation. It asks for explanation, comparison, scenario reasoning, and structured output. It also gives the model a target length, which makes repeated runs easier to compare.

While the model runs, the useful things to watch are:

Metric	What I Wanted to See
GPU utilization	Active use rather than idle behavior
Dedicated VRAM	Stable usage without maxing out
CPU usage	Moderate use, not completely dominant
Output behavior	Smooth generation without long stalls
Tokens/sec	Consistent throughput if visible in LM Studio

The most important benchmarking rule is to use the same exact prompt every time. Otherwise, I may be benchmarking prompt differences instead of model/runtime differences. Even with identical prompts, models have some variation in their outputs and are not deterministic, so for us accuracy or correctness is not a comparison 1 for 1 character by character for the output returned, but in whether the model successfully replies to the prompt with sufficient detail and thought to be useful.

For smaller models, I also tested rapid iteration and practical coding prompts, such as a TypeScript refactor, JavaScript closures, a React follow-up, and a REST API explanation. This mattered because smaller models are not always meant to win deep reasoning tests. Sometimes their best use case is fast iteration.

Other Benchmark Prompts Used

The CAP theorem prompt was the main sustained-generation benchmark, but it was not the only prompt used. I also used shorter prompts to test coding quality, conversational speed, context retention, and general explanation throughput.

Prompt Name	Prompt Type	Intended Model Class	What It Tests	Notes
TypeScript Refactor	Coding / rapid iteration	E4B / E2B, also retested on 26B / 31B	Code quality, type safety, iteration speed	Primary practical coding prompt.
JavaScript Closures	Back-and-forth chat speed	E4B / E2B, also retested on larger models	Latency, clarity, short explanation	Followed by the React use-case prompt.
React Follow-up	Context retention follow-up	E4B / E2B	Context retention and conversational speed	Run immediately after JavaScript Closures.
REST API Speed	Speed / explanation	E4B / E2B	Throughput and general explanation quality	Shorter speed-oriented run.
Database Index Speed	Speed / technical explanation	Optional / all models	Shorter technical throughput test	Useful when CAP is too long.

The TypeScript refactor prompt was:

You are helping me refactor code.
Here is a TypeScript function:
function processData(data: any[]) {
 let result = [];
 for (let i = 0; i < data.length; i++) {
 if (data[i].active === true) {
 result.push(data[i].value * 2);
 }
 }
 return result;
}
1. Refactor this to be more functional and readable
2. Add type safety
3. Explain your changes briefly

The JavaScript closures prompt was:

Explain closures in JavaScript in simple terms, then give 2 examples.

The React follow-up prompt was:

Now show me a real-world use case in React.

The REST API speed prompt was:

Write a detailed explanation of how a REST API works, including request lifecycle, headers, and status codes.

The optional database index prompt was:

Write a detailed explanation of how a database index works, including B-trees, hashing, and query optimization. Provide examples.

Fresh Restart Baseline

Before doing the real tuning pass, I restarted Windows and opened only the tools needed for the test. That gave me a cleaner baseline before loading any model.

For the formal notes, I did not try to create an artificial bare-metal lab environment. Each test began after a restart, once normal startup apps had settled. LM Studio, Task Manager, Snipping Tool, Notepad++ for observations, and Excel for logging were open. Normal background tools such as Bitdefender, JetBrains Toolbox, WebView2, password manager, Steam client, Edge background processes, and other usual services were still present. That is intentional. This was a practical workstation benchmark, not a synthetic lab benchmark.

In that state, the system was generally idle: low CPU usage, no meaningful GPU activity, and about 0.1 to 1.1 GB of GPU memory used before a model was loaded.

That baseline matters. Even before loading a model, the machine is not starting from zero VRAM usage. Windows, displays, LM Studio, CUDA initialization, and other GPU reservations can already consume memory. Total VRAM is not the same thing as available VRAM.

Benchmark Caveats

This was a practical benchmark, not a lab-grade controlled benchmark. Some runs used text notes rather than full screenshots. Some measurements were approximate. Not every run captured tokens/sec. Some prompts were repeated after previous model runs, so runtime caching, warmed filesystem cache, LM Studio state, or other reuse effects may have influenced later runs. I did not verify whether any thinking-mode behavior attempted external searching or benefited from cached information.

That means these results should not be read as universal scores for the models. They describe how these specific model packages behaved on this laptop, using this version of LM Studio, with these runtime settings, under a practical workstation baseline.

9. Results and Observations

After working through the setup and tuning process, the benchmark became more interesting than I expected.

At first, the story looked simple: E2B and E4B were practical, while the larger 26B and 31B models looked too large for interactive work on a 16 GB VRAM laptop.

But after retuning GPU offload, the story changed. The larger models were not simply "too big." They were painful when over-offloaded to the GPU. Reducing GPU offload left more VRAM headroom, shifted more work to CPU and system RAM, and dramatically improved responsiveness.

Summary Table

Model / Config	Context	GPU Offload	Load Time	VRAM After Load	Representative Prompt Time	Practical Feel	Best Use
Gemma 4 E2B	4096	35 / Max	~4 sec	~3.9 GB	CAP: ~10.4 sec	Very fast	lightweight local assistant
Gemma 4 E4B	8192	42	~4.6 sec	~5.4 GB	CAP: ~12.3 sec	Fast, better quality	daily driver candidate
Gemma 4 26B A4B over-offloaded	4096	24	~42 sec	~15.5 GB	CAP: ~32 min	Too slow	example of no-headroom failure
Gemma 4 26B A4B retuned	4096	16	TBD	~12.2 GB	CAP: ~32.8 sec off / ~50.4 sec thinking	Surprisingly usable	larger reasoning with CPU assist
Gemma 4 31B high offload	4096	35	~56.8 sec	~15.4 GB	JS closures stopped around 2 min	Very slow	stress test / over-offload example
Gemma 4 31B retuned	4096	24	~13 sec	~11.6 GB	CAP: ~6 min 1 sec	Usable but slower	large-model experiment

Gemma 4 E2B: The Lightweight Baseline

E2B was the fastest and lightest model in this test set. It loaded in only a few seconds and settled around 3.9 GB of dedicated VRAM after load. That left plenty of headroom on a 16 GB GPU.

For quick prompts, it felt genuinely responsive. The TypeScript refactor, JavaScript closure explanation, React follow-up, REST API explanation, and CAP theorem benchmark all completed quickly enough to feel interactive.

Representative times included:

Prompt	Approx Time
TypeScript refactor	~6.8-7.9 sec
JavaScript closures	~8.9 sec
React follow-up	~7.3 sec
REST API explanation	~7.6 sec
CAP theorem benchmark	~10.4 sec

The main caveat was quality. E2B was fast and useful, but not perfect. In one closure example, it produced a questionable JavaScript snippet using let name = name;, which is the kind of mistake I would want to catch before trusting the output. That makes E2B useful for quick local assistance, but not necessarily the model I would trust most for careful code review.

My practical read: E2B is a great sanity check and a very fast fallback model. It proves the local setup is working and is useful when speed matters more than depth.

Gemma 4 E4B: The Best Practical Balance

E4B was slower than E2B, but it produced stronger, cleaner answers. It used more VRAM, settling around 5.4 GB after load, but that is still well within the available GPU budget on this machine.

The E4B model also ran with a larger context length: 8192 instead of E2B’s 4096. That is important because the comparison is not perfectly apples-to-apples. Even so, E4B still felt practical.

Representative times included:

Prompt	Approx Time
TypeScript refactor	~14.8 sec
JavaScript closures	~11.6 sec
React follow-up	~13.8 sec
REST API explanation	~17.2 sec
CAP theorem benchmark	~12.3 sec

The TypeScript refactor answer was more polished than E2B’s. The closure explanation avoided the obvious bug I saw in the E2B example. The REST API and CAP theorem outputs were also more structured.

My practical read: E4B is the best daily-driver candidate so far. It is not as instant as E2B, but it still feels responsive, leaves plenty of VRAM headroom, and produces better output.

Gemma 4 26B A4B: Initial Over-Offload Result

The 26B A4B model was the first major reality check.

On paper, it is tempting to assume the larger model will be better. In practice, the first configuration was not better for interactive work.

The model loaded, but loading took around 42 seconds. After load, it consumed roughly 15.5 GB of dedicated VRAM. That left almost no practical headroom on a 16 GB GPU.

The first TypeScript refactor run accidentally had thinking mode enabled. That alone took 7 minutes and 44 seconds before the full output completed at about 15 minutes and 7 seconds. Turning thinking off helped, but not enough: the same type of refactor still took about 7 minutes and 32 seconds at roughly 1.03 tokens per second.

The CAP theorem benchmark was even worse for interactive use, taking about 32 minutes and 20 seconds.

This was the clearest example of the difference between "it loads" and "it is useful." The 26B model did run, but it consumed nearly the entire VRAM budget and was not competitive with E4B in that configuration.

Gemma 4 26B A4B: Retuned Result

The "Aha!" Moment: Less GPU Offload Was Faster

Reducing GPU offload from 24 layers to 16 on the 26B A4B model made the model dramatically faster. The lower setting preserved VRAM headroom and let CPU/system RAM participate more effectively.

One of the most surprising results from the benchmark came after reducing GPU offload from 24 layers to 16 layers on the 26B A4B model.

More GPU offload was not automatically better

At first, I expected lower GPU offload to make the model slower. Instead, the opposite happened.

With GPU offload reduced to 16 layers, dedicated VRAM usage dropped from roughly 15.5 GB to about 12.2 GB. Total GPU memory stabilized around 12.4 GB while system RAM rose substantially into the 35-38 GB range. CPU utilization increased into the 40-44% range.

Most importantly, the model became dramatically more responsive.

Representative timings after retuning included:

Prompt	Thinking Mode	Approx Time
TypeScript refactor	Off	~12.85 sec
TypeScript refactor	On	~26.34 sec
JavaScript closures	On	~26.65 sec
CAP theorem benchmark	Off	~32.80 sec
CAP theorem benchmark	On	~50.45 sec

This was shocking compared with the earlier GPU-heavy configuration where some prompts took many minutes.

The practical interpretation is that the earlier configuration was likely over-offloaded to the GPU. The model technically fit, but it left too little VRAM headroom for efficient runtime behavior, KV cache growth, and the rest of the inference pipeline.

By reducing GPU offload, the system allowed more work to flow through CPU and system RAM instead of trying to force nearly everything into constrained GPU memory. Even though CPU utilization increased significantly, the overall runtime improved.

In other words, more GPU offload was not automatically better.

This became one of the most important lessons of the benchmark. A balanced workload between GPU VRAM and system RAM can outperform an over-constrained all-GPU configuration, especially on a 16 GB laptop GPU where memory headroom matters.

Another interesting observation was memory recovery behavior after generation completed. System RAM usage appeared to fall gradually after prompts finished, suggesting that portions of the runtime allocation, cache, or working memory were being reclaimed over time rather than instantly released.

Gemma 4 31B: Initial Stress Test

I also tried loading the Gemma 4 31B model at what appeared to be its default runtime settings: context length 4096, GPU offload 35, CPU thread pool size 9, evaluation batch size 512, and max concurrent predictions 4.

At those settings, the model loaded in about 56.83 seconds. After load, system RAM was around 38.9 GB, dedicated GPU memory was around 15.4 GB, and total GPU memory was around 15.7 GB.

That memory profile looked very similar to the earlier over-offloaded 26B case. The model loaded, but it left very little VRAM headroom.

I started a JavaScript closures prompt with thinking off. It got through the initial thinking phase at around 40 seconds, but the actual response generation was extremely slow. Since the goal is practical usability rather than proving that a huge model can technically grind through a prompt, I stopped the run around the two-minute mark and treated the default 31B setting as an over-offload stress test.

Gemma 4 31B: Retuned Result

After the first 31B attempt at 35 GPU-offloaded layers proved too slow, I reduced GPU offload to 24 layers while keeping context length at 4096, CPU thread pool size at 9, evaluation batch size at 512, and max concurrent predictions at 4.

That changed the behavior dramatically.

At 24 GPU-offloaded layers, the 31B model loaded in about 12.98 seconds. Dedicated GPU memory dropped to about 11.6 GB, with total GPU memory around 11.8 GB. At rest, CPU usage was around 4 percent and system RAM was around 39.4 GB.

During the TypeScript refactor test, system RAM rose to about 40.5 GB, CPU usage reached about 44 percent, dedicated VRAM stayed around 11.8 GB, and total GPU memory was around 12.0 GB. The TypeScript refactor completed in about 1 minute 17.69 seconds, using about 13 percent of the context window.

The CAP theorem benchmark took about 6 minutes 1.13 seconds. LM Studio reported about 4.61 tokens per second, 1699 tokens used, and about 1.99 seconds before the EOS token was found. During that run, CPU usage rose to about 42 percent, system RAM was around 35.2 GB, dedicated VRAM was around 11.7 GB, and total GPU memory was around 11.9 GB.

This was still slower than the retuned 26B model, but it was no longer unusable. Reducing GPU offload again changed the model from a memory-constrained crawl into a model that could complete the benchmark.

One possible additional factor was thermals. During the 31B CAP run, GPU temperature appeared to move between roughly 71 and 78 degrees Celsius. At the higher end of that temperature range, the CPU clock appeared to fall from around 4.77 GHz toward roughly 4.15-4.2 GHz as temperatures came back down. That raises the possibility that heat or power management may have influenced sustained performance. I was not tracking thermals rigorously enough to prove throttling, so this remains an observation rather than a conclusion.

Another useful observation: GPU utilization percentage did not appear to rise above about 25 percent, even while GPU memory use stayed nearly constant. That suggests the bottleneck was not simply raw GPU compute utilization. Memory pressure, CPU participation, thermal behavior, or synchronization overhead may have mattered more than the GPU utilization percentage alone.

The 31B test reinforces the larger lesson: model tuning is not only about pushing more layers onto the GPU. On this laptop, reducing GPU offload preserved VRAM headroom and made a larger model significantly more usable.

10. What Surprised Me

Several things surprised me during this process.

First, the smaller models were more capable than expected. E2B was not perfect, but it was genuinely useful and extremely fast. E4B was even better and felt like a practical local daily driver.

Second, the 26B model was not simply "too large." It was terrible when I over-offloaded it to the GPU, but dramatically better after reducing GPU offload and letting CPU/system RAM participate more.

Third, the 31B model also became more usable after reducing GPU offload. It was still slower than the retuned 26B model, but it crossed from "not worth waiting for" into "this can complete the benchmark."

Fourth, GPU utilization percentage by itself was not enough to explain what was happening. A low GPU utilization percentage did not mean the run was cheap or efficient. GPU memory, CPU utilization, system RAM, thermals, and model settings all mattered.

Fifth, tokens/sec and total completion time are not the same thing. A larger model may generate fewer tokens, or answer more compactly, while still having lower tokens/sec. That means total prompt time depends both on generation speed and on how much the model decides to say.

Finally, the biggest surprise was that the best GPU offload setting was not necessarily the model default, the highest setting available, or the highest setting that technically fit in VRAM. Lowering GPU offload below the default improved performance for the larger models by preserving headroom.

11. Lessons Learned

The biggest lesson is simple: start small and work upward.

The smaller models are not just training wheels. They are the best way to confirm that the runtime is configured correctly, the intended GPU is being used, and the machine can generate responses at a useful speed.

The second lesson is that "loads" does not mean "usable." A model can load and still be unpleasant if it consumes nearly all VRAM, leaves no room for runtime behavior, or generates too slowly.

The third lesson is that VRAM matters more than total system RAM for this kind of GPU-accelerated local inference. Having 64 GB of system RAM is helpful, but it does not make a 16 GB GPU behave like a 24 GB or 48 GB GPU.

The fourth lesson is that available VRAM is less than advertised VRAM. Windows, displays, runtime overhead, LM Studio, KV cache, and other applications all consume memory before and during generation.

The fifth lesson is that smaller models may be better daily drivers. E4B was not the largest model I tried, but it had the best balance of speed, output quality, and headroom.

The sixth lesson is that settings matter. Context length, GPU offload, KV cache behavior, thinking/reasoning mode, thread pool size, and batch settings can change the result dramatically.

The seventh lesson is that maximum GPU offload is not automatically best. On constrained VRAM systems, a lower offload setting can leave enough memory headroom to make the whole pipeline faster.

Finally, record the settings. If you do not record context length, GPU offload layers, quantization, tokens/sec, and memory use, it becomes very hard to explain later why one run felt better than another.

12. Practical Recommendations

On this class of laptop, I would treat E4B as the practical default model from this test set.

E2B is useful as a very fast fallback and sanity-check model. It is lightweight, responsive, and easy to keep around. But its output needs a little more review.

E4B is the better daily assistant candidate. It is still fast enough to feel interactive, but its explanations and code responses were stronger.

The 26B A4B model is no longer something I would dismiss as unusable. After retuning GPU offload downward, it became surprisingly practical for larger reasoning prompts. I still would not keep it as my default daily model on a 16 GB VRAM system, but it is worth keeping as a tuned larger-model option.

The 31B model is also not impossible, but it requires more patience and careful tuning. At its higher/default offload setting it behaved like a stress test. After reducing offload, it could complete real prompts, but it remained slower than the 26B model.

For someone with similar hardware, my recommendation would be:

Start with E2B to confirm the setup.
Try E4B as the likely daily driver.
Try 26B A4B only after understanding GPU offload and VRAM headroom.
Treat 31B as a larger-model experiment, not a default.
Leave VRAM headroom for the operating system, context cache, and other applications.
Do not assume the default GPU offload setting is best.
Try lower GPU offload values if a model loads but performs badly.

Hardware Starting Points

Hardware	Suggested Starting Point
CPU only	Smallest quantized model
8 GB VRAM	E2B or E4B Q4
12 GB VRAM	E4B or mid-size Q4/Q5
16 GB VRAM	E4B comfortably; larger Q4/MoE models with careful offload tuning
24 GB+ VRAM	Larger models and higher quantizations become more practical

Use Case Starting Points

Goal	Suggested Direction
Fast chat	Smaller model, lower memory footprint
Coding assistant	E4B or larger if VRAM allows
Architecture reasoning	Larger model or MoE variant, tuned carefully
Long-context work	Leave extra VRAM for KV cache
Background productivity	Avoid consuming all VRAM so other apps stay usable
Multi-model workflow	Prefer models that leave enough headroom to load/unload without disrupting the machine

For someone with less VRAM, start smaller and be more aggressive about quantization.

For someone with more VRAM, the larger models become more interesting, especially if you want to keep a fast model and a reasoning model available without constantly fighting memory pressure.

13. Conclusion

Local AI is not only about saving money.

It is about privacy, offline access, control, experimentation, and learning how models actually behave on the hardware you own.

The encouraging part is that useful local AI is already possible on consumer hardware. The caution is that model names alone do not tell you whether the experience will be good. Neither does parameter count. Neither does the fact that a model loads.

The most surprising lesson from this test was that the largest useful configuration was not the one that pushed the most layers onto the GPU. In fact, some of the worst results came from trying to keep too much of the model in VRAM.

The best model is not simply the biggest model that loads. The best setting is also not necessarily the highest GPU offload value, the model’s default offload value, or even the highest value that fits inside available VRAM. On a constrained VRAM system, lowering GPU offload below the default may actually improve local model performance.

That changed how I think about local model tuning.

The goal is not maximum GPU offload. The goal is a balanced configuration that leaves enough VRAM for the model, runtime overhead, KV cache, and the rest of the system.

For everyday use on this laptop, E4B still looks like the best default model. It is fast, useful, and leaves plenty of headroom. E2B is a great lightweight fallback. The larger 26B and 31B models are better treated as reasoning or stress-test models that require careful tuning before they become practical.

A local model is not useful merely because it loads. It is useful when it fits your hardware, your workflow, and your patience.

The real benchmark is not "how big a model can I start?"

The better benchmark is: can I use this model, with these settings, on this machine, without breaking my flow?

If you’ve tried tuning Gemma 4 or other local models on your workstation or laptop, what GPU offload settings have worked best for your VRAM? I was surprised that lowering offload actually improved my speeds.

Top comments (1)

timothy_western_ed7594e0a profile image

Timothy Western

Joined

Jul 14, 2024

• Jun 22

If anyone happens to read this, and has learned anything from the more recent updates to Gemma 4 or later models, please feel free to share, or link from these comments, things grow so fast in this space, its important that people learn how to adapt and grow!