Key takeaways:
- QLoRA lets you fine-tune a 27B-parameter model in under 22 GB of VRAM. A single consumer GPU handles what required a cluster two years ago.
- LoRA applied to all transformer layers (not just attention) consistently outperforms the default HuggingFace PEFT configuration, per Sebastian Raschka's experiments.
- Most fine-tuning projects should start as prompting projects. Seriously. I've watched teams burn weeks on fine-tuning when a well-crafted system prompt would've gotten them 90% of the way.
- Unsloth makes Gemma-family fine-tuning 1.6x faster and uses 60% less VRAM than standard HuggingFace training, with critical float16 fixes for T4 and older GPUs.
- Evaluation is not optional. If you can't measure whether fine-tuning helped, you shouldn't be fine-tuning.
Fine-tuning without evaluation is just expensive prompt engineering with extra steps.
What Is Supervised Fine-Tuning (SFT) and Why It Matters in 2026
Supervised fine-tuning takes a pre-trained foundation model and continues training it on labeled input-output pairs specific to your task. Unlike pre-training (which requires billions of tokens and millions of dollars), SFT works with hundreds to thousands of examples and a single GPU.
SFT matters more in 2026 than it did in 2024 precisely because the base models got so much better. Gemma 4, released by Google DeepMind with full Apache 2.0 licensing, is multimodal (text, image, audio) and introduces real architectural innovations: Per-Layer Embeddings (PLE) and a Shared KV Cache. As Merve Noyan and the HuggingFace team noted at launch, they "struggled to find good fine-tuning examples because they are so good out of the box."
That quote should be your first filter. If a frontier open-source model handles your task well with a good system prompt, fine-tuning adds cost and complexity for marginal gain. SFT shines when you need the model to reliably follow a specific format, speak in domain-specific language, or compress a long prompt template into learned behavior.
The full fine-tuning approach — updating every parameter — creates a memory footprint approximately 12x larger than the model itself due to optimizer states and gradients. Artur Niederfahrenhorst and colleagues at Anyscale documented this thoroughly. For a 7B model, that's over 80 GB of VRAM. This is why parameter-efficient methods like LoRA and QLoRA aren't nice-to-haves. They're the only practical path for most teams.
LoRA vs QLoRA: How They Work and When to Use Each
Low-Rank Adaptation (LoRA) was introduced by Edward J. Hu et al. at Microsoft Research in 2021. The core idea is elegant: freeze the pre-trained model's weights entirely, then inject small trainable rank-decomposition matrices into each transformer layer. Instead of updating a weight matrix W directly, LoRA decomposes the update ΔW into two much smaller matrices A and B where ΔW = A ×ばつ B. The rank r of these matrices controls capacity.
The results are striking. Compared to full fine-tuning of GPT-3 175B with Adam, LoRA reduces trainable parameters by 10,000x and GPU memory by 3x. And because the adapter matrices can be merged back into the base weights after training, there's zero additional inference latency. Your fine-tuned model runs at exactly the same speed as the original.
QLoRA, introduced by Tim Dettmers et al. at the University of Washington in May 2023, stacks three additional innovations on top of LoRA:
-
4-bit NormalFloat (NF4) — an information-theoretically optimal data type for normally distributed neural network weights
-
Double quantization — quantizes the quantization constants themselves, squeezing out additional memory savings
-
Paged optimizers — uses NVIDIA unified memory to handle memory spikes during gradient computation without OOM crashes
The result: QLoRA fine-tunes a 65B-parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning performance. Their best model, Guanaco, reached 99.3% of ChatGPT's performance on the Vicuna benchmark after just 24 hours of training on one GPU.
When to use which:
-
LoRA (16-bit base): When you have ample VRAM (40GB+), want maximum quality, and are fine-tuning a model under 12B parameters. The quality ceiling is slightly higher because no information is lost to quantization.
-
QLoRA (4-bit base): When VRAM is constrained (16-24 GB), when fine-tuning larger models (12B-70B), or when training on consumer hardware. The quality gap vs. LoRA is negligible for most practical tasks.
For most practitioners in 2026, QLoRA is the default. The VRAM savings are too significant to ignore, and the quality trade-off is minimal. Based on the benchmark data I maintain at kunalganglani.com/llm-benchmarks, quantization quality cliffs are model-family-specific. A blanket recommendation doesn't hold, but QLoRA's NF4 specifically handles the weight distributions of modern transformer architectures well.
The Decision Framework: Fine-Tuning vs Prompt Engineering
This is the section most guides skip, and it's the most important one. I've seen teams jump straight to fine-tuning because it feels more "real" than prompt engineering. It's not. It's just more expensive.
Before you spin up a GPU instance, work through this checklist:
-
Have you tried few-shot prompting? Give the model 3-5 examples of your desired input-output format in the system prompt. If this gets you 90%+ of the way there, stop.
-
Have you tried RAG? If the model needs domain knowledge it doesn't have, retrieval-augmented generation with a vector database is cheaper and more maintainable than baking knowledge into weights.
-
Do you have at least 500 high-quality training examples? Below this threshold, LoRA fine-tuning rarely outperforms a well-engineered prompt. The sweet spot starts around 1,000-5,000 examples.
-
Is the task about format, not knowledge? Fine-tuning excels at teaching consistent output structure, tone, and domain terminology. It's mediocre at injecting new factual knowledge.
-
Will you call this model >10,000 times? Fine-tuning amortizes. If your fine-tuned model eliminates a 500-token system prompt, that's real money at scale. But only if the volume justifies the upfront training cost.
-
Can you measure improvement? If you can't define a metric to compare before and after, you have no way to know if fine-tuning helped.
As Maxime Labonne puts it in his Unsloth fine-tuning guide: try few-shot prompting or RAG first before committing to fine-tuning. This isn't just good advice. It's the economically rational path.
Fine-tuning makes sense when you need: consistent JSON output across millions of API calls, domain-specific medical/legal/financial terminology, a particular conversational persona, or reduced latency by eliminating long context windows. If none of those apply, prompt engineering is your answer.
GPU Requirements: Model Size ×ばつ Technique ×ばつ VRAM
This is the table I wish someone had given me before I wasted money on oversized instances. Here are realistic VRAM requirements for fine-tuning in 2026, accounting for Unsloth optimizations where applicable:
| Model Size |
Full Fine-Tuning |
LoRA (16-bit) |
QLoRA (4-bit) |
QLoRA + Unsloth |
| 1B |
~12 GB |
~6 GB |
~4 GB |
~3 GB |
| 4B |
~48 GB |
~18 GB |
~8 GB |
~5 GB |
| 7-8B |
~80 GB |
~20 GB |
~10 GB |
~6 GB |
| 12B |
~144 GB |
~32 GB |
~16 GB |
~10 GB |
| 27B |
~324 GB |
~72 GB |
~28 GB |
~22 GB |
| 70B |
~840 GB |
~180 GB |
~48 GB |
~36 GB |
Numbers assume sequence length 2048, batch size 1, gradient checkpointing enabled. Actual requirements vary with batch size, sequence length, and optimizer choice.
The practical takeaway: an RTX 4090 (24 GB) handles QLoRA fine-tuning of anything up to 12B comfortably. With Unsloth's optimizations, you can squeeze Gemma 4 27B into that same 24 GB card. A free Colab T4 (16 GB) works for models up to about 12B with QLoRA + Unsloth, though you'll be constrained on batch size.
For local LLM enthusiasts running on Apple Silicon, note that unified memory changes the equation. An M4 Max with 128 GB unified memory can technically load a 70B model for QLoRA, but throughput will be significantly slower than a dedicated NVIDIA GPU due to memory bandwidth differences. I've run training jobs on both, and for serious fine-tuning work, NVIDIA GPUs on CUDA remain the pragmatic choice.
Check out the local LLM hardware guide for current GPU recommendations.
Step-by-Step: Fine-Tuning Gemma 4 With Unsloth and QLoRA
Unsloth joined the official PyTorch ecosystem in May 2026, cementing its position as the recommended efficient fine-tuning framework. Daniel Han and Michael Han, Unsloth's co-founders, have specifically optimized for Gemma-family models, achieving 1.6x faster training and 60% less VRAM than standard HuggingFace pipelines.
Here's the workflow for fine-tuning Gemma 4 12B with QLoRA on a 24 GB GPU:
Step 1: Environment setup. Install Unsloth with pip install unsloth. If you're on a T4 or older GPU that only supports float16 tensor cores, Unsloth automatically handles the bfloat16 activation + manual float16 matrix multiply workaround that prevents gradient overflow. Without this fix, training produces NaN losses on these GPUs. Don't skip this. If you haven't set up your Python environment yet, the Python AI development setup guide covers the full stack.
Step 2: Load the model with 4-bit quantization. Use Unsloth's FastModel.from_pretrained() with load_in_4bit=True. Unsloth uses dynamic 4-bit quantization that's more accurate than standard GPTQ or AWQ for training purposes. For Gemma 4 specifically, be aware of the new Per-Layer Embeddings (PLE) architecture. Unsloth handles this automatically, but if you're using raw PEFT, you need to exclude embedding layers from quantization.
Step 3: Configure the PEFT adapter. This is where LoRA hyperparameters come in (covered in the next section).
Step 4: Prepare your dataset. Covered below.
Step 5: Train with SFTTrainer. Unsloth wraps HuggingFace's TRL SFTTrainer with its own optimizations. Training Gemma 4 12B on 1,000 examples with QLoRA typically takes 15-30 minutes on an A100 or RTX 4090, and about 45-60 minutes on a T4.
Step 6: Save and export. You can save the adapter separately, merge it into the base model, or export directly to GGUF.
Setting Up Your Dataset and Chat Template
Your dataset quality matters more than your hyperparameters. I've shipped enough fine-tuned models to know this is true every single time: a perfect LoRA configuration trained on noisy data will produce a worse model than default settings on clean, well-structured examples.
Format your data as conversations using the model's chat template. For Gemma 4, this means the <start_of_turn> / <end_of_turn> format. Unsloth's standardize_data() function handles conversion from common formats (ShareGPT, Alpaca, OpenAI-style) automatically.
Practical dataset guidance:
-
Minimum viable dataset: 200-500 examples for format/style adaptation. Below 200, you're almost certainly better off with few-shot prompting.
-
Sweet spot: 1,000-5,000 examples. This is where LoRA consistently outperforms prompting on task-specific evaluations.
-
Diminishing returns: Beyond 10,000 examples, gains flatten unless you're training on genuinely diverse data. More data isn't always better. More distinct data is.
-
Quality filter: Every example should be something you'd be proud to show as model output. One garbage example teaches the model that garbage is acceptable.
The Philipp Schmid guide on HuggingFace covers dataset preparation mechanics well, though it targets the 2024 toolchain. The principles haven't changed, but the specific APIs have.
Configuring LoRA Hyperparameters: Rank, Alpha, and Target Modules
I've tested these defaults across three different model families this year. They work:
-
Rank (r): Start with 16. This controls adapter capacity. Higher rank = more parameters = more expressiveness but more memory. For most tasks, r=16 is sufficient. Sebastian Raschka found that very large ranks (r=256) can help on certain tasks, but r=16-64 covers the practical range.
-
Alpha: Set to 2x your rank (alpha=32 for r=16). Alpha scales the adapter's contribution. The ratio alpha/r controls the effective learning rate multiplier for the adapters.
-
Dropout: 0.05. Some practitioners use 0, but a small dropout helps prevent overfitting on small datasets.
-
Learning rate: 2e-4 with cosine scheduling. Lower than you'd use for full fine-tuning. The Anyscale team found that lower learning rates improve LoRA checkpoint reliability.
-
Target modules: Apply LoRA to ALL linear layers, not just the attention Q and V matrices. Sebastian Raschka's experiments showed this consistently improves downstream task performance. The default HuggingFace PEFT configuration only targets attention layers, leaving gains on the table. In Unsloth, use
target_modules="all-linear".
For Gemma 4 specifically, the Shared KV Cache architecture means the key-value projections are shared across certain layer groups. This doesn't change your LoRA config (Unsloth handles the mapping correctly), but it means the effective parameter count of your adapters may be slightly lower than you'd expect from the rank alone.
Running the Training Loop
With Unsloth + TRL's SFTTrainer, the training configuration is straightforward. Key settings beyond the LoRA config:
-
Batch size: Start with 1 if VRAM is tight. Use gradient accumulation steps (4-8) to achieve an effective batch size of 4-8 without increasing memory.
-
Epochs: 1-3 for most tasks. More epochs on a small dataset leads to overfitting fast. Monitor your validation loss.
-
Max sequence length: 2048 is a safe default. Gemma 4 supports much longer contexts, but longer sequences eat VRAM quadratically. Only increase if your data actually contains long documents.
-
Gradient checkpointing: Always enable. It trades ~20% more computation for massive VRAM savings.
-
Warmup steps: 5-10% of total training steps.
Unsloth's training loop automatically applies its optimizations: fused cross-entropy, custom CUDA kernels for attention, and async gradient checkpointing. On Gemma 4 12B with QLoRA, expect throughput around 3-4x what you'd get with vanilla HuggingFace Trainer on the same hardware.
When I built the pipeline for this site's multi-agent blog publishing system, I learned something that applies directly here: model-per-job-shape beats one-model-everywhere on both cost and quality. Don't try to make one fine-tuned model do everything. Train separate adapters for separate tasks and swap them at inference time. LoRA adapters are tiny (typically 10-100 MB) and can be hot-swapped without reloading the base model.
Merging the Adapter vs Keeping It Separate: The Deployment Decision
After training, you have a choice that affects your entire inference architecture:
Merge and export (single model): Call merge_and_unload() to fold the adapter weights back into the base model. The result is a standard model with zero inference overhead. Export to GGUF for use with Ollama, LM Studio, or llama.cpp. This is the right choice when you have one fine-tuned task and want maximum simplicity.
Keep adapters separate (multi-adapter serving): Store the base model once and load different LoRA adapters per request. This is the right choice when you have multiple fine-tuned variants (one per customer, one per task) and want to avoid storing N copies of a multi-gigabyte model. Tools like vLLM support serving multiple LoRA adapters from a single base model with minimal overhead. See the vLLM vs Ollama comparison for production serving options.
For local AI use cases — running on your own hardware for privacy or cost — merged GGUF export is almost always the right call. The operational simplicity of a single file you can load in Ollama outweighs the flexibility of adapter serving.
To export to GGUF with Unsloth: use save_pretrained_gguf() with your desired quantization level. Q4_K_M is a solid default for inference quality, but test against your evaluation suite before committing.
What Is QAT and How Does It Differ From QLoRA?
Quantization-Aware Training (QAT) is gaining real traction in 2026. Google released a Gemma 4 12B QAT model in June 2026, which tells you something about where this technique is headed.
The distinction matters:
-
QLoRA: Quantizes the frozen base model to 4-bit, then trains LoRA adapters in higher precision. The quantization is applied before training and the model never learns to compensate for quantization artifacts.
-
QAT: Simulates quantization during training, allowing the model to learn weight values that work well in their quantized representation. The result is a natively quantized model that performs better at low bit-widths than post-training quantization.
QAT is complementary to QLoRA, not a replacement. You might use QLoRA to fine-tune a model cheaply, then apply QAT as a final optimization step before deployment. Or you might start from a QAT-optimized base model (like Google's Gemma 4 12B QAT) and fine-tune it with standard LoRA.
Unsloth added QAT support in October 2025. If you're deploying to edge devices or need aggressive quantization (2-bit, 3-bit), QAT-trained models hold up dramatically better than post-training quantized equivalents.
Evaluating Your Fine-Tuned Model: Did It Actually Improve?
This is where I see teams fail over and over. Too many practitioners declare victory based on vibes — "it feels better" — without measuring anything. That's not engineering. That's wishful thinking.
Set up evaluation before you start training. Here's how I approach it:
1. Hold out a test set. Take 10-15% of your dataset and never train on it. Non-negotiable.
2. Define task-specific metrics. For classification: accuracy, F1. For generation: use an LLM-as-judge approach (have GPT-4 or Claude rate outputs on your criteria). For structured output: exact-match on schema compliance. For coding tasks, consider referencing approaches from the Gemma fine-tuning for code generation post.
3. Run lm-evaluation-harness. EleutherAI's lm-evaluation-harness is the standard tool for general capability evaluation. Run it on both your base model and fine-tuned model to check you haven't degraded general capabilities while improving task performance. This regression check is critical. Fine-tuning on narrow data can catastrophically forget broader skills.
4. Compare against the prompting baseline. Your fine-tuned model needs to beat the best prompt you can engineer. If few-shot prompting with the base model scores 85% and your fine-tuned model scores 87%, that 2% gain probably isn't worth the operational complexity of maintaining a custom model.
5. Test at your target quantization. If you're deploying as Q4_K_M GGUF, evaluate at that quantization level, not at full precision. Performance can drop at lower bit-widths, and you need to know before deployment.
Building this site's multi-agent blog pipeline taught me something I keep coming back to: deterministic gates before LLM review catch more issues than doubling the review model's size. The same applies here. Automated, deterministic quality checks (exact-match on format, schema validation, regression suite) catch more problems than eyeballing outputs.
Common Pitfalls and How to Avoid Them
Overfitting on small datasets. If your training loss drops to near-zero but validation loss diverges, you're memorizing, not learning. Reduce epochs, increase dropout, or get more data.
Wrong chat template. Each model family has its own special tokens and conversation format. Using Llama's template on Gemma produces garbage. Unsloth handles this, but if you're rolling your own pipeline, verify the template matches the model's tokenizer. I've seen this bite people more than any hyperparameter mistake.
Ignoring Gemma 4's architectural differences. Gemma 4's Per-Layer Embeddings mean each transformer layer has its own embedding projection, not a shared one. If you're writing custom training code (not using Unsloth), make sure your LoRA configuration accounts for this. Shared KV Cache similarly means certain layers share key-value projections, which affects how LoRA adapters interact with attention.
Training on the wrong data format. Multi-turn conversations need multi-turn training data. If your training examples are all single-turn but your deployment is multi-turn, the model won't learn turn-taking behavior.
Float16 overflow on T4/V100 GPUs. Gemma models produce infinite activations in float16 mixed precision on GPUs without bfloat16 tensor cores. As Daniel Han and Michael Han documented, Unsloth is currently the only framework that correctly handles this with its three-fold fix: bfloat16 activations, manual float16 matrix multiplies, and float32 upcast for non-matmul operations.
Skipping evaluation. I've said it twice. I'll say it a third time. If you don't measure, you don't know. Ship an eval harness before you ship a fine-tuned model.
Not version-controlling your experiments. Track your hyperparameters, dataset version, base model version, and metrics for every run. Weights & Biases works great. A simple CSV file works too. Whatever stops you from repeating failed experiments.
What Changed in Fine-Tuning Between 2024 and 2026
If you're coming from a 2024-era guide (and most of the top-ranking articles are), here's what's different:
-
Unsloth joined the PyTorch ecosystem (May 2026), making it the officially endorsed efficient fine-tuning path rather than a third-party hack.
-
Gemma 4 introduced PLE and Shared KV Cache. These are architectural changes that require framework-level support for correct LoRA placement.
-
QAT models ship from the provider. Google released Gemma 4 12B QAT in June 2026, so you can start from a model that's already optimized for low-precision deployment.
-
Unsloth launched an API endpoint (May 2026), so you can fine-tune without managing GPU infrastructure at all.
-
NVIDIA collaboration (May 2026) means Unsloth's CUDA kernels are optimized for current-gen hardware (Blackwell architecture, RTX 50xx series).
-
Context-length fine-tuning expanded to 500K+ tokens (Unsloth, December 2025), enabling fine-tuning on book-length documents.
Every competitor article currently ranking for how to fine-tune open source LLM LoRA QLoRA 2026 predates all of these developments. That's not a minor gap. Their GPU tables, code examples, and framework recommendations are outdated.
The Fine-Tuning Decision Checklist
Here's the framework I use for every fine-tuning project:
-
Can prompting solve this? Test few-shot with 5-10 examples. If accuracy exceeds your threshold, stop here.
-
Can RAG solve this? If the gap is knowledge, not format, build a retrieval pipeline with a vector database first.
-
Do you have 500+ clean examples? If not, invest in data collection before GPU time.
-
Pick your base model. In mid-2026, Gemma 4 12B is the best bang-for-VRAM open-source model for most tasks. Gemma 4 vs GPT-4o Mini covers the comparison in depth.
-
Use QLoRA + Unsloth. Unless you have specific reasons for LoRA 16-bit or full fine-tuning, QLoRA is the default.
-
Apply LoRA to all linear layers. r=16, alpha=32, dropout=0.05, lr=2e-4.
-
Evaluate against your prompting baseline. If the fine-tuned model doesn't beat it by a meaningful margin, don't deploy it.
-
Export to GGUF for local deployment or keep adapters separate for multi-tenant serving.
This is one of those things where the boring answer is actually the right one. Item 1 — "can prompting solve this?" — eliminates 80% of fine-tuning projects before they start. But for the 20% where fine-tuning is the right call, the toolchain in 2026 makes it genuinely accessible. A 1,500ドル RTX 4090 and 30 minutes of training time can produce a specialized model that would have cost tens of thousands of dollars in compute two years ago.
The gap between "I have an idea for a specialized model" and "I have a deployed specialized model" has never been smaller. Stop reading guides. Start measuring whether your task actually needs fine-tuning. And if it does, Unsloth + QLoRA + Gemma 4 is the stack that makes it work.
Originally published on kunalganglani.com