
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Guide
by Deep Parmar
CTO, Sunbots & Xwits

LoRA and QLoRA let you adapt a large model cheaply by training a small set of extra "adapter" weights instead of the whole model. The base model is frozen — you are not retraining it. You are training a thin wrapper on top of it. QLoRA takes this further by first quantising the base model to 4-bit precision, which dramatically cuts VRAM usage and makes fine-tuning viable on a single consumer GPU.
These techniques changed the economics of model customisation. What previously needed a cluster of A100s can now run overnight on a gaming GPU.
When Should You Fine-Tune at All?
Fine-tuning is the right tool in a narrow set of situations. It is the wrong tool in many more.
Fine-tune when:
- You need the model to produce output in a very specific format, style, or structure consistently.
- You are adding domain knowledge that the base model genuinely lacks — not just terminology, but concepts.
- Prompt engineering and retrieval (RAG) have hit a ceiling and the quality gap is still unacceptable.
- You have quality-labelled training data: at minimum a few hundred examples, ideally thousands.
Do not fine-tune when:
- You want the model to "know more facts." RAG is the right answer for factual retrieval. Fine-tuning memorises facts poorly and degrades when facts change.
- Your real problem is prompt structure. Write better prompts first. Fine-tuning for formatting issues is often wasteful.
- You have fewer than a couple hundred high-quality examples. You can fine-tune on small data, but you will mostly be overfitting.
- The task changes frequently. A fine-tuned model is a snapshot. Updating it requires retraining.
The decision framework for RAG vs fine-tuning is worth reading before you commit to either path. They solve different problems and are frequently confused.
LoRA vs QLoRA Explained Simply
LoRA stands for Low-Rank Adaptation. The core idea is that the weight updates needed to adapt a model to a new task can be expressed as the product of two small matrices — instead of changing a large weight matrix directly, you train two skinny matrices whose product approximates the change. This is the "low-rank" part: the update is low-rank in the mathematical sense, meaning it has far fewer degrees of freedom than a full weight update.
In practice, LoRA adds adapters — pairs of those small matrices — to specific layers of the transformer, usually the attention layers. The base model weights are frozen. Only the adapter weights are trained. At inference, you either merge the adapters into the base model weights (no runtime overhead) or apply them on the fly.
QLoRA (Quantised LoRA) extends this by quantising the frozen base model to 4-bit precision before training begins. This cuts the VRAM required to hold the base model by roughly 4x compared to 16-bit. You then train LoRA adapters on top of the quantised base model in a higher-precision format (typically bfloat16), so the adapters themselves are not quantised.
The result: a 7B model that would need 16 GB of VRAM for full fine-tuning needs roughly 8–10 GB with LoRA, and 6–8 GB with QLoRA. A 7B model for QLoRA can fit on a GPU with as little as 8 GB VRAM.
The Practical Workflow
Step 1: Data Preparation
Data quality dominates everything else. A fine-tuning run on 500 excellent examples beats one on 5,000 mediocre ones.
For supervised instruction fine-tuning — the most common case — you need input-output pairs in a consistent format. The standard chat template uses system, user, and assistant turns:
{
"conversations": [
{"role": "system", "content": "You are a GST filing assistant for Indian businesses."},
{"role": "user", "content": "What is the deadline for GSTR-3B?"},
{"role": "assistant", "content": "GSTR-3B is due on the 20th of the following month for monthly filers."}
]
}
Format matters. Use whatever template the base model was trained with. Mismatched templates are a common source of poor fine-tuning results that are hard to diagnose.
If you lack real labelled data, synthetic data generation — using a frontier model to produce training examples that you then verify — is now a standard approach. The risks and methods involved are covered in the synthetic data guide.
Step 2: Choose a Base Model
Pick a model that is close to your target task, licenced for your use case, and small enough to fit your hardware. Current starting points as of mid-2026:
- Gemma 3 27B or Gemma 4 variants (Google, Apache 2.0): Strong general base, well-documented fine-tuning support.
- Phi-4 14B (Microsoft, MIT): Punches above its weight on reasoning tasks, trains faster due to smaller size.
- Qwen3.5 series (Alibaba, Apache 2.0): Strong multilingual support; useful if your domain includes non-English text.
- Llama 3.x (Meta): The most documented fine-tuning ecosystem, large community, but check the Llama licence for commercial use.
Start smaller than you think you need. Fine-tuning a 7B model is faster and cheaper to iterate on than a 27B model. Prove the approach at small scale before scaling up.
Step 3: Train with Unsloth (or Axolotl / TRL)
For single-GPU training on consumer hardware, Unsloth is the current practical choice. It offers approximately 70% less VRAM usage than a standard LoRA setup and roughly 2x faster training than vanilla TRL. It supports QLoRA with 4-bit quantisation out of the box and has been updated for current model families.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-3-27b-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
)
The default hyperparameters in 2026 tooling are r=16, lora_alpha=16, targeting all linear layers. These are reasonable starting points. r (rank) controls the capacity of the adapters — higher rank means more expressiveness, more parameters, more risk of overfitting on small data. Start at 16. Increase only if quality is clearly insufficient.
For multi-GPU or larger pipelines, Axolotl (YAML-driven, flexible) and TRL (for RLHF and advanced objectives) are the alternatives. Axolotl is particularly good for teams that want a reproducible, config-driven pipeline.
Step 4: Evaluate Before You Serve
Never judge a fine-tuned model by training loss alone. Training loss tells you the model memorised the training set. It does not tell you if it generalised.
Build a small held-out evaluation set — examples your model never saw during training — and score it on the specific quality dimensions that matter for your use case. For structured output tasks, measure exact-match format compliance. For language quality tasks, use a combination of automated scoring and human review.
Compare against the unmodified base model on the same eval set. If the fine-tuned model is not meaningfully better on your specific task, you have a data problem, a hyperparameter problem, or a problem definition problem.
Step 5: Serve the Adapters
After training, you have two options. Merge the adapters into the base model weights — a one-time operation that produces a single model file with no runtime overhead. Or load the base model and apply adapters at inference time — useful if you want to maintain multiple specialisations off the same base.
Merging is the simpler and more portable option for most deployments. The merged model loads in Ollama, LM Studio, or any GGUF-compatible runtime just like any other model.
Cost and Hardware Expectations
QLoRA on a 7B model: 8–10 GB VRAM. Consumer RTX 4070 or equivalent. Training time for a few hundred examples: 30 minutes to a few hours depending on sequence length. GPU costs on cloud (A10, L4): a few dollars for a complete run.
LoRA on a 7B model (16-bit): 16–20 GB VRAM. RTX 4090 or equivalent. Faster training but higher VRAM cost.
QLoRA on a 27B model: 20–28 GB VRAM. RTX 4090 or an A10G. Training is significantly slower. Cloud is usually the right call here.
QLoRA on a 70B model: A single A100 80 GB handles it. Cost on cloud: roughly $5–$25 per training run depending on dataset size and provider.
The pattern: use QLoRA to get the most capability out of the smallest GPU. Use cloud for anything above 27B unless you have dedicated hardware.
The Pitfalls That Waste People's Time
Training on the wrong data format. The chat template must match what the base model expects. Using the wrong template is the single most common cause of "it trained fine but outputs garbage."
Not establishing a baseline. Before fine-tuning, test the base model with a carefully written system prompt on your task. Many teams skip this and fine-tune a model that a good system prompt would have handled.
Using too high a rank on a small dataset. High r values on small datasets cause overfitting. The model performs perfectly on the training examples and fails on anything else. Start at r=8 or r=16. Only increase if you have sufficient data and the quality improvement justifies it.
Forgetting to evaluate the base capability. Fine-tuning can improve performance on your target task while degrading performance on general tasks. If the model needs to be useful outside your specific fine-tuned domain, test for regression on general benchmarks.
Treating fine-tuning as one-shot. Fine-tuning is iterative. The first run reveals data problems you could not see before you had results. Budget for at least two or three training iterations.
---
Fine-tuning is a precision instrument. It does one thing well: it adapts a model's behaviour to match a distribution of examples you show it. Define that distribution carefully, and it earns its place in the stack.
---
Frequently Asked Questions
Quick answers about this topic — also indexed by AI search engines via FAQPage schema.
Share this article:
