What is the difference between LoRA and QLoRA?

Both train small adapter weights on top of a frozen base model. QLoRA additionally quantises the base model to 4-bit precision before training, which cuts VRAM requirements by roughly 4x. LoRA (without quantisation) uses 16-bit base weights and needs more VRAM but may train slightly faster. QLoRA is the practical choice when hardware is the constraint.

How much data do I need to fine-tune a model?

The honest answer is: it depends on the task. For format and style adaptation, a few hundred high-quality examples can produce visible improvement. For new domain knowledge or behaviour patterns, you typically want a few thousand examples minimum. More data improves generalisation, but quality matters more than quantity. Fifty carefully curated examples can outperform five hundred noisy ones.

Should I fine-tune or use RAG?

Use RAG when your task is factual retrieval — the model needs to answer questions based on documents that change over time. Use fine-tuning when you need consistent output format, style, or domain behaviour that prompting alone does not reliably produce. Many production systems use both: a fine-tuned model with RAG retrieval on top.

What is rank (r) in LoRA and how do I choose it?

The rank `r` controls the capacity of the LoRA adapters — how many degrees of freedom the adapters have to adapt the model. Higher rank means more expressiveness and more parameters, but also more risk of overfitting on small datasets. Start at `r=16`. Increase to `r=32` or `r=64` only if you have a large, clean dataset and clear evidence that the lower rank is limiting quality.

Can I fine-tune a model without a GPU?

Technically yes, on CPU, but it is impractically slow for any useful fine-tuning run. Even a small QLoRA run on a 7B model needs hours on CPU versus minutes on a GPU. For one-off experiments, cloud GPU providers (Colab, Lambda, RunPod, Vast.ai) are the practical answer if you do not have a local GPU. Short runs are inexpensive — a few dollars for a complete training job on a 7B model.

How do I know if my fine-tuning actually worked?

Build a held-out evaluation set before you train — examples the model never sees during training. Score the fine-tuned model on this set and compare to the base model on the same examples. If the fine-tuned model is not measurably better on your specific task, the issue is usually data quality, data format, or the fact that a better system prompt would have solved the problem without fine-tuning.

Fine-Tuning LLMs with LoRA & QLoRA

LoRA and QLoRA let you adapt a large model cheaply by training a small set of extra "adapter" weights instead of the whole model. The base model is frozen — you are not retraining it. You are training a thin wrapper on top of it. QLoRA takes this further by first quantising the base model to 4-bit precision, which dramatically cuts VRAM usage and makes fine-tuning viable on a single consumer GPU.

These techniques changed the economics of model customisation. What previously needed a cluster of A100s can now run overnight on a gaming GPU.

When Should You Fine-Tune at All?

Fine-tuning is the right tool in a narrow set of situations. It is the wrong tool in many more.

Fine-tune when:

You need the model to produce output in a very specific format, style, or structure consistently.
You are adding domain knowledge that the base model genuinely lacks — not just terminology, but concepts.
Prompt engineering and retrieval (RAG) have hit a ceiling and the quality gap is still unacceptable.
You have quality-labelled training data: at minimum a few hundred examples, ideally thousands.

Do not fine-tune when:

You want the model to "know more facts." RAG is the right answer for factual retrieval. Fine-tuning memorises facts poorly and degrades when facts change.
Your real problem is prompt structure. Write better prompts first. Fine-tuning for formatting issues is often wasteful.
You have fewer than a couple hundred high-quality examples. You can fine-tune on small data, but you will mostly be overfitting.
The task changes frequently. A fine-tuned model is a snapshot. Updating it requires retraining.

The decision framework for RAG vs fine-tuning is worth reading before you commit to either path. They solve different problems and are frequently confused.

LoRA vs QLoRA Explained Simply

LoRA stands for Low-Rank Adaptation. The core idea is that the weight updates needed to adapt a model to a new task can be expressed as the product of two small matrices — instead of changing a large weight matrix directly, you train two skinny matrices whose product approximates the change. This is the "low-rank" part: the update is low-rank in the mathematical sense, meaning it has far fewer degrees of freedom than a full weight update.

In practice, LoRA adds adapters — pairs of those small matrices — to specific layers of the transformer, usually the attention layers. The base model weights are frozen. Only the adapter weights are trained. At inference, you either merge the adapters into the base model weights (no runtime overhead) or apply them on the fly.

QLoRA (Quantised LoRA) extends this by quantising the frozen base model to 4-bit precision before training begins. This cuts the VRAM required to hold the base model by roughly 4x compared to 16-bit. You then train LoRA adapters on top of the quantised base model in a higher-precision format (typically bfloat16), so the adapters themselves are not quantised.

The result: a 7B model that would need 16 GB of VRAM for full fine-tuning needs roughly 8–10 GB with LoRA, and 6–8 GB with QLoRA. A 7B model for QLoRA can fit on a GPU with as little as 8 GB VRAM.

The Practical Workflow

Step 1: Data Preparation

Data quality dominates everything else. A fine-tuning run on 500 excellent examples beats one on 5,000 mediocre ones.

For supervised instruction fine-tuning — the most common case — you need input-output pairs in a consistent format. The standard chat template uses system, user, and assistant turns:

{
  "conversations": [
    {"role": "system", "content": "You are a GST filing assistant for Indian businesses."},
    {"role": "user", "content": "What is the deadline for GSTR-3B?"},
    {"role": "assistant", "content": "GSTR-3B is due on the 20th of the following month for monthly filers."}
  ]
}

Format matters. Use whatever template the base model was trained with. Mismatched templates are a common source of poor fine-tuning results that are hard to diagnose.

If you lack real labelled data, synthetic data generation — using a frontier model to produce training examples that you then verify — is now a standard approach. The risks and methods involved are covered in the synthetic data guide.

Step 2: Choose a Base Model

Pick a model that is close to your target task, licenced for your use case, and small enough to fit your hardware. Current starting points as of mid-2026:

Gemma 3 27B or Gemma 4 variants (Google, Apache 2.0): Strong general base, well-documented fine-tuning support.
Phi-4 14B (Microsoft, MIT): Punches above its weight on reasoning tasks, trains faster due to smaller size.
Qwen3.5 series (Alibaba, Apache 2.0): Strong multilingual support; useful if your domain includes non-English text.
Llama 3.x (Meta): The most documented fine-tuning ecosystem, large community, but check the Llama licence for commercial use.

Start smaller than you think you need. Fine-tuning a 7B model is faster and cheaper to iterate on than a 27B model. Prove the approach at small scale before scaling up.

Step 3: Train with Unsloth (or Axolotl / TRL)

For single-GPU training on consumer hardware, Unsloth is the current practical choice. It offers approximately 70% less VRAM usage than a standard LoRA setup and roughly 2x faster training than vanilla TRL. It supports QLoRA with 4-bit quantisation out of the box and has been updated for current model families.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-27b-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
)

The default hyperparameters in 2026 tooling are r=16, lora_alpha=16, targeting all linear layers. These are reasonable starting points. r (rank) controls the capacity of the adapters — higher rank means more expressiveness, more parameters, more risk of overfitting on small data. Start at 16. Increase only if quality is clearly insufficient.

For multi-GPU or larger pipelines, Axolotl (YAML-driven, flexible) and TRL (for RLHF and advanced objectives) are the alternatives. Axolotl is particularly good for teams that want a reproducible, config-driven pipeline.

Step 4: Evaluate Before You Serve

Never judge a fine-tuned model by training loss alone. Training loss tells you the model memorised the training set. It does not tell you if it generalised.

Build a small held-out evaluation set — examples your model never saw during training — and score it on the specific quality dimensions that matter for your use case. For structured output tasks, measure exact-match format compliance. For language quality tasks, use a combination of automated scoring and human review.

Compare against the unmodified base model on the same eval set. If the fine-tuned model is not meaningfully better on your specific task, you have a data problem, a hyperparameter problem, or a problem definition problem.

Step 5: Serve the Adapters

After training, you have two options. Merge the adapters into the base model weights — a one-time operation that produces a single model file with no runtime overhead. Or load the base model and apply adapters at inference time — useful if you want to maintain multiple specialisations off the same base.

Merging is the simpler and more portable option for most deployments. The merged model loads in Ollama, LM Studio, or any GGUF-compatible runtime just like any other model.

Cost and Hardware Expectations

QLoRA on a 7B model: 8–10 GB VRAM. Consumer RTX 4070 or equivalent. Training time for a few hundred examples: 30 minutes to a few hours depending on sequence length. GPU costs on cloud (A10, L4): a few dollars for a complete run.

LoRA on a 7B model (16-bit): 16–20 GB VRAM. RTX 4090 or equivalent. Faster training but higher VRAM cost.

QLoRA on a 27B model: 20–28 GB VRAM. RTX 4090 or an A10G. Training is significantly slower. Cloud is usually the right call here.

QLoRA on a 70B model: A single A100 80 GB handles it. Cost on cloud: roughly $5–$25 per training run depending on dataset size and provider.

The pattern: use QLoRA to get the most capability out of the smallest GPU. Use cloud for anything above 27B unless you have dedicated hardware.

The Pitfalls That Waste People's Time

Training on the wrong data format. The chat template must match what the base model expects. Using the wrong template is the single most common cause of "it trained fine but outputs garbage."

Not establishing a baseline. Before fine-tuning, test the base model with a carefully written system prompt on your task. Many teams skip this and fine-tune a model that a good system prompt would have handled.

Using too high a rank on a small dataset. High r values on small datasets cause overfitting. The model performs perfectly on the training examples and fails on anything else. Start at r=8 or r=16. Only increase if you have sufficient data and the quality improvement justifies it.

Forgetting to evaluate the base capability. Fine-tuning can improve performance on your target task while degrading performance on general tasks. If the model needs to be useful outside your specific fine-tuned domain, test for regression on general benchmarks.

Treating fine-tuning as one-shot. Fine-tuning is iterative. The first run reveals data problems you could not see before you had results. Budget for at least two or three training iterations.

---

Fine-tuning is a precision instrument. It does one thing well: it adapts a model's behaviour to match a distribution of examples you show it. Define that distribution carefully, and it earns its place in the stack.

---

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Guide