What is the minimum RAM or VRAM needed to run a local LLM?

The minimum is 8 GB of GPU VRAM — or 16 GB of Apple Silicon unified memory — to run a 7B-class model comfortably. With only system RAM and no GPU, 16 GB of RAM can run 3B–7B models at 8–20 tokens per second using CPU inference. This is usable for batch tasks but slow for conversation.

Is Ollama or LM Studio better for developers?

Ollama is better for developers who want a CLI-first, scriptable tool with a clean local API. LM Studio is better if you want a visual interface for exploring models, comparing outputs, or running a local server for a small team. Both expose an OpenAI-compatible API, so your existing code works with either.

Can I use local LLMs with my existing OpenAI API code?

Yes. Both Ollama (on port 11434) and LM Studio (on port 1234) expose an OpenAI-compatible API. You change the base URL in your client and set a dummy API key. Most OpenAI client libraries support this with a one-line configuration change.

Are open-source local models good enough for production use?

For many tasks, yes. Document summarisation, classification, code assistance, extraction, and RAG-based question answering all work well with models like Gemma 3 27B or Phi-4 14B. For tasks requiring frontier reasoning quality — complex multi-step planning, nuanced judgment — cloud models still lead. The honest answer is: test on your specific task. Benchmarks rarely predict task-specific performance reliably.

What does quantisation mean and which quantisation should I use?

Quantisation reduces model weights from 32-bit or 16-bit floats to 4-bit or 8-bit integers, shrinking file size and VRAM usage at a modest quality cost. Q4_K_M is the most common balanced choice — it fits larger models into consumer hardware with minimal quality degradation. Q8 is closer to full precision but uses more VRAM. Q2 is very small but noticeably weaker. Start with Q4_K_M.

How much disk space do local LLMs take?

Expect 2–5 GB for 3B–7B models at Q4, 8–12 GB for 14B models, 15–20 GB for 27B models, and 35–45 GB for 70B models at Q4_K_M quantisation. You need a fast NVMe SSD — models load from disk into RAM/VRAM on startup, and a slow disk means slow load times even if inference is fast.

Run LLMs Locally: 2026 Guide to Local AI

You can run capable open models entirely on your own laptop or desktop using tools like Ollama or LM Studio. No API key. No data leaving your machine. No per-token bill. The constraint is hardware: specifically, how much RAM or GPU VRAM you have determines which models you can run and how fast they respond.

I have run local models in production contexts — including for client-side AI work where data never touches a server — and the gap between local and cloud has closed more than most people realise. Here is what actually matters.

Why Run Locally?

Four reasons that genuinely hold up, and one that is usually overstated.

Privacy. Your prompts, documents, and data stay on your machine. For anything involving client data, sensitive business logic, or personal information, this is not a nice-to-have. It is a requirement. GDPR, India's DPDPA, and most enterprise policies create real liability when data passes through third-party model APIs.

Cost at scale. At low usage, API billing is trivial. At high usage — batch document processing, developer tooling running thousands of completions a day, agentic pipelines that make many calls per task — local inference costs nothing per token after the hardware is purchased.

Offline capability. A model that runs locally works without internet. For edge deployments, field tools, or applications in unreliable connectivity environments, this is the only option.

Control. You choose the model, the version, the context length, the quantisation. You are not subject to model updates that change behaviour without notice. You can pin a model and keep it pinned.

The overstated reason: local models as a substitute for frontier models. They are not, for most tasks. If you need the reasoning quality of a top cloud model, you need a cloud model. Local inference is about privacy, cost, and control — not capability parity at the top end.

What Hardware You Realistically Need

GPU VRAM is the hard limit. It determines which models load at all. System RAM and CPU affect load time and CPU-only fallback speed.

Minimum (7B-class models): 8 GB GPU VRAM, or an Apple Silicon Mac with 16 GB unified memory. This gets you models like Gemma 3 4B, Phi-4-mini, or Llama 3.2 3B running at useful speeds.

Comfortable (14B–27B models): 16–24 GB VRAM. An RTX 4070 Ti (12 GB) handles 7B–14B models in 4-bit quantisation. An RTX 4090 (24 GB) comfortably runs 27B models. Apple Silicon with 32 GB or 64 GB unified memory is a strong option here.

Serious (32B–70B models): 24–40 GB VRAM, or Apple Silicon with 48 GB+. A 70B model at 4-bit quantisation (Q4_K_M) needs roughly 40 GB.

CPU-only fallback: A machine with 16 GB system RAM can run 3B–7B models at 8–20 tokens per second using CPU inference. Usable for non-interactive batch tasks, painful for conversation.

Add headroom. Budget 25% on top of model weights for the KV cache at 8K context, up to 100% at 32K context. A model that just barely fits at idle will fail once the conversation gets long.

The Main Tools

Ollama (current version: v0.30.8 as of June 2026)

Ollama is the fastest way to get a model running. Install it, run ollama pull , and you have a local OpenAI-compatible API at http://localhost:11434. It handles model downloads, GGUF conversion, and GPU offloading automatically. The library has over 4,500 models. It now supports structured outputs (JSON schema-constrained responses), Gemma 4 speculative decoding on Apple Silicon, and cached API responses for lower latency.

ollama pull gemma3:27b
ollama run gemma3:27b

The API is intentionally minimal. You point your existing OpenAI-compatible code at the local endpoint and change nothing else.

LM Studio (current version: 0.4.16 as of June 2026)

LM Studio is the GUI-first option. It ships a model browser connected to Hugging Face, shows you RAM/VRAM requirements before you download, and runs an OpenAI-compatible server on port 1234. It also ships llmster, a headless daemon for running without the GUI — useful for CI/CD pipelines or headless Linux servers.

A newer feature, LM Link, lets you access your desktop models from a phone over an encrypted connection. For developers who want to test voice or mobile integrations against their local model, this is useful.

LM Studio is the right choice when you want a visual interface, are evaluating multiple models, or are sharing a local inference server with a small team.

llama.cpp

The low-level engine that most local inference tooling is built on. If you need maximum control — custom quantisation, specific hardware optimisations, embedding into a C++ application — llama.cpp is the right level to work at. Ollama and LM Studio both use it under the hood. Directly using llama.cpp means more configuration and no GUI, but gives you the most direct control over inference parameters.

Which Open Models to Start With

As of mid-2026, these families and specific variants are worth knowing.

For most tasks, start here:

Gemma 3 27B (Google, Apache 2.0): Strong general-purpose model, runs in 16 GB VRAM at Q4. Well-supported in Ollama.
Phi-4 14B (Microsoft, MIT): Excellent at reasoning for its size. Fits in 8–12 GB VRAM with quantisation.
Llama 3.2 3B (Meta): The most-pulled model on Ollama. Fast, small, good for tooling where latency matters.
Qwen3.5 (Alibaba, Apache 2.0): Strong multilingual support, good for code, available in many sizes.

For coding specifically:

DeepSeek V3.2 and GLM-5 family models are strong at code tasks, though they are larger and harder to run on consumer hardware.

For long context:

Llama 4 Scout (released April 2026) supports up to 10M token context. The full model is large, but it signals where the open ecosystem is heading.

Licensing note: Gemma 4, Qwen3, and Phi-4 are all Apache 2.0 or MIT licensed, which means no restrictions on commercial use. Check the licence before you ship anything.

For a deeper look at how small models compare in production use, including latency and quality trade-offs, that post goes further.

Good Use Cases for Local LLMs

Where local inference genuinely earns its place:

Developer tooling. Code completion, PR review, doc generation — tasks that run thousands of times a day against your own codebase. Sending that code to a cloud API has both privacy and cost implications. Local inference handles both.

Document processing. Summarising, extracting, classifying documents that contain sensitive information. Legal, financial, medical contexts where data residency matters.

Agentic pipelines. Agents make many LLM calls per task. At cloud API prices, this gets expensive fast. Local inference makes agentic architectures economical.

Prototyping. You can iterate quickly on prompts and workflows without burning API credits. The feedback loop is tighter.

Offline applications. Field tools, edge devices, rural deployments — anything where connectivity is not guaranteed.

The Honest Limits

Local models in 2026 are genuinely good. They are not frontier models.

On complex reasoning, extended planning, and tasks requiring broad world knowledge, the gap between a local 7B–27B model and a top cloud model is real. For many practical tasks — classification, extraction, summarisation, code assistance — the gap is small enough not to matter. For tasks that require the best possible reasoning, it matters a lot.

Multimodal support (vision, audio) in local models is improving but still limited compared to cloud options. Running a multimodal model locally requires significantly more VRAM than the text-only equivalent.

Speed on CPU-only hardware is functional but not comfortable for interactive use. If you are on a machine without a capable GPU and no Apple Silicon, manage your expectations on latency.

Finally, running local models requires some operational discipline. Models need disk space (anywhere from 2 GB to 40 GB+ per model). Updates are manual. You are responsible for evaluating whether a new model version is better or worse for your use case.

None of these are reasons not to use local inference. They are reasons to be clear-eyed about when it is the right tool.

---

The rule I use: if the data is sensitive, or the scale makes API costs meaningful, or the environment is offline — run it locally. Otherwise, use the best cloud model you can access.

---

Run LLMs on Your Own Machine: A 2026 Guide to Local AI

Why Run Locally?

What Hardware You Realistically Need

The Main Tools

Which Open Models to Start With

Good Use Cases for Local LLMs

The Honest Limits

Frequently Asked Questions

Related Posts

Small Models, Big Wins: When Phi-4 or Gemma Beats GPT-4 in Your Stack

Privacy-First AI: Why Client-Side Inference Matters