llama.cpp and Local Inference

One-Line Summary: Georgi Gerganov's llama.cpp project, started in March 2023 as a C/C++ port of LLaMA inference, sparked a revolution in local AI by proving that large language models could run on ordinary laptops and even phones without a GPU.

Prerequisites: 01-llama-1.md, 04-quantization-and-compression.md, 04-the-open-source-ecosystem.md

What Is llama.cpp?

Imagine the early days of personal computing. Mainframes existed -- powerful, expensive, locked away in corporate data centers. Then someone figured out how to build a computer that fit on a desk and cost a fraction of the price. It wasn't as powerful as the mainframe, but it was yours. You could run it at home, modify it, build whatever you wanted on it. llama.cpp is that moment for AI: the project that took language models out of the data center and put them on your laptop.

When Meta released the LLaMA model weights in February 2023, they provided a reference implementation in Python using PyTorch. Running it required a modern NVIDIA GPU with substantial VRAM, CUDA libraries, and a Python environment with specific dependency versions. For AI researchers, this was routine. For the millions of developers, tinkerers, and curious users who didn't have a $10,000 GPU setup, it was a brick wall.

Georgi Gerganov, a Bulgarian software engineer known for his work on the Whisper.cpp speech recognition port, saw an opportunity. On March 10, 2023 -- less than two weeks after LLaMA's release -- he published the first commit of llama.cpp: a from-scratch implementation of LLaMA inference in plain C/C++ with no dependencies on Python, PyTorch, or CUDA. It ran on CPU. It ran on a MacBook. It ran, remarkably, on a Raspberry Pi. The AI world would never be the same.

How It Works

llama.cpp -- bringing AI from data centers to laptops:

Before llama.cpp:                       After llama.cpp:
┌──────────────────────────┐           ┌──────────────────────────┐
│  Requirements:           │           │  Requirements:           │
│  - NVIDIA GPU ($5K+)     │           │  - Any computer          │
│  - CUDA drivers          │           │  - Download 2 files:     │
│  - Python + PyTorch      │           │    1. llama.cpp binary   │
│  - Conda environment     │           │    2. GGUF model file    │
│  - ML expertise          │           │  - That's it.            │
└──────────────────────────┘           └──────────────────────────┘
 
Supported hardware:
┌────────────┬────────────┬───────────────┬──────────────┐
│  Desktop   │  Laptop    │  Phone/Edge   │  Server      │
│  (CPU/GPU) │  (Apple M) │  (ARM)        │  (NVIDIA)    │
├────────────┼────────────┼───────────────┼──────────────┤
│  x86       │  Metal GPU │  NEON SIMD    │  CUDA        │
│  AVX2/512  │  Unified   │  Raspberry Pi │  Multi-GPU   │
│  Vulkan    │  Memory    │  Android      │  RPC cluster │
└────────────┴────────────┴───────────────┴──────────────┘
 
Performance (7B Q4_K_M):
  MacBook Air M2:    ~30-40 tok/s
  RTX 4090:          ~100+ tok/s
  Raspberry Pi 5:    ~5-8 tok/s
 
Ecosystem: Ollama (100K+ stars) | LM Studio (millions of downloads)

The GGML Tensor Library

At the heart of llama.cpp is GGML (Georgi Gerganov Machine Learning), a custom tensor computation library written in C. Unlike PyTorch or TensorFlow, which are designed for flexibility and GPU-first computation, GGML is optimized for a specific use case: running Transformer inference efficiently on CPUs. It uses:

Fixed memory allocation: No dynamic memory management during inference, eliminating garbage collection pauses
Quantized computation kernels: Native support for INT4, INT5, INT8, and mixed-precision arithmetic
SIMD optimization: Hand-tuned kernels for x86 (AVX, AVX2, AVX-512), ARM (NEON), and Apple Silicon (Accelerate framework, Metal GPU)
Memory-mapped files: Models are loaded via mmap, so the operating system manages memory paging -- only the parts of the model actively needed are loaded into RAM

This architecture means GGML can run a 7B parameter model in 4 GB of RAM, generating tokens at 10-30 tokens per second on a modern laptop CPU. On Apple Silicon Macs, which have unified memory shared between CPU and GPU, performance is even better -- the Metal backend can leverage the GPU while keeping the model in shared memory.

The GGUF Format

Early llama.cpp used a format called GGML, but as the project grew, the community developed GGUF (GPT-Generated Unified Format) -- a self-contained binary format that stores everything needed to load and run a model in a single file: architecture metadata, tokenizer vocabulary, quantization parameters, and the weight tensors themselves.

GGUF's key innovation is supporting mixed quantization: different layers can use different precision levels. A common configuration like Q4_K_M (4-bit quantization with medium-precision key layers) keeps attention layers at higher precision (where accuracy matters most) while aggressively quantizing feed-forward layers (where it matters less). This achieves better quality-per-byte than uniform quantization across all layers.

By 2024, GGUF had become a de facto community standard. The Hugging Face model hub hosted thousands of GGUF-quantized models, with prolific quantizers like TheBloke and bartowski providing multiple quantization variants for virtually every major open-weight model release.

Supported Hardware and Backends

What started as a CPU-only project expanded to support an impressive range of hardware:

CPU: x86 (Intel, AMD) and ARM, with architecture-specific optimizations
Apple Metal: GPU acceleration on Macs and iPhones, leveraging unified memory
CUDA: NVIDIA GPU support for users who have discrete GPUs
Vulkan: Cross-platform GPU support covering AMD, Intel, and NVIDIA
SYCL: Intel GPU acceleration
RPC: Distributed inference across multiple machines over a network

This hardware flexibility is critical: it means the same codebase runs on everything from an NVIDIA H100 in a data center to an M1 MacBook Air to an Android phone.

Performance Characteristics

Real-world performance depends on model size, quantization level, and hardware. Representative numbers as of 2024-2025:

7B model, Q4_K_M, M2 MacBook Air: ~30-40 tokens/second
13B model, Q4_K_M, M2 Pro Mac: ~20-30 tokens/second
70B model, Q4_K_M, M3 Max 128GB Mac: ~10-15 tokens/second
7B model, Q4_K_M, RTX 4090: ~100+ tokens/second (GPU-accelerated)
3B model, Q4_K_M, Raspberry Pi 5: ~5-8 tokens/second

These speeds are for token generation (the bottleneck phase). Prompt processing is typically faster due to batch parallelism.

Why It Matters

AI On Your Laptop Becomes Real

Before llama.cpp, "running AI locally" meant setting up a Python environment, installing CUDA drivers, managing PyTorch versions, and hoping your GPU had enough VRAM. After llama.cpp, it meant downloading a single binary and a GGUF file. The barrier dropped from "AI engineering expertise" to "can download two files." This wasn't just a technical simplification -- it was a cultural transformation. Millions of people who would never configure a CUDA environment could now interact with a language model running entirely on their own hardware.

The Privacy and Sovereignty Argument

When you run a model locally, your data never leaves your machine. No API calls, no cloud logging, no terms of service granting the provider rights to your interactions. For lawyers reviewing confidential documents, doctors discussing patient cases, journalists protecting sources, or anyone in a jurisdiction with strict data residency requirements, local inference isn't a preference -- it's a necessity. llama.cpp made this possible without enterprise infrastructure.

Spawning an Ecosystem

llama.cpp didn't just prove local inference was viable -- it spawned an entire ecosystem of tools that made it accessible:

Ollama (2023): A one-command model manager ("ollama run llama3") that wraps llama.cpp in an elegant CLI and REST API. Made local AI as simple as Docker made containers.
LM Studio (2023): A polished desktop application with a GUI for downloading, managing, and chatting with local models. No command line required.
Open WebUI (2023): A ChatGPT-like web interface for local models.
Jan (2024): Another desktop client focused on privacy and offline-first usage.
text-generation-webui (2023): Gradio-based interface supporting multiple backends including llama.cpp.

These tools collectively brought local AI to an audience of millions -- not just developers, but designers, writers, students, and professionals who simply wanted a private, free AI assistant.

The Community

By 2025, llama.cpp's GitHub repository had accumulated thousands of contributors. The project moved faster than most corporate engineering teams, with major features and optimizations landing weekly. The community developed specialized quantization methods, contributed hardware-specific optimizations, created compatibility layers for dozens of model architectures, and maintained GGUF converters for every major model release. This was open-source collaboration at its finest: a shared infrastructure project that benefited the entire ecosystem.

Key Technical Details

First commit: March 10, 2023, by Georgi Gerganov -- less than two weeks after LLaMA's public weight release
Core library: GGML, written in C, with no external dependencies
GGUF format: self-contained model files with mixed-precision quantization support
7B model in Q4_K_M: ~4 GB file size, runs in ~6 GB RAM
70B model in Q4_K_M: ~40 GB file size, runs on hardware with 48-64+ GB memory
Supported architectures: LLaMA, Mistral, Phi, Gemma, Qwen, Command R, and dozens more
Hardware backends: CPU (x86/ARM), Metal, CUDA, Vulkan, SYCL, RPC
Ecosystem tools: Ollama (100K+ GitHub stars), LM Studio (millions of downloads)

Common Misconceptions

"Local models are too slow to be useful." A 7B model at 30-40 tokens/second on a MacBook is plenty fast for interactive use. Larger models are slower but still practical for tasks where latency tolerance is higher, like code generation or document analysis.
"You need an NVIDIA GPU." llama.cpp was designed CPU-first and works excellently on Apple Silicon Macs, AMD processors, and even ARM chips. GPU acceleration is optional and cross-vendor.
"Local models are much worse than cloud APIs." For many tasks, a well-quantized 70B model running locally is competitive with cloud-hosted models. The quality gap has narrowed dramatically, especially for coding, summarization, and domain-specific tasks.
"Quantization makes models useless." At Q4_K_M (4-bit mixed precision), quality loss is typically under 1% on standard benchmarks. Most users cannot distinguish between a 4-bit quantized model and the original in blind testing.
"llama.cpp is only for hobbyists." Major companies use llama.cpp-derived inference for production workloads. Its efficiency and hardware flexibility make it a legitimate deployment option for real-world applications.

Connections to Other Concepts

llama.cpp depends critically on 04-quantization-and-compression.md for making models fit in limited memory. The models it runs come from 01-phi-series.md, 02-gemma.md, the LLaMA family, and virtually every open-weight project. Fine-tuned models created via 05-lora-and-fine-tuning-democratization.md are commonly converted to GGUF for local deployment. The broader movement it enables is captured in 07-the-slm-revolution.md, and its role in the open ecosystem feeds directly into 07-open-vs-closed-the-narrowing-gap.md.