One-Line Summary: Install Ollama on your machine and verify it is running — the simplest way to run open-source LLMs locally.
Prerequisites: macOS, Linux, or Windows (WSL2), terminal access, ~10 GB free disk space
What Is Ollama
Ollama is an open-source tool that packages LLM weights, configuration, and a runtime into a single, easy-to-use system. Think of it as "Docker for LLMs" — you pull models by name and run them with a single command.
Under the hood, Ollama uses llama.cpp for inference, which means:
- CPU inference works out of the box — no GPU required
- Automatic quantization support — run large models on limited hardware
- Apple Silicon acceleration — uses Metal on M1/M2/M3/M4 Macs
- NVIDIA GPU acceleration — uses CUDA when available
Install on macOS
Download and install from the official site:
# Download the macOS installer
# Visit https://ollama.com/download and install the app
# Or install via Homebrew
brew install ollamaInstall on Linux
The install script handles everything automatically:
# One-line install script — downloads and configures Ollama
curl -fsSL https://ollama.com/install.sh | shThis installs the ollama binary and sets up a systemd service. On Linux, Ollama will automatically detect and use NVIDIA GPUs if the CUDA drivers are installed.
Install on Windows
Ollama runs natively on Windows or inside WSL2:
# Option 1: Download the Windows installer from https://ollama.com/download
# Option 2: Inside WSL2, use the Linux install script
curl -fsSL https://ollama.com/install.sh | shVerify the Installation
After installing, confirm Ollama is working:
# Check the installed version
ollama --versionYou should see output like ollama version 0.4.x or newer.
Start the Ollama Server
On macOS, the Ollama app starts the server automatically. On Linux, the systemd service handles it. You can also start it manually:
# Start the Ollama server in the foreground
ollama serveIn a separate terminal, verify the server is responding:
# Check that the API is reachable — should return "Ollama is running"
curl http://localhost:11434You should see the response:
Ollama is runningUnderstand the Ollama Architecture
When you run Ollama, here is what happens:
┌──────────────┐ HTTP API ┌──────────────────┐
│ ollama CLI │───────────────────►│ Ollama Server │
│ or curl │ localhost:11434 │ │
└──────────────┘ │ ┌────────────┐ │
│ │ llama.cpp │ │
│ │ runtime │ │
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Model files │ │
│ │ ~/.ollama/ │ │
│ └────────────┘ │
└──────────────────┘Key details:
- Model storage: Models are stored in
~/.ollama/models/on macOS/Linux - Port: The server listens on port
11434by default - API: Ollama exposes a REST API that is compatible with the OpenAI chat completions format
- Concurrency: Ollama handles one inference request at a time by default, queuing additional requests
Check GPU Detection
If you have an NVIDIA GPU, verify Ollama can see it:
# Check if NVIDIA drivers are installed
nvidia-smi
# Ollama will log GPU detection when starting — check the logs
# On Linux with systemd:
journalctl -u ollama --no-pager | head -20If you see your GPU listed, Ollama will automatically offload model layers to it for faster inference. If not, CPU inference still works — just slower.
You now have Ollama installed and running. In the next step, we will pull our first model and start chatting with it.
← Previous: Step 1 - What We're Building | Next: Step 3 - Run Your First Model →