Running LLMs Locally with Ollama
ai dev
Ollama made running LLMs locally trivially easy. One command, and you have a local ChatGPT alternative. Here’s everything you need to know.
What Is Ollama
Ollama is a tool that:
- Downloads and manages LLM models
- Runs inference locally
- Provides an OpenAI-compatible API
- Works on macOS, Linux, and Windows
Installation
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows
# Download from ollama.ai
# Verify
ollama --version
Quick Start
# Download and run Llama 2
ollama run llama2
# Chat directly
>>> What is Python?
Python is a high-level, interpreted programming language...
# Exit with /bye
>>> /bye
That’s it. Local LLM in seconds.
Available Models
# List downloaded models
ollama list
# Pull specific models
ollama pull mistral
ollama pull codellama
ollama pull llama2:13b
ollama pull mixtral
| Model | Size | Use Case |
|---|---|---|
| llama2 | 3.8GB | General purpose |
| llama2:13b | 7.3GB | Better quality |
| mistral | 4.1GB | Efficient, high quality |
| codellama | 3.8GB | Code generation |
| mixtral | 26GB | Best open quality |
| phi | 1.6GB | Lightweight |
Model Variants
# Different quantization levels
ollama pull llama2:7b-q4_0 # Smallest, fastest
ollama pull llama2:7b-q8_0 # Better quality
ollama pull llama2:7b # Default (q4_0)
# Different sizes
ollama pull llama2:7b
ollama pull llama2:13b
ollama pull llama2:70b # Needs serious hardware
API Usage
Ollama runs a local API server:
# Start server (usually automatic)
ollama serve
Generate Endpoint
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "What is machine learning?"
}'
Chat Endpoint
curl http://localhost:11434/api/chat -d '{
"model": "llama2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
OpenAI-Compatible
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but unused
)
response = client.chat.completions.create(
model="llama2",
messages=[
{"role": "user", "content": "Explain Python decorators"}
]
)
print(response.choices[0].message.content)
Python Integration
import ollama
# Simple generation
response = ollama.generate(model='llama2', prompt='Why is the sky blue?')
print(response['response'])
# Chat
response = ollama.chat(model='llama2', messages=[
{'role': 'user', 'content': 'What is recursion?'}
])
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(model='llama2',
messages=[{'role': 'user', 'content': 'Tell me a story'}],
stream=True):
print(chunk['message']['content'], end='', flush=True)
Custom Models (Modelfiles)
Create specialized versions:
# Modelfile
FROM llama2
# Set parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
# Set system prompt
SYSTEM """
You are a Python expert. Always provide code examples.
Format code with proper syntax highlighting.
"""
# Create custom model
ollama create python-expert -f Modelfile
# Use it
ollama run python-expert
LangChain Integration
from langchain_community.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
llm = Ollama(model="llama2")
prompt = PromptTemplate(
input_variables=["topic"],
template="Explain {topic} in simple terms."
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("quantum computing")
Hardware Requirements
| Model Size | RAM Required | GPU VRAM |
|---|---|---|
| 7B | 8GB | 6GB |
| 13B | 16GB | 10GB |
| 30B | 32GB | 24GB |
| 70B | 64GB | 48GB |
Apple Silicon works well—M1/M2/M3 run these models efficiently.
Performance Tips
GPU Acceleration
# Ollama auto-detects GPUs
# For NVIDIA, ensure CUDA is installed
nvidia-smi # Check GPU
# For Apple Silicon, Metal is used automatically
Context Length
# Increase context in Modelfile
PARAMETER num_ctx 8192
Parallel Requests
# Set environment variable
OLLAMA_NUM_PARALLEL=4 ollama serve
Use Cases
Local Development Assistant
def get_code_help(code, question):
response = ollama.generate(
model='codellama',
prompt=f"""
Code:
```python
{code}
Question: {question} """ ) return response[‘response’]
### Private Document Q&A
```python
from langchain.vectorstores import Chroma
from langchain.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
# Embeddings run locally too
embeddings = OllamaEmbeddings(model="llama2")
vectorstore = Chroma.from_documents(docs, embeddings)
# RAG with local LLM
llm = Ollama(model="llama2")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
Offline Operation
# Pull models while online
ollama pull llama2
ollama pull codellama
# Use anywhere, no internet needed
ollama run llama2
Comparison
| Feature | Ollama | LM Studio | llama.cpp |
|---|---|---|---|
| Ease of use | Best | Good | Complex |
| API server | Built-in | Built-in | Manual |
| GUI | No | Yes | No |
| Custom models | Yes | Limited | Yes |
| Resource usage | Efficient | Efficient | Most efficient |
Common Issues
Model Won’t Load
# Check available memory
free -h # Linux
vm_stat # macOS
# Try smaller quantization
ollama pull llama2:7b-q4_0
Slow Generation
# Ensure GPU is being used
ollama run llama2 --verbose
# Check GPU utilization
nvidia-smi # or Activity Monitor on macOS
Final Thoughts
Ollama democratized local LLMs. No complex setup, no CUDA debugging, no Python environment issues. Just ollama run llama2.
For privacy-conscious development, offline work, or cost-free experimentation—it’s the obvious choice.
Local AI, no cloud required.