Running LLMs Locally with Ollama

January 12, 2024

ai dev

Ollama made running LLMs locally trivially easy. One command, and you have a local ChatGPT alternative. Here’s everything you need to know.

What Is Ollama

Ollama is a tool that:

Downloads and manages LLM models
Runs inference locally
Provides an OpenAI-compatible API
Works on macOS, Linux, and Windows

Installation

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download from ollama.ai

# Verify
ollama --version

Quick Start

# Download and run Llama 2
ollama run llama2

# Chat directly
>>> What is Python?
Python is a high-level, interpreted programming language...

# Exit with /bye
>>> /bye

That’s it. Local LLM in seconds.

Available Models

# List downloaded models
ollama list

# Pull specific models
ollama pull mistral
ollama pull codellama
ollama pull llama2:13b
ollama pull mixtral

Model	Size	Use Case
llama2	3.8GB	General purpose
llama2:13b	7.3GB	Better quality
mistral	4.1GB	Efficient, high quality
codellama	3.8GB	Code generation
mixtral	26GB	Best open quality
phi	1.6GB	Lightweight

Model Variants

# Different quantization levels
ollama pull llama2:7b-q4_0    # Smallest, fastest
ollama pull llama2:7b-q8_0    # Better quality
ollama pull llama2:7b         # Default (q4_0)

# Different sizes
ollama pull llama2:7b
ollama pull llama2:13b
ollama pull llama2:70b        # Needs serious hardware

API Usage

Ollama runs a local API server:

# Start server (usually automatic)
ollama serve

Generate Endpoint

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "What is machine learning?"
}'

Chat Endpoint

curl http://localhost:11434/api/chat -d '{
  "model": "llama2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

OpenAI-Compatible

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "user", "content": "Explain Python decorators"}
    ]
)

print(response.choices[0].message.content)

Python Integration

import ollama

# Simple generation
response = ollama.generate(model='llama2', prompt='Why is the sky blue?')
print(response['response'])

# Chat
response = ollama.chat(model='llama2', messages=[
    {'role': 'user', 'content': 'What is recursion?'}
])
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(model='llama2', 
                         messages=[{'role': 'user', 'content': 'Tell me a story'}],
                         stream=True):
    print(chunk['message']['content'], end='', flush=True)

Custom Models (Modelfiles)

Create specialized versions:

# Modelfile
FROM llama2

# Set parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

# Set system prompt
SYSTEM """
You are a Python expert. Always provide code examples.
Format code with proper syntax highlighting.
"""

# Create custom model
ollama create python-expert -f Modelfile

# Use it
ollama run python-expert

LangChain Integration

from langchain_community.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = Ollama(model="llama2")

prompt = PromptTemplate(
    input_variables=["topic"],
    template="Explain {topic} in simple terms."
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("quantum computing")

Hardware Requirements

Model Size	RAM Required	GPU VRAM
7B	8GB	6GB
13B	16GB	10GB
30B	32GB	24GB
70B	64GB	48GB

Apple Silicon works well—M1/M2/M3 run these models efficiently.

Performance Tips

GPU Acceleration

# Ollama auto-detects GPUs
# For NVIDIA, ensure CUDA is installed
nvidia-smi  # Check GPU

# For Apple Silicon, Metal is used automatically

Context Length

# Increase context in Modelfile
PARAMETER num_ctx 8192

Parallel Requests

# Set environment variable
OLLAMA_NUM_PARALLEL=4 ollama serve

Use Cases

Local Development Assistant

def get_code_help(code, question):
    response = ollama.generate(
        model='codellama',
        prompt=f"""
Code:
```python
{code}

Question: {question} """ ) return response[‘response’]


### Private Document Q&A

```python
from langchain.vectorstores import Chroma
from langchain.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama

# Embeddings run locally too
embeddings = OllamaEmbeddings(model="llama2")
vectorstore = Chroma.from_documents(docs, embeddings)

# RAG with local LLM
llm = Ollama(model="llama2")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

Offline Operation

# Pull models while online
ollama pull llama2
ollama pull codellama

# Use anywhere, no internet needed
ollama run llama2

Comparison

Feature	Ollama	LM Studio	llama.cpp
Ease of use	Best	Good	Complex
API server	Built-in	Built-in	Manual
GUI	No	Yes	No
Custom models	Yes	Limited	Yes
Resource usage	Efficient	Efficient	Most efficient

Common Issues

Model Won’t Load

# Check available memory
free -h  # Linux
vm_stat  # macOS

# Try smaller quantization
ollama pull llama2:7b-q4_0

Slow Generation

# Ensure GPU is being used
ollama run llama2 --verbose

# Check GPU utilization
nvidia-smi  # or Activity Monitor on macOS

Final Thoughts

Ollama democratized local LLMs. No complex setup, no CUDA debugging, no Python environment issues. Just ollama run llama2.

For privacy-conscious development, offline work, or cost-free experimentation—it’s the obvious choice.

Local AI, no cloud required.