Llama 2: Open Source Commercial AI
On July 18, 2023, Meta released Llama 2—not just for research, but with commercial use permissions. This changed the calculus for AI deployment.
What Changed
Llama 1 (February 2023)
License: Research only
Usage: Academics, experiments
Commercial: Prohibited
Llama 2 (July 2023)
License: Commercial allowed*
Usage: Anyone
Commercial: Yes (with conditions)
*Except if >700M monthly users (need Meta permission)
The Models
| Model | Parameters | Context | Use Case |
|---|---|---|---|
| Llama 2 7B | 7 billion | 4K tokens | Edge, mobile |
| Llama 2 13B | 13 billion | 4K tokens | General purpose |
| Llama 2 70B | 70 billion | 4K tokens | Best quality |
| Llama 2-Chat | All sizes | 4K tokens | Conversation |
Performance
Compared to GPT-3.5:
| Benchmark | Llama 2 70B | GPT-3.5 |
|---|---|---|
| MMLU | 68.9% | 70.0% |
| TriviaQA | 85.0% | 87.3% |
| NaturalQuestions | 46.9% | 44.9% |
| HumanEval | 29.9% | 48.1% |
Competitive but not quite as good. However: it’s free and runs locally.
Running Llama 2
With Ollama
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Run Llama 2
ollama run llama2
# Or specific variants
ollama run llama2:13b
ollama run llama2:70b
With Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16
)
prompt = "[INST] What is Python? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Quantized (Lower Memory)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
quantization_config=bnb_config,
device_map="auto"
)
# 7B model in ~4GB VRAM
The Chat Format
Llama 2 Chat uses a specific prompt format:
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is the capital of France? [/INST]
Multi-turn:
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is the capital of France? [/INST] The capital of France is Paris.
[INST] What is its population? [/INST]
Fine-Tuning
With QLoRA
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments
from trl import SFTTrainer
# Prepare model
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
args=TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
),
peft_config=peft_config,
)
trainer.train()
Resources Needed
| Task | GPU Memory | Time |
|---|---|---|
| Inference 7B (4-bit) | 4 GB | - |
| Inference 13B (4-bit) | 8 GB | - |
| Fine-tune 7B (QLoRA) | 16 GB | Hours |
| Fine-tune 13B (QLoRA) | 24 GB | Hours |
Use Cases
Enterprise Chatbots
# On-premise, no data leaves your servers
def enterprise_chat(query, context):
prompt = f"""[INST] <<SYS>>
You are a helpful assistant for Acme Corp. Only answer based on the provided context.
<</SYS>>
Context: {context}
Question: {query} [/INST]"""
return generate(prompt)
Code Generation
# Code Llama extends Llama 2 for code
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf")
Local AI Assistants
# Run locally, no API costs
ollama run llama2
>>> Write a Python function to calculate compound interest
Deployment Options
Cloud
| Provider | Service |
|---|---|
| AWS | SageMaker endpoints |
| Azure | Azure ML |
| GCP | Vertex AI |
| Replicate | Hosted API |
| Together | API access |
On-Premise
| Tool | Approach |
|---|---|
| vLLM | High-performance serving |
| TGI | Hugging Face inference |
| Ollama | Easy local deployment |
| llama.cpp | CPU-efficient |
Edge
Llama 2 7B (quantized)
├── Apple M1/M2 MacBooks → Works well
├── NVIDIA Jetson → Possible
└── Modern phones → Experimental
Ecosystem
Llama 2 spawned derivatives:
| Model | Specialization |
|---|---|
| Code Llama | Programming |
| Llama 2 Long | Extended context |
| Various GGUF | Quantized for llama.cpp |
| Fine-tuned variants | Domain-specific |
The Strategic Implications
For Meta
“If you can’t beat OpenAI on models, commoditize the model layer and compete on distribution.”
Meta wins if AI runs on their infrastructure, regardless of model source.
For Companies
Before: "We can't use local models commercially"
After: "We can deploy Llama 2 in production"
The buy vs. build calculation changed.
For Open Source
Llama 2 validated that:
- Open-weight models can be competitive
- Commercial use attracts contributors
- The community can iterate rapidly
Final Thoughts
Llama 2’s commercial license was the real innovation. Capable models existed before—the permission to use them commercially didn’t.
This enabled:
- On-premise AI without vendor lock-in
- Privacy-preserving deployments
- Custom fine-tuning for enterprises
- A thriving open-source ecosystem
Not as good as GPT-4, but good enough for many use cases—and getting better.
July 2023: Commercial open-source AI became real.