Llama 3: The New Open Source King
Meta released Llama 3 in April 2024. The 70B model matched GPT-4 on many benchmarks. The 8B model was the best in its class. Open-source AI reached a new milestone.
The Release
| Model | Parameters | Context | Best For |
|---|---|---|---|
| Llama 3 8B | 8 billion | 8K | Edge/local |
| Llama 3 70B | 70 billion | 8K | Quality |
| Llama 3 400B+ | 400+ billion | - | Coming later |
Benchmarks
Llama 3 8B vs Competition
| Benchmark | Llama 3 8B | Llama 2 13B | Mistral 7B |
|---|---|---|---|
| MMLU | 68.4% | 54.8% | 60.1% |
| HumanEval | 62.2% | 29.9% | 26.2% |
| GSM8K | 79.6% | 44.4% | 52.2% |
An 8B model beating a 13B model—architecture matters.
Llama 3 70B vs GPT-4
| Benchmark | Llama 3 70B | GPT-4 |
|---|---|---|
| MMLU | 82.0% | 86.4% |
| HumanEval | 81.7% | 67.0% |
| GSM8K | 93.0% | 92.0% |
Competitive with GPT-4, open-weight.
Running Llama 3
Ollama (Easiest)
# Install and run
ollama run llama3
# Or specific version
ollama run llama3:70b
Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
vLLM (Production)
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)
prompts = ["What is machine learning?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Chat Template
Llama 3 uses a new chat format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
What is Python?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Tokenizers handle this automatically with apply_chat_template.
What Made It Better
Training Data
- 15 trillion tokens (vs 2T for Llama 2)
- Higher quality curation
- More recent data
Architecture Improvements
- Grouped Query Attention (GQA)
- Larger vocabulary (128K tokens)
- Better tokenizer
Post-Training
- More RLHF iterations
- Better instruction following
- Improved safety tuning
Memory Requirements
| Model | FP16 | INT8 | INT4 |
|---|---|---|---|
| 8B | 16GB | 8GB | 4GB |
| 70B | 140GB | 70GB | 35GB |
Most developers can run 8B locally.
Use Cases
Local Development
# Fast iteration without API costs
import ollama
def code_review(code: str) -> str:
response = ollama.chat(
model='llama3',
messages=[{
'role': 'user',
'content': f'Review this code:\n\n{code}'
}]
)
return response['message']['content']
Private RAG
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
llm = Ollama(model="llama3")
# RAG with private data, no API calls
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever()
)
Fine-Tuning
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=lora_config,
max_seq_length=2048,
)
trainer.train()
Comparison with Alternatives
| Model | Quality | Speed | Cost | Privacy |
|---|---|---|---|---|
| GPT-4 | Best | Fast | $$$ | Cloud |
| Claude 3 | Great | Fast | $$ | Cloud |
| Llama 3 70B | Great | Medium | Free* | Local |
| Llama 3 8B | Good | Fast | Free* | Local |
*Hardware costs apply.
License
Llama 3 Community License:
- ✅ Commercial use allowed
- ✅ Fine-tuning allowed
- ✅ Derivative models allowed
- ⚠️ Need Meta permission if >700M monthly users
More permissive than many open models.
Ecosystem
Fine-Tuned Variants
Within weeks of release:
- Code-specialized versions
- Domain-specific fine-tunes
- Quantized versions
- Extended context versions
Hosting Options
| Provider | Model | Type |
|---|---|---|
| Together AI | Llama 3 | API |
| Replicate | Llama 3 | API |
| AWS Bedrock | Llama 3 | Managed |
| Ollama | Llama 3 | Local |
| vLLM | Llama 3 | Self-hosted |
Impact
For Developers
Before: "Need GPT-4 for quality, pay per token"
After: "Llama 3 70B is close enough, run locally"
For Enterprise
Before: "Can't use AI, data privacy concerns"
After: "Deploy on-premise, data never leaves"
For Open Source
Llama 1: Leaked, research only
Llama 2: Commercial license
Llama 3: Better quality, same license
Clear trajectory toward parity.
Final Thoughts
Llama 3 validated that open-weight models can compete with the best proprietary ones. The 8B model is particularly impressive—GPT-3.5 level quality in a model you can run on a laptop.
For many use cases, the question “which API?” becomes “do I even need an API?”
Open source caught up. Again.