Llama 3: The New Open Source King

April 28, 2024

ai ml

Meta released Llama 3 in April 2024. The 70B model matched GPT-4 on many benchmarks. The 8B model was the best in its class. Open-source AI reached a new milestone.

The Release

Model	Parameters	Context	Best For
Llama 3 8B	8 billion	8K	Edge/local
Llama 3 70B	70 billion	8K	Quality
Llama 3 400B+	400+ billion	-	Coming later

Benchmarks

Llama 3 8B vs Competition

Benchmark	Llama 3 8B	Llama 2 13B	Mistral 7B
MMLU	68.4%	54.8%	60.1%
HumanEval	62.2%	29.9%	26.2%
GSM8K	79.6%	44.4%	52.2%

An 8B model beating a 13B model—architecture matters.

Llama 3 70B vs GPT-4

Benchmark	Llama 3 70B	GPT-4
MMLU	82.0%	86.4%
HumanEval	81.7%	67.0%
GSM8K	93.0%	92.0%

Competitive with GPT-4, open-weight.

Running Llama 3

Ollama (Easiest)

# Install and run
ollama run llama3

# Or specific version
ollama run llama3:70b

Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM (Production)

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)

prompts = ["What is machine learning?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Chat Template

Llama 3 uses a new chat format:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is Python?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Tokenizers handle this automatically with apply_chat_template.

What Made It Better

Training Data

15 trillion tokens (vs 2T for Llama 2)
Higher quality curation
More recent data

Architecture Improvements

Grouped Query Attention (GQA)
Larger vocabulary (128K tokens)
Better tokenizer

Post-Training

More RLHF iterations
Better instruction following
Improved safety tuning

Memory Requirements

Model	FP16	INT8	INT4
8B	16GB	8GB	4GB
70B	140GB	70GB	35GB

Most developers can run 8B locally.

Use Cases

Local Development

# Fast iteration without API costs
import ollama

def code_review(code: str) -> str:
    response = ollama.chat(
        model='llama3',
        messages=[{
            'role': 'user',
            'content': f'Review this code:\n\n{code}'
        }]
    )
    return response['message']['content']

Private RAG

from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

llm = Ollama(model="llama3")

# RAG with private data, no API calls
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

Fine-Tuning

from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    max_seq_length=2048,
)
trainer.train()

Comparison with Alternatives

Model	Quality	Speed	Cost	Privacy
GPT-4	Best	Fast	$$$	Cloud
Claude 3	Great	Fast	$$	Cloud
Llama 3 70B	Great	Medium	Free*	Local
Llama 3 8B	Good	Fast	Free*	Local

*Hardware costs apply.

License

Llama 3 Community License:

✅ Commercial use allowed
✅ Fine-tuning allowed
✅ Derivative models allowed
⚠️ Need Meta permission if >700M monthly users

More permissive than many open models.

Ecosystem

Fine-Tuned Variants

Within weeks of release:

Code-specialized versions
Domain-specific fine-tunes
Quantized versions
Extended context versions

Hosting Options

Provider	Model	Type
Together AI	Llama 3	API
Replicate	Llama 3	API
AWS Bedrock	Llama 3	Managed
Ollama	Llama 3	Local
vLLM	Llama 3	Self-hosted

Impact

For Developers

Before: "Need GPT-4 for quality, pay per token"
After:  "Llama 3 70B is close enough, run locally"

For Enterprise

Before: "Can't use AI, data privacy concerns"
After:  "Deploy on-premise, data never leaves"

For Open Source

Llama 1: Leaked, research only
Llama 2: Commercial license
Llama 3: Better quality, same license

Clear trajectory toward parity.

Final Thoughts

Llama 3 validated that open-weight models can compete with the best proprietary ones. The 8B model is particularly impressive—GPT-3.5 level quality in a model you can run on a laptop.

For many use cases, the question “which API?” becomes “do I even need an API?”

Open source caught up. Again.