Llama 3: The New Open Source King

ai ml

Meta released Llama 3 in April 2024. The 70B model matched GPT-4 on many benchmarks. The 8B model was the best in its class. Open-source AI reached a new milestone.

The Release

ModelParametersContextBest For
Llama 3 8B8 billion8KEdge/local
Llama 3 70B70 billion8KQuality
Llama 3 400B+400+ billion-Coming later

Benchmarks

Llama 3 8B vs Competition

BenchmarkLlama 3 8BLlama 2 13BMistral 7B
MMLU68.4%54.8%60.1%
HumanEval62.2%29.9%26.2%
GSM8K79.6%44.4%52.2%

An 8B model beating a 13B model—architecture matters.

Llama 3 70B vs GPT-4

BenchmarkLlama 3 70BGPT-4
MMLU82.0%86.4%
HumanEval81.7%67.0%
GSM8K93.0%92.0%

Competitive with GPT-4, open-weight.

Running Llama 3

Ollama (Easiest)

# Install and run
ollama run llama3

# Or specific version
ollama run llama3:70b

Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM (Production)

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)

prompts = ["What is machine learning?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Chat Template

Llama 3 uses a new chat format:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is Python?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Tokenizers handle this automatically with apply_chat_template.

What Made It Better

Training Data

Architecture Improvements

Post-Training

Memory Requirements

ModelFP16INT8INT4
8B16GB8GB4GB
70B140GB70GB35GB

Most developers can run 8B locally.

Use Cases

Local Development

# Fast iteration without API costs
import ollama

def code_review(code: str) -> str:
    response = ollama.chat(
        model='llama3',
        messages=[{
            'role': 'user',
            'content': f'Review this code:\n\n{code}'
        }]
    )
    return response['message']['content']

Private RAG

from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

llm = Ollama(model="llama3")

# RAG with private data, no API calls
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

Fine-Tuning

from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    max_seq_length=2048,
)
trainer.train()

Comparison with Alternatives

ModelQualitySpeedCostPrivacy
GPT-4BestFast$$$Cloud
Claude 3GreatFast$$Cloud
Llama 3 70BGreatMediumFree*Local
Llama 3 8BGoodFastFree*Local

*Hardware costs apply.

License

Llama 3 Community License:

More permissive than many open models.

Ecosystem

Fine-Tuned Variants

Within weeks of release:

Hosting Options

ProviderModelType
Together AILlama 3API
ReplicateLlama 3API
AWS BedrockLlama 3Managed
OllamaLlama 3Local
vLLMLlama 3Self-hosted

Impact

For Developers

Before: "Need GPT-4 for quality, pay per token"
After:  "Llama 3 70B is close enough, run locally"

For Enterprise

Before: "Can't use AI, data privacy concerns"
After:  "Deploy on-premise, data never leaves"

For Open Source

Llama 1: Leaked, research only
Llama 2: Commercial license
Llama 3: Better quality, same license

Clear trajectory toward parity.

Final Thoughts

Llama 3 validated that open-weight models can compete with the best proprietary ones. The 8B model is particularly impressive—GPT-3.5 level quality in a model you can run on a laptop.

For many use cases, the question “which API?” becomes “do I even need an API?”


Open source caught up. Again.

All posts