Llama 2: Open Source Commercial AI

ai ml

On July 18, 2023, Meta released Llama 2—not just for research, but with commercial use permissions. This changed the calculus for AI deployment.

What Changed

Llama 1 (February 2023)

License: Research only
Usage: Academics, experiments
Commercial: Prohibited

Llama 2 (July 2023)

License: Commercial allowed*
Usage: Anyone
Commercial: Yes (with conditions)
*Except if >700M monthly users (need Meta permission)

The Models

ModelParametersContextUse Case
Llama 2 7B7 billion4K tokensEdge, mobile
Llama 2 13B13 billion4K tokensGeneral purpose
Llama 2 70B70 billion4K tokensBest quality
Llama 2-ChatAll sizes4K tokensConversation

Performance

Compared to GPT-3.5:

BenchmarkLlama 2 70BGPT-3.5
MMLU68.9%70.0%
TriviaQA85.0%87.3%
NaturalQuestions46.9%44.9%
HumanEval29.9%48.1%

Competitive but not quite as good. However: it’s free and runs locally.

Running Llama 2

With Ollama

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Run Llama 2
ollama run llama2

# Or specific variants
ollama run llama2:13b
ollama run llama2:70b

With Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

prompt = "[INST] What is Python? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Quantized (Lower Memory)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    quantization_config=bnb_config,
    device_map="auto"
)
# 7B model in ~4GB VRAM

The Chat Format

Llama 2 Chat uses a specific prompt format:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST]

Multi-turn:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST] The capital of France is Paris.

[INST] What is its population? [/INST]

Fine-Tuning

With QLoRA

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments
from trl import SFTTrainer

# Prepare model
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, peft_config)

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
    ),
    peft_config=peft_config,
)

trainer.train()

Resources Needed

TaskGPU MemoryTime
Inference 7B (4-bit)4 GB-
Inference 13B (4-bit)8 GB-
Fine-tune 7B (QLoRA)16 GBHours
Fine-tune 13B (QLoRA)24 GBHours

Use Cases

Enterprise Chatbots

# On-premise, no data leaves your servers
def enterprise_chat(query, context):
    prompt = f"""[INST] <<SYS>>
You are a helpful assistant for Acme Corp. Only answer based on the provided context.
<</SYS>>

Context: {context}

Question: {query} [/INST]"""
    
    return generate(prompt)

Code Generation

# Code Llama extends Llama 2 for code
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf")

Local AI Assistants

# Run locally, no API costs
ollama run llama2
>>> Write a Python function to calculate compound interest

Deployment Options

Cloud

ProviderService
AWSSageMaker endpoints
AzureAzure ML
GCPVertex AI
ReplicateHosted API
TogetherAPI access

On-Premise

ToolApproach
vLLMHigh-performance serving
TGIHugging Face inference
OllamaEasy local deployment
llama.cppCPU-efficient

Edge

Llama 2 7B (quantized)
├── Apple M1/M2 MacBooks → Works well
├── NVIDIA Jetson → Possible
└── Modern phones → Experimental

Ecosystem

Llama 2 spawned derivatives:

ModelSpecialization
Code LlamaProgramming
Llama 2 LongExtended context
Various GGUFQuantized for llama.cpp
Fine-tuned variantsDomain-specific

The Strategic Implications

For Meta

“If you can’t beat OpenAI on models, commoditize the model layer and compete on distribution.”

Meta wins if AI runs on their infrastructure, regardless of model source.

For Companies

Before: "We can't use local models commercially"
After:  "We can deploy Llama 2 in production"

The buy vs. build calculation changed.

For Open Source

Llama 2 validated that:

Final Thoughts

Llama 2’s commercial license was the real innovation. Capable models existed before—the permission to use them commercially didn’t.

This enabled:

Not as good as GPT-4, but good enough for many use cases—and getting better.


July 2023: Commercial open-source AI became real.

All posts