Llama 2: Open Source Commercial AI

August 10, 2023

ai ml

On July 18, 2023, Meta released Llama 2—not just for research, but with commercial use permissions. This changed the calculus for AI deployment.

What Changed

Llama 1 (February 2023)

License: Research only
Usage: Academics, experiments
Commercial: Prohibited

Llama 2 (July 2023)

License: Commercial allowed*
Usage: Anyone
Commercial: Yes (with conditions)
*Except if >700M monthly users (need Meta permission)

The Models

Model	Parameters	Context	Use Case
Llama 2 7B	7 billion	4K tokens	Edge, mobile
Llama 2 13B	13 billion	4K tokens	General purpose
Llama 2 70B	70 billion	4K tokens	Best quality
Llama 2-Chat	All sizes	4K tokens	Conversation

Performance

Compared to GPT-3.5:

Benchmark	Llama 2 70B	GPT-3.5
MMLU	68.9%	70.0%
TriviaQA	85.0%	87.3%
NaturalQuestions	46.9%	44.9%
HumanEval	29.9%	48.1%

Competitive but not quite as good. However: it’s free and runs locally.

Running Llama 2

With Ollama

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Run Llama 2
ollama run llama2

# Or specific variants
ollama run llama2:13b
ollama run llama2:70b

With Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

prompt = "[INST] What is Python? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Quantized (Lower Memory)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    quantization_config=bnb_config,
    device_map="auto"
)
# 7B model in ~4GB VRAM

The Chat Format

Llama 2 Chat uses a specific prompt format:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST]

Multi-turn:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST] The capital of France is Paris.

[INST] What is its population? [/INST]

Fine-Tuning

With QLoRA

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments
from trl import SFTTrainer

# Prepare model
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, peft_config)

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
    ),
    peft_config=peft_config,
)

trainer.train()

Resources Needed

Task	GPU Memory	Time
Inference 7B (4-bit)	4 GB	-
Inference 13B (4-bit)	8 GB	-
Fine-tune 7B (QLoRA)	16 GB	Hours
Fine-tune 13B (QLoRA)	24 GB	Hours

Use Cases

Enterprise Chatbots

# On-premise, no data leaves your servers
def enterprise_chat(query, context):
    prompt = f"""[INST] <<SYS>>
You are a helpful assistant for Acme Corp. Only answer based on the provided context.
<</SYS>>

Context: {context}

Question: {query} [/INST]"""
    
    return generate(prompt)

Code Generation

# Code Llama extends Llama 2 for code
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf")

Local AI Assistants

# Run locally, no API costs
ollama run llama2
>>> Write a Python function to calculate compound interest

Deployment Options

Cloud

Provider	Service
AWS	SageMaker endpoints
Azure	Azure ML
GCP	Vertex AI
Replicate	Hosted API
Together	API access

On-Premise

Tool	Approach
vLLM	High-performance serving
TGI	Hugging Face inference
Ollama	Easy local deployment
llama.cpp	CPU-efficient

Edge

Llama 2 7B (quantized)
├── Apple M1/M2 MacBooks → Works well
├── NVIDIA Jetson → Possible
└── Modern phones → Experimental

Ecosystem

Llama 2 spawned derivatives:

Model	Specialization
Code Llama	Programming
Llama 2 Long	Extended context
Various GGUF	Quantized for llama.cpp
Fine-tuned variants	Domain-specific

The Strategic Implications

For Meta

“If you can’t beat OpenAI on models, commoditize the model layer and compete on distribution.”

Meta wins if AI runs on their infrastructure, regardless of model source.

For Companies

Before: "We can't use local models commercially"
After:  "We can deploy Llama 2 in production"

The buy vs. build calculation changed.

For Open Source

Llama 2 validated that:

Open-weight models can be competitive
Commercial use attracts contributors
The community can iterate rapidly

Final Thoughts

Llama 2’s commercial license was the real innovation. Capable models existed before—the permission to use them commercially didn’t.

This enabled:

On-premise AI without vendor lock-in
Privacy-preserving deployments
Custom fine-tuning for enterprises
A thriving open-source ecosystem

Not as good as GPT-4, but good enough for many use cases—and getting better.

July 2023: Commercial open-source AI became real.