LLaMA: Meta Leaks the Keys to the Castle

ai ml

Meta released LLaMA in February 2023 as a research project. Within days, the weights were “leaked” to 4chan. Suddenly, anyone could run a capable LLM locally. The implications are still unfolding.

What Happened

Feb 24: Meta announces LLaMA
Feb 27: Weights appear on 4chan  
Mar 11: llama.cpp enables local inference
Mar 13: Stanford releases Alpaca (fine-tuned LLaMA)

A million-dollar model could now run on a laptop.

Why It Mattered

Before LLaMA

Want to use an LLM?
├── Use OpenAI API → Pay per token, data leaves your control
├── Use GPT-J/NeoX → Open, but lower quality
└── Train your own → Millions of dollars

After LLaMA

Want to use an LLM?
├── Run LLaMA locally → Free, private, offline
├── Fine-tune on your data → Possible with consumer GPU
└── Build custom applications → No API limits

The Models

ModelParametersMemory (full)Memory (4-bit)
LLaMA 7B7 billion28 GB4 GB
LLaMA 13B13 billion52 GB8 GB
LLaMA 30B30 billion120 GB16 GB
LLaMA 65B65 billion260 GB32 GB

With quantization, 7B runs on a MacBook.

Running Locally

llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Convert weights to GGML format
python convert.py ./models/7B/

# Run
./main -m ./models/7B/ggml-model-q4_0.bin \
    -p "The meaning of life is" \
    -n 128

Ollama (easier)

# Install
curl -fsSL https://ollama.ai/install.sh | sh

# Run
ollama run llama2

LM Studio (GUI)

Download models, click run. No command line needed.

The Derivatives

LLaMA spawned an ecosystem:

ModelWhat It Added
AlpacaInstruction tuning
VicunaChat fine-tuning
WizardLMComplex reasoning
OrcaMicrosoft’s explanation tuning
CodeLlamaCode specialization

Fine-Tuning Democratized

Before

Fine-tune GPT-3:
- OpenAI approval required
- Data uploaded to OpenAI
- Limited customization
- Pay per training token

After

from transformers import LlamaForCausalLM, LlamaTokenizer
from peft import LoraConfig, get_peft_model

# Load model
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")

# Add LoRA adapters
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)
model = get_peft_model(model, config)

# Train on your data
# ... (runs on single GPU)

LoRA made fine-tuning accessible on consumer hardware.

Use Cases Unlocked

Privacy-First Applications

Before: "I can't send this data to OpenAI"
After:  Run the model locally, data never leaves

Medical records, legal documents, proprietary code—now processable with LLMs.

Offline Operation

# Edge deployment
No internet required
Works on planes, in bunkers, on Mars

Cost Reduction

OpenAI API: $0.002 per 1K tokens
Local LLaMA: Free after hardware

For high-volume applications, local wins.

Customization

# Train on your domain
fine_tune(llama, medical_papers)
fine_tune(llama, legal_briefs)
fine_tune(llama, company_docs)

Your model, your specialization.

The Controversy

Meta’s Position

“We released to researchers, we didn’t leak it.”

The genie was out—intentionally or not.

The Safety Debate

Pro-open:
- Democratizes AI access
- Enables research
- Reduces centralization of power

Pro-closed:
- Misuse potential
- No content filtering
- No usage controls

My Take

You can’t un-release knowledge. Better to work on safety at the application layer than pretend containment is possible.

What Changed

For Developers

For Companies

For Research

Running LLaMA Today

Hardware Requirements

ConfigRequirements
Minimal8GB RAM, any CPU
Good16GB RAM, Apple M1/M2
Great32GB RAM, NVIDIA GPU
Optimal80GB+ VRAM, multiple GPUs

Quick Start

# Using text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh  # or start_macos.sh

# Download model through UI
# Start chatting

The Future

LLaMA showed that capable models could be commoditized. This led to:

The era of AI being API-only is over.


February 2023: AI escaped from the cloud.

All posts