LLaMA: Meta Leaks the Keys to the Castle

February 27, 2023

ai ml

Meta released LLaMA in February 2023 as a research project. Within days, the weights were “leaked” to 4chan. Suddenly, anyone could run a capable LLM locally. The implications are still unfolding.

What Happened

Feb 24: Meta announces LLaMA
Feb 27: Weights appear on 4chan  
Mar 11: llama.cpp enables local inference
Mar 13: Stanford releases Alpaca (fine-tuned LLaMA)

A million-dollar model could now run on a laptop.

Why It Mattered

Before LLaMA

Want to use an LLM?
├── Use OpenAI API → Pay per token, data leaves your control
├── Use GPT-J/NeoX → Open, but lower quality
└── Train your own → Millions of dollars

After LLaMA

Want to use an LLM?
├── Run LLaMA locally → Free, private, offline
├── Fine-tune on your data → Possible with consumer GPU
└── Build custom applications → No API limits

The Models

Model	Parameters	Memory (full)	Memory (4-bit)
LLaMA 7B	7 billion	28 GB	4 GB
LLaMA 13B	13 billion	52 GB	8 GB
LLaMA 30B	30 billion	120 GB	16 GB
LLaMA 65B	65 billion	260 GB	32 GB

With quantization, 7B runs on a MacBook.

Running Locally

llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Convert weights to GGML format
python convert.py ./models/7B/

# Run
./main -m ./models/7B/ggml-model-q4_0.bin \
    -p "The meaning of life is" \
    -n 128

Ollama (easier)

# Install
curl -fsSL https://ollama.ai/install.sh | sh

# Run
ollama run llama2

LM Studio (GUI)

Download models, click run. No command line needed.

The Derivatives

LLaMA spawned an ecosystem:

Model	What It Added
Alpaca	Instruction tuning
Vicuna	Chat fine-tuning
WizardLM	Complex reasoning
Orca	Microsoft’s explanation tuning
CodeLlama	Code specialization

Fine-Tuning Democratized

Before

Fine-tune GPT-3:
- OpenAI approval required
- Data uploaded to OpenAI
- Limited customization
- Pay per training token

After

from transformers import LlamaForCausalLM, LlamaTokenizer
from peft import LoraConfig, get_peft_model

# Load model
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")

# Add LoRA adapters
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)
model = get_peft_model(model, config)

# Train on your data
# ... (runs on single GPU)

LoRA made fine-tuning accessible on consumer hardware.

Use Cases Unlocked

Privacy-First Applications

Before: "I can't send this data to OpenAI"
After:  Run the model locally, data never leaves

Medical records, legal documents, proprietary code—now processable with LLMs.

Offline Operation

# Edge deployment
No internet required
Works on planes, in bunkers, on Mars

Cost Reduction

OpenAI API: $0.002 per 1K tokens
Local LLaMA: Free after hardware

For high-volume applications, local wins.

Customization

# Train on your domain
fine_tune(llama, medical_papers)
fine_tune(llama, legal_briefs)
fine_tune(llama, company_docs)

Your model, your specialization.

The Controversy

Meta’s Position

“We released to researchers, we didn’t leak it.”

The genie was out—intentionally or not.

The Safety Debate

Pro-open:
- Democratizes AI access
- Enables research
- Reduces centralization of power

Pro-closed:
- Misuse potential
- No content filtering
- No usage controls

My Take

You can’t un-release knowledge. Better to work on safety at the application layer than pretend containment is possible.

What Changed

For Developers

Local AI development is viable
No API dependency for MVP
Experimentation is free

For Companies

On-premise LLMs are possible
Data privacy concerns addressed
Build vs. API decision changed

For Research

Reproducibility improved
Open experimentation
Faster iteration

Running LLaMA Today

Hardware Requirements

Config	Requirements
Minimal	8GB RAM, any CPU
Good	16GB RAM, Apple M1/M2
Great	32GB RAM, NVIDIA GPU
Optimal	80GB+ VRAM, multiple GPUs

Quick Start

# Using text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh  # or start_macos.sh

# Download model through UI
# Start chatting

The Future

LLaMA showed that capable models could be commoditized. This led to:

LLaMA 2 (official commercial release)
Mistral (efficient alternatives)
Continuous improvement in local inference

The era of AI being API-only is over.

February 2023: AI escaped from the cloud.