CLIP: Connecting Text and Images
CLIP (Contrastive Language-Image Pre-training) might be the most important AI model you’ve never heard of. It connects text and images, and it’s the foundation for image generation, search, and more.
What is CLIP?
CLIP learns to match images with text descriptions:
Image: [photo of a dog] ←→ Text: "a photo of a dog" ✓ Match
Image: [photo of a dog] ←→ Text: "a photo of a cat" ✗ No match
Trained on 400 million image-text pairs from the internet.
How It Works
Two Encoders
Image Encoder (Vision Transformer or ResNet)
↓
[Image embedding vector]
→ Similarity comparison
[Text embedding vector]
↑
Text Encoder (Transformer)
Contrastive Learning
# Training objective (simplified)
# For each batch of (image, text) pairs:
1. Encode all images → image_embeddings
2. Encode all texts → text_embeddings
3. Compute similarity matrix:
similarity = image_embeddings @ text_embeddings.T
4. Loss:
- Maximize similarity for correct pairs (diagonal)
- Minimize similarity for incorrect pairs (off-diagonal)
Zero-Shot Classification
CLIP can classify without training on specific classes:
# Define classes as text
labels = [
"a photo of a dog",
"a photo of a cat",
"a photo of a bird"
]
# Encode image and text
image_features = clip.encode_image(image)
text_features = clip.encode_text(labels)
# Compare
similarities = image_features @ text_features.T
predicted_class = labels[similarities.argmax()]
No fine-tuning needed.
Why CLIP Matters
Flexibility
Traditional computer vision:
Train on dogs → Only recognizes dogs
New class? → Retrain or fine-tune
CLIP:
Train once → Recognize anything describable in text
New class? → Just describe it
Language-Vision Bridge
CLIP enables:
- Text-to-image search
- Image-to-text matching
- Zero-shot classification
- Image generation guidance (DALL-E, Stable Diffusion)
Applications
Image Generation
CLIP guides image generation:
# CLIP-guided generation (simplified)
while not converged:
generated_image = generator.step()
# CLIP scores how well image matches text
image_features = clip.encode_image(generated_image)
text_features = clip.encode_text("a painting of a sunset")
# Optimize to maximize similarity
loss = -similarity(image_features, text_features)
loss.backward()
DALL-E, Stable Diffusion, Midjourney all use CLIP-like components.
Image Search
# Search images by text query
query = "sunset over mountains"
query_features = clip.encode_text(query)
# Compare against image database
similarities = [
similarity(query_features, clip.encode_image(img))
for img in database
]
# Return top matches
top_images = sorted(database, key=lambda img: similarities[img], reverse=True)
Visual Question Answering
question = "What color is the car in this image?"
options = ["red", "blue", "green", "black"]
image_features = clip.encode_image(image)
option_features = [
clip.encode_text(f"The car is {color}")
for color in options
]
similarities = [similarity(image_features, opt) for opt in option_features]
answer = options[similarities.argmax()]
Using CLIP
Installation
pip install openai-clip
# or
pip install transformers
Basic Usage
import clip
import torch
from PIL import Image
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Prepare inputs
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)
# Get features
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Compare
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
# Result
print(f"Label probabilities: {similarity}")
With Hugging Face
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(
text=["a photo of a cat", "a photo of a dog"],
images=image,
return_tensors="pt",
padding=True
)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
Limitations
Prompt Engineering Matters
# These give different results:
"dog"
"a dog"
"a photo of a dog"
"a professional photo of a dog"
"a cute fluffy dog"
The text description significantly affects matching.
Biases
CLIP inherits internet biases:
- Gender associations with professions
- Cultural representations
- Western-centric training data
Abstract Concepts
CLIP struggles with:
- Numbers (“3 cats” vs “2 cats”)
- Negation (“no dogs”)
- Spatial relationships (“cat on the left of dog”)
- Abstract concepts (“freedom”, “justice”)
Impact
CLIP changed the AI landscape:
- Made text-to-image possible at scale
- Enabled zero-shot visual classification
- Bridged language and vision research
- Opened multimodal AI development
Every major image generation model builds on CLIP’s approach.
Variants and Successors
| Model | Improvements |
|---|---|
| CLIP | Original |
| OpenCLIP | Open-source reproduction |
| BLIP | Better language modeling |
| SigLIP | Improved training objective |
| EVA-CLIP | Efficient vision architecture |
Final Thoughts
CLIP is infrastructure. You might not use it directly, but the AI tools you use probably depend on it.
Understanding CLIP helps understand modern image generation, search, and classification. It’s the bridge that connected language models to the visual world.
Images have captions. CLIP understands both.