Deep Learning for Computer Vision: CNNs Explained
From facial recognition to autonomous vehicles, Convolutional Neural Networks (CNNs) are the backbone of modern computer vision. Understanding how they work is essential for any ML practitioner.
Why CNNs for Images?
Images have spatial structure. A pixel’s meaning depends on its neighbors. Traditional neural networks treat inputs as flat vectors, losing this structure.
CNNs preserve spatial relationships through:
- Local connectivity: Neurons only see small regions
- Weight sharing: Same filters applied across the image
- Hierarchical learning: Simple features combine into complex ones
The Convolution Operation
A convolution slides a small filter (kernel) across the image, computing dot products:
# 3x3 edge detection filter
kernel = [
[-1, -1, -1],
[-1, 8, -1],
[-1, -1, -1]
]
# Slide over image, compute weighted sum at each position
output[i,j] = sum(image[i:i+3, j:j+3] * kernel)
Different kernels detect different features:
- Edge detection
- Blur
- Sharpen
- Pattern matching
CNN Architecture
A typical CNN stacks several layer types:
Convolutional Layers
Learn filters that detect features:
import torch.nn as nn
nn.Conv2d(
in_channels=3, # RGB input
out_channels=32, # 32 different filters
kernel_size=3, # 3x3 filters
padding=1 # Maintain spatial dimensions
)
Activation Functions
ReLU is standard:
nn.ReLU() # max(0, x)
Pooling Layers
Reduce spatial dimensions:
nn.MaxPool2d(kernel_size=2, stride=2) # Halve dimensions
Fully Connected Layers
Final classification:
nn.Linear(in_features=512, out_features=10) # 10 classes
Complete CNN Example
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Conv Block 1
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2),
# Conv Block 2
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2),
# Conv Block 3
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 512),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(512, num_classes),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
Famous Architectures
LeNet (1998)
The original CNN for digit recognition. Simple architecture that proved the concept.
AlexNet (2012)
Won ImageNet, sparked the deep learning revolution:
- Deeper than LeNet
- ReLU activation
- Dropout regularization
- GPU training
VGG (2014)
Simple, uniform architecture:
- All 3x3 convolutions
- VGG-16, VGG-19 variants
- Easy to understand, still used today
ResNet (2015)
Introduced skip connections to train very deep networks:
class ResidualBlock(nn.Module):
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out += identity # Skip connection!
out = self.relu(out)
return out
ResNet-50, ResNet-101, ResNet-152 remain popular.
YOLO (2016)
Real-time object detection:
- Single forward pass
- Predicts bounding boxes and classes
- Fast enough for video
Training CNNs
Data Augmentation
Critical for preventing overfitting:
from torchvision import transforms
transform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
Transfer Learning
Don’t train from scratch—use pretrained models:
from torchvision import models
# Load pretrained ResNet
model = models.resnet50(pretrained=True)
# Freeze early layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer
model.fc = nn.Linear(2048, num_classes)
# Train only the final layer
optimizer = optim.Adam(model.fc.parameters())
Batch Normalization
Stabilizes training:
nn.BatchNorm2d(num_features=64)
Visualizing What CNNs Learn
Feature Maps
Early layers detect edges and textures. Later layers detect complex patterns.
Gradient-weighted Class Activation Mapping (Grad-CAM)
Shows which regions influenced the prediction:
# Compute gradients of class score with respect to feature maps
# Weight feature maps by average gradient
# Results show "where the network looked"
Common Applications
- Image Classification: Cat vs dog, medical imaging
- Object Detection: YOLO, Faster R-CNN
- Semantic Segmentation: Pixel-wise classification
- Face Recognition: Identity verification
- Pose Estimation: Human body keypoints
- Style Transfer: Artistic rendering
Practical Tips
- Start with pretrained models: Transfer learning almost always beats training from scratch
- Use data augmentation: Free regularization
- Monitor with tensorboard: Watch loss curves, visualize filters
- Experiment with learning rate: Too high → diverge, too low → slow
- Start simple: Get a baseline before adding complexity
Final Thoughts
CNNs are remarkably effective at extracting visual features. The core ideas—local connectivity, weight sharing, hierarchical learning—mirror how biological vision systems work.
Understanding CNNs deeply will serve you well. They’re the foundation for most computer vision and increasingly appear in other domains like audio and text.
See the patterns. Learn the features.