Deep Learning for Computer Vision: CNNs Explained

ai machine-learning computer-vision

From facial recognition to autonomous vehicles, Convolutional Neural Networks (CNNs) are the backbone of modern computer vision. Understanding how they work is essential for any ML practitioner.

Why CNNs for Images?

Images have spatial structure. A pixel’s meaning depends on its neighbors. Traditional neural networks treat inputs as flat vectors, losing this structure.

CNNs preserve spatial relationships through:

The Convolution Operation

A convolution slides a small filter (kernel) across the image, computing dot products:

# 3x3 edge detection filter
kernel = [
    [-1, -1, -1],
    [-1,  8, -1],
    [-1, -1, -1]
]

# Slide over image, compute weighted sum at each position
output[i,j] = sum(image[i:i+3, j:j+3] * kernel)

Different kernels detect different features:

CNN Architecture

A typical CNN stacks several layer types:

Convolutional Layers

Learn filters that detect features:

import torch.nn as nn

nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=32,  # 32 different filters
    kernel_size=3,    # 3x3 filters
    padding=1         # Maintain spatial dimensions
)

Activation Functions

ReLU is standard:

nn.ReLU()  # max(0, x)

Pooling Layers

Reduce spatial dimensions:

nn.MaxPool2d(kernel_size=2, stride=2)  # Halve dimensions

Fully Connected Layers

Final classification:

nn.Linear(in_features=512, out_features=10)  # 10 classes

Complete CNN Example

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        self.features = nn.Sequential(
            # Conv Block 1
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Conv Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Conv Block 3
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Famous Architectures

LeNet (1998)

The original CNN for digit recognition. Simple architecture that proved the concept.

AlexNet (2012)

Won ImageNet, sparked the deep learning revolution:

VGG (2014)

Simple, uniform architecture:

ResNet (2015)

Introduced skip connections to train very deep networks:

class ResidualBlock(nn.Module):
    def forward(self, x):
        identity = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out += identity  # Skip connection!
        out = self.relu(out)
        return out

ResNet-50, ResNet-101, ResNet-152 remain popular.

YOLO (2016)

Real-time object detection:

Training CNNs

Data Augmentation

Critical for preventing overfitting:

from torchvision import transforms

transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

Transfer Learning

Don’t train from scratch—use pretrained models:

from torchvision import models

# Load pretrained ResNet
model = models.resnet50(pretrained=True)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer
model.fc = nn.Linear(2048, num_classes)

# Train only the final layer
optimizer = optim.Adam(model.fc.parameters())

Batch Normalization

Stabilizes training:

nn.BatchNorm2d(num_features=64)

Visualizing What CNNs Learn

Feature Maps

Early layers detect edges and textures. Later layers detect complex patterns.

Gradient-weighted Class Activation Mapping (Grad-CAM)

Shows which regions influenced the prediction:

# Compute gradients of class score with respect to feature maps
# Weight feature maps by average gradient
# Results show "where the network looked"

Common Applications

Practical Tips

  1. Start with pretrained models: Transfer learning almost always beats training from scratch
  2. Use data augmentation: Free regularization
  3. Monitor with tensorboard: Watch loss curves, visualize filters
  4. Experiment with learning rate: Too high → diverge, too low → slow
  5. Start simple: Get a baseline before adding complexity

Final Thoughts

CNNs are remarkably effective at extracting visual features. The core ideas—local connectivity, weight sharing, hierarchical learning—mirror how biological vision systems work.

Understanding CNNs deeply will serve you well. They’re the foundation for most computer vision and increasingly appear in other domains like audio and text.


See the patterns. Learn the features.

All posts