Deep Learning for Computer Vision: CNNs Explained

August 28, 2018

ai machine-learning computer-vision

From facial recognition to autonomous vehicles, Convolutional Neural Networks (CNNs) are the backbone of modern computer vision. Understanding how they work is essential for any ML practitioner.

Why CNNs for Images?

Images have spatial structure. A pixel’s meaning depends on its neighbors. Traditional neural networks treat inputs as flat vectors, losing this structure.

CNNs preserve spatial relationships through:

Local connectivity: Neurons only see small regions
Weight sharing: Same filters applied across the image
Hierarchical learning: Simple features combine into complex ones

The Convolution Operation

A convolution slides a small filter (kernel) across the image, computing dot products:

# 3x3 edge detection filter
kernel = [
    [-1, -1, -1],
    [-1,  8, -1],
    [-1, -1, -1]
]

# Slide over image, compute weighted sum at each position
output[i,j] = sum(image[i:i+3, j:j+3] * kernel)

Different kernels detect different features:

Edge detection
Blur
Sharpen
Pattern matching

CNN Architecture

A typical CNN stacks several layer types:

Convolutional Layers

Learn filters that detect features:

import torch.nn as nn

nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=32,  # 32 different filters
    kernel_size=3,    # 3x3 filters
    padding=1         # Maintain spatial dimensions
)

Activation Functions

ReLU is standard:

nn.ReLU()  # max(0, x)

Pooling Layers

Reduce spatial dimensions:

nn.MaxPool2d(kernel_size=2, stride=2)  # Halve dimensions

Fully Connected Layers

Final classification:

nn.Linear(in_features=512, out_features=10)  # 10 classes

Complete CNN Example

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        self.features = nn.Sequential(
            # Conv Block 1
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Conv Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Conv Block 3
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Famous Architectures

LeNet (1998)

The original CNN for digit recognition. Simple architecture that proved the concept.

AlexNet (2012)

Won ImageNet, sparked the deep learning revolution:

Deeper than LeNet
ReLU activation
Dropout regularization
GPU training

VGG (2014)

Simple, uniform architecture:

All 3x3 convolutions
VGG-16, VGG-19 variants
Easy to understand, still used today

ResNet (2015)

Introduced skip connections to train very deep networks:

class ResidualBlock(nn.Module):
    def forward(self, x):
        identity = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out += identity  # Skip connection!
        out = self.relu(out)
        return out

ResNet-50, ResNet-101, ResNet-152 remain popular.

YOLO (2016)

Real-time object detection:

Single forward pass
Predicts bounding boxes and classes
Fast enough for video

Training CNNs

Data Augmentation

Critical for preventing overfitting:

from torchvision import transforms

transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

Transfer Learning

Don’t train from scratch—use pretrained models:

from torchvision import models

# Load pretrained ResNet
model = models.resnet50(pretrained=True)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer
model.fc = nn.Linear(2048, num_classes)

# Train only the final layer
optimizer = optim.Adam(model.fc.parameters())

Batch Normalization

Stabilizes training:

nn.BatchNorm2d(num_features=64)

Visualizing What CNNs Learn

Feature Maps

Early layers detect edges and textures. Later layers detect complex patterns.

Gradient-weighted Class Activation Mapping (Grad-CAM)

Shows which regions influenced the prediction:

# Compute gradients of class score with respect to feature maps
# Weight feature maps by average gradient
# Results show "where the network looked"

Common Applications

Image Classification: Cat vs dog, medical imaging
Object Detection: YOLO, Faster R-CNN
Semantic Segmentation: Pixel-wise classification
Face Recognition: Identity verification
Pose Estimation: Human body keypoints
Style Transfer: Artistic rendering

Practical Tips

Start with pretrained models: Transfer learning almost always beats training from scratch
Use data augmentation: Free regularization
Monitor with tensorboard: Watch loss curves, visualize filters
Experiment with learning rate: Too high → diverge, too low → slow
Start simple: Get a baseline before adding complexity

Final Thoughts

CNNs are remarkably effective at extracting visual features. The core ideas—local connectivity, weight sharing, hierarchical learning—mirror how biological vision systems work.

Understanding CNNs deeply will serve you well. They’re the foundation for most computer vision and increasingly appear in other domains like audio and text.

See the patterns. Learn the features.