VGG

Theoretical Introduction

Depth and Simplicity in Convolutional Neural Networks

The VGG architecture, developed by the Visual Geometry Group at the University of Oxford and presented in 2014, constitutes a milestone in the evolution of deep learning applied to computer vision. While LeNet-5 establishes the conceptual foundations of convolutional neural networks, VGG systematically demonstrates that increasing the depth of the network, combined with an extremely simple and homogeneous structure, leads to significant improvements in performance. The most influential variants of this family are VGG-16 and VGG-19, named according to the number of trainable layers that compose them.

The design of VGG is characterized by its structural simplicity: a sequence of convolutions of fixed size, periodically separated by pooling operations, followed by a block of fully connected layers. This minimalist philosophy contrasts with later, more complex architectures, such as Inception, and turns VGG into a fundamental didactic reference for understanding deep convolutional networks. The architecture shows that highly discriminative visual representations can be obtained through the systematic repetition of basic components, without the need to introduce complex operations or specialized modules.

Design Principles

The distinctive feature of the VGG architecture lies in its deliberately conservative and homogeneous design philosophy. Unlike other architectures that combine convolutional filters of different sizes (for example, \(5 \times 5\) or \(7 \times 7\)), VGG strictly standardizes its components and maintains a highly regular structure throughout the network.

The architecture employs only \(3 \times 3\) convolutional filters, which are applied repeatedly in sequence. As the network depth increases and the spatial resolution of the feature maps decreases, the number of filters is progressively increased, typically doubling this number after each max pooling operation. This strategy allows the network to incrementally enrich the representational capacity of the feature maps while controlling the computational cost in terms of spatial dimensions.

In addition, VGG makes systematic use of convolutions with appropriate padding in order to preserve the spatial resolution within each block. At the end of each block, it applies \(2 \times 2\) max pooling operations with stride 2 to reduce the spatial dimensions of the feature maps. This alternation between groups of convolutions with preserved resolution and pooling layers with downsampling produces a hierarchical representation of the input.

Within this hierarchy, the early layers capture local and low-level information, such as edges, corners, and simple textures. As the signal propagates through deeper layers, the successive application of \(3 \times 3\) convolutions over increasingly abstract feature maps enables the network to aggregate broader spatial context and encode more complex and semantically rich patterns. Consequently, the deeper layers represent higher-level, more abstract features, such as object parts or entire object configurations, which are especially useful for tasks like image classification and object recognition.

Justification for the Use of Small Filters

The choice of \(3 \times 3\) convolutional filters is justified by a combination of theoretical and practical arguments. Although LeNet-5 employs \(5 \times 5\) filters, the designers of VGG demonstrate that several consecutive layers of \(3 \times 3\) filters can emulate the receptive field of larger filters while requiring fewer parameters and incorporating a greater number of intermediate nonlinearities.

From the perspective of the receptive field, a single \(k \times k\) convolution can be approximated by a sequence of smaller convolutions, such as \(3 \times 3\), provided that padding is chosen appropriately. In particular, two \(3 \times 3\) layers possess an effective receptive field equivalent to a \(5 \times 5\) filter, and three \(3 \times 3\) layers approximate a \(7 \times 7\) receptive field. This decomposition proves advantageous both in terms of parameter efficiency and in terms of the expressive capacity of the model.

To understand this equivalence, it is useful to consider the notion of effective receptive field of a neuron in a convolutional network. Intuitively, the receptive field indicates the number of pixels from the original image that influence the activation of a neuron in a given layer. In the simplest case, with stride 1 and appropriate padding, a convolutional layer with filters of size \(5 \times 5\) possesses a receptive field of \(5 \times 5 = 25\) pixels. In contrast, two consecutive \(3 \times 3\) layers produce an effective receptive field of \(5 \times 5\): the first layer observes \(3 \times 3\) pixels of the image, and the second layer observes \(3 \times 3\) neurons of the previous layer, whose receptive fields overlap in such a manner that, combined, they cover \(5 \times 5\) pixels of the original image.

Architectural Organization of VGG-16

The VGG-16 variant receives color (RGB) input images of size \(224 \times 224\) pixels and is organized as a hierarchical sequence of five convolutional blocks, followed by a set of fully connected layers responsible for the final classification. Each block groups several \(3 \times 3\) convolutions followed by a \(2 \times 2\) max pooling operation.

In its original configuration for ImageNet, the architecture is organized as follows:

Block	Conv Layers	Filters	Typical Output Size	Pooling
Block 1	2	64	\(112 \times 112 \times 64\)	MaxPool \(2 \times 2\)
Block 2	2	128	\(56 \times 56 \times 128\)	MaxPool \(2 \times 2\)
Block 3	3	256	\(28 \times 28 \times 256\)	MaxPool \(2 \times 2\)
Block 4	3	512	\(14 \times 14 \times 512\)	MaxPool \(2 \times 2\)
Block 5	3	512	\(7 \times 7 \times 512\)	MaxPool \(2 \times 2\)
FC6	-	4096	4096	-
FC7	-	4096	4096	-
FC8	-	1000	1000	Softmax

The first two convolutional blocks are composed of two convolutional layers each. The first block employs 64 filters, while the second employs 128 filters, always with filter size \(3 \times 3\). At the end of each block, a \(2 \times 2\) max pooling operation is applied, whose role is to halve the spatial resolution and concentrate the most relevant information. The third, fourth, and fifth blocks increase the effective depth of the network by means of three consecutive convolutional layers per block. The number of filters increases to 256 in the third block and to 512 in the last two blocks. At these deeper stages, the network learns highly abstract features, such as object parts, complex textures, and high-level visual configurations, which are crucial for class discrimination.

After the convolutional blocks, the resulting feature maps are transformed into a one-dimensional vector that feeds the classifier layers. The final segment consists of two dense layers with 4096 neurons each, followed by an output layer with 1000 neurons, corresponding to the thousand categories of the ImageNet dataset. The Softmax activation makes it possible to interpret the output as a probability distribution over classes. Throughout the architecture, the ReLU activation function is used, which accelerates training and helps mitigate the vanishing gradient problem. The combination of these design decisions—moderate depth, small filters, homogeneous structure—positions VGG-16 as a high-performance model on ImageNet, though at the cost of a very large number of parameters.

Impact, Advantages, and Limitations

VGG-16 achieves second place in the 2014 ImageNet challenge, but its impact extends far beyond the competition. The scientific and technical community adopts this architecture as a reference due to its clear, regular, and easily interpretable design. This clarity makes it a fundamental tool both for research and teaching on deep convolutional networks, as well as for the development of numerous subsequent works in transfer learning and feature extraction.

Among its advantages, it is worth highlighting its homogeneous and modular architecture, which facilitates implementation and experimentation; its excellent capacity for learning hierarchical features in images; and its suitability for transfer learning. The initial layers of VGG learn generic and robust representations focused on edges, textures, and local patterns that can be reused effectively in a wide variety of computer vision tasks by adapting only the final layers.

However, VGG also presents significant limitations. The very large number of parameters (on the order of 138 million in VGG-16) implies considerable memory consumption (around 500 MB in 32-bit precision) and a high computational cost both in training and inference. These characteristics make the architecture unsuitable for devices with limited resources and increase the cost of large-scale production deployment. In addition, a substantial portion of the parameters is concentrated in the final fully connected layers, which has motivated later architectures to replace these layers with more efficient mechanisms such as global average pooling.

These constraints have driven the development of subsequent architectures such as Inception, ResNet, or MobileNet, which aim to maintain or improve performance while reducing computational cost, facilitating the training of deeper networks, and adapting to resource-constrained environments. Despite this, VGG remains a classic reference model due to its conceptual transparency and its capacity to serve as a starting point in numerous practical applications.

Practical Implementation of VGG-16 on CIFAR-10

This section implements a variant of VGG-16 adapted to the CIFAR-10 dataset, which contains color images of \(32 \times 32\) pixels belonging to 10 categories. The original architecture, designed for ImageNet (\(224 \times 224\)), is modified to accommodate the smaller input size and the reduced number of classes in CIFAR-10, while preserving the VGG design philosophy.

Library Imports

The following modules are imported to implement VGG-16 and train it on CIFAR-10. The goal is to have at our disposal the PyTorch tools required to define the model, manage data, train the network, and analyze the results in a systematic way.

# Standard libraries
import time
from typing import Any, List

# 3pps
# Third-party libraries
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.manifold import TSNE
from torch import nn
from torch.utils.data import DataLoader
from torchinfo import summary
from torchvision import datasets, transforms
from tqdm import tqdm

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

This block verifies GPU compatibility when available and provides basic information about the execution environment, which is useful for reproducing experiments and diagnosing configuration issues.

Global Hyperparameter Configuration

Constants used throughout the implementation are defined next, such as batch size, number of epochs, learning rate, and number of classes. Centralizing these parameters facilitates experimentation and model tuning.

# Global configuration
BATCH_SIZE: int = 128
NUM_EPOCHS: int = 1
LEARNING_RATE: float = 1e-3
WEIGHT_DECAY: float = 5e-4
NUM_CLASSES: int = 10  # CIFAR-10 has 10 classes
INPUT_SIZE: int = 32  # CIFAR-10: 32×32 images

# CIFAR-10 class names
CIFAR10_CLASSES = [
    "airplane",
    "automobile",
    "bird",
    "cat",
    "deer",
    "dog",
    "frog",
    "horse",
    "ship",
    "truck",
]

print("Configuration:")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Number of classes: {NUM_CLASSES}")

This configuration provides a reasonable starting point for training VGG-16 on CIFAR-10, establishing a compromise between performance and computational cost that can be adjusted according to available resources.

Auxiliary Visualization Function

Visual exploration of the data helps to better understand the problem and to verify that preprocessing is applied correctly. A function is defined to visualize images with their labels and, optionally, with model predictions.

def show_images(images, labels, predictions=None, classes=CIFAR10_CLASSES):
    """
    Visualizes a set of images with their labels.

    Args:
        images: Image tensor [N, C, H, W]
        labels: Label tensor [N]
        predictions: Optional prediction tensor [N]
        classes: List of class names
    """
    n_images = len(images)
    fig, axes = plt.subplots(1, n_images, figsize=(2 * n_images, 3))

    if n_images == 1:
        axes = [axes]

    for idx, (img, label, ax) in enumerate(zip(images, labels, axes)):
        # Denormalize image for visualization
        img = img / 2 + 0.5  # Revert normalization [-1, 1] -> [0, 1]
        img = img.numpy().transpose((1, 2, 0))

        ax.imshow(img)

        title = f"True: {classes[label]}"
        if predictions is not None:
            pred = predictions[idx]
            color = "green" if pred == label else "red"
            title += f"\nPred: {classes[pred]}"
            ax.set_title(title, fontsize=10, color=color, fontweight="bold")
        else:
            ax.set_title(title, fontsize=10, fontweight="bold")

        ax.axis("off")

    plt.tight_layout()
    plt.show()

print("Visualization function defined correctly")

This function is used later to inspect both samples from the dataset and correct and incorrect model predictions, providing an essential visual diagnostic tool.

CIFAR-10 Dataset Preparation

The CIFAR-10 data are loaded and the data augmentation and normalization transformations applied to the images are defined. CIFAR-10 contains 60,000 color images of \(32 \times 32\) pixels distributed across 10 classes and presents greater variability and complexity than MNIST due to the diversity of objects, backgrounds, and capture conditions.

from torch.utils.data import Subset

# Transformations with data augmentation for training
transform_train = transforms.Compose(
    [
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(
            (0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)
        ),
    ]
)

# Transformations for validation (no augmentation)
transform_test = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
    ]
)

print("Downloading CIFAR-10 dataset...")

# Training set
train_dataset_full = datasets.CIFAR10(
    root="./data", train=True, download=True, transform=transform_train
)

# Test set
test_dataset_full = datasets.CIFAR10(
    root="./data", train=False, download=True, transform=transform_test
)

# Limit samples
train_dataset = Subset(train_dataset_full, range(5000))
test_dataset = Subset(test_dataset_full, range(1000))

print("\nDataset statistics:")
print(f"  Training samples: {len(train_dataset):,}")
print(f"  Test samples: {len(test_dataset):,}")
print(f"  Number of classes: {len(train_dataset_full.classes)}")
print(f"  Classes: {', '.join(train_dataset_full.classes)}")
print("  Image size: 32×32 pixels (RGB)")

Data augmentation is introduced to increase the generalization capacity of the model. RandomCrop with padding simulates variations in framing and object position, while RandomHorizontalFlip increases robustness to horizontal symmetries. Both mechanisms reduce overfitting by generating slightly different versions of each image at each epoch, effectively expanding the training set without requiring additional data.

Normalization is performed using the mean and standard deviation of the complete CIFAR-10 dataset for each RGB channel:

\[ \mu = (0.4914, 0.4822, 0.4465), \quad \sigma = (0.2470, 0.2435, 0.2616). \]

This operation centers and scales the values of each channel using the transformation:

\[ x_{\text{normalized}} = \frac{x - \mu}{\sigma}, \]

which facilitates optimization and stabilizes training by improving the numerical conditioning of gradients.

DataLoader Creation

Once the datasets are defined, DataLoader objects are created to manage batch iteration during training and evaluation, including optimizations to accelerate data loading.

print("Configuring DataLoaders:")
print(f"  Batch size: {BATCH_SIZE}")

# Training DataLoader
train_dataloader = DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=4,
    persistent_workers=True,
    pin_memory=True,
)

# Test DataLoader
test_dataloader = DataLoader(
    dataset=test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=4,
    persistent_workers=True,
    pin_memory=True,
)

print(f"  Training batches: {len(train_dataloader)}")
print(f"  Test batches: {len(test_dataloader)}")

Common PyTorch optimizations are applied. The parameter num_workers=4 allows data loading in parallel via auxiliary processes, taking advantage of multiple CPU cores. The use of pin_memory=True improves the speed of data transfer to the GPU by using pinned (non-pageable) memory. Finally, persistent_workers=True avoids recreating the worker processes at each epoch, reducing initialization overhead and accelerating the data pipeline.

Initial Visual Exploration of the Dataset

Before defining the model, it is useful to inspect some images from the training set and verify the tensor dimensions and the effect of normalization.

# Obtain one data batch
data_iter = iter(train_dataloader)
train_images, train_labels = next(data_iter)

print("Batch dimensions:")
print(f"  Images: {train_images.shape}")
print(f"  Labels: {train_labels.shape}")
print(f"\n  Interpretation: {BATCH_SIZE} RGB images of size 32×32 pixels")

# Visualize first 8 examples
print("\nVisualizing first 8 samples...")
show_images(train_images[:8], train_labels[:8])

# Statistics of normalized images
print("\nStatistics after normalization:")
print(f"  Min value: {train_images.min():.3f}")
print(f"  Max value: {train_images.max():.3f}")
print("  Mean per channel:")
for i, channel in enumerate(["R", "G", "B"]):
    print(f"    {channel}: {train_images[:, i, :, :].mean():.3f}")

This analysis allows the verification that data are correctly loaded, preprocessing is applied appropriately, and data augmentation transformations produce reasonable variations without excessively distorting the images.

Definition of the VGG-16 Architecture for CIFAR-10

A VGG-16 variant adapted to \(32 \times 32\) images is implemented next. The original architecture for ImageNet is designed for \(224 \times 224\), so adjustments in the final layers are required to adapt to the reduced input size and the smaller number of classes.

class VGG16(nn.Module):
    """
    Implementation of VGG-16 adapted for CIFAR-10 (32×32 pixels).

    Architecture:
    - 5 convolutional blocks with configuration [64, 128, 256, 512, 512]
    - All filters are 3×3
    - 2×2 MaxPooling after each block
    - 3 fully connected layers at the end
    - BatchNorm to stabilize training
    - Dropout for regularization
    """

    def __init__(self, num_classes: int = 10, **kwargs: Any) -> None:
        super().__init__(**kwargs)

        self.num_classes = num_classes

        # Block 1: 2 conv layers with 64 filters
        self.block1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 32×32 -> 16×16
        )

        # Block 2: 2 conv layers with 128 filters
        self.block2 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 16×16 -> 8×8
        )

        # Block 3: 3 conv layers with 256 filters
        self.block3 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 8×8 -> 4×4
        )

        # Block 4: 3 conv layers with 512 filters
        self.block4 = nn.Sequential(
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 4×4 -> 2×2
        )

        # Block 5: 3 conv layers with 512 filters
        self.block5 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 2×2 -> 1×1
        )

        # Classifier layers
        # For CIFAR-10 (32×32), after 5 poolings: 32 / 2^5 = 1
        # Therefore: 512 × 1 × 1 = 512 features
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(512 * 1 * 1, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )

        # Weight initialization
        self._initialize_weights()

    def _initialize_weights(self):
        """
        Initializes weights using He initialization for layers with ReLU.
        """
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, (nn.BatchNorm2d, nn.BatchNorm1d)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of VGG-16.

        Args:
            x: Input tensor [B, 3, 32, 32]

        Returns:
            Classification logits [B, num_classes]
        """
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.block4(x)
        x = self.block5(x)
        x = self.classifier(x)
        return x

    def get_features(self, x: torch.Tensor) -> torch.Tensor:
        """
        Extracts features before the final classification layer.
        Useful for embedding visualization.
        """
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.block4(x)
        x = self.block5(x)
        x = torch.flatten(x, 1)
        return x

print("VGG-16 architecture defined correctly")

Model Instantiation and Complexity Analysis

Once the architecture is defined, the model is instantiated, moved to the appropriate device (CPU or GPU), and its structure and parameter count are analyzed to verify that the implementation is correct.

# Create the model
model = VGG16(num_classes=NUM_CLASSES)

# Determine available device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

print(f"Device used: {device}")
print(f"\n{'=' * 70}")
print("VGG-16 ARCHITECTURE SUMMARY")
print(f"{'=' * 70}\n")

# Detailed architecture summary
summary(model, input_size=(BATCH_SIZE, 3, 32, 32), device=str(device))

# Parameter count per block
print(f"\n{'=' * 70}")
print("PARAMETER ANALYSIS PER BLOCK")
print(f"{'=' * 70}")

def count_parameters(module):
    return sum(p.numel() for p in module.parameters())

print(f"  Block 1: {count_parameters(model.block1):>12,} parameters")
print(f"  Block 2: {count_parameters(model.block2):>12,} parameters")
print(f"  Block 3: {count_parameters(model.block3):>12,} parameters")
print(f"  Block 4: {count_parameters(model.block4):>12,} parameters")
print(f"  Block 5: {count_parameters(model.block5):>12,} parameters")
print(f"  Classifier: {count_parameters(model.classifier):>12,} parameters")
print(f"  {'-' * 66}")

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"  TOTAL: {total_params:>12,} parameters")
print(f"  Trainable: {trainable_params:>12,} parameters")
print(f"  Memory (float32): {total_params * 4 / (1024 ** 2):>10.2f} MB")

In the original VGG-16 for ImageNet, about 14.7 million parameters correspond to the convolutional layers and around 123 million to the fully connected layers (approximately 89 % of the total). The CIFAR-10 variant drastically reduces the parameters of the dense layers by going from 4096 to 512 neurons, resulting in a more manageable model suited to this dataset, although it remains considerably large compared with more modern, efficient architectures.

Training Configuration

The optimizer, loss function, and a learning rate scheduler are now defined to improve convergence and adapt dynamically to the evolution of training.

print("TRAINING CONFIGURATION")
print(f"{'=' * 70}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Initial learning rate: {LEARNING_RATE}")
print(f"  Weight decay (L2): {WEIGHT_DECAY}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"{'=' * 70}\n")

# Optimizer: SGD with momentum
optimizer = torch.optim.SGD(
    params=model.parameters(), lr=LEARNING_RATE, momentum=0.9, weight_decay=WEIGHT_DECAY
)

# Learning rate scheduler: reduces LR when progress plateaus
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode="max",  # Monitor accuracy (to be maximized)
    factor=0.1,  # Reduce LR to 10 % of current value
    patience=3,  # After 3 epochs without improvement
    min_lr=1e-6,
)

# Loss function: Cross-Entropy
loss_function = nn.CrossEntropyLoss()

print("Optimizer: SGD with momentum=0.9")
print("  SGD with momentum accumulates gradients with exponential decay")
print("  This helps escape local minima and accelerates convergence")
print("\nScheduler: ReduceLROnPlateau")
print("  Reduces the learning rate when accuracy stops improving")
print("  Reduction factor: 0.1 (LR → 0.1 × LR)")
print("  Patience: 3 epochs")
print("\nLoss function: CrossEntropyLoss")

Historically, for architectures such as VGG, SGD with momentum has been very effective, especially when sufficient training time is available and the learning rate is carefully tuned. The momentum term accumulates past gradients as follows:

\[ v_t = \beta v_{t-1} + (1 - \beta)\nabla L(\theta_{t-1}), \quad \theta_t = \theta_{t-1} - \text{lr} \cdot v_t, \]

where \(0 < \beta < 1\) is the momentum coefficient, typically 0.9. This mechanism accelerates descent in directions of consistent gradient and damps oscillations in directions of high curvature, improving both the speed of convergence and training stability.

The ReduceLROnPlateau scheduler adjusts the learning rate dynamically based on the evolution of performance on the validation set. When the monitored metric, in this case test accuracy, stops improving for a given number of epochs, the patience parameter, the scheduler reduces the learning rate, allowing a finer adjustment of parameters near a local optimum.

Training Loop

The training loop consists of two phases per epoch: a training phase, during which the model parameters are updated, and a validation phase, during which performance is evaluated without modifying the weights. Losses and accuracies on both phases are recorded, as well as the evolution of the learning rate.

from tqdm import tqdm
import time

# Lists to store metrics
train_losses, train_accuracies = [], []
test_losses, test_accuracies = [], []
learning_rates = []

# Auxiliary function to compute accuracy
def calculate_accuracy(outputs, labels):
    _, predicted = torch.max(outputs, 1)
    correct = (predicted == labels).sum().item()
    total = labels.size(0)
    return correct, total

print("STARTING TRAINING\n")
print(f"{'=' * 70}\n")

# Start time
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    epoch_start_time = time.time()

    # ============ TRAINING PHASE ============
    model.train()
    running_loss, correct, total = 0.0, 0, 0

    train_loop = tqdm(
        train_dataloader, desc=f"Epoch {epoch + 1}/{NUM_EPOCHS} [TRAIN]", leave=False
    )

    for batch_image, batch_label in train_loop:
        batch_image = batch_image.to(device)
        batch_label = batch_label.to(device)

        optimizer.zero_grad()

        outputs = model(batch_image)
        loss = loss_function(outputs, batch_label)

        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        batch_correct, batch_total = calculate_accuracy(outputs, batch_label)
        correct += batch_correct
        total += batch_total

        train_loop.set_postfix(
            {"loss": f"{loss.item():.4f}", "acc": f"{100 * correct / total:.2f}%"}
        )

    epoch_train_loss = running_loss / len(train_dataloader)
    epoch_train_acc = 100 * correct / total
    train_losses.append(epoch_train_loss)
    train_accuracies.append(epoch_train_acc)

    # ============ VALIDATION PHASE ============
    model.eval()
    test_loss, correct_test, total_test = 0.0, 0, 0

    test_loop = tqdm(
        test_dataloader, desc=f"Epoch {epoch + 1}/{NUM_EPOCHS} [TEST]", leave=False
    )

    with torch.no_grad():
        for images, labels in test_loop:
            images = images.to(device)
            labels = labels.to(device)

            outputs = model(images)
            loss = loss_function(outputs, labels)

            test_loss += loss.item()
            batch_correct, batch_total = calculate_accuracy(outputs, labels)
            correct_test += batch_correct
            total_test += batch_total

            test_loop.set_postfix(
                {
                    "loss": f"{loss.item():.4f}",
                    "acc": f"{100 * correct_test / total_test:.2f}%",
                }
            )

    epoch_test_loss = test_loss / len(test_dataloader)
    epoch_test_acc = 100 * correct_test / total_test
    test_losses.append(epoch_test_loss)
    test_accuracies.append(epoch_test_acc)

    # Update learning rate scheduler
    scheduler.step(epoch_test_acc)
    current_lr = optimizer.param_groups[0]["lr"]
    learning_rates.append(current_lr)

    # Epoch time
    epoch_time = time.time() - epoch_start_time

    # Epoch report
    print(f"Epoch [{epoch + 1}/{NUM_EPOCHS}] - Time: {epoch_time:.2f}s")
    print(f"  Train → Loss: {epoch_train_loss:.4f} | Acc: {epoch_train_acc:.2f}%")
    print(f"  Test  → Loss: {epoch_test_loss:.4f} | Acc: {epoch_test_acc:.2f}%")
    print(f"  LR: {current_lr:.6f}")
    print(f"  {'─' * 66}\n")

# Total time
total_time = time.time() - start_time

print(f"\n{'=' * 70}")
print("TRAINING COMPLETED")
print(f"{'=' * 70}")
print(f"  Total time: {total_time / 60:.2f} minutes")
print(f"  Average time per epoch: {total_time / NUM_EPOCHS:.2f} seconds")
print(f"  Final accuracy (train): {train_accuracies[-1]:.2f}%")
print(f"  Final accuracy (test): {test_accuracies[-1]:.2f}%")
print(f"  Best accuracy (test): {max(test_accuracies):.2f}%")

# Save the model
torch.save(
    {
        "epoch": NUM_EPOCHS,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "train_losses": train_losses,
        "train_accuracies": train_accuracies,
        "test_losses": test_losses,
        "test_accuracies": test_accuracies,
    },
    "vgg16_cifar10.pth",
)

print("\nModel saved as 'vgg16_cifar10.pth'")

VGG is computationally intensive due to the number of convolution operations and the large number of channels in the deeper layers. On a modern GPU (for example, an RTX 3080), each epoch may require on the order of tens of seconds with the described configuration, whereas on CPU the process can be an order of magnitude slower. The instruction model.train() activates training-specific behaviors, such as updating BatchNorm statistics and applying Dropout, while model.eval() deactivates these behaviors to ensure deterministic and reproducible evaluation.

Visualization of Training Metrics

Inspecting the evolution of loss, accuracy, and learning rate across epochs makes it possible to identify potential issues such as overfitting, training stagnation, or inadequate learning rate schedules.

import os

os.makedirs("results", exist_ok=True)

epochs_range = range(1, NUM_EPOCHS + 1)

# Create figure with three subplots
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))

# Loss plot
ax1.plot(
    epochs_range, train_losses, "o-", label="Train Loss", linewidth=2, markersize=6
)
ax1.plot(epochs_range, test_losses, "s-", label="Test Loss", linewidth=2, markersize=6)
ax1.set_xlabel("Epoch", fontsize=12, fontweight="bold")
ax1.set_ylabel("Loss", fontsize=12, fontweight="bold")
ax1.set_title("Loss Evolution", fontsize=14, fontweight="bold")
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)
ax1.set_xticks(list(epochs_range))

# Accuracy plot
ax2.plot(
    epochs_range,
    train_accuracies,
    "o-",
    label="Train Accuracy",
    linewidth=2,
    markersize=6,
)
ax2.plot(
    epochs_range,
    test_accuracies,
    "s-",
    label="Test Accuracy",
    linewidth=2,
    markersize=6,
)
ax2.set_xlabel("Epoch", fontsize=12, fontweight="bold")
ax2.set_ylabel("Accuracy (%)", fontsize=12, fontweight="bold")
ax2.set_title("Accuracy Evolution", fontsize=14, fontweight="bold")
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.set_xticks(list(epochs_range))

# Learning rate plot
ax3.plot(epochs_range, learning_rates, "o-", color="red", linewidth=2, markersize=6)
ax3.set_xlabel("Epoch", fontsize=12, fontweight="bold")
ax3.set_ylabel("Learning Rate", fontsize=12, fontweight="bold")
ax3.set_title("Learning Rate Schedule", fontsize=14, fontweight="bold")
ax3.set_yscale("log")
ax3.grid(True, alpha=0.3)
ax3.set_xticks(list(epochs_range))

plt.tight_layout()
plt.savefig("results/vgg16_training_history.png", dpi=300, bbox_inches="tight")
plt.show()

# Quantitative analysis
print("\nResult analysis:")
diff = train_accuracies[-1] - test_accuracies[-1]
print(f"  Overfitting detected: {'YES' if diff > 10 else 'NO'}")
print(f"  Train-test gap: {diff:.2f}%")
print(f"  Best epoch (test acc): {np.argmax(test_accuracies) + 1}")
print(f"  Gain since epoch 1: {test_accuracies[-1] - test_accuracies[0]:.2f}%")

The loss and accuracy curves help detect typical behaviors: if training loss decreases while validation loss increases, overfitting is evident; if both remain high, the model may be underfitting or the learning rate may be inadequate. The learning-rate plot on a logarithmic scale shows the stepwise decreases produced by the scheduler, which usually coincide with refinement phases during which the model parameters are adjusted more precisely.

Confusion Matrix and Per-Class Analysis

The confusion matrix provides a detailed view of per-class performance and helps identify systematic error patterns, thereby clarifying which categories are more difficult for the model to distinguish.

# 3pps
import os
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix

os.makedirs("results", exist_ok=True)

print("Generating confusion matrix...")

# Obtain all predictions
model.eval()
all_predictions = []
all_labels = []

with torch.no_grad():
    for images, labels in tqdm(test_dataloader, desc="Evaluating"):
        images = images.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)

        all_predictions.extend(predicted.cpu().numpy())
        all_labels.extend(labels.numpy())

# Compute confusion matrix
cm = confusion_matrix(all_labels, all_predictions)

# Visualize confusion matrix
plt.figure(figsize=(12, 10))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=CIFAR10_CLASSES,
    yticklabels=CIFAR10_CLASSES,
    cbar_kws={"label": "Number of samples"},
)
plt.xlabel("Prediction", fontsize=12, fontweight="bold")
plt.ylabel("True Label", fontsize=12, fontweight="bold")
plt.title("Confusion Matrix - VGG16 on CIFAR-10", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("results/vgg16_confusion_matrix.png", dpi=300, bbox_inches="tight")
plt.show()

# Classification report
print("\nClassification Report:")
print("=" * 70)
print(
    classification_report(
        all_labels, all_predictions, target_names=CIFAR10_CLASSES, digits=3
    )
)

# Per-class accuracy analysis
print("\nPer-Class Accuracy Analysis:")
print("=" * 70)
class_correct = cm.diagonal()
class_total = cm.sum(axis=1)
class_accuracy = 100 * class_correct / class_total

for idx, class_name in enumerate(CIFAR10_CLASSES):
    print(
        f"  {class_name:12s}: {class_accuracy[idx]:6.2f}% "
        f"({class_correct[idx]:4d}/{class_total[idx]:4d})"
    )

# Most confused class pairs
print("\nMost Confused Class Pairs:")
print("=" * 70)
confusion_pairs = []
for i in range(len(CIFAR10_CLASSES)):
    for j in range(len(CIFAR10_CLASSES)):
        if i != j:
            confusion_pairs.append((cm[i, j], CIFAR10_CLASSES[i], CIFAR10_CLASSES[j]))

confusion_pairs.sort(reverse=True)
for count, true_class, pred_class in confusion_pairs[:5]:
    print(f"  {true_class:12s} → {pred_class:12s}: {count:4d} times")

The main diagonal of the matrix reflects correct classifications, while off-diagonal elements quantify confusions between class pairs. On CIFAR-10 it is common to see confusions between cat and dog, automobile and truck, or bird and airplane, indicating that certain categories share similar visual patterns from the model’s perspective. This analysis is valuable for identifying model limitations and guiding potential improvements, such as collecting additional data for problematic classes or applying class-balancing techniques.

Visualization of Correct and Incorrect Predictions

To better understand model behavior, it is useful to visualize some correct predictions and some errors, providing a qualitative perspective that complements quantitative metrics.

print("Visualizing model predictions...\n")

# Obtain one test batch
data_iter = iter(test_dataloader)
test_images, test_labels = next(data_iter)

# Make predictions
model.eval()
with torch.no_grad():
    test_images_device = test_images.to(device)
    outputs = model(test_images_device)
    _, predictions = torch.max(outputs, 1)
    predictions = predictions.cpu()

# Visualize first 8 predictions
print("First 8 predictions:")
show_images(test_images[:8], test_labels[:8], predictions[:8])

# Find misclassified examples
incorrect_indices = (predictions != test_labels).nonzero(as_tuple=True)[0]

if len(incorrect_indices) >= 8:
    print("\nExamples of incorrect predictions:")
    error_indices = incorrect_indices[:8]
    show_images(
        test_images[error_indices],
        test_labels[error_indices],
        predictions[error_indices],
    )
else:
    print(f"\nOnly {len(incorrect_indices)} errors in this batch")

This qualitative analysis helps detect systematic error patterns, such as consistently confusing one type of vehicle with another or certain animal classes with one another. Visual inspection of errors can also reveal issues in the data, such as incorrect labels or ambiguous images that would be difficult for a human to classify.

Extraction and Visualization of Intermediate Features

One of VGG’s strengths is its ability to learn hierarchical features in depth. Activations of intermediate layers can be inspected to better understand which types of patterns the network captures at each convolutional block.

import os

os.makedirs("results", exist_ok=True)

def get_activation_maps(model, image, layer_name):
    """
    Extracts activation maps from a specific layer.
    """
    activations = {}

    def hook_fn(module, input, output):
        activations["output"] = output

    # Register hook
    layer = dict([*model.named_modules()])[layer_name]
    hook = layer.register_forward_hook(hook_fn)

    # Forward pass
    model.eval()
    with torch.no_grad():
        _ = model(image.unsqueeze(0).to(device))

    hook.remove()
    return activations["output"].squeeze().cpu()

# Select one test image
test_image, test_label = test_dataset[0]
print(f"Analyzing image of class: {CIFAR10_CLASSES[test_label]}\n")

# Show original image
plt.figure(figsize=(4, 4))
img_display = test_image / 2 + 0.5  # Denormalize
plt.imshow(img_display.permute(1, 2, 0))
plt.title(f"Original Image: {CIFAR10_CLASSES[test_label]}", fontweight="bold")
plt.axis("off")
plt.tight_layout()
plt.show()

# Select and visualize activations from different blocks
layers_to_visualize = {
    "block1": "block1.0",  # First conv of block1
    "block2": "block2.0",  # First conv of block2
    "block3": "block3.0",  # First conv of block3
    "block5": "block5.0",  # First conv of block5
}

for block_name, layer_name in layers_to_visualize.items():
    print(f"Visualizing activations of {block_name}...")

    activations = get_activation_maps(model, test_image, layer_name)

    # Select first 16 filters for visualization
    num_filters = min(16, activations.shape[0])

    fig, axes = plt.subplots(4, 4, figsize=(12, 12))
    fig.suptitle(
        f"Activations of {block_name.upper()} - " f"{CIFAR10_CLASSES[test_label]}",
        fontsize=16,
        fontweight="bold",
    )

    for idx, ax in enumerate(axes.flat):
        if idx < num_filters:
            activation = activations[idx]
            ax.imshow(activation, cmap="viridis")
            ax.set_title(f"Filter {idx}", fontsize=10)
        ax.axis("off")

    plt.tight_layout()
    plt.savefig(f"results/vgg16_activations_{block_name}.png", dpi=300, bbox_inches="tight")
    plt.show()

print("\nInterpretation of block-wise activations:")
print("=" * 70)
print("  Block 1: Detects low-level features")
print("           (edges, corners, simple color variations)")
print("  Block 2-3: Detect mid-level patterns")
print("             (textures, more complex shapes, repetitive patterns)")
print("  Block 4-5: Detect high-level features")
print("             (object parts, combinations of textures and shapes)")