LeNet

Historical Context and Relevance of LeNet-5

The LeNet-5 architecture, developed by Yann LeCun and his collaborators between 1988 and 1998, constitutes one of the earliest and most influential convolutional neural network architectures. It was specifically designed to address the problem of automatic handwritten character recognition, a task of considerable practical importance at the time. LeNet-5 was not merely a theoretical contribution, but was successfully deployed in real-world industrial systems, most notably for automatic check processing in the United States. For this reason, it is widely regarded as one of the first concrete and large-scale applications of deep learning in an operational setting.

A central contribution of LeNet-5 lies in its demonstration that hierarchical feature representations can be learned directly from raw image data. By progressively transforming the input through multiple layers, the network captures increasingly abstract patterns while preserving the underlying spatial organization of the image. This approach leads to a significant reduction in the number of trainable parameters when compared to traditional fully connected multilayer perceptrons. In earlier architectures, each pixel was treated as an independent input feature, resulting in a loss of spatial information and a rapid growth in the number of parameters as input resolution increased. LeNet-5 overcomes these limitations by explicitly leveraging the two-dimensional structure of images and the strong local correlations that exist between neighboring pixels.

The architecture illustrates how the coordinated use of convolutional layers, subsampling operations, and nonlinear activation functions enables the construction of models that are both expressive and computationally efficient. Convolutional layers enforce local connectivity and parameter sharing, subsampling layers progressively reduce spatial resolution while increasing robustness, and nonlinearities allow the network to model complex decision boundaries. Together, these components yield systems that are resilient to variations in position, scale, and moderate geometric deformations of handwritten characters, without incurring prohibitive computational costs.

Through these design choices, LeNet-5 established a set of architectural principles that continue to underpin modern convolutional neural networks used in computer vision today. Its emphasis on spatial locality, hierarchical feature learning, and efficiency makes it a direct conceptual ancestor of many contemporary deep learning models, and a foundational milestone in the historical development of neural network architectures.

Conceptual Foundations of LeNet-5

Before the introduction of LeNet-5, image recognition tasks were predominantly addressed using fully connected multilayer perceptrons. This approach exhibits a fundamental structural limitation: the two-dimensional image is flattened into a one-dimensional vector before being processed by the network. As a consequence, all information about the relative spatial arrangement of pixels is lost. Pixels that are neighbors in the original image are treated in the same way as pixels that are far apart, preventing the model from exploiting spatial locality. This representation makes the system highly sensitive to small translations, local deformations, or changes in the position of the object within the image. Moreover, as image resolution increases, the number of parameters grows rapidly, leading to high computational costs, difficulties during training, and a strong tendency toward overfitting, particularly when only limited training data are available.

LeNet-5 introduced a decisive conceptual shift by combining convolutional operations, subsampling mechanisms, and systematic weight sharing. Convolutional layers are designed to detect local patterns, such as edges, corners, or elementary strokes, while explicitly preserving the two-dimensional structure of the input image. Each convolutional filter is applied by sliding it across the image, acting as a specialized detector that responds strongly to a specific visual pattern wherever it appears. In this way, local features are extracted consistently across the entire spatial extent of the image.

Subsampling layers, implemented in LeNet-5 through average pooling operations, further transform these feature maps by progressively reducing their spatial resolution. This dimensionality reduction introduces a degree of invariance to small translations and minor geometric deformations, as the precise location of a feature becomes less critical at coarser scales. At the same time, subsampling significantly decreases computational complexity and reduces the number of parameters required in subsequent layers, thereby improving efficiency and stability during training.

A further key design principle is weight sharing. Instead of learning a distinct set of weights for each spatial position, the same convolutional filter is reused across all locations in the image. This approach drastically reduces the total number of parameters and enforces the idea that a meaningful visual pattern, such as a vertical stroke or an edge, should be recognized consistently regardless of its position. As a result, weight sharing enhances generalization and contributes to the robustness of the learned representations.

The combined effect of convolution, subsampling, and weight sharing enables LeNet-5 to learn hierarchical representations in a progressive and structured manner. Early layers capture simple, localized features, while deeper layers integrate these elements into increasingly abstract and task-specific representations. This hierarchical organization of features remains a fundamental principle in contemporary convolutional neural network architectures, forming the conceptual foundation for models ranging from AlexNet to more recent vision systems, including transformer-based architectures.

Structural Organization of the LeNet-5 Architecture

The original LeNet-5 architecture is composed of seven trainable layers that combine convolutional operations, subsampling mechanisms, and fully connected transformations. The network is designed to process grayscale images of size \(32 \times 32\), a resolution that is slightly larger than the standard MNIST digit images of \(28 \times 28\). This deliberate enlargement introduces a uniform margin around the digit, which facilitates the application of convolutional filters near image boundaries and allows the model to handle small translations without discarding relevant edge information.

From a structural perspective, the architecture can be divided into two main components: a convolutional feature extraction stage followed by a fully connected classification stage. In the convolutional part, convolutional layers and average pooling layers alternate, progressively transforming the input image into a set of compact and informative feature representations. The final part of the network consists of dense layers that operate on these extracted features to perform classification.

The characteristic dimensions of the layers in the original LeNet-5 architecture are summarized in the following table, which illustrates how spatial resolution decreases while the number of feature channels increases as the data propagate through the network:

Layer	Type	Input	Output
C1	Convolution	\(32 \times 32 \times 1\)	\(28 \times 28 \times 6\)
S2	Average Pooling	\(28 \times 28 \times 6\)	\(14 \times 14 \times 6\)
C3	Convolution	\(14 \times 14 \times 6\)	\(10 \times 10 \times 16\)
S4	Average Pooling	\(10 \times 10 \times 16\)	\(5 \times 5 \times 16\)
C5	Convolution	\(5 \times 5 \times 16\)	\(1 \times 1 \times 120\)
F6	Fully Connected	120	84
Output	Fully Connected	84	10

In the original implementation, LeNet-5 employs sigmoid or \(\tanh\) activation functions rather than the rectified linear units commonly used in modern architectures. Furthermore, subsampling is performed using average pooling instead of max pooling, reflecting both the theoretical preferences and computational constraints of the period in which the model was developed. Despite these differences, the total number of trainable parameters is approximately 60,000, which is relatively small when compared to fully connected networks designed to process inputs of similar dimensionality.

PyTorch Implementation

This section presents a modern, functional implementation inspired by LeNet using PyTorch. The goal is to have a complete workflow, executable end-to-end and directly convertible into a Jupyter Notebook. The MNIST dataset is used as the reference dataset for handwritten digit classification.

Importing Libraries

First, the necessary libraries from Python’s standard library and third-party packages are imported, including modules for model construction, data handling, visualization, and embedding analysis.

# Standard libraries
from typing import Any

# 3pps
# Third-party libraries
import matplotlib.pyplot as plt
import torch
from sklearn.manifold import TSNE
from torch import nn
from torch.utils.data import DataLoader
from torchinfo import summary
from torchvision import datasets, transforms
from tqdm import tqdm

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Device Setup

The device is automatically selected based on GPU availability. If CUDA is available, the GPU is used; otherwise, the model runs on the CPU.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device used: {device}")

Auxiliary Visualization Function

A helper function is defined to visually inspect examples from the dataset, displaying a set of images with their corresponding labels. This helps quickly verify that preprocessing is correct and samples are interpreted properly.

def show_images(images, labels):
    fig, axes = plt.subplots(1, len(images), figsize=(15, 3))
    if len(images) == 1:
        axes = [axes]

    for img, label, ax in zip(images, labels, axes):
        ax.imshow(img.squeeze(), cmap="gray")
        ax.set_title(f"Digit: {label}")
        ax.axis("off")

    plt.tight_layout()
    plt.show()

Loading and Preprocessing the MNIST Dataset

The MNIST dataset contains grayscale images of handwritten digits sized \(28 \times 28\). Preprocessing includes normalization using the mean \(\mu = 0.1307\) and standard deviation \(\sigma = 0.3081\), estimated from the dataset itself. Normalization is defined as:

\[ x_{\text{normalized}} = \frac{x - \mu}{\sigma}. \]

This centers the data and scales it, facilitating and stabilizing the training of deep networks by improving the numerical conditioning of optimization operations.

transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)

train_dataset = datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)

test_dataset = datasets.MNIST(
    root="./data", train=False, download=True, transform=transform
)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

Creating DataLoaders

DataLoaders are created from the training and test sets to batch samples, shuffle training examples, and efficiently handle data transfer to the computation device.

BATCH_SIZE = 32

train_dataloader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2
)

test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2
)

Visual Inspection of the Dataset

Before training, it is useful to inspect some training samples. The mean and standard deviation of a batch are also computed to verify proper normalization.

images, labels = next(iter(train_dataloader))
show_images(images[:10], labels[:10])

print(f"Batch mean: {images.mean():.3f}")
print(f"Batch standard deviation: {images.std():.3f}")

Defining a Modern LeNet Version in PyTorch

A modern, simplified version of LeNet is defined, adapted to MNIST and current deep learning practices. While not exactly replicating the original LeNet-5, it preserves the design spirit: a convolutional part for spatial feature extraction and a fully connected part for classification. Batch normalization and ReLU activation are included, standard in contemporary architectures, improving convergence speed and training stability.

class LeNet(nn.Module):
    def __init__(self, input_channels: int = 1):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(input_channels, 16, kernel_size=4, stride=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.Conv2d(16, 32, kernel_size=4, stride=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
        )

        self.classifier = nn.Sequential(nn.Flatten(), nn.Linear(32, 10))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = self.classifier(x)
        return x

The features block applies two convolutional layers with spatial resolution reduction via stride=2, followed by batch normalization and ReLU activation. nn.AdaptiveAvgPool2d((1, 1)) then adaptively reduces each feature map to size \(1 \times 1\) per channel, making the architecture robust to small spatial input variations. The classifier block flattens the features and applies a linear layer to produce logits for the 10 digit classes, later interpreted by CrossEntropyLoss, which internally applies softmax.

Model Instantiation and Analysis

The model is instantiated, moved to the selected device, and torchinfo.summary provides a structured architecture overview, including input/output dimensions and parameter counts.

model = LeNet().to(device)

summary(model, input_size=(BATCH_SIZE, 1, 28, 28), device=str(device))

total_params = sum(p.numel() for p in model.parameters())
print(f"Total trainable parameters: {total_params:,}")

Training Setup

Training hyperparameters, optimizer, and loss function are defined. AdamW is used, combining Adam’s advantages with explicit weight decay regularization. The chosen loss is CrossEntropyLoss, suitable for multi-class classification with integer labels.

NUM_EPOCHS = 2
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-4

optimizer = torch.optim.AdamW(
    model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY
)

loss_function = nn.CrossEntropyLoss()

Training and Validation Loop

Training is organized in epochs. Each epoch updates model parameters on the training set, then evaluates performance on the test set without updating parameters. Loss and accuracy are recorded for both sets to analyze learning progression and detect issues such as overfitting.

train_losses, test_losses = [], []
train_accuracies, test_accuracies = [], []

for epoch in range(NUM_EPOCHS):

    model.train()
    running_loss, correct, total = 0.0, 0, 0

    for images, labels in tqdm(
        train_dataloader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS} [TRAIN]"
    ):
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, preds = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (preds == labels).sum().item()

    train_losses.append(running_loss / len(train_dataloader))
    train_accuracies.append(100 * correct / total)

    model.eval()
    test_loss, correct, total = 0.0, 0, 0

    with torch.no_grad():
        for images, labels in tqdm(
            test_dataloader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS} [TEST]"
        ):
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = loss_function(outputs, labels)

            test_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (preds == labels).sum().item()

    test_losses.append(test_loss / len(test_dataloader))
    test_accuracies.append(100 * correct / total)

    print(f"Epoch {epoch+1}")
    print(f"  Train → Loss: {train_losses[-1]:.4f} | Acc: {train_accuracies[-1]:.2f}%")
    print(f"  Test  → Loss: {test_losses[-1]:.4f} | Acc: {test_accuracies[-1]:.2f}%")

model.train() activates training-specific behaviors, such as updating batch normalization statistics and applying dropout if present. model.eval() disables these behaviors for deterministic evaluation, and torch.no_grad() during validation avoids gradient computation, reducing memory usage and computation time.

Visualizing Metric Evolution

After training, loss and accuracy evolution for training and testing is plotted. This visual analysis helps identify overfitting, underfitting, or learning stagnation, guiding potential architecture or hyperparameter adjustments.

epochs = range(1, NUM_EPOCHS + 1)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(epochs, train_losses, label="Train Loss")
plt.plot(epochs, test_losses, label="Test Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Loss Evolution")

plt.subplot(1, 2, 2)
plt.plot(epochs, train_accuracies, label="Train Accuracy")
plt.plot(epochs, test_accuracies, label="Test Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy (%)")
plt.legend()
plt.title("Accuracy Evolution")

plt.tight_layout()
plt.show()

Comparing training and testing curves provides insights into model generalization. For example, increasing training accuracy with stagnant or decreasing test accuracy usually indicates overfitting, while high loss on both sets suggests insufficient model capacity or training time.

Visualizing Embeddings with t-SNE

Finally, the structure of embeddings produced by the model is analyzed using t-SNE (t-distributed Stochastic Neighbor Embedding). The model’s linear outputs (logits) are extracted as example representations. t-SNE projects these high-dimensional vectors into 2D space while preserving local neighborhood relations. This projection visually shows how the model separates different classes in feature space.

model.eval()

max_samples = 1000
embeddings, all_labels = [], []

with torch.no_grad():
    for i, (images, labels) in enumerate(train_dataloader):
        if len(all_labels) * train_dataloader.batch_size >= max_samples:
            break
        images = images.to(device)
        outputs = model(images)
        embeddings.append(outputs.cpu())
        all_labels.append(labels)

embeddings = torch.cat(embeddings).numpy()
all_labels = torch.cat(all_labels).numpy()

tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    max_iter=300,
    learning_rate=200,
    n_jobs=-1,
)
X_embedded = tsne.fit_transform(embeddings)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    X_embedded[:, 0], X_embedded[:, 1], c=all_labels, cmap="tab10", alpha=0.6, s=10
)
plt.colorbar(scatter, ticks=range(10))
plt.title("t-SNE of embeddings learned by LeNet")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

When the model has learned a good data representation, points corresponding to different classes tend to cluster relatively separately in 2D space. This visualization provides an intuitive perspective on how the model internally organizes information and distinguishes handwritten digit classes. Clear separation indicates that the network-induced feature space facilitates linear classification in the final layer, confirming that learned representations are discriminative and semantically meaningful.