Normalization

Normalization constitutes a key stage both in input data preprocessing and in the internal design of neural network architectures. Its primary objective is to control the scale of numerical values, ensuring that different features are in comparable ranges and that training is stable, efficient, and less sensitive to initialization or hyperparameter choices.

In the context of images, normalization can be divided into two major conceptual blocks. On one hand, input data normalization, which is applied before introducing images into the network. On the other hand, layer normalization, which is applied to the internal activations of the network during training. Although both categories pursue similar objectives, they are implemented at different stages of the data flow and with different mechanisms.

Input Data Normalization

Input data normalization is applied directly to images before they are processed by the network layers. In the case of images, one works with tensors or arrays where each pixel can be represented with raw values in the range \([0, 255]\) or, after prior conversion, with floating-point values.

The purpose of this normalization is threefold. First, it provides numerical stability by avoiding excessively large or small values, which can cause uncontrolled or practically null gradients. Second, it accelerates training, as gradients propagate more uniformly through the network. Finally, it prevents one feature from dominating others simply due to its scale, favoring that all dimensions of the feature space contribute comparably to model learning.

Motivation for Normalizing Input

Input normalization fulfills several essential objectives. In terms of numerical stability, it prevents activations from reaching magnitudes that hinder the convergence of optimization algorithms. Additionally, input homogenization facilitates that gradients calculated during backpropagation have reasonable orders of magnitude, which allows using more aggressive learning rates without compromising convergence. Finally, by adjusting all features to similar ranges, a balancing effect is produced, so that the model is not biased toward those components with larger numerical values.

Input Normalization Techniques

Various standard techniques exist for normalizing images, each suitable for certain scenarios and architectures. A first technique consists of Min-Max normalization to the range \([0, 1]\). In this case, the image is linearly rescaled using its minimum and maximum values:

# 3pps
import numpy as np

def normalize_min_max(image):
    """Brings the image to the range [0, 1]."""
    image = image.astype(np.float32)
    normalized = (image - image.min()) / (image.max() - image.min() + 1e-8)
    return normalized

This method is useful when one wants to work with values bounded between 0 and 1, for example in simple models or when one wishes to visualize or combine different data sources normalized to the same range.

A widely used variant in deep neural networks consists of bringing values to the range \([-1, 1]\). For typical 8-bit images, a direct way to achieve this is to first divide by 255 and then apply a linear transformation:

def normalize_minus_one_to_one(image):
    """Brings the image to the range [-1, 1]."""
    image = image.astype(np.float32) / 255.0
    normalized = 2.0 * image - 1.0
    return normalized

This type of normalization is common in architectures such as Generative Adversarial Networks (GANs), where it is preferable for input data to be centered around zero.

Another fundamental approach is standardization or \(z\)-score normalization. In this case, the mean is subtracted and divided by the standard deviation of the data:

def standardize(image):
    """Standardizes: (x - mean) / standard deviation."""
    image = image.astype(np.float32)
    mean = image.mean()
    std = image.std()
    standardized = (image - mean) / (std + 1e-8)
    return standardized

This technique transforms data so that it has approximately zero mean and unit variance. In computer vision, it is frequently used at the channel level, utilizing precomputed means and standard deviations over large datasets, such as ImageNet.

In practice, frameworks like PyTorch facilitate input normalization through predefined transformations. A typical example for models pretrained on ImageNet is as follows:

# 3pps
from torchvision import transforms

# Standard transformation for pretrained models (ImageNet)
transform = transforms.Compose(
    [
        transforms.ToTensor(),  # Converts to tensor and scales to [0, 1]
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],  # ImageNet mean per channel
            std=[0.229, 0.224, 0.225],  # ImageNet standard deviation per channel
        ),
    ]
)

In this pipeline, the ToTensor function converts the image to a floating-point tensor and scales values to the range \([0, 1]\). Subsequently, Normalize applies channel-wise standardization using global statistics from the original training set. This practice ensures that pretrained models receive inputs in the same statistical regime for which they were optimized.

Layer Normalization in Neural Networks

Layer normalization is performed within the network architecture, on the intermediate activations that are generated as data advances through different layers. Unlike input data normalization, which is fixed preprocessing, layer normalization is implemented as differentiable blocks that form part of the model and that, in many cases, contain learnable parameters.

The general idea consists of normalizing activations according to certain dimensions (for example, over the batch, over channels, or over all elements of a sample), and then applying a linear transformation with scale and shift parameters that are learned during training. In this way, the so-called "internal covariate shift" is corrected and the distribution of activations is stabilized, which facilitates the training of deep networks.

Local Response Normalization (LRN)

Local Response Normalization (LRN) is a technique introduced in early networks such as AlexNet. Its purpose is to perform normalization based on the response of neighboring channels, mimicking certain lateral inhibition mechanisms observed in the biological visual system. Although it is included here for historical completeness, in practice its current use is residual, as it has been widely displaced by more effective methods such as Batch Normalization or Layer Normalization.

A schematic implementation of LRN in PyTorch can be structured as a class that receives parameters such as neighborhood size \(n\), coefficients \(\alpha\) and \(\beta\), and a constant \(k\):

# 3pps
import torch
import torch.nn as nn

class LocalResponseNormalization(nn.Module):
    def __init__(self, k=2.0, n=5, alpha=1e-4, beta=0.75):
        super().__init__()
        self.k = k
        self.n = n
        self.alpha = alpha
        self.beta = beta

    # The complete implementation would include the calculation of normalization
    # over neighboring channels according to the above parameters.

Although LRN had relevance in early works with deep CNNs, its current impact is very limited and it is not considered a recommendable choice for modern architectures.

Global Response Normalization (GRN)

Global Response Normalization (GRN) is proposed as a more recent alternative to local normalization. Instead of normalizing with respect to neighboring channels, GRN considers the global response of all channels for each spatial position and regulates the magnitude of activations per channel from that global information. The objective is to prevent certain channels from becoming redundant or systematically dominating the representation, promoting a more balanced distribution of energy across channels.

A typical GRN implementation in PyTorch can take the following form:

class GlobalResponseNormalization(nn.Module):
    def __init__(self, num_channels, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.eps = eps

    def forward(self, x):
        # Calculate global norm per channel (p=2 over spatial dimensions)
        gx = torch.norm(x, p=2, dim=(2, 3), keepdim=True)
        # Normalize with respect to the mean of global norms
        nx = gx / (gx.mean(dim=1, keepdim=True) + self.eps)
        # Rescale and add residual component
        return self.gamma * (x * nx) + self.beta + x

In this block, an \(L^2\) norm per channel is first calculated by aggregating over spatial dimensions. Subsequently, this norm is normalized with respect to its mean and used to rescale the original activations through the learnable parameters \(\gamma\) and \(\beta\), to which the input itself is also added as a residual term. This type of normalization has been explored in modern convolutional architectures and in masked autoencoder models.

Batch Normalization (BN)

Batch Normalization (BN) is one of the most influential internal normalization techniques in deep networks. Its central idea consists of normalizing activations using statistics (mean and variance) calculated over the training batch itself for each channel.

For an activation tensor \(x\) of size \((N, C, H, W)\), where \(N\) is the batch size, \(C\) is the number of channels, and \((H, W)\) is the spatial dimension, the mean and variance per channel are calculated in training mode:

\[ \mu_c = \frac{1}{N H W} \sum_{n,h,w} x_{n,c,h,w}, \quad \sigma_c^2 = \frac{1}{N H W} \sum_{n,h,w} (x_{n,c,h,w} - \mu_c)^2. \]

Next, normalization is performed:

\[ \hat{x}_{n,c,h,w} = \frac{x_{n,c,h,w} - \mu_c}{\sqrt{\sigma_c^2 + \varepsilon}}, \]

and an affine transformation is applied with learnable parameters \(\gamma_c\) and \(\beta_c\):

\[ y*{n,c,h,w} = \gamma_c \hat{x}*{n,c,h,w} + \beta_c. \]

A simplified implementation of two-dimensional Batch Normalization can be expressed in PyTorch as follows:

class BatchNormalization2D(nn.Module):
    def __init__(self, num_channels, eps=1e-5, momentum=0.1):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(1, num_channels, 1, 1))
        self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.eps = eps
        self.momentum = momentum

        # Accumulated statistics for inference
        self.register_buffer("running_mean", torch.zeros(1, num_channels, 1, 1))
        self.register_buffer("running_var", torch.ones(1, num_channels, 1, 1))

    def forward(self, x):
        if self.training:
            mean = x.mean(dim=(0, 2, 3), keepdim=True)
            var = x.var(dim=(0, 2, 3), keepdim=True)

            # Update accumulated statistics
            self.running_mean = (
                1 - self.momentum
            ) * self.running_mean + self.momentum * mean
            self.running_var = (
                1 - self.momentum
            ) * self.running_var + self.momentum * var
        else:
            mean = self.running_mean
            var = self.running_var

        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

During training, batch statistics are used and accumulated means and variances are updated with a certain momentum. During inference, batch statistics are no longer used and accumulated means and variances are employed instead, which guarantees deterministic behavior.

Among the main advantages of Batch Normalization are training acceleration, the possibility of using higher learning rates, and reduced dependence on weight initialization. In many architectures, BN also contributes to reducing the need for additional regularization techniques such as Dropout. However, it also presents limitations. In particular, its performance degrades when the batch size is very small, as mean and variance estimates become noisy, and its behavior differs between training and inference modes, which requires careful management of train and eval modes.

Layer Normalization (LN)

Layer Normalization (LN) is designed to overcome some limitations of BN, especially in contexts where batch size is small or where the model structure does not adapt well to batch normalization, such as in recurrent networks or Transformers. In LN, normalization is performed independently for each sample, aggregating over all its feature dimensions.

If one considers an input tensor \(x\) associated with an individual sample, LN calculates the mean and variance over all relevant dimensions (for example, over channels and spatial positions) for each batch element, and normalizes analogously to BN but without depending on other samples in the batch. Thus, the normalization behavior is identical in training and inference, and does not depend on batch size.

A schematic implementation of Layer Normalization for tensors of type \((N, C, H, W)\) can be written as follows:

class LayerNormalization2D(nn.Module):
    def __init__(self, num_channels=None, eps=1e-6):
        super().__init__()
        self.eps = eps
        if num_channels is not None:
            self.gamma = nn.Parameter(torch.ones(1, num_channels, 1, 1))
            self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        else:
            self.gamma = None
            self.beta = None

    def forward(self, x):
        mean = x.mean(dim=(1, 2, 3), keepdim=True)
        var = x.var(dim=(1, 2, 3), keepdim=True)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        if self.gamma is not None and self.beta is not None:
            return self.gamma * x_norm + self.beta
        return x_norm

This normalization is especially suitable for attention-based architectures, such as Transformers, and for recurrent networks, where dependence on batch statistics could introduce undesired noise. Additionally, by not differentiating between training and inference modes, it simplifies the operational flow of the model and facilitates the use of very small batch sizes, even equal to one.