Gradient Descent

Gradient descent constitutes the core of training algorithms in machine learning and deep learning. In essence, it is an iterative procedure that adjusts model parameters in the direction opposite to the gradient of the cost function, with the objective of minimizing said function. This section first presents a purely numerical example in two dimensions, to visualize descent trajectories, and then several practical examples in PyTorch that show how the gradient is used to learn the parameters of simple models.

Example 1: Gradient Descent in a Two-Dimensional Landscape

In this first example, a nonlinear function of two variables is defined and its gradients are calculated analytically. From several random initial points, gradient descent is applied and the trajectories are visualized in the parameter plane, which provides a geometric idea of the optimization process.

The function considered is:

\[ f(x_1, x_2) = \sin(x_1)\cos(x_2) + \sin(0.5\, x_1)\cos(0.5\, x_2), \]

implemented in NumPy as:

# 3pps
import matplotlib.pyplot as plt
import numpy as np

# Function definition
def function(input: np.ndarray) -> np.ndarray:
    assert input.shape[-1] == 2, "The input must contain 2 elements"
    return np.sin(input[:, 0]) * np.cos(input[:, 1]) + np.sin(
        0.5 * input[:, 0]
    ) * np.cos(0.5 * input[:, 1])

Next, the partial derivatives are defined analytically, that is, the gradient \(\nabla f(x_1, x_2) = (\partial f/\partial x_1, \partial f/\partial x_2)\):

# Gradient calculation (partial derivatives)

def gradient_fn(input: np.ndarray) -> np.ndarray:
    assert input.shape[-1] == 2, "The input must contain 2 elements"

    df_x1 = np.cos(input[:, 0]) * np.cos(input[:, 1]) + 0.5 * np.cos(
        0.5 * input[:, 0]
    ) * np.cos(0.5 * input[:, 1])
    df_x2 = -np.sin(input[:, 0]) * np.sin(input[:, 1]) - 0.5 * np.sin(
        0.5 * input[:, 0]
    ) * np.sin(0.5 * input[:, 1])

    return np.stack([df_x1, df_x2], axis=1)

The gradient descent algorithm is implemented as:

# Gradient descent algorithm

def gradient_descent(
    num_points: int = 10,
    num_iterations: int = 30,
    learning_rate: float = 1e-3,
):
    dim = 2
    # Random initialization in the domain [0, 10] x [0, 10]
    X = np.random.rand(num_points, dim) * 10
    trajectories = [X.copy()]

    for _ in range(num_iterations):
        X = X - learning_rate * gradient_fn(input=X)
        trajectories.append(X.copy())

    return np.array(trajectories)

The algorithm is executed for several initial points and their trajectories are plotted in the \((x_1, x_2)\) plane:

# Execute gradient descent
trajectory = gradient_descent(num_points=5, num_iterations=30)

# Visualize trajectories in 2D plane
for i in range(trajectory.shape[1]):
    plt.plot(trajectory[:, i, 0], trajectory[:, i, 1], marker="o")

plt.xlabel("x1")
plt.ylabel("x2")
plt.title("Gradient Descent Trajectories")
plt.grid()
plt.show()

Each curve shows how a point moves iteratively in the descent direction of \(f\). This example visually illustrates the fundamental idea: the gradient indicates the direction of maximum increase, and the algorithm moves in the opposite direction to approach function minima.

Example 2: Fitting a Quadratic Function in PyTorch

In the second example, it is shown how to apply gradient descent in PyTorch to fit a quadratic function to synthetically generated data. A relationship between time and velocity is simulated that approximately follows a parabola, with added noise:

# 3pps
import matplotlib.pyplot as plt
import torch

# Synthetic data
time_steps = torch.arange(0, 20).float()
velocity = torch.randn(20) * 3 + 0.75 * (time_steps - 9.5) ** 2 + 1

plt.scatter(time_steps, velocity)
plt.xlabel("Time")
plt.ylabel("Velocity")
plt.title("Synthetic data (time vs. velocity)")
plt.show()

velocity.shape, time_steps.shape

The assumed model is a quadratic function of the form

\[\hat{v}(t) = a t^2 + b t + c, \]

where \((a, b, c)\) are learnable parameters:

def quadratic_fn(time_step: torch.Tensor, parameters: torch.Tensor) -> torch.Tensor:
    a, b, c = parameters
    return a * (time_step**2) + b * time_step + c

def loss_function(predicted: torch.Tensor, real: torch.Tensor) -> torch.Tensor:
    return (real - predicted).square().mean()

Parameters are initialized randomly and the initial prediction is observed:

parameters = torch.randn(3, requires_grad=True)
parameters

predictions = quadratic_fn(time_step=time_steps, parameters=parameters)
predictions

To visualize the fit, an auxiliary function is defined:

def show_preds(time_steps, real, preds: torch.Tensor):
    plt.scatter(time_steps, real, color="blue", label="Real")
    plt.scatter(
        time_steps,
        preds.detach().cpu().numpy(),
        color="red",
        label="Predicted",
    )
    plt.legend()
    plt.show()

show_preds(time_steps, velocity, predictions)

The initial loss is calculated as:

loss_val = loss_function(predictions, velocity)
loss_val

Next, a manual gradient descent step is applied: the gradient is calculated using backward(), parameters are updated, and gradients are reset:

# Calculate gradients
loss_val.backward()
parameters.grad

# Gradient descent step
lr = 1e-5
parameters.data = parameters.data - lr * parameters.grad.data
parameters.grad = None

# New prediction after update
predictions = quadratic_fn(time_step=time_steps, parameters=parameters)
show_preds(time_steps, velocity, predictions)

To repeat this process systematically, it is encapsulated in a function:

def apply_step_training(
    time_steps,
    learnable_params,
    target_data,
    lr: float = 1e-5,
):
    predictions = quadratic_fn(time_step=time_steps, parameters=learnable_params)
    loss_val = loss_function(predicted=predictions, real=target_data)
    loss_val.backward()

    # Update parameters without gradient tracking
    with torch.no_grad():
        learnable_params -= lr * learnable_params.grad

    # Reset gradients
    learnable_params.grad.zero_()

    show_preds(time_steps, target_data, predictions)
    return predictions, learnable_params, loss_val

Training is executed for several epochs:

# 3pps
from tqdm import tqdm

num_epochs = 20
learnable_params = torch.randn(3, requires_grad=True)

for epoch in tqdm(range(num_epochs)):
    predictions, learnable_params, loss_val = apply_step_training(
        time_steps=time_steps,
        learnable_params=learnable_params,
        target_data=velocity,
    )
    print(f"Epoch {epoch+1}, loss: {loss_val}")

This flow illustrates the key training components in PyTorch:

Definition of a differentiable function.
Loss calculation.
Call to backward() to obtain gradients.
Manual parameter update within a torch.no_grad() context.
Gradient reset before the next iteration.

Example 3: Manually Implemented Linear Layer and Simple Linear Module

In this part, two complementary ideas are introduced: the abstraction of a linear layer and the implementation of a linear model in PyTorch as a subclass of nn.Module.

First, a function that would represent a linear layer applied to an input is sketched:

def linear_layer(tensor_entrada: torch.Tensor) -> torch.Tensor:
    # tensor_entrada: (B, N)
    # w: (N,)
    # b: scalar
    return tensor_entrada @ w + b

And a minimalist class:

class LinearLayer:
    def __init__(self, input_shape: int) -> None:
        self.w = torch.randn(input_shape)

Although this is just a sketch, it serves to connect with PyTorch's standard implementation using nn.Module. Next, a fully functional linear model is proposed:

# 3pps
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from torch import nn

class Linear(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.weight = nn.Parameter(data=torch.rand(1), requires_grad=True)
        self.bias = nn.Parameter(data=torch.rand(1), requires_grad=True)

    def forward(self, input_tensor: torch.Tensor) -> torch.Tensor:
        return self.weight * input_tensor + self.bias

The available device is checked:

device = "cuda" if torch.cuda.is_available() else "cpu"
device

Synthetic data following a linear relationship is generated:

start = 0
end = 1
steps = 0.02
X = np.arange(start, end, steps)

bias = 0.3
weight = 0.7
y = weight * X + bias

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

X_train = torch.from_numpy(X_train.astype(np.float32))
X_test = torch.from_numpy(X_test.astype(np.float32))
y_train = torch.from_numpy(y_train.astype(np.float32))
y_test = torch.from_numpy(y_test.astype(np.float32))

plt.scatter(X_train, y_train, c="b", s=4, label="Training")
plt.legend()
plt.show()

plt.scatter(X_test, y_test, c="g", s=4, label="Testing")
plt.legend()
plt.show()

The model is initialized and its parameters are inspected:

linear_model = Linear()
list(linear_model.parameters())
linear_model.state_dict()

Before training, the model is evaluated on the test set:

linear_model.eval()
with torch.no_grad():
    predictions = linear_model(X_test)

predictions

Here an important distinction is introduced: torch.no_grad() and torch.inference_mode(). From PyTorch's documentation:

no_grad: Disables gradient tracking during the block, which avoids storing information for autograd.
inference_mode: Analogous to no_grad but more strict and efficient: It also disables view tracking and version counting, and ensures that tensors created in this context are not subsequently used in computations with autograd.

In practice, inference_mode is recommended for inference code, where it is known that the model will not be trained or updated. This reduces overhead and increases safety against accidental parameter modifications:

with torch.inference_mode():
    predictions_2 = linear_model(X_test)

predictions_2

plt.scatter(X_test, predictions, c="r", s=4, label="Predictions (no_grad)")
plt.scatter(X_test, y_test, c="b", s=4, label="Real")
plt.legend()
plt.show()

A loss function and optimizer based on PyTorch are defined:

loss_fn = nn.L1Loss()  # Mean absolute error
optimizer = torch.optim.SGD(linear_model.parameters())

Next, the model is trained for several epochs, iterating over training data and evaluating on test data:

num_epochs: int = 50

for epoch in range(num_epochs):
    epoch_losses_train = []
    epoch_losses_test = []

    # Training phase
    linear_model.train()
    for x, y_true in zip(X_train, y_train):
        optimizer.zero_grad()

        output_model = linear_model(x)
        loss = loss_fn(output_model, y_true.unsqueeze(0))

        loss.backward()
        optimizer.step()

        epoch_losses_train.append(loss.item())

    # Evaluation phase
    linear_model.eval()
    with torch.inference_mode():
        for x, y_true in zip(X_test, y_test):
            output_model = linear_model(x)
            loss = loss_fn(output_model, y_true.unsqueeze(0))
            epoch_losses_test.append(loss.item())

    print(
        f"Epoch: {epoch+1}, "
        f"Train Loss: {np.mean(epoch_losses_train):.4f}, "
        f"Test Loss: {np.mean(epoch_losses_test):.4f}"
    )

After training, final predictions are compared with real data:

with torch.inference_mode():
    predictions_trained = linear_model(X_test)

plt.scatter(X_test, predictions_trained, c="r", s=4, label="Predictions")
plt.scatter(X_test, y_test, c="b", s=4, label="Real")
plt.legend()
plt.show()

Finally, it is illustrated how to save and load the trained model:

# Save only the state dict
torch.save(linear_model.state_dict(), "linear_model_state.pth")

# Load the state dict
linear_model_loaded = Linear()  # Create a new instance
linear_model_loaded.load_state_dict(
    torch.load("linear_model_state.pth", weights_only=True)
)
linear_model_loaded.eval()

with torch.inference_mode():
    predictions_loaded = linear_model_loaded(X_test)

plt.scatter(X_test, predictions_loaded, c="r", s=4, label="Predictions (loaded)")
plt.scatter(X_test, y_test, c="b", s=4, label="Real")
plt.legend()
plt.show()