Gradient Descent
Gradient descent constitutes the core of training algorithms in machine learning and deep learning. In essence, it is an iterative procedure that adjusts model parameters in the direction opposite to the gradient of the cost function, with the objective of minimizing said function. This section first presents a purely numerical example in two dimensions, to visualize descent trajectories, and then several practical examples in PyTorch that show how the gradient is used to learn the parameters of simple models.
Example 1: Gradient Descent in a Two-Dimensional Landscape
In this first example, a nonlinear function of two variables is defined and its gradients are calculated analytically. From several random initial points, gradient descent is applied and the trajectories are visualized in the parameter plane, which provides a geometric idea of the optimization process.
The function considered is:
implemented in NumPy as:
# 3pps
import matplotlib.pyplot as plt
import numpy as np
# Function definition
def function(input: np.ndarray) -> np.ndarray:
assert input.shape[-1] == 2, "The input must contain 2 elements"
return np.sin(input[:, 0]) * np.cos(input[:, 1]) + np.sin(
0.5 * input[:, 0]
) * np.cos(0.5 * input[:, 1])
Next, the partial derivatives are defined analytically, that is, the gradient \(\nabla f(x_1, x_2) = (\partial f/\partial x_1, \partial f/\partial x_2)\):
# Gradient calculation (partial derivatives)
def gradient_fn(input: np.ndarray) -> np.ndarray:
assert input.shape[-1] == 2, "The input must contain 2 elements"
df_x1 = np.cos(input[:, 0]) * np.cos(input[:, 1]) + 0.5 * np.cos(
0.5 * input[:, 0]
) * np.cos(0.5 * input[:, 1])
df_x2 = -np.sin(input[:, 0]) * np.sin(input[:, 1]) - 0.5 * np.sin(
0.5 * input[:, 0]
) * np.sin(0.5 * input[:, 1])
return np.stack([df_x1, df_x2], axis=1)
The gradient descent algorithm is implemented as:
# Gradient descent algorithm
def gradient_descent(
num_points: int = 10,
num_iterations: int = 30,
learning_rate: float = 1e-3,
):
dim = 2
# Random initialization in the domain [0, 10] x [0, 10]
X = np.random.rand(num_points, dim) * 10
trajectories = [X.copy()]
for _ in range(num_iterations):
X = X - learning_rate * gradient_fn(input=X)
trajectories.append(X.copy())
return np.array(trajectories)
The algorithm is executed for several initial points and their trajectories are plotted in the \((x_1, x_2)\) plane:
# Execute gradient descent
trajectory = gradient_descent(num_points=5, num_iterations=30)
# Visualize trajectories in 2D plane
for i in range(trajectory.shape[1]):
plt.plot(trajectory[:, i, 0], trajectory[:, i, 1], marker="o")
plt.xlabel("x1")
plt.ylabel("x2")
plt.title("Gradient Descent Trajectories")
plt.grid()
plt.show()
Each curve shows how a point moves iteratively in the descent direction of \(f\). This example visually illustrates the fundamental idea: the gradient indicates the direction of maximum increase, and the algorithm moves in the opposite direction to approach function minima.
Example 2: Fitting a Quadratic Function in PyTorch
In the second example, it is shown how to apply gradient descent in PyTorch to fit a quadratic function to synthetically generated data. A relationship between time and velocity is simulated that approximately follows a parabola, with added noise:
# 3pps
import matplotlib.pyplot as plt
import torch
# Synthetic data
time_steps = torch.arange(0, 20).float()
velocity = torch.randn(20) * 3 + 0.75 * (time_steps - 9.5) ** 2 + 1
plt.scatter(time_steps, velocity)
plt.xlabel("Time")
plt.ylabel("Velocity")
plt.title("Synthetic data (time vs. velocity)")
plt.show()
velocity.shape, time_steps.shape
The assumed model is a quadratic function of the form
where \((a, b, c)\) are learnable parameters:
def quadratic_fn(time_step: torch.Tensor, parameters: torch.Tensor) -> torch.Tensor:
a, b, c = parameters
return a * (time_step**2) + b * time_step + c
def loss_function(predicted: torch.Tensor, real: torch.Tensor) -> torch.Tensor:
return (real - predicted).square().mean()
Parameters are initialized randomly and the initial prediction is observed:
parameters = torch.randn(3, requires_grad=True)
parameters
predictions = quadratic_fn(time_step=time_steps, parameters=parameters)
predictions
To visualize the fit, an auxiliary function is defined:
def show_preds(time_steps, real, preds: torch.Tensor):
plt.scatter(time_steps, real, color="blue", label="Real")
plt.scatter(
time_steps,
preds.detach().cpu().numpy(),
color="red",
label="Predicted",
)
plt.legend()
plt.show()
show_preds(time_steps, velocity, predictions)
The initial loss is calculated as:
loss_val = loss_function(predictions, velocity)
loss_val
Next, a manual gradient descent step is applied: the gradient is calculated using
backward(), parameters are updated, and gradients are reset:
# Calculate gradients
loss_val.backward()
parameters.grad
# Gradient descent step
lr = 1e-5
parameters.data = parameters.data - lr * parameters.grad.data
parameters.grad = None
# New prediction after update
predictions = quadratic_fn(time_step=time_steps, parameters=parameters)
show_preds(time_steps, velocity, predictions)
To repeat this process systematically, it is encapsulated in a function:
def apply_step_training(
time_steps,
learnable_params,
target_data,
lr: float = 1e-5,
):
predictions = quadratic_fn(time_step=time_steps, parameters=learnable_params)
loss_val = loss_function(predicted=predictions, real=target_data)
loss_val.backward()
# Update parameters without gradient tracking
with torch.no_grad():
learnable_params -= lr * learnable_params.grad
# Reset gradients
learnable_params.grad.zero_()
show_preds(time_steps, target_data, predictions)
return predictions, learnable_params, loss_val
Training is executed for several epochs:
# 3pps
from tqdm import tqdm
num_epochs = 20
learnable_params = torch.randn(3, requires_grad=True)
for epoch in tqdm(range(num_epochs)):
predictions, learnable_params, loss_val = apply_step_training(
time_steps=time_steps,
learnable_params=learnable_params,
target_data=velocity,
)
print(f"Epoch {epoch+1}, loss: {loss_val}")
This flow illustrates the key training components in PyTorch:
- Definition of a differentiable function.
- Loss calculation.
- Call to
backward()to obtain gradients. - Manual parameter update within a
torch.no_grad()context. - Gradient reset before the next iteration.
Example 3: Manually Implemented Linear Layer and Simple Linear Module
In this part, two complementary ideas are introduced: the abstraction of a linear layer
and the implementation of a linear model in PyTorch as a subclass of nn.Module.
First, a function that would represent a linear layer applied to an input is sketched:
def linear_layer(tensor_entrada: torch.Tensor) -> torch.Tensor:
# tensor_entrada: (B, N)
# w: (N,)
# b: scalar
return tensor_entrada @ w + b
And a minimalist class:
class LinearLayer:
def __init__(self, input_shape: int) -> None:
self.w = torch.randn(input_shape)
Although this is just a sketch, it serves to connect with PyTorch's standard
implementation using nn.Module. Next, a fully functional linear model is proposed:
# 3pps
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from torch import nn
class Linear(nn.Module):
def __init__(self) -> None:
super().__init__()
self.weight = nn.Parameter(data=torch.rand(1), requires_grad=True)
self.bias = nn.Parameter(data=torch.rand(1), requires_grad=True)
def forward(self, input_tensor: torch.Tensor) -> torch.Tensor:
return self.weight * input_tensor + self.bias
The available device is checked:
device = "cuda" if torch.cuda.is_available() else "cpu"
device
Synthetic data following a linear relationship is generated:
start = 0
end = 1
steps = 0.02
X = np.arange(start, end, steps)
bias = 0.3
weight = 0.7
y = weight * X + bias
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
X_train = torch.from_numpy(X_train.astype(np.float32))
X_test = torch.from_numpy(X_test.astype(np.float32))
y_train = torch.from_numpy(y_train.astype(np.float32))
y_test = torch.from_numpy(y_test.astype(np.float32))
plt.scatter(X_train, y_train, c="b", s=4, label="Training")
plt.legend()
plt.show()
plt.scatter(X_test, y_test, c="g", s=4, label="Testing")
plt.legend()
plt.show()
The model is initialized and its parameters are inspected:
linear_model = Linear()
list(linear_model.parameters())
linear_model.state_dict()
Before training, the model is evaluated on the test set:
linear_model.eval()
with torch.no_grad():
predictions = linear_model(X_test)
predictions
Here an important distinction is introduced: torch.no_grad() and
torch.inference_mode(). From PyTorch's documentation:
no_grad: Disables gradient tracking during the block, which avoids storing information for autograd.inference_mode: Analogous tono_gradbut more strict and efficient: It also disables view tracking and version counting, and ensures that tensors created in this context are not subsequently used in computations with autograd.
In practice, inference_mode is recommended for inference code, where it is known that
the model will not be trained or updated. This reduces overhead and increases safety
against accidental parameter modifications:
with torch.inference_mode():
predictions_2 = linear_model(X_test)
predictions_2
plt.scatter(X_test, predictions, c="r", s=4, label="Predictions (no_grad)")
plt.scatter(X_test, y_test, c="b", s=4, label="Real")
plt.legend()
plt.show()
A loss function and optimizer based on PyTorch are defined:
loss_fn = nn.L1Loss() # Mean absolute error
optimizer = torch.optim.SGD(linear_model.parameters())
Next, the model is trained for several epochs, iterating over training data and evaluating on test data:
num_epochs: int = 50
for epoch in range(num_epochs):
epoch_losses_train = []
epoch_losses_test = []
# Training phase
linear_model.train()
for x, y_true in zip(X_train, y_train):
optimizer.zero_grad()
output_model = linear_model(x)
loss = loss_fn(output_model, y_true.unsqueeze(0))
loss.backward()
optimizer.step()
epoch_losses_train.append(loss.item())
# Evaluation phase
linear_model.eval()
with torch.inference_mode():
for x, y_true in zip(X_test, y_test):
output_model = linear_model(x)
loss = loss_fn(output_model, y_true.unsqueeze(0))
epoch_losses_test.append(loss.item())
print(
f"Epoch: {epoch+1}, "
f"Train Loss: {np.mean(epoch_losses_train):.4f}, "
f"Test Loss: {np.mean(epoch_losses_test):.4f}"
)
After training, final predictions are compared with real data:
with torch.inference_mode():
predictions_trained = linear_model(X_test)
plt.scatter(X_test, predictions_trained, c="r", s=4, label="Predictions")
plt.scatter(X_test, y_test, c="b", s=4, label="Real")
plt.legend()
plt.show()
Finally, it is illustrated how to save and load the trained model:
# Save only the state dict
torch.save(linear_model.state_dict(), "linear_model_state.pth")
# Load the state dict
linear_model_loaded = Linear() # Create a new instance
linear_model_loaded.load_state_dict(
torch.load("linear_model_state.pth", weights_only=True)
)
linear_model_loaded.eval()
with torch.inference_mode():
predictions_loaded = linear_model_loaded(X_test)
plt.scatter(X_test, predictions_loaded, c="r", s=4, label="Predictions (loaded)")
plt.scatter(X_test, y_test, c="b", s=4, label="Real")
plt.legend()
plt.show()