Skip to content

Basic Mathematics and Automatic Differentiation

In this section, some fundamental concepts of differential calculus applied to machine learning are introduced, illustrating how PyTorch allows calculating gradients automatically through its autograd system. The objective is to connect the traditional mathematical formulation (symbolic calculus) with practical implementation in code, and show how these gradients are used in typical tasks such as linear regression, logistic regression, or multiclass classification.

The central idea is as follows: A differentiable function is defined that depends on one or more tensors with requires_grad=True, a scalar value is calculated from them, and backward() is invoked. From that moment, PyTorch traverses the computational graph it has built internally and calculates the partial derivatives of the scalar output with respect to each of the differentiable inputs, storing them in the .grad attribute of the corresponding tensors.

Gradient Calculation: PyTorch Versus SymPy

To illustrate the parallelism between symbolic calculus and automatic differentiation, consider the scalar function of two variables:

\[ f(x_1, x_2) = x_1^2 + 3 x_1 x_2 + x_2^2. \]

In PyTorch, a tensor x with two components is defined and gradient tracking is activated:

# 3pps
import sympy as sp
import torch

# Create input tensor with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Define the differentiable function: f(x1, x2) = x1^2 + 3*x1*x2 + x2^2
y = x[0] ** 2 + 3 * x[0] * x[1] + x[1] ** 2

# Calculate gradients
y.backward()

# Gradients with respect to each input
grad_x1 = x.grad[0]  # ∂f/∂x1
grad_x2 = x.grad[1]  # ∂f/∂x2

print("PyTorch gradients:")
print("Gradient ∂f/∂x1:", grad_x1)
print("Gradient ∂f/∂x2:", grad_x2)

PyTorch automatically constructs the operation graph that leads from x to y and, when invoking y.backward(), calculates the partial derivatives \(\frac{\partial f}{\partial x_1}\) and \(\frac{\partial f}{\partial x_2}\) at the specific point \(x = [2, 3]\). These derivatives are stored in x.grad.

In parallel, the same function can be represented symbolically with SymPy:

# Define symbolic variables
x1, x2 = sp.symbols("x1 x2")

# Define the same function symbolically
f = x1**2 + 3 * x1 * x2 + x2**2

# Calculate symbolic derivatives
df_dx1 = sp.diff(f, x1)
df_dx2 = sp.diff(f, x2)

print("SymPy derivative formulas:")
print("∂f/∂x1 =", df_dx1)
print("∂f/∂x2 =", df_dx2)

# Evaluate derivatives at point (x1=2, x2=3)
grad_x1_sym = df_dx1.evalf(subs={x1: 2, x2: 3})
grad_x2_sym = df_dx2.evalf(subs={x1: 2, x2: 3})

print("SymPy symbolic gradients evaluated at (x1=2, x2=3):")
print("Gradient x1:", grad_x1_sym)
print("Gradient x2:", grad_x2_sym)

SymPy provides closed-form symbolic expressions for derivatives and allows evaluating them at specific points. The comparison between SymPy's results and PyTorch's demonstrates how PyTorch's automatic differentiation matches analytical derivatives, which helps validate the implementation and understand the relationship between theory and practice.

Examples

Below are several simple examples that illustrate how PyTorch calculates derivatives in different contexts: single-variable functions, multi-variable functions, chain rule application, and simple linear and logistic models. These examples allow intuitively understanding how the autograd system tracks operations and applies the rules of differential calculus.

# 3pps
import torch

# Example 1: Quadratic function
# y = x², dy/dx = 2x
x = torch.tensor(3.0, requires_grad=True)
y = x**2
y.backward()
print(f"y = x² | x={x.item()}, dy/dx={x.grad.item()}")

In this first case, the function is one-dimensional and simple. PyTorch automatically applies the power rule for derivatives and obtains dy/dx = 2x evaluated at x = 3.

In a scenario with multiple variables, PyTorch calculates partial gradients:

# Example 2: Multiple variables
# z = 2a + 3b, dz/da = 2, dz/db = 3
a = torch.tensor(4.0, requires_grad=True)
b = torch.tensor(5.0, requires_grad=True)
z = 2 * a + 3 * b
z.backward()
print(f"z = 2a + 3b | dz/da={a.grad.item()}, dz/db={b.grad.item()}")

Here, a.grad contains \(\frac{\partial z}{\partial a}\) and b.grad contains \(\frac{\partial z}{\partial b}\), as expected from a linear function in two variables.

The chain rule is applied implicitly when the function is composed of several intermediate operations:

# Example 3: Chain rule
# y = (2x + 1)², dy/dx = 4(2x + 1)
x = torch.tensor(3.0, requires_grad=True)
y = (2 * x + 1) ** 2
y.backward()
print(f"y = (2x+1)² | x={x.item()}, dy/dx={x.grad.item()}")

In this case, PyTorch internally decomposes the function into elementary steps (multiplication, addition, power) and combines their derivatives following the chain rule, without the user needing to do it explicitly.

Linear Regression and Logistic Regression

Derivatives acquire a central role when working with linear and logistic models, as they allow quantifying how the model output changes with small variations in the inputs or parameters. The following examples show how PyTorch calculates gradients with respect to inputs in simple configurations.

In a linear model with two features, with weights w and bias b, the output is:

\[ y = w_1 x_1 + w_2 x_2 + b, \]

so the derivatives with respect to the inputs are \(\frac{\partial y}{\partial x_1} = w_1\) and \(\frac{\partial y}{\partial x_2} = w_2\):

# Example 4: Linear regression
# y = w·x + b, dy/dx = w
x = torch.tensor([2.0, 3.0], requires_grad=True)
w = torch.tensor([0.5, -1.0])
b = 2.0

y = w[0] * x[0] + w[1] * x[1] + b
y.backward()

print(f"Linear | dy/dx1={x.grad[0].item()}, dy/dx2={x.grad[1].item()}")

PyTorch exactly reproduces these derivatives: The gradient of the output with respect to each component of x matches the corresponding weight. This behavior is what generalizes when calculating gradients with respect to model parameters during training.

In the case of logistic regression, a sigmoid function is applied over the linear combination:

\[ z = w_1 x_1 + w_2 x_2 + b,\\ y = \sigma(z) = \frac{1}{1 + e^{-z}}. \]

The derivative with respect to the inputs is given by the chain rule: \(\frac{\partial y}{\partial x_i} = \sigma'(z)\, w_i\), where \(\sigma'(z) = \sigma(z)\,(1 - \sigma(z))\). PyTorch handles this composition automatically:

# Example 5: Logistic regression
# y = σ(w·x + b), dy/dx = σ'(z)·w
x = torch.tensor([2.0, 3.0], requires_grad=True)
w = torch.tensor([0.5, -1.0])
b = 2.0
z = w[0] * x[0] + w[1] * x[1] + b
y = torch.sigmoid(z)

y.backward()
print(f"Logistic | dy/dx1={x.grad[0].item():.4f}, dy/dx2={x.grad[1].item():.4f}")

The values contained in x.grad reflect the sensitivity of the predicted probability with respect to each of the input features, and illustrate how the nonlinear activation function (the sigmoid) affects the gradient.

Multiclass Classification

In multiclass classification tasks, it is common to use a linear layer followed by a softmax function. The linear layer calculates a score or logit for each class, and softmax transforms these scores into probabilities that sum to 1. Below is a simple example with three input features and three output classes.

Consider an input vector x and a weight matrix W, where each column of W can be interpreted as the weight vector associated with a class. From them, the logits are obtained as a matrix product and softmax is applied:

# 3pps
import torch
import torch.nn.functional as F

# Input features
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Weight matrix for 3 classes
W = torch.tensor(
    [
        [0.2, -0.5, 0.3],
        [0.4, 0.1, -0.2],
        [0.1, 0.3, 0.2],
    ],
    requires_grad=False,
)
b = torch.tensor([0.0, 0.0, 0.0])

# Linear scores for each class: logits = W^T x + b
logits = torch.matmul(x, W) + b  # shape [3]

# Apply Softmax to obtain probabilities
probs = F.softmax(logits, dim=0)

# Select the probability of the predicted class (the highest)
pred_class_idx = probs.argmax()
top_prob = probs[pred_class_idx]

# Calculate gradients with respect to the input
top_prob.backward()

print("Multiclass Classification | Probabilities:", probs.detach().numpy())
print("Predicted class index:", pred_class_idx.item())
print("Gradients inputs:", x.grad.detach().numpy())

In this example, logits is a one-dimensional tensor of size 3 containing the linear score of each class. The F.softmax function transforms these logits into a probability vector. Next, the probability of the class with the highest value (top_prob) is selected and backward() is called to calculate the gradient of that probability with respect to the input vector x.

The resulting values in x.grad indicate how the probability of the predicted class would vary if each component of the input were slightly perturbed. This information can be used, for example, to analyze the model's sensitivity to input features or as the basis for explanation techniques and adversarial example generation.