PyTorch Geometric

Introduction

PyTorch Geometric (PyG) is a library built on top of PyTorch that provides efficient tools for deep learning on graph-structured data. It offers a collection of graph-specific data structures, common benchmark datasets, useful transformations, and implementations of state-of-the-art graph neural network layers. This tutorial introduces the core components of PyG through practical examples, covering data representation, built-in datasets, message passing layers, and a complete node classification pipeline.

Imports

import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv
from torch_geometric.utils import to_dense_adj

Graph Representation with `Data`

In PyG, a graph is represented using the torch_geometric.data.Data object. At minimum, a graph requires an edge index tensor of shape [2, num_edges] that encodes source and destination nodes for each edge. Node features, edge features, labels, and any other tensor can be attached as additional attributes.

The following example constructs a small undirected graph with 4 nodes and 4 edges. Each node has a 2-dimensional feature vector, and a single integer label is assigned to each node.

# Edge list in COO format: each column is an edge (src, dst)
edge_index = torch.tensor(
    [[0, 1, 2, 3],
     [1, 2, 3, 0]],
    dtype=torch.long,
)

# Node feature matrix: 4 nodes, 2 features each
x = torch.tensor(
    [[1.0, 0.0],
     [0.0, 1.0],
     [1.0, 1.0],
     [0.0, 0.0]],
    dtype=torch.float,
)

# Node labels
y = torch.tensor([0, 1, 0, 1], dtype=torch.long)

data = Data(x=x, edge_index=edge_index, y=y)
print(data)
print(f"Number of nodes: {data.num_nodes}")
print(f"Number of edges: {data.num_edges}")
print(f"Number of node features: {data.num_node_features}")
print(f"Has isolated nodes: {data.has_isolated_nodes()}")
print(f"Has self-loops: {data.has_self_loops()}")
print(f"Is undirected: {data.is_undirected()}")

The edge index uses COO (Coordinate) format, which is memory-efficient for sparse graphs. Note that for an undirected graph, each edge must appear in both directions. The dense adjacency matrix can be recovered using the to_dense_adj utility:

adj = to_dense_adj(edge_index, max_num_nodes=data.num_nodes)
print("Adjacency matrix:\n", adj.squeeze(0))

Loading a Benchmark Dataset

PyG provides access to many standard graph learning benchmarks. The Planetoid collection includes the Cora, CiteSeer, and PubMed citation network datasets, which are widely used for semi-supervised node classification. In these datasets, each node represents a document, edges represent citation links, and node features are bag-of-words vectors. The task is to predict the topic category of each document.

dataset = Planetoid(root="/tmp/Cora", name="Cora")

print(f"Dataset: {dataset}")
print(f"Number of graphs: {len(dataset)}")
print(f"Number of classes: {dataset.num_classes}")
print(f"Number of node features: {dataset.num_node_features}")

cora = dataset[0]
print(f"\nGraph properties:")
print(f"  Nodes: {cora.num_nodes}")
print(f"  Edges: {cora.num_edges}")
print(f"  Training nodes: {cora.train_mask.sum().item()}")
print(f"  Validation nodes: {cora.val_mask.sum().item()}")
print(f"  Test nodes: {cora.test_mask.sum().item()}")

The dataset provides boolean masks (train_mask, val_mask, test_mask) that indicate which nodes belong to each split. This is the standard semi-supervised setting where only a small fraction of nodes have labels available during training.

Graph Convolutional Network

A Graph Convolutional Network (GCN) applies learned linear transformations followed by neighborhood aggregation at each layer. The GCNConv layer in PyG implements the propagation rule introduced by Kipf and Welling (2017):

\[\mathbf{X}^{(l+1)} = \hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} \mathbf{X}^{(l)} \mathbf{W}^{(l)}\]

where \(\hat{A} = A + I\) is the adjacency matrix with added self-loops and \(\hat{D}\) is its degree matrix. The following model stacks two GCNConv layers with a ReLU activation and dropout in between:

class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

Training and Evaluation

The training loop follows the standard PyTorch pattern, with the key difference that the loss is computed only on the masked training nodes. Similarly, evaluation is performed on the validation and test masks.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = GCN(dataset.num_node_features, 16, dataset.num_classes).to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch + 1:03d}, Loss: {loss.item():.4f}")

model.eval()
with torch.no_grad():
    out = model(data.x, data.edge_index)
    pred = out.argmax(dim=1)

    for split, mask in [("Train", data.train_mask), ("Val", data.val_mask), ("Test", data.test_mask)]:
        correct = (pred[mask] == data.y[mask]).sum().item()
        total = mask.sum().item()
        print(f"{split} Accuracy: {correct / total:.4f}")