PyTorch Geometric
Introduction
PyTorch Geometric (PyG) is a library built on top of PyTorch that provides efficient tools for deep learning on graph-structured data. It offers a collection of graph-specific data structures, common benchmark datasets, useful transformations, and implementations of state-of-the-art graph neural network layers. This tutorial introduces the core components of PyG through practical examples, covering data representation, built-in datasets, message passing layers, and a complete node classification pipeline.
Imports
import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv
from torch_geometric.utils import to_dense_adj
Graph Representation with Data
In PyG, a graph is represented using the torch_geometric.data.Data object. At minimum,
a graph requires an edge index tensor of shape [2, num_edges] that encodes source and
destination nodes for each edge. Node features, edge features, labels, and any other
tensor can be attached as additional attributes.
The following example constructs a small undirected graph with 4 nodes and 4 edges. Each node has a 2-dimensional feature vector, and a single integer label is assigned to each node.
# Edge list in COO format: each column is an edge (src, dst)
edge_index = torch.tensor(
[[0, 1, 2, 3],
[1, 2, 3, 0]],
dtype=torch.long,
)
# Node feature matrix: 4 nodes, 2 features each
x = torch.tensor(
[[1.0, 0.0],
[0.0, 1.0],
[1.0, 1.0],
[0.0, 0.0]],
dtype=torch.float,
)
# Node labels
y = torch.tensor([0, 1, 0, 1], dtype=torch.long)
data = Data(x=x, edge_index=edge_index, y=y)
print(data)
print(f"Number of nodes: {data.num_nodes}")
print(f"Number of edges: {data.num_edges}")
print(f"Number of node features: {data.num_node_features}")
print(f"Has isolated nodes: {data.has_isolated_nodes()}")
print(f"Has self-loops: {data.has_self_loops()}")
print(f"Is undirected: {data.is_undirected()}")
The edge index uses COO (Coordinate) format, which is memory-efficient for sparse graphs.
Note that for an undirected graph, each edge must appear in both directions. The dense
adjacency matrix can be recovered using the to_dense_adj utility:
adj = to_dense_adj(edge_index, max_num_nodes=data.num_nodes)
print("Adjacency matrix:\n", adj.squeeze(0))
Loading a Benchmark Dataset
PyG provides access to many standard graph learning benchmarks. The Planetoid collection includes the Cora, CiteSeer, and PubMed citation network datasets, which are widely used for semi-supervised node classification. In these datasets, each node represents a document, edges represent citation links, and node features are bag-of-words vectors. The task is to predict the topic category of each document.
dataset = Planetoid(root="/tmp/Cora", name="Cora")
print(f"Dataset: {dataset}")
print(f"Number of graphs: {len(dataset)}")
print(f"Number of classes: {dataset.num_classes}")
print(f"Number of node features: {dataset.num_node_features}")
cora = dataset[0]
print(f"\nGraph properties:")
print(f" Nodes: {cora.num_nodes}")
print(f" Edges: {cora.num_edges}")
print(f" Training nodes: {cora.train_mask.sum().item()}")
print(f" Validation nodes: {cora.val_mask.sum().item()}")
print(f" Test nodes: {cora.test_mask.sum().item()}")
The dataset provides boolean masks (train_mask, val_mask, test_mask) that indicate
which nodes belong to each split. This is the standard semi-supervised setting where only
a small fraction of nodes have labels available during training.
Graph Convolutional Network
A Graph Convolutional Network (GCN) applies learned linear transformations followed by
neighborhood aggregation at each layer. The GCNConv layer in PyG implements the
propagation rule introduced by Kipf and Welling (2017):
where \(\hat{A} = A + I\) is the adjacency matrix with added self-loops and \(\hat{D}\) is
its degree matrix. The following model stacks two GCNConv layers with a ReLU activation
and dropout in between:
class GCN(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels):
super().__init__()
self.conv1 = GCNConv(in_channels, hidden_channels)
self.conv2 = GCNConv(hidden_channels, out_channels)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv2(x, edge_index)
return x
Training and Evaluation
The training loop follows the standard PyTorch pattern, with the key difference that the loss is computed only on the masked training nodes. Similarly, evaluation is performed on the validation and test masks.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GCN(dataset.num_node_features, 16, dataset.num_classes).to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
model.train()
for epoch in range(200):
optimizer.zero_grad()
out = model(data.x, data.edge_index)
loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
if (epoch + 1) % 50 == 0:
print(f"Epoch {epoch + 1:03d}, Loss: {loss.item():.4f}")
model.eval()
with torch.no_grad():
out = model(data.x, data.edge_index)
pred = out.argmax(dim=1)
for split, mask in [("Train", data.train_mask), ("Val", data.val_mask), ("Test", data.test_mask)]:
correct = (pred[mask] == data.y[mask]).sum().item()
total = mask.sum().item()
print(f"{split} Accuracy: {correct / total:.4f}")