A comprehensive implementation comparing Graph Convolutional Networks (GCN) and custom Graph Attention Networks (GAT) for product classification in co-purchase networks. This project systematically explores seven node feature engineering strategies and demonstrates how attention mechanisms improve performance on noisy graph structures.
Traditional machine learning models operate on tabular data or sequences, but many real-world datasets have relational structure that these models cannot capture:
- Social networks: Users connected by friendships
- E-commerce: Products linked by co-purchase behavior
- Citation networks: Papers connected by citations
- Molecular graphs: Atoms bonded in chemical structures
- Transportation: Cities connected by routes
The Challenge: Standard neural networks like CNNs and RNNs are designed for grid-like (images) or sequential (text) data. They cannot naturally handle irregular graph structures where each node has a variable number of neighbors.
The Solution: Graph Neural Networks (GNNs) generalize neural networks to graphs by:
- Message Passing: Nodes exchange information with neighbors
- Aggregation: Each node combines neighbor messages
- Learning: Neural networks learn what messages to send and how to combine them
Result: GNNs can learn representations that capture both node features AND graph structure, making them powerful for tasks like node classification, link prediction, and graph classification.
Problem: New products arrive daily without proper categorization. Manual labeling is expensive and slow.
GNN Solution:
- Build co-purchase graph from transaction logs
- Train GNN on labeled products
- Automatically classify new products based on which existing products customers buy them with
- Business Impact: Faster time-to-market, improved search/discovery, better inventory management
Problem: Fraudsters create accounts that transact with each other, forming suspicious clusters.
GNN Solution:
- Represent transactions as edges between account nodes
- GNN learns patterns like "accounts connected to known fraudsters are suspicious"
- Detects fraud rings that simple rule-based systems miss
- Business Impact: Reduced fraud losses, lower false positive rates
Problem: Cold-start problem for new users/items with no interaction history.
GNN Solution:
- User-item interaction graph (users connected to items they engaged with)
- GNN propagates information: "users who liked similar items might like this"
- Works even for brand new items by leveraging graph structure
- Business Impact: Better recommendations, increased engagement, higher revenue
Problem: Identify influential users, detect communities, predict information spread.
GNN Solution:
- User friendship/follower graph
- GNN learns to identify influencers based on network position
- Predicts which users will adopt new features based on their connections
- Business Impact: Targeted marketing, viral growth strategies, community management
Problem: Predict disruptions and bottlenecks in complex supplier networks.
GNN Solution:
- Supplier-manufacturer-distributor graph
- GNN predicts which nodes are at risk based on neighbor failures
- Identifies critical nodes whose failure would cascade
- Business Impact: Reduced downtime, proactive risk mitigation, cost savings
Scenario: Amazon's photo category has 7,650 products that need classification into 8 subcategories (cameras, lenses, tripods, bags, etc.).
Available Data:
- Co-purchase graph: 238,162 edges showing which products are bought together
- Bag-of-words features: 745-dim vectors (removed for this experiment)
Challenge: Not all co-purchased products belong to the same category. For example:
- Camera + Camera Bag (noisy edge - different categories)
- Camera + Camera Lens (useful edge - related categories)
Goal: Build a GNN that learns from graph structure to classify products, even when edges contain noise.
Understanding the difference between transductive and inductive learning is crucial for real-world GNN deployment.
Definition: The model is trained on a graph containing ALL nodes (train + validation + test), but only some nodes have labels during training.
Example:
- You have a social network with 10,000 users
- 7,000 are labeled (their interests are known)
- 3,000 are unlabeled (you want to predict their interests)
- All 10,000 users are present in the graph during training
Characteristics:
- ✅ Can achieve very high accuracy (model sees entire graph structure)
- ❌ Cannot handle new nodes that arrive after training
- ❌ Must retrain model when graph grows
- ❌ Not suitable for dynamic graphs
Use Cases: Static graphs where all nodes are known upfront (citation networks, social network analysis snapshots)
Definition: The model learns generalizable patterns that work on unseen nodes/subgraphs.
Example:
- You train on one social network with 10,000 users
- Model learns: "users connected to sports enthusiasts often like sports"
- When 1,000 new users join, you can classify them immediately without retraining
Characteristics:
- ✅ Generalizes to new nodes
- ✅ No retraining needed when graph grows
- ✅ Suitable for production systems with continuous growth
- ❌ May have slightly lower accuracy than transductive
Use Cases: E-commerce (new products daily), social networks (new users), fraud detection (new accounts)
The distinction shows up clearly in feature choice:
| Feature Type | Transductive | Inductive | Why? |
|---|---|---|---|
| One-Hot Encoding | ✅ Yes | ❌ No | Each node has unique ID - new nodes don't have IDs |
| Random (Fixed) | ✅ Yes | ❌ No | Random vectors are node-specific |
| Random (Regenerated) | ❌ No | ✅ Yes | Simulates no persistent features |
| Degree | ✅ Yes | ✅ Yes | Can compute degree for any node |
| Centrality | ✅ Yes | ✅ Yes | Can compute metrics for any node |
| Spectral | ✅ Yes | Eigenvectors are graph-specific but patterns generalize |
Critical Insight from This Project: The 50.87 percentage point drop in accuracy when switching from transductive random (87.98%) to inductive random (37.11%) demonstrates why feature stability matters for message passing.
All experiments use identical GCN architecture to isolate the impact of features:
- Architecture: 3-layer GCN with 128 hidden dimensions per layer
- Activation: ReLU
- Regularization: Dropout (0.5), Weight decay (5e-4)
- Optimizer: Adam (lr=0.01)
- Training: 200 epochs, evaluate at epoch with best validation accuracy
Method: Identity matrix - each node gets unique binary vector.
one_hot_features = torch.eye(7650) # 7650 × 7650Results:
- Dimensionality: 7,650
- Test Accuracy: 92.07%
- Validation Accuracy: 93.03%
- Transductive: Yes
- Inductive: No
Analysis: Highest accuracy but cannot generalize to new products. The model essentially memorizes node-specific patterns rather than learning structural rules.
Method: All nodes receive identical vector [1, 1, 1, 1].
constant_features = torch.ones(7650, 4) # All nodes identicalResults:
- Dimensionality: 4
- Test Accuracy: 40.16%
- Validation Accuracy: 42.11%
- Transductive: No
- Inductive: Yes
Analysis: With zero node-specific information, the model still achieves 40.16% accuracy (vs. 12.5% random baseline for 8 classes). This proves that graph topology alone contains significant classification signal.
Method: Fixed random Gaussian vectors, normalized to unit length, stable across all 200 epochs.
random_features = torch.randn(7650, 128)
random_features = F.normalize(random_features, p=2, dim=1)
# These features stay the same throughout trainingResults:
- Dimensionality: 128
- Test Accuracy: 87.98%
- Validation Accuracy: 89.10%
- Transductive: Yes
- Inductive: No
Analysis: Surprisingly effective despite being random. Fixed random features serve as positional encodings that break symmetries between nodes with similar neighborhoods. The stability allows the GNN to learn complex patterns through message passing.
Method: Regenerate new random vectors at every epoch to simulate unstable features.
for epoch in range(200):
# Generate NEW random features each epoch
data.x = torch.randn(7650, 128)
data.x = F.normalize(data.x, p=2, dim=1)
# Train for one epoch with these featuresResults:
- Dimensionality: 128
- Test Accuracy: 37.11%
- Validation Accuracy: 38.10%
- Transductive: No
- Inductive: Yes
Critical Finding: Catastrophic 50.87 percentage point accuracy drop (87.98% → 37.11%) compared to transductive random. Even worse than constant features (40.16%)!
Why This Happens: Without stable features, the model cannot build persistent representations through message passing. Each epoch starts from scratch. The model only learns from graph structure within a single epoch, which is insufficient.
Key Insight: Feature stability matters more than semantic meaning for GNNs. Random but stable features (88%) vastly outperform semantically empty but unstable features (37%).
Method: Two features - raw degree count and log-transformed degree.
degree_features = torch.stack([
node_degrees, # Raw degree
torch.log(node_degrees + 1) # Log degree
], dim=1)Results:
- Dimensionality: 2
- Test Accuracy: 66.03%
- Validation Accuracy: 67.22%
- Transductive: No
- Inductive: Yes
Analysis: Simple but effective. With only 2 features, achieves 25.87 percentage points above constant baseline. The log transformation handles power-law degree distributions common in real networks.
Method: Five graph-theoretic metrics capturing different structural roles.
Features:
1. Degree Centrality: Normalized degree (hub identification)
2. Betweenness Centrality: Bridge nodes connecting communities
3. Closeness Centrality: Distance to all other nodes
4. PageRank: Prestige/importance
5. Clustering Coefficient: Local cohesionResults:
- Dimensionality: 5
- Test Accuracy: 84.06%
- Validation Accuracy: 84.83%
- Transductive: No
- Inductive: Yes
Analysis: Strong performance with compact, interpretable features. Each centrality captures a complementary aspect of network position. Good balance between accuracy and feature efficiency.
Method: Top-64 eigenvectors of graph Laplacian (L = D - A).
# PyG built-in transform
transform = T.AddLaplacianEigenvectorPE(k=64)
data = transform(data)Theory:
- Low eigenvalues (first eigenvectors): Capture global structure, communities
- High eigenvalues (later eigenvectors): Capture local symmetries
- Multi-scale encoding: 64 eigenvectors span coarse to fine structural details
Results:
- Dimensionality: 64
- Test Accuracy: 92.07%
- Validation Accuracy: 92.50%
- Transductive: Partially
- Inductive: Partially (patterns generalize)
Analysis: Matches one-hot accuracy (92.07%) with 119× fewer dimensions (64 vs 7,650). Best practical choice:
- Achieves highest accuracy
- Compact representation
- Partially inductive (structural patterns generalize to similar nodes)
- Theoretically grounded in spectral graph theory
| Feature Type | Dims | Test Acc | Val Acc | Inductive? | Accuracy Drop from Best |
|---|---|---|---|---|---|
| One-Hot Encoding | 7,650 | 92.07% | 93.03% | ❌ No | 0.00% |
| Laplacian Eigenvectors | 64 | 92.07% | 92.50% | 0.00% | |
| Random (Transductive) | 128 | 87.98% | 89.10% | ❌ No | -4.09% |
| Node Centralities | 5 | 84.06% | 84.83% | ✅ Yes | -8.01% |
| Node Degree | 2 | 66.03% | 67.22% | ✅ Yes | -26.04% |
| Constant Features | 4 | 40.16% | 42.11% | ✅ Yes | -51.91% |
| Random (Inductive) | 128 | 37.11% | 38.10% | ✅ Yes | -54.96% |
-
Transductive vs Inductive Gap: 50.87% accuracy drop between transductive random (87.98%) and inductive random (37.11%) proves feature stability is critical
-
Spectral Features Optimal: Laplacian eigenvectors achieve top accuracy with 119× compression and partial inductive capability
-
Structure Contains Signal: Constant features achieve 40.16% (vs 12.5% random), proving graph topology encodes meaningful information
-
Stability > Semantics: Fixed random (87.98%) beats unstable features (37.11%) by 50.87 percentage points
-
Practical Choice: For production systems with new nodes, use Laplacian eigenvectors (92.07% accuracy, inductive) or centralities (84.06%, fully inductive, interpretable)
Standard GCN Problem: Treats all neighbors equally with fixed weights:
h_i^(new) = σ( Σ_j (1/√(d_i·d_j)) · h_j · W )
↑
Fixed weight based only on degree
Challenge in Co-Purchase Networks:
- Camera + Camera Bag: Different categories (noisy edge)
- Camera + Camera Lens: Same category (useful edge)
- GCN averages both equally → noise dilutes signal
GAT Solution: Learn attention weights to identify and downweight noisy connections:
h_i^(new) = σ( Σ_j α_ij · h_j · W )
↑
Learned weight based on feature similarity
Attention Mechanism:
1. Transform features: h'_i = h_i W
2. Compute attention logits: e_ij = LeakyReLU(a^T [h'_i || h'_j])
3. Normalize with softmax: α_ij = exp(e_ij) / Σ_k exp(e_ik)
4. Weighted aggregation: h_i = σ(Σ_j α_ij h'_j)
Multi-Head Design: 4 parallel attention mechanisms per layer. Each head can specialize:
- Head 1: Same-category products
- Head 2: Complementary accessories
- Head 3: Price-range similarity
- Head 4: Brand affinity
Full Model:
Input: One-Hot Features (7,650-dim)
↓
GAT Layer 1: 7650 → 4 heads × 128 = 512-dim (concat)
↓ ELU + Dropout(0.6)
GAT Layer 2: 512 → 4 heads × 128 = 512-dim (concat)
↓ ELU + Dropout(0.6)
GAT Layer 3: 512 → 8 classes
↓
Output: Class probabilities
Implementation: Built from scratch using PyTorch Geometric's MessagePassing base class with:
- LeakyReLU (negative_slope=0.2) for attention logits
- Softmax normalization over neighborhoods
- Self-loops for self-attention
- Concatenation of multi-head outputs
| Model | Test Accuracy | Validation Accuracy | Improvement |
|---|---|---|---|
| GCN Baseline | 92.16% | 93.03% | - |
| Custom GAT | 92.68% | 93.03% | +0.52% |
Why GAT Outperforms:
-
Noise Filtering: Attention weights suppress irrelevant co-purchase edges between different categories
-
Adaptive Aggregation: Important neighbors (same category) receive higher weights
-
Multi-Head Robustness: 4 heads capture diverse relational patterns simultaneously
Attention Visualization (conceptual example):
Target: DSLR Camera (class: Cameras)
Neighbors & Learned Attention Weights:
Camera Lens α = 0.35 ← High (same category)
Camera Bag α = 0.28 ← High (same category)
Memory Card α = 0.22 ← Medium (accessory)
Tripod α = 0.10 ← Medium (accessory)
Random Book α = 0.05 ← Low (NOISE - filtered out!)
Why Only 0.52% Improvement?
The dataset is relatively clean - most co-purchases are category-consistent. The 0.52% gain demonstrates that:
- Even small improvements matter in high-accuracy regimes (92%+)
- Attention successfully identifies and filters the noisy edges that exist
- Validates the architectural choice for noisy graph problems
When GAT Would Win Bigger: Datasets with more noise (fraud networks, social media, recommendation graphs) would show larger attention benefits.
├── GNN.ipynb # Full implementation with experiments
├── README.md # This comprehensive guide
└── TECHNICAL_GUIDE.md # Step-by-step walkthrough with toy example
# Clone repository
git clone https://github.com/yourusername/graph-neural-networks-node-classification.git
cd graph-neural-networks-node-classification
# Install dependencies
pip install torch torch-geometric networkx numpy scipy matplotlib scikit-learn
# Run notebook
jupyter notebook GNN.ipynbAll experiments use fixed random seed (42) for reproducibility.
-
GNNs Enable Relational Learning: Capture graph structure that traditional models miss
-
Feature Engineering is Critical: 92.07% (spectral) vs 40.16% (constant) - same model, different features
-
Stability Matters More Than Semantics: Fixed random (87.98%) beats unstable random (37.11%) by 50.87 points
-
Spectral Features Optimal: Match one-hot accuracy with 119× fewer dimensions and partial inductive capability
-
Attention Filters Noise: GAT (92.68%) outperforms GCN (92.16%) by learning to suppress irrelevant edges
-
Choose Based on Production Needs:
- Static graphs → One-hot encoding (92.07%)
- Dynamic graphs → Spectral features (92.07%, inductive)
- Interpretability needed → Centralities (84.06%, explainable)
- Heterogeneous graphs (multiple node/edge types)
- Temporal dynamics (time-evolving co-purchases)
- GraphSAGE comparison (sampling for scalability)
- Edge features (purchase frequency, ratings)
- Attention visualization dashboard
- Kipf & Welling (2017): "Semi-Supervised Classification with Graph Convolutional Networks"
- Veličković et al. (2018): "Graph Attention Networks"
- Hamilton et al. (2017): "Inductive Representation Learning on Large Graphs"
- PyTorch Geometric: https://pytorch-geometric.readthedocs.io/