Graph Neural Networks for Node Classification

A comprehensive implementation comparing Graph Convolutional Networks (GCN) and custom Graph Attention Networks (GAT) for product classification in co-purchase networks. This project systematically explores seven node feature engineering strategies and demonstrates how attention mechanisms improve performance on noisy graph structures.

Why Graph Neural Networks?

Traditional machine learning models operate on tabular data or sequences, but many real-world datasets have relational structure that these models cannot capture:

Social networks: Users connected by friendships
E-commerce: Products linked by co-purchase behavior
Citation networks: Papers connected by citations
Molecular graphs: Atoms bonded in chemical structures
Transportation: Cities connected by routes

The Challenge: Standard neural networks like CNNs and RNNs are designed for grid-like (images) or sequential (text) data. They cannot naturally handle irregular graph structures where each node has a variable number of neighbors.

The Solution: Graph Neural Networks (GNNs) generalize neural networks to graphs by:

Message Passing: Nodes exchange information with neighbors
Aggregation: Each node combines neighbor messages
Learning: Neural networks learn what messages to send and how to combine them

Result: GNNs can learn representations that capture both node features AND graph structure, making them powerful for tasks like node classification, link prediction, and graph classification.

Business Use Cases

1. E-Commerce Product Classification

Problem: New products arrive daily without proper categorization. Manual labeling is expensive and slow.

GNN Solution:

Build co-purchase graph from transaction logs
Train GNN on labeled products
Automatically classify new products based on which existing products customers buy them with
Business Impact: Faster time-to-market, improved search/discovery, better inventory management

2. Fraud Detection in Financial Networks

Problem: Fraudsters create accounts that transact with each other, forming suspicious clusters.

GNN Solution:

Represent transactions as edges between account nodes
GNN learns patterns like "accounts connected to known fraudsters are suspicious"
Detects fraud rings that simple rule-based systems miss
Business Impact: Reduced fraud losses, lower false positive rates

3. Content Recommendation

Problem: Cold-start problem for new users/items with no interaction history.

GNN Solution:

User-item interaction graph (users connected to items they engaged with)
GNN propagates information: "users who liked similar items might like this"
Works even for brand new items by leveraging graph structure
Business Impact: Better recommendations, increased engagement, higher revenue

4. Social Network Analysis

Problem: Identify influential users, detect communities, predict information spread.

GNN Solution:

User friendship/follower graph
GNN learns to identify influencers based on network position
Predicts which users will adopt new features based on their connections
Business Impact: Targeted marketing, viral growth strategies, community management

5. Supply Chain Optimization

Problem: Predict disruptions and bottlenecks in complex supplier networks.

GNN Solution:

Supplier-manufacturer-distributor graph
GNN predicts which nodes are at risk based on neighbor failures
Identifies critical nodes whose failure would cascade
Business Impact: Reduced downtime, proactive risk mitigation, cost savings

The Amazon Photo Dataset Problem

Scenario: Amazon's photo category has 7,650 products that need classification into 8 subcategories (cameras, lenses, tripods, bags, etc.).

Available Data:

Co-purchase graph: 238,162 edges showing which products are bought together
Bag-of-words features: 745-dim vectors (removed for this experiment)

Challenge: Not all co-purchased products belong to the same category. For example:

Camera + Camera Bag (noisy edge - different categories)
Camera + Camera Lens (useful edge - related categories)

Goal: Build a GNN that learns from graph structure to classify products, even when edges contain noise.

Transductive vs. Inductive Learning

Understanding the difference between transductive and inductive learning is crucial for real-world GNN deployment.

Transductive Learning

Definition: The model is trained on a graph containing ALL nodes (train + validation + test), but only some nodes have labels during training.

Example:

You have a social network with 10,000 users
7,000 are labeled (their interests are known)
3,000 are unlabeled (you want to predict their interests)
All 10,000 users are present in the graph during training

Characteristics:

✅ Can achieve very high accuracy (model sees entire graph structure)
❌ Cannot handle new nodes that arrive after training
❌ Must retrain model when graph grows
❌ Not suitable for dynamic graphs

Use Cases: Static graphs where all nodes are known upfront (citation networks, social network analysis snapshots)

Inductive Learning

Definition: The model learns generalizable patterns that work on unseen nodes/subgraphs.

Example:

You train on one social network with 10,000 users
Model learns: "users connected to sports enthusiasts often like sports"
When 1,000 new users join, you can classify them immediately without retraining

Characteristics:

✅ Generalizes to new nodes
✅ No retraining needed when graph grows
✅ Suitable for production systems with continuous growth
❌ May have slightly lower accuracy than transductive

Use Cases: E-commerce (new products daily), social networks (new users), fraud detection (new accounts)

Key Difference in Features

The distinction shows up clearly in feature choice:

Feature Type	Transductive	Inductive	Why?
One-Hot Encoding	✅ Yes	❌ No	Each node has unique ID - new nodes don't have IDs
Random (Fixed)	✅ Yes	❌ No	Random vectors are node-specific
Random (Regenerated)	❌ No	✅ Yes	Simulates no persistent features
Degree	✅ Yes	✅ Yes	Can compute degree for any node
Centrality	✅ Yes	✅ Yes	Can compute metrics for any node
Spectral	✅ Yes	⚠️ Partial	Eigenvectors are graph-specific but patterns generalize

Critical Insight from This Project: The 50.87 percentage point drop in accuracy when switching from transductive random (87.98%) to inductive random (37.11%) demonstrates why feature stability matters for message passing.

Part 1: Feature Engineering Experiments

All experiments use identical GCN architecture to isolate the impact of features:

Architecture: 3-layer GCN with 128 hidden dimensions per layer
Activation: ReLU
Regularization: Dropout (0.5), Weight decay (5e-4)
Optimizer: Adam (lr=0.01)
Training: 200 epochs, evaluate at epoch with best validation accuracy

1. One-Hot Encoding (Transductive)

Method: Identity matrix - each node gets unique binary vector.

one_hot_features = torch.eye(7650)  # 7650 × 7650

Results:

Dimensionality: 7,650
Test Accuracy: 92.07%
Validation Accuracy: 93.03%
Transductive: Yes
Inductive: No

Analysis: Highest accuracy but cannot generalize to new products. The model essentially memorizes node-specific patterns rather than learning structural rules.

2. Constant Features (Baseline)

Method: All nodes receive identical vector [1, 1, 1, 1].

constant_features = torch.ones(7650, 4)  # All nodes identical

Results:

Dimensionality: 4
Test Accuracy: 40.16%
Validation Accuracy: 42.11%
Transductive: No
Inductive: Yes

Analysis: With zero node-specific information, the model still achieves 40.16% accuracy (vs. 12.5% random baseline for 8 classes). This proves that graph topology alone contains significant classification signal.

3. Random Features (Transductive)

Method: Fixed random Gaussian vectors, normalized to unit length, stable across all 200 epochs.

random_features = torch.randn(7650, 128)
random_features = F.normalize(random_features, p=2, dim=1)
# These features stay the same throughout training

Results:

Dimensionality: 128
Test Accuracy: 87.98%
Validation Accuracy: 89.10%
Transductive: Yes
Inductive: No

Analysis: Surprisingly effective despite being random. Fixed random features serve as positional encodings that break symmetries between nodes with similar neighborhoods. The stability allows the GNN to learn complex patterns through message passing.

4. Random Features (Inductive)

Method: Regenerate new random vectors at every epoch to simulate unstable features.

for epoch in range(200):
    # Generate NEW random features each epoch
    data.x = torch.randn(7650, 128)
    data.x = F.normalize(data.x, p=2, dim=1)
    # Train for one epoch with these features

Results:

Dimensionality: 128
Test Accuracy: 37.11%
Validation Accuracy: 38.10%
Transductive: No
Inductive: Yes

Critical Finding: Catastrophic 50.87 percentage point accuracy drop (87.98% → 37.11%) compared to transductive random. Even worse than constant features (40.16%)!

Why This Happens: Without stable features, the model cannot build persistent representations through message passing. Each epoch starts from scratch. The model only learns from graph structure within a single epoch, which is insufficient.

Key Insight: Feature stability matters more than semantic meaning for GNNs. Random but stable features (88%) vastly outperform semantically empty but unstable features (37%).

5. Degree Features

Method: Two features - raw degree count and log-transformed degree.

degree_features = torch.stack([
    node_degrees,                    # Raw degree
    torch.log(node_degrees + 1)      # Log degree
], dim=1)

Results:

Dimensionality: 2
Test Accuracy: 66.03%
Validation Accuracy: 67.22%
Transductive: No
Inductive: Yes

Analysis: Simple but effective. With only 2 features, achieves 25.87 percentage points above constant baseline. The log transformation handles power-law degree distributions common in real networks.

6. Centrality Features

Method: Five graph-theoretic metrics capturing different structural roles.

Features:
1. Degree Centrality: Normalized degree (hub identification)
2. Betweenness Centrality: Bridge nodes connecting communities
3. Closeness Centrality: Distance to all other nodes
4. PageRank: Prestige/importance
5. Clustering Coefficient: Local cohesion

Results:

Dimensionality: 5
Test Accuracy: 84.06%
Validation Accuracy: 84.83%
Transductive: No
Inductive: Yes

Analysis: Strong performance with compact, interpretable features. Each centrality captures a complementary aspect of network position. Good balance between accuracy and feature efficiency.

7. Laplacian Eigenvectors (Spectral Features)

Method: Top-64 eigenvectors of graph Laplacian (L = D - A).

# PyG built-in transform
transform = T.AddLaplacianEigenvectorPE(k=64)
data = transform(data)

Theory:

Low eigenvalues (first eigenvectors): Capture global structure, communities
High eigenvalues (later eigenvectors): Capture local symmetries
Multi-scale encoding: 64 eigenvectors span coarse to fine structural details

Results:

Dimensionality: 64
Test Accuracy: 92.07%
Validation Accuracy: 92.50%
Transductive: Partially
Inductive: Partially (patterns generalize)

Analysis: Matches one-hot accuracy (92.07%) with 119× fewer dimensions (64 vs 7,650). Best practical choice:

Achieves highest accuracy
Compact representation
Partially inductive (structural patterns generalize to similar nodes)
Theoretically grounded in spectral graph theory

Feature Engineering Results Summary

Feature Type	Dims	Test Acc	Val Acc	Inductive?	Accuracy Drop from Best
One-Hot Encoding	7,650	92.07%	93.03%	❌ No	0.00%
Laplacian Eigenvectors	64	92.07%	92.50%	⚠️ Partial	0.00%
Random (Transductive)	128	87.98%	89.10%	❌ No	-4.09%
Node Centralities	5	84.06%	84.83%	✅ Yes	-8.01%
Node Degree	2	66.03%	67.22%	✅ Yes	-26.04%
Constant Features	4	40.16%	42.11%	✅ Yes	-51.91%
Random (Inductive)	128	37.11%	38.10%	✅ Yes	-54.96%

Key Findings

Transductive vs Inductive Gap: 50.87% accuracy drop between transductive random (87.98%) and inductive random (37.11%) proves feature stability is critical
Spectral Features Optimal: Laplacian eigenvectors achieve top accuracy with 119× compression and partial inductive capability
Structure Contains Signal: Constant features achieve 40.16% (vs 12.5% random), proving graph topology encodes meaningful information
Stability > Semantics: Fixed random (87.98%) beats unstable features (37.11%) by 50.87 percentage points
Practical Choice: For production systems with new nodes, use Laplacian eigenvectors (92.07% accuracy, inductive) or centralities (84.06%, fully inductive, interpretable)

Part 2: Custom Graph Attention Network

Why Attention Mechanisms?

Standard GCN Problem: Treats all neighbors equally with fixed weights:

h_i^(new) = σ( Σ_j (1/√(d_i·d_j)) · h_j · W )
            ↑
      Fixed weight based only on degree

Challenge in Co-Purchase Networks:

Camera + Camera Bag: Different categories (noisy edge)
Camera + Camera Lens: Same category (useful edge)
GCN averages both equally → noise dilutes signal

GAT Solution: Learn attention weights to identify and downweight noisy connections:

h_i^(new) = σ( Σ_j α_ij · h_j · W )
            ↑
      Learned weight based on feature similarity

Architecture: Multi-Head Graph Attention Network

Attention Mechanism:

1. Transform features: h'_i = h_i W
2. Compute attention logits: e_ij = LeakyReLU(a^T [h'_i || h'_j])
3. Normalize with softmax: α_ij = exp(e_ij) / Σ_k exp(e_ik)
4. Weighted aggregation: h_i = σ(Σ_j α_ij h'_j)

Multi-Head Design: 4 parallel attention mechanisms per layer. Each head can specialize:

Head 1: Same-category products
Head 2: Complementary accessories
Head 3: Price-range similarity
Head 4: Brand affinity

Full Model:

Input: One-Hot Features (7,650-dim)
    ↓
GAT Layer 1: 7650 → 4 heads × 128 = 512-dim (concat)
    ↓ ELU + Dropout(0.6)
GAT Layer 2: 512 → 4 heads × 128 = 512-dim (concat)
    ↓ ELU + Dropout(0.6)
GAT Layer 3: 512 → 8 classes
    ↓
Output: Class probabilities

Implementation: Built from scratch using PyTorch Geometric's MessagePassing base class with:

LeakyReLU (negative_slope=0.2) for attention logits
Softmax normalization over neighborhoods
Self-loops for self-attention
Concatenation of multi-head outputs

Results: GCN vs Custom GAT

Model	Test Accuracy	Validation Accuracy	Improvement
GCN Baseline	92.16%	93.03%	-
Custom GAT	92.68%	93.03%	+0.52%

Analysis

Why GAT Outperforms:

Noise Filtering: Attention weights suppress irrelevant co-purchase edges between different categories
Adaptive Aggregation: Important neighbors (same category) receive higher weights
Multi-Head Robustness: 4 heads capture diverse relational patterns simultaneously

Attention Visualization (conceptual example):

Target: DSLR Camera (class: Cameras)

Neighbors & Learned Attention Weights:
  Camera Lens       α = 0.35  ← High (same category)
  Camera Bag        α = 0.28  ← High (same category)
  Memory Card       α = 0.22  ← Medium (accessory)
  Tripod           α = 0.10  ← Medium (accessory)
  Random Book      α = 0.05  ← Low (NOISE - filtered out!)

Why Only 0.52% Improvement?

The dataset is relatively clean - most co-purchases are category-consistent. The 0.52% gain demonstrates that:

Even small improvements matter in high-accuracy regimes (92%+)
Attention successfully identifies and filters the noisy edges that exist
Validates the architectural choice for noisy graph problems

When GAT Would Win Bigger: Datasets with more noise (fraud networks, social media, recommendation graphs) would show larger attention benefits.

Repository Structure

├── GNN.ipynb                    # Full implementation with experiments
├── README.md                    # This comprehensive guide
└── TECHNICAL_GUIDE.md           # Step-by-step walkthrough with toy example

Usage

# Clone repository
git clone https://github.com/yourusername/graph-neural-networks-node-classification.git
cd graph-neural-networks-node-classification

# Install dependencies
pip install torch torch-geometric networkx numpy scipy matplotlib scikit-learn

# Run notebook
jupyter notebook GNN.ipynb

All experiments use fixed random seed (42) for reproducibility.

Key Takeaways

GNNs Enable Relational Learning: Capture graph structure that traditional models miss
Feature Engineering is Critical: 92.07% (spectral) vs 40.16% (constant) - same model, different features
Stability Matters More Than Semantics: Fixed random (87.98%) beats unstable random (37.11%) by 50.87 points
Spectral Features Optimal: Match one-hot accuracy with 119× fewer dimensions and partial inductive capability
Attention Filters Noise: GAT (92.68%) outperforms GCN (92.16%) by learning to suppress irrelevant edges
Choose Based on Production Needs:
- Static graphs → One-hot encoding (92.07%)
- Dynamic graphs → Spectral features (92.07%, inductive)
- Interpretability needed → Centralities (84.06%, explainable)

Future Extensions

Heterogeneous graphs (multiple node/edge types)
Temporal dynamics (time-evolving co-purchases)
GraphSAGE comparison (sampling for scalability)
Edge features (purchase frequency, ratings)
Attention visualization dashboard

References

Kipf & Welling (2017): "Semi-Supervised Classification with Graph Convolutional Networks"
Veličković et al. (2018): "Graph Attention Networks"
Hamilton et al. (2017): "Inductive Representation Learning on Large Graphs"
PyTorch Geometric: https://pytorch-geometric.readthedocs.io/

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GNN.ipynb		GNN.ipynb
README.md		README.md
TECHNICAL_GUIDE.md		TECHNICAL_GUIDE.md

aaliusama/Graph-neural-networks-node-classification

Folders and files

Latest commit

History

Repository files navigation

Graph Neural Networks for Node Classification

Why Graph Neural Networks?

Business Use Cases

1. E-Commerce Product Classification

2. Fraud Detection in Financial Networks

3. Content Recommendation

4. Social Network Analysis

5. Supply Chain Optimization

The Amazon Photo Dataset Problem

Transductive vs. Inductive Learning

Transductive Learning

Inductive Learning

Key Difference in Features

Part 1: Feature Engineering Experiments

1. One-Hot Encoding (Transductive)

2. Constant Features (Baseline)

3. Random Features (Transductive)

4. Random Features (Inductive)

5. Degree Features

6. Centrality Features

7. Laplacian Eigenvectors (Spectral Features)

Feature Engineering Results Summary

Key Findings

Part 2: Custom Graph Attention Network

Why Attention Mechanisms?

Architecture: Multi-Head Graph Attention Network

Results: GCN vs Custom GAT

Analysis

Repository Structure

Usage

Key Takeaways

Future Extensions

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages