Skip to content

A comprehensive comparison of Graph Convolutional Networks (GCN) and custom Graph Attention Networks (GAT) for product classification in noisy co-purchase networks. Includes node feature augmentation techniques (spectral, degree-based, one-hot encoding) and demonstrates how attention mechanisms filter irrelevant edges on Amazon Photo dataset

Notifications You must be signed in to change notification settings

aaliusama/Graph-neural-networks-node-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Graph Neural Networks for Node Classification

A comprehensive implementation comparing Graph Convolutional Networks (GCN) and custom Graph Attention Networks (GAT) for product classification in co-purchase networks. This project systematically explores seven node feature engineering strategies and demonstrates how attention mechanisms improve performance on noisy graph structures.


Why Graph Neural Networks?

Traditional machine learning models operate on tabular data or sequences, but many real-world datasets have relational structure that these models cannot capture:

  • Social networks: Users connected by friendships
  • E-commerce: Products linked by co-purchase behavior
  • Citation networks: Papers connected by citations
  • Molecular graphs: Atoms bonded in chemical structures
  • Transportation: Cities connected by routes

The Challenge: Standard neural networks like CNNs and RNNs are designed for grid-like (images) or sequential (text) data. They cannot naturally handle irregular graph structures where each node has a variable number of neighbors.

The Solution: Graph Neural Networks (GNNs) generalize neural networks to graphs by:

  1. Message Passing: Nodes exchange information with neighbors
  2. Aggregation: Each node combines neighbor messages
  3. Learning: Neural networks learn what messages to send and how to combine them

Result: GNNs can learn representations that capture both node features AND graph structure, making them powerful for tasks like node classification, link prediction, and graph classification.


Business Use Cases

1. E-Commerce Product Classification

Problem: New products arrive daily without proper categorization. Manual labeling is expensive and slow.

GNN Solution:

  • Build co-purchase graph from transaction logs
  • Train GNN on labeled products
  • Automatically classify new products based on which existing products customers buy them with
  • Business Impact: Faster time-to-market, improved search/discovery, better inventory management

2. Fraud Detection in Financial Networks

Problem: Fraudsters create accounts that transact with each other, forming suspicious clusters.

GNN Solution:

  • Represent transactions as edges between account nodes
  • GNN learns patterns like "accounts connected to known fraudsters are suspicious"
  • Detects fraud rings that simple rule-based systems miss
  • Business Impact: Reduced fraud losses, lower false positive rates

3. Content Recommendation

Problem: Cold-start problem for new users/items with no interaction history.

GNN Solution:

  • User-item interaction graph (users connected to items they engaged with)
  • GNN propagates information: "users who liked similar items might like this"
  • Works even for brand new items by leveraging graph structure
  • Business Impact: Better recommendations, increased engagement, higher revenue

4. Social Network Analysis

Problem: Identify influential users, detect communities, predict information spread.

GNN Solution:

  • User friendship/follower graph
  • GNN learns to identify influencers based on network position
  • Predicts which users will adopt new features based on their connections
  • Business Impact: Targeted marketing, viral growth strategies, community management

5. Supply Chain Optimization

Problem: Predict disruptions and bottlenecks in complex supplier networks.

GNN Solution:

  • Supplier-manufacturer-distributor graph
  • GNN predicts which nodes are at risk based on neighbor failures
  • Identifies critical nodes whose failure would cascade
  • Business Impact: Reduced downtime, proactive risk mitigation, cost savings

The Amazon Photo Dataset Problem

Scenario: Amazon's photo category has 7,650 products that need classification into 8 subcategories (cameras, lenses, tripods, bags, etc.).

Available Data:

  • Co-purchase graph: 238,162 edges showing which products are bought together
  • Bag-of-words features: 745-dim vectors (removed for this experiment)

Challenge: Not all co-purchased products belong to the same category. For example:

  • Camera + Camera Bag (noisy edge - different categories)
  • Camera + Camera Lens (useful edge - related categories)

Goal: Build a GNN that learns from graph structure to classify products, even when edges contain noise.


Transductive vs. Inductive Learning

Understanding the difference between transductive and inductive learning is crucial for real-world GNN deployment.

Transductive Learning

Definition: The model is trained on a graph containing ALL nodes (train + validation + test), but only some nodes have labels during training.

Example:

  • You have a social network with 10,000 users
  • 7,000 are labeled (their interests are known)
  • 3,000 are unlabeled (you want to predict their interests)
  • All 10,000 users are present in the graph during training

Characteristics:

  • ✅ Can achieve very high accuracy (model sees entire graph structure)
  • ❌ Cannot handle new nodes that arrive after training
  • ❌ Must retrain model when graph grows
  • ❌ Not suitable for dynamic graphs

Use Cases: Static graphs where all nodes are known upfront (citation networks, social network analysis snapshots)

Inductive Learning

Definition: The model learns generalizable patterns that work on unseen nodes/subgraphs.

Example:

  • You train on one social network with 10,000 users
  • Model learns: "users connected to sports enthusiasts often like sports"
  • When 1,000 new users join, you can classify them immediately without retraining

Characteristics:

  • ✅ Generalizes to new nodes
  • ✅ No retraining needed when graph grows
  • ✅ Suitable for production systems with continuous growth
  • ❌ May have slightly lower accuracy than transductive

Use Cases: E-commerce (new products daily), social networks (new users), fraud detection (new accounts)

Key Difference in Features

The distinction shows up clearly in feature choice:

Feature Type Transductive Inductive Why?
One-Hot Encoding ✅ Yes ❌ No Each node has unique ID - new nodes don't have IDs
Random (Fixed) ✅ Yes ❌ No Random vectors are node-specific
Random (Regenerated) ❌ No ✅ Yes Simulates no persistent features
Degree ✅ Yes ✅ Yes Can compute degree for any node
Centrality ✅ Yes ✅ Yes Can compute metrics for any node
Spectral ✅ Yes ⚠️ Partial Eigenvectors are graph-specific but patterns generalize

Critical Insight from This Project: The 50.87 percentage point drop in accuracy when switching from transductive random (87.98%) to inductive random (37.11%) demonstrates why feature stability matters for message passing.


Part 1: Feature Engineering Experiments

All experiments use identical GCN architecture to isolate the impact of features:

  • Architecture: 3-layer GCN with 128 hidden dimensions per layer
  • Activation: ReLU
  • Regularization: Dropout (0.5), Weight decay (5e-4)
  • Optimizer: Adam (lr=0.01)
  • Training: 200 epochs, evaluate at epoch with best validation accuracy

1. One-Hot Encoding (Transductive)

Method: Identity matrix - each node gets unique binary vector.

one_hot_features = torch.eye(7650)  # 7650 × 7650

Results:

  • Dimensionality: 7,650
  • Test Accuracy: 92.07%
  • Validation Accuracy: 93.03%
  • Transductive: Yes
  • Inductive: No

Analysis: Highest accuracy but cannot generalize to new products. The model essentially memorizes node-specific patterns rather than learning structural rules.


2. Constant Features (Baseline)

Method: All nodes receive identical vector [1, 1, 1, 1].

constant_features = torch.ones(7650, 4)  # All nodes identical

Results:

  • Dimensionality: 4
  • Test Accuracy: 40.16%
  • Validation Accuracy: 42.11%
  • Transductive: No
  • Inductive: Yes

Analysis: With zero node-specific information, the model still achieves 40.16% accuracy (vs. 12.5% random baseline for 8 classes). This proves that graph topology alone contains significant classification signal.


3. Random Features (Transductive)

Method: Fixed random Gaussian vectors, normalized to unit length, stable across all 200 epochs.

random_features = torch.randn(7650, 128)
random_features = F.normalize(random_features, p=2, dim=1)
# These features stay the same throughout training

Results:

  • Dimensionality: 128
  • Test Accuracy: 87.98%
  • Validation Accuracy: 89.10%
  • Transductive: Yes
  • Inductive: No

Analysis: Surprisingly effective despite being random. Fixed random features serve as positional encodings that break symmetries between nodes with similar neighborhoods. The stability allows the GNN to learn complex patterns through message passing.


4. Random Features (Inductive)

Method: Regenerate new random vectors at every epoch to simulate unstable features.

for epoch in range(200):
    # Generate NEW random features each epoch
    data.x = torch.randn(7650, 128)
    data.x = F.normalize(data.x, p=2, dim=1)
    # Train for one epoch with these features

Results:

  • Dimensionality: 128
  • Test Accuracy: 37.11%
  • Validation Accuracy: 38.10%
  • Transductive: No
  • Inductive: Yes

Critical Finding: Catastrophic 50.87 percentage point accuracy drop (87.98% → 37.11%) compared to transductive random. Even worse than constant features (40.16%)!

Why This Happens: Without stable features, the model cannot build persistent representations through message passing. Each epoch starts from scratch. The model only learns from graph structure within a single epoch, which is insufficient.

Key Insight: Feature stability matters more than semantic meaning for GNNs. Random but stable features (88%) vastly outperform semantically empty but unstable features (37%).


5. Degree Features

Method: Two features - raw degree count and log-transformed degree.

degree_features = torch.stack([
    node_degrees,                    # Raw degree
    torch.log(node_degrees + 1)      # Log degree
], dim=1)

Results:

  • Dimensionality: 2
  • Test Accuracy: 66.03%
  • Validation Accuracy: 67.22%
  • Transductive: No
  • Inductive: Yes

Analysis: Simple but effective. With only 2 features, achieves 25.87 percentage points above constant baseline. The log transformation handles power-law degree distributions common in real networks.


6. Centrality Features

Method: Five graph-theoretic metrics capturing different structural roles.

Features:
1. Degree Centrality: Normalized degree (hub identification)
2. Betweenness Centrality: Bridge nodes connecting communities
3. Closeness Centrality: Distance to all other nodes
4. PageRank: Prestige/importance
5. Clustering Coefficient: Local cohesion

Results:

  • Dimensionality: 5
  • Test Accuracy: 84.06%
  • Validation Accuracy: 84.83%
  • Transductive: No
  • Inductive: Yes

Analysis: Strong performance with compact, interpretable features. Each centrality captures a complementary aspect of network position. Good balance between accuracy and feature efficiency.


7. Laplacian Eigenvectors (Spectral Features)

Method: Top-64 eigenvectors of graph Laplacian (L = D - A).

# PyG built-in transform
transform = T.AddLaplacianEigenvectorPE(k=64)
data = transform(data)

Theory:

  • Low eigenvalues (first eigenvectors): Capture global structure, communities
  • High eigenvalues (later eigenvectors): Capture local symmetries
  • Multi-scale encoding: 64 eigenvectors span coarse to fine structural details

Results:

  • Dimensionality: 64
  • Test Accuracy: 92.07%
  • Validation Accuracy: 92.50%
  • Transductive: Partially
  • Inductive: Partially (patterns generalize)

Analysis: Matches one-hot accuracy (92.07%) with 119× fewer dimensions (64 vs 7,650). Best practical choice:

  • Achieves highest accuracy
  • Compact representation
  • Partially inductive (structural patterns generalize to similar nodes)
  • Theoretically grounded in spectral graph theory

Feature Engineering Results Summary

Feature Type Dims Test Acc Val Acc Inductive? Accuracy Drop from Best
One-Hot Encoding 7,650 92.07% 93.03% ❌ No 0.00%
Laplacian Eigenvectors 64 92.07% 92.50% ⚠️ Partial 0.00%
Random (Transductive) 128 87.98% 89.10% ❌ No -4.09%
Node Centralities 5 84.06% 84.83% ✅ Yes -8.01%
Node Degree 2 66.03% 67.22% ✅ Yes -26.04%
Constant Features 4 40.16% 42.11% ✅ Yes -51.91%
Random (Inductive) 128 37.11% 38.10% ✅ Yes -54.96%

Key Findings

  1. Transductive vs Inductive Gap: 50.87% accuracy drop between transductive random (87.98%) and inductive random (37.11%) proves feature stability is critical

  2. Spectral Features Optimal: Laplacian eigenvectors achieve top accuracy with 119× compression and partial inductive capability

  3. Structure Contains Signal: Constant features achieve 40.16% (vs 12.5% random), proving graph topology encodes meaningful information

  4. Stability > Semantics: Fixed random (87.98%) beats unstable features (37.11%) by 50.87 percentage points

  5. Practical Choice: For production systems with new nodes, use Laplacian eigenvectors (92.07% accuracy, inductive) or centralities (84.06%, fully inductive, interpretable)


Part 2: Custom Graph Attention Network

Why Attention Mechanisms?

Standard GCN Problem: Treats all neighbors equally with fixed weights:

h_i^(new) = σ( Σ_j (1/√(d_i·d_j)) · h_j · W )
            ↑
      Fixed weight based only on degree

Challenge in Co-Purchase Networks:

  • Camera + Camera Bag: Different categories (noisy edge)
  • Camera + Camera Lens: Same category (useful edge)
  • GCN averages both equally → noise dilutes signal

GAT Solution: Learn attention weights to identify and downweight noisy connections:

h_i^(new) = σ( Σ_j α_ij · h_j · W )
            ↑
      Learned weight based on feature similarity

Architecture: Multi-Head Graph Attention Network

Attention Mechanism:

1. Transform features: h'_i = h_i W
2. Compute attention logits: e_ij = LeakyReLU(a^T [h'_i || h'_j])
3. Normalize with softmax: α_ij = exp(e_ij) / Σ_k exp(e_ik)
4. Weighted aggregation: h_i = σ(Σ_j α_ij h'_j)

Multi-Head Design: 4 parallel attention mechanisms per layer. Each head can specialize:

  • Head 1: Same-category products
  • Head 2: Complementary accessories
  • Head 3: Price-range similarity
  • Head 4: Brand affinity

Full Model:

Input: One-Hot Features (7,650-dim)
    ↓
GAT Layer 1: 7650 → 4 heads × 128 = 512-dim (concat)
    ↓ ELU + Dropout(0.6)
GAT Layer 2: 512 → 4 heads × 128 = 512-dim (concat)
    ↓ ELU + Dropout(0.6)
GAT Layer 3: 512 → 8 classes
    ↓
Output: Class probabilities

Implementation: Built from scratch using PyTorch Geometric's MessagePassing base class with:

  • LeakyReLU (negative_slope=0.2) for attention logits
  • Softmax normalization over neighborhoods
  • Self-loops for self-attention
  • Concatenation of multi-head outputs

Results: GCN vs Custom GAT

Model Test Accuracy Validation Accuracy Improvement
GCN Baseline 92.16% 93.03% -
Custom GAT 92.68% 93.03% +0.52%

Analysis

Why GAT Outperforms:

  1. Noise Filtering: Attention weights suppress irrelevant co-purchase edges between different categories

  2. Adaptive Aggregation: Important neighbors (same category) receive higher weights

  3. Multi-Head Robustness: 4 heads capture diverse relational patterns simultaneously

Attention Visualization (conceptual example):

Target: DSLR Camera (class: Cameras)

Neighbors & Learned Attention Weights:
  Camera Lens       α = 0.35  ← High (same category)
  Camera Bag        α = 0.28  ← High (same category)
  Memory Card       α = 0.22  ← Medium (accessory)
  Tripod           α = 0.10  ← Medium (accessory)
  Random Book      α = 0.05  ← Low (NOISE - filtered out!)

Why Only 0.52% Improvement?

The dataset is relatively clean - most co-purchases are category-consistent. The 0.52% gain demonstrates that:

  • Even small improvements matter in high-accuracy regimes (92%+)
  • Attention successfully identifies and filters the noisy edges that exist
  • Validates the architectural choice for noisy graph problems

When GAT Would Win Bigger: Datasets with more noise (fraud networks, social media, recommendation graphs) would show larger attention benefits.

Repository Structure

├── GNN.ipynb                    # Full implementation with experiments
├── README.md                    # This comprehensive guide
└── TECHNICAL_GUIDE.md           # Step-by-step walkthrough with toy example

Usage

# Clone repository
git clone https://github.com/yourusername/graph-neural-networks-node-classification.git
cd graph-neural-networks-node-classification

# Install dependencies
pip install torch torch-geometric networkx numpy scipy matplotlib scikit-learn

# Run notebook
jupyter notebook GNN.ipynb

All experiments use fixed random seed (42) for reproducibility.


Key Takeaways

  1. GNNs Enable Relational Learning: Capture graph structure that traditional models miss

  2. Feature Engineering is Critical: 92.07% (spectral) vs 40.16% (constant) - same model, different features

  3. Stability Matters More Than Semantics: Fixed random (87.98%) beats unstable random (37.11%) by 50.87 points

  4. Spectral Features Optimal: Match one-hot accuracy with 119× fewer dimensions and partial inductive capability

  5. Attention Filters Noise: GAT (92.68%) outperforms GCN (92.16%) by learning to suppress irrelevant edges

  6. Choose Based on Production Needs:

    • Static graphs → One-hot encoding (92.07%)
    • Dynamic graphs → Spectral features (92.07%, inductive)
    • Interpretability needed → Centralities (84.06%, explainable)

Future Extensions

  • Heterogeneous graphs (multiple node/edge types)
  • Temporal dynamics (time-evolving co-purchases)
  • GraphSAGE comparison (sampling for scalability)
  • Edge features (purchase frequency, ratings)
  • Attention visualization dashboard

References

  • Kipf & Welling (2017): "Semi-Supervised Classification with Graph Convolutional Networks"
  • Veličković et al. (2018): "Graph Attention Networks"
  • Hamilton et al. (2017): "Inductive Representation Learning on Large Graphs"
  • PyTorch Geometric: https://pytorch-geometric.readthedocs.io/

About

A comprehensive comparison of Graph Convolutional Networks (GCN) and custom Graph Attention Networks (GAT) for product classification in noisy co-purchase networks. Includes node feature augmentation techniques (spectral, degree-based, one-hot encoding) and demonstrates how attention mechanisms filter irrelevant edges on Amazon Photo dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published