Skip to content

Conversation

@grapentt
Copy link

@grapentt grapentt commented Nov 25, 2025

On-Disk Inductive Preprocessing with DAG-Based Incremental Caching

🎯 Problem Statement

Topological deep learning on large inductive datasets presents a unique challenge: topological structures (simplicial complexes, hypergraphs, cell complexes) are inherently memory-intensive. For datasets with thousands of graphs, traditional in-memory preprocessing causes out-of-memory (OOM) errors before training can even begin.

The bottleneck:

  • Graph dataset (5,000 graphs × 100 nodes) → ~500 MB
  • After topological lifting (e.g., SimplicialCliqueLifting) → ~10-15 GB
  • Result: OOM on systems with < 16GB RAM, limiting accessibility of topological DL

Research workflow friction:

  • Iterating on transform pipelines requires reprocessing from scratch
  • Changing one transform = re-applying ALL previous transforms
  • Result: Wasted hours reprocessing expensive computations

💡 Solution Overview

This PR introduces on-disk inductive preprocessing with DAG-based incremental caching to TopoBench, enabling:

  1. Constant memory training on datasets of any size (O(1) memory vs O(N×D²))
  2. Automatic transform reuse through intelligent caching much faster iteration)
  3. Configurable storage backends optimized for different workflow phases
  4. Parallel preprocessing with transparent multi-worker support (>3x speedup)

🏗️ Architecture

Core Components

1. OnDiskInductivePreprocessor

Stream-to-disk architecture that processes samples sequentially:

for sample in dataset:              # One at a time
    transformed = transform(sample)  # ~50MB memory
    save_to_disk(transformed)        # Free memory immediately
# Result: Constant memory regardless of dataset size

Key innovations:

  • Sequential processing with constant memory footprint
  • Persistent caching with unique directory structure
  • Automatic batch loading during training

2. DAG-Based Transform Caching

Treats transform pipelines as a Directed Acyclic Graph (DAG):

Source → [Transform A] → [Transform B] → [Transform C] → Output
         ↓ cached         ↓ cached         ↓ new
         DataTransform_0  DataTransform_1  DataTransform_2

Cache key: {transform_id}_{parameter_hash}

  • transform_id: Position in pipeline (handles duplicate transforms)
  • parameter_hash: Hash of configuration (detects parameter changes)

Impact: When adding Transform C, only C is processed—A and B are reused automatically.

3. Dual Storage Backends

Two complementary storage strategies:

Backend Use Case Speed Storage Best For
Files Development Fast (no compression) ~70MB/1K samples Rapid iteration, parallel processing
Mmap Production Moderate (I/O optimized) ~15MB/1K samples (4-5× compression) Deployment, storage-constrained systems

Trade-off explanation:

  • Files backend: Enables 3-4× parallel speedup, shows clear DAG cache benefits
  • Mmap backend: Compression creates bottleneck (limits parallel to ~2×), hides DAG speedup behind conversion overhead
  • Recommendation: Use files for development, convert to mmap for production

🔌 Seamless Integration

Minimal API Changes

The interface remains nearly identical to in-memory preprocessing:

# Before (in-memory)
dataset = MyDataset(...)
preprocessor = Preprocessor(
    dataset=dataset,
    transforms_config=config
)

# After (on-disk) - just add data_dir!
preprocessor = OnDiskInductivePreprocessor(
    dataset=dataset,
    data_dir="./data/cache",      # ← Only addition
    transforms_config=config,      # Same config
    storage_backend="files",       # Optional: choose backend
    num_workers=4                  # Optional: parallel processing
)

Automatic DAG Caching

No manual cache management required:

# Experiment 1: Baseline
config1 = {"clique_lifting": {...}}
preprocessor1 = OnDiskInductivePreprocessor(data_dir="./data", transforms_config=config1)
# Time: 40s, clique_lifting cached

# Experiment 2: Add feature transform (automatic reuse!)
config2 = {"clique_lifting": {...}, "projection": {...}}
preprocessor2 = OnDiskInductivePreprocessor(data_dir="./data", transforms_config=config2)
# Output: "Reusing 1 cached transform(s)!"
# Time: 14s (only processes projection, reuses clique_lifting)

Compatible with Existing Workflows

Works with all TopoBench components:

  • ✅ All liftings (simplicial, hypergraph, cell)
  • ✅ All feature transforms
  • ✅ All models (SCCNN, EDGNN, etc.)
  • ✅ TBDataloader
  • ✅ PyTorch Lightning training

⚖️ Trade-offs

Performance

Aspect In-Memory On-Disk (Files) On-Disk (Mmap)
Memory O(N×D²) O(1) constant O(1) constant
Preprocessing All at once Stream to disk Stream + compress
Training speed Baseline +10-15% slower +15-20% slower
Disk usage None ~70MB/1K samples ~15MB/1K samples
Parallel speedup N/A 3-4× (7 workers) ~2× (compression bottleneck)
DAG cache benefit None Clear (3× visible) Partial (hidden by compression)

When to Use Each Approach

Use on-disk when:

  • ✅ Dataset has > 1,000 graphs
  • ✅ Using topological liftings (memory-intensive)
  • ✅ Available RAM < 8GB
  • ✅ Want persistent caching across experiments

Use in-memory when:

  • ✅ Dataset has < 500 graphs
  • ✅ Abundant RAM (> 16GB)
  • ✅ Need absolute fastest training
  • ✅ No storage constraints

Backend selection:

  • Files: Development, iteration, parallel processing (recommended for experimentation)
  • Mmap: Production deployment, storage-limited systems (recommended for final models)

📊 Benchmark Results

1. Parallel Speedup (Files Backend)

Dataset: 20,000 samples, SimplicialCliqueLifting

image

Key findings:

  • 1 worker: 240s (baseline)
  • 2 workers: 154s (1.6× speedup)
  • 4 workers: 91s (2.6× speedup)
  • 7 workers: 72s (3.3× speedup) ✅

Compression overhead (Mmap):

  • Storage: 14.8 MB (4.46× compression ratio)
  • With 1 worker: ~36s processing time
  • Parallel speedup limited to ~2× due to compression bottleneck

2. DAG Cache Reuse (Files Backend)

Dataset: 20,000 samples, incremental transforms

image

Scenario comparison:

  • Initial build: 42.3s (SimplicialCliqueLifting)
  • Cache hit: <0.01s (50000× speedup!)
  • Light extension: 14.2s (add 1 ProjectionSum, 3.0× speedup)
  • Heavy extension: 28.1s (add 2 ProjectionSum, 1.5× speedup)

3. Memory Efficiency

Dataset: Variable sizes (1,000 - 4,000 samples), SimplicialCliqueLifting

image

Memory usage:

  • In-memory: Linear growth (~2-5 MB per 50 samples)
    • 1,000 samples: ~800 MB
    • 2,000 samples: ~1.6 GB
    • 4,000 samples: ~3.2 GB
  • On-disk (both backends): Constant ~50-100 MB regardless of dataset size ✅

Disk usage comparison:

  • Files backend: ~70 MB per 1,000 samples
  • Mmap backend: ~15 MB per 1,000 samples (4-5× compression)

Benchmark scripts:

  • benchmarks/benchmark_comprehensive_pipeline.py - Main benchmark suite
  • benchmarks/configs/test.yaml - Test configuration

Note: Benchmark scripts are included in this PR for validation. Consider removing before merge or moving to separate benchmarks/ directory.


📚 Documentation

User-Facing Documentation

Tutorials created:

  1. Part 1: Getting Started (tutorial_ondisk_inductive_part1_getting_started.ipynb)

    • Basic on-disk preprocessing
    • When to use vs in-memory
  2. Part 2: Advanced Techniques (tutorial_ondisk_inductive_part2_advanced.ipynb)

    • DAG caching workflows
    • Storage backend selection
    • Parallel processing

Summary: This PR makes topological deep learning accessible to researchers with limited computational resources while optimizing workflows for rapid experimentation. The dual storage backend approach provides the right tool for each phase of research—from fast development iteration to efficient production deployment.

@levtelyatnikov levtelyatnikov added the category-b1 Submission to TDL Challenge 2025: Mission B, Category 1. label Nov 26, 2025
@levtelyatnikov
Copy link
Collaborator

Hi @grapentt! Feel free to modify pyproject.toml file to include any extra dependencies you might need for your submission. Good luck!

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@grapentt grapentt marked this pull request as ready for review November 26, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-b1 Submission to TDL Challenge 2025: Mission B, Category 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants