Category: B1; Team name: TG; Dataset: ogbg-molpcba #251

grapentt · 2025-11-25T21:52:12Z

On-Disk Inductive Preprocessing with DAG-Based Incremental Caching

🎯 Problem Statement

Topological deep learning on large inductive datasets presents a unique challenge: topological structures (simplicial complexes, hypergraphs, cell complexes) are inherently memory-intensive. For datasets with thousands of graphs, traditional in-memory preprocessing causes out-of-memory (OOM) errors before training can even begin.

The bottleneck:

Graph dataset (5,000 graphs × 100 nodes) → ~500 MB
After topological lifting (e.g., SimplicialCliqueLifting) → ~10-15 GB
Result: OOM on systems with < 16GB RAM, limiting accessibility of topological DL

Research workflow friction:

Iterating on transform pipelines requires reprocessing from scratch
Changing one transform = re-applying ALL previous transforms
Result: Wasted hours reprocessing expensive computations

💡 Solution Overview

This PR introduces on-disk inductive preprocessing with DAG-based incremental caching to TopoBench, enabling:

Constant memory training on datasets of any size (O(1) memory vs O(N×D²))
Automatic transform reuse through intelligent caching much faster iteration)
Configurable storage backends optimized for different workflow phases
Parallel preprocessing with transparent multi-worker support (>3x speedup)

🏗️ Architecture

Core Components

1. OnDiskInductivePreprocessor

Stream-to-disk architecture that processes samples sequentially:

for sample in dataset:              # One at a time
    transformed = transform(sample)  # ~50MB memory
    save_to_disk(transformed)        # Free memory immediately
# Result: Constant memory regardless of dataset size

Key innovations:

Sequential processing with constant memory footprint
Persistent caching with unique directory structure
Automatic batch loading during training

2. DAG-Based Transform Caching

Treats transform pipelines as a Directed Acyclic Graph (DAG):

Source → [Transform A] → [Transform B] → [Transform C] → Output
         ↓ cached         ↓ cached         ↓ new
         DataTransform_0  DataTransform_1  DataTransform_2

Cache key: {transform_id}_{parameter_hash}

transform_id: Position in pipeline (handles duplicate transforms)
parameter_hash: Hash of configuration (detects parameter changes)

Impact: When adding Transform C, only C is processed—A and B are reused automatically.

3. Dual Storage Backends

Two complementary storage strategies:

Backend	Use Case	Speed	Storage	Best For
Files	Development	Fast (no compression)	~70MB/1K samples	Rapid iteration, parallel processing
Mmap	Production	Moderate (I/O optimized)	~15MB/1K samples (4-5× compression)	Deployment, storage-constrained systems

Trade-off explanation:

Files backend: Enables 3-4× parallel speedup, shows clear DAG cache benefits
Mmap backend: Compression creates bottleneck (limits parallel to ~2×), hides DAG speedup behind conversion overhead
Recommendation: Use files for development, convert to mmap for production

🔌 Seamless Integration

Minimal API Changes

The interface remains nearly identical to in-memory preprocessing:

# Before (in-memory)
dataset = MyDataset(...)
preprocessor = Preprocessor(
    dataset=dataset,
    transforms_config=config
)

# After (on-disk) - just add data_dir!
preprocessor = OnDiskInductivePreprocessor(
    dataset=dataset,
    data_dir="./data/cache",      # ← Only addition
    transforms_config=config,      # Same config
    storage_backend="files",       # Optional: choose backend
    num_workers=4                  # Optional: parallel processing
)

Automatic DAG Caching

No manual cache management required:

# Experiment 1: Baseline
config1 = {"clique_lifting": {...}}
preprocessor1 = OnDiskInductivePreprocessor(data_dir="./data", transforms_config=config1)
# Time: 40s, clique_lifting cached

# Experiment 2: Add feature transform (automatic reuse!)
config2 = {"clique_lifting": {...}, "projection": {...}}
preprocessor2 = OnDiskInductivePreprocessor(data_dir="./data", transforms_config=config2)
# Output: "Reusing 1 cached transform(s)!"
# Time: 14s (only processes projection, reuses clique_lifting)

Compatible with Existing Workflows

Works with all TopoBench components:

✅ All liftings (simplicial, hypergraph, cell)
✅ All feature transforms
✅ All models (SCCNN, EDGNN, etc.)
✅ TBDataloader
✅ PyTorch Lightning training

⚖️ Trade-offs

Performance

Aspect	In-Memory	On-Disk (Files)	On-Disk (Mmap)
Memory	O(N×D²)	O(1) constant	O(1) constant
Preprocessing	All at once	Stream to disk	Stream + compress
Training speed	Baseline	+10-15% slower	+15-20% slower
Disk usage	None	~70MB/1K samples	~15MB/1K samples
Parallel speedup	N/A	3-4× (7 workers)	~2× (compression bottleneck)
DAG cache benefit	None	Clear (3× visible)	Partial (hidden by compression)

When to Use Each Approach

Use on-disk when:

✅ Dataset has > 1,000 graphs
✅ Using topological liftings (memory-intensive)
✅ Available RAM < 8GB
✅ Want persistent caching across experiments

Use in-memory when:

✅ Dataset has < 500 graphs
✅ Abundant RAM (> 16GB)
✅ Need absolute fastest training
✅ No storage constraints

Backend selection:

Files: Development, iteration, parallel processing (recommended for experimentation)
Mmap: Production deployment, storage-limited systems (recommended for final models)

📊 Benchmark Results

1. Parallel Speedup (Files Backend)

Dataset: 20,000 samples, SimplicialCliqueLifting

Key findings:

1 worker: 240s (baseline)
2 workers: 154s (1.6× speedup)
4 workers: 91s (2.6× speedup)
7 workers: 72s (3.3× speedup) ✅

Compression overhead (Mmap):

Storage: 14.8 MB (4.46× compression ratio)
With 1 worker: ~36s processing time
Parallel speedup limited to ~2× due to compression bottleneck

2. DAG Cache Reuse (Files Backend)

Dataset: 20,000 samples, incremental transforms

Scenario comparison:

Initial build: 42.3s (SimplicialCliqueLifting)
Cache hit: <0.01s (50000× speedup!)
Light extension: 14.2s (add 1 ProjectionSum, 3.0× speedup)
Heavy extension: 28.1s (add 2 ProjectionSum, 1.5× speedup)

3. Memory Efficiency

Dataset: Variable sizes (1,000 - 4,000 samples), SimplicialCliqueLifting

Memory usage:

In-memory: Linear growth (~2-5 MB per 50 samples)
- 1,000 samples: ~800 MB
- 2,000 samples: ~1.6 GB
- 4,000 samples: ~3.2 GB
On-disk (both backends): Constant ~50-100 MB regardless of dataset size ✅

Disk usage comparison:

Files backend: ~70 MB per 1,000 samples
Mmap backend: ~15 MB per 1,000 samples (4-5× compression)

Benchmark scripts:

benchmarks/benchmark_comprehensive_pipeline.py - Main benchmark suite
benchmarks/configs/test.yaml - Test configuration

Note: Benchmark scripts are included in this PR for validation. Consider removing before merge or moving to separate benchmarks/ directory.

📚 Documentation

User-Facing Documentation

Tutorials created:

Part 1: Getting Started (tutorial_ondisk_inductive_part1_getting_started.ipynb)
- Basic on-disk preprocessing
- When to use vs in-memory
Part 2: Advanced Techniques (tutorial_ondisk_inductive_part2_advanced.ipynb)
- DAG caching workflows
- Storage backend selection
- Parallel processing

Summary: This PR makes topological deep learning accessible to researchers with limited computational resources while optimizing workflows for rapid experimentation. The dual storage backend approach provides the right tool for each phase of research—from fast development iteration to efficient production deployment.

levtelyatnikov · 2025-11-26T12:12:57Z

Hi @grapentt! Feel free to modify pyproject.toml file to include any extra dependencies you might need for your submission. Good luck!

review-notebook-app · 2025-11-26T12:17:50Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

grapentt added 3 commits November 25, 2025 22:50

🗃️ Storage backend + tests

d71a929

⚡ Parallel processor

9d9f2c5

✨ OnDiskInductivePreprocessor + Storage benchmark justifying approach

6e3f9fc

levtelyatnikov added the category-b1 Submission to TDL Challenge 2025: Mission B, Category 1. label Nov 26, 2025

grapentt added 2 commits November 26, 2025 13:08

🧪 Add comprehensive test suite

3909af8

🐛 Fix bugs

5d4e7e8

grapentt added 3 commits November 26, 2025 13:14

✨ Tutorials

5585e19

💄 Fix some linters

7baff2f

✨ ogbg-molpcba dataset

5c0b891

grapentt added 11 commits November 26, 2025 13:29

💄 Fix more linters

350a941

💄 Lint🤡

7ce1490

🐛 🙈 Forgot to push transductive splits

2a3e841

🐛 Fix pickling errors

8c6e1da

➕ Some missing files

62c1177

:waste_basket: Remove wrong files

1173fa7

➖ Remove outdated imports

44a6c0a

:waste_basket: Remove wrong files

94aa0e0

:liptick: linters

527c8aa

💄 More linters

c1a59b7

💄 Test fixes

aa9fc7f

grapentt marked this pull request as ready for review November 26, 2025 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Category: B1; Team name: TG; Dataset: ogbg-molpcba #251

Category: B1; Team name: TG; Dataset: ogbg-molpcba #251

Uh oh!

grapentt commented Nov 25, 2025 •

edited

Loading

Uh oh!

levtelyatnikov commented Nov 26, 2025

Uh oh!

review-notebook-app bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Category: B1; Team name: TG; Dataset: ogbg-molpcba #251

Are you sure you want to change the base?

Category: B1; Team name: TG; Dataset: ogbg-molpcba #251

Uh oh!

Conversation

grapentt commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

On-Disk Inductive Preprocessing with DAG-Based Incremental Caching

🎯 Problem Statement

💡 Solution Overview

🏗️ Architecture

Core Components

1. OnDiskInductivePreprocessor

2. DAG-Based Transform Caching

3. Dual Storage Backends

🔌 Seamless Integration

Minimal API Changes

Automatic DAG Caching

Compatible with Existing Workflows

⚖️ Trade-offs

Performance

When to Use Each Approach

📊 Benchmark Results

1. Parallel Speedup (Files Backend)

2. DAG Cache Reuse (Files Backend)

3. Memory Efficiency

📚 Documentation

User-Facing Documentation

Uh oh!

levtelyatnikov commented Nov 26, 2025

Uh oh!

review-notebook-app bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grapentt commented Nov 25, 2025 •

edited

Loading