CorridorKey VRAM Optimizations

Enable CorridorKey 4K inference (4096x2160) on consumer GPUs with 8 GB VRAM.

The original CorridorKey engine OOMs at its native 2048x2048 inference resolution on 8 GB GPUs. Even with Flash Attention enabled (the minimum to avoid the OOM), PyTorch's allocator reserves 9.8 GB and spills into system RAM. This optimization suite reduces that reserved memory to 3.8 GB (61% reduction) while also being 58% faster per frame, enabling full 4K DCI processing on an 8 GB laptop GPU with 4.4 GB of headroom.

Benchmark Results

Tested on real 4K green screen footage from Tears of Steel (CC-BY 3.0, Blender Foundation).

Test setup: 100 frames, 4096x2160 OpenEXR 16-bit half-float, linear color space, NVIDIA RTX 4060 Laptop GPU (8 GB).

Speed

Metric	Flash Attention Only (Baseline)	Optimized	V2 (compile)	Baseline vs V2
Total time (100 frames)	1118.8 s	785.9 s	646.5 s	-42.2%
Effective FPS	0.09 fps	0.13 fps	0.15 fps	+66.7%
Avg frame time	8423 ms	5038 ms	3946 ms	-53.1%
Median frame time	8402 ms	4979 ms	3563 ms	-57.6%
Min frame time	--	--	3198 ms	--

VRAM

Metric	Baseline	Optimized	V2 (compile)	Baseline vs V2
Dedicated VRAM (physical)	8188 MB (maxed out)	5166 MB	4916 MB	-40%
PyTorch allocator reserved	9792 MB	1582 MB	3794 MB	-61%
Shared GPU memory spillover	~1604 MB	0 MB	0 MB	Eliminated
Headroom within 8 GB VRAM	0 MB	3022 MB	3272 MB	--

The baseline completely saturates the 8 GB GPU and spills ~1.6 GB into shared GPU memory (system RAM accessed over PCIe), which is significantly slower than dedicated VRAM. Both the optimized and v2 configs fit comfortably within dedicated VRAM.

Note on allocator reserved: The optimized profile uses aggressive torch.cuda.empty_cache() between stages, which returns cached blocks to the OS (hence the very low 1,582 MB reserved). The v2 profile disables cache clearing because in-place operations and torch.compile's fused kernels manage memory more efficiently. The allocator reserves more (3,794 MB), but actual device peak is lower (4,916 MB vs 5,166 MB) and throughput is higher.

Warmup (V2 profile)

The v2 profile incurs a one-time warmup cost as torch.compile traces and compiles Triton kernels:

	First 5 avg (ms)	Last 5 avg (ms)	Warmup overhead
V2	11,735	3,456	+240%

After warmup, steady-state performance is 3,456 ms/frame (0.29 fps).

Why "Flash Attention Only" as baseline?

The original engine (no optimizations at all) OOMs immediately at 4K on 8 GB GPUs. Flash Attention is the minimum required optimization to avoid the out-of-memory crash. The baseline uses Flash Attention only to serve as the closest proxy to original behavior while remaining runnable.

Visual Comparison

All footage is from Tears of Steel scene 02_3c, 4096x2160 at 24 fps.

Raw Green Screen Input

raw_greenscreen.mp4

Alpha Hint (Input Mask)

alpha_hint.mp4

Composite Output (Baseline vs Optimized vs V2)

Baseline (Flash Attention only):

comp_baseline.mp4

Optimized (all optimizations):

comp_optimized.1.mp4

V2 (optimized + torch.compile):

comp_v2.mp4

Alpha Matte (Baseline vs Optimized vs V2)

Baseline:

alpha_baseline.mp4

Optimized:

alpha_optimized.mp4

V2 (optimized + torch.compile):

alpha_v2.mp4

Optimizations Implemented

1. Flash Attention Patching

Config flag: flash_attention: True Profile: All profiles except original Impact: Required to avoid OOM at 4K on 8 GB GPUs. Without this patch, the global attention blocks materialize a full N x N attention matrix, which does not fit in memory.

Problem

CorridorKey uses Meta's Hiera vision transformer as its backbone. Hiera organizes tokens into "mask units" for windowed attention (Stages 0-1), then switches to global attention (Stages 2-3) by setting num_windows = 1.

The problem is in how Hiera constructs its Q/K/V tensors. For global attention, it creates tensors with shape [B, heads, 1, N, head_dim], a 5D tensor where the num_windows dimension is 1. When this 5D non-contiguous tensor is passed to F.scaled_dot_product_attention, PyTorch's SDPA dispatcher silently falls back to the math backend, which materializes the full N x N attention matrix in memory instead of using the memory-efficient FlashAttention kernel.

At 2048x2048 input with Hiera's tokenization, N is large enough that this math-backend fallback consumes too much VRAM for consumer GPUs.

Solution

Monkey-patch the forward() method of Hiera's MaskUnitAttention on global-attention blocks (where use_mask_unit_attn == False). The patch:

Squeezes the num_windows dimension from Q/K/V tensors
Makes them contiguous 4D tensors: [B, heads, N, head_dim]
Passes them to F.scaled_dot_product_attention, which now correctly dispatches to FlashAttention/memory-efficient kernels

Windowed attention blocks (Stages 0-1) are left unmodified since they are already efficient.

Implementation: CorridorKeyModule/core/optimized_model.py:40-93 (_patch_hiera_global_attention()) Applied in: CorridorKeyModule/core/model_transformer.py:225-230 (during GreenFormer.__init__())

2. Tiled CNN Refiner

Config flag: tiled_refiner: True, tile_size: 512, tile_overlap: 128 Profile: All profiles except original Impact: Reduces VRAM usage during the refiner stage by processing the input in small tiles instead of the full 2048x2048 resolution at once.

Problem

The CNN Refiner (CNNRefinerModule) takes a 7-channel input (RGB + coarse alpha + coarse FG predictions) at the full 2048x2048 resolution and runs dilated residual convolution blocks to produce additive "delta logits" that sharpen edges. Processing the entire 2048x2048 input at once consumes significant VRAM in the intermediate feature maps.

Solution

Replace the standard refiner with TiledCNNRefiner, which processes the input in overlapping tiles:

Tile size: 512x512 (default)
Overlap: 128px (default)
Stride: tile_size - overlap = 384px

Each tile is processed independently through the same CNN pipeline. Tile outputs are merged using linear blend weights (ramps from 0 to 1 over the overlap region) to produce seamless results.

This is mathematically lossless because the refiner's receptive field is ~65px (from dilated residual blocks with dilations 1, 2, 4, 8), and the 128px overlap fully covers it. Any pixel's prediction depends only on inputs within 65px, which the overlap guarantees are identical whether processed as part of the full image or a tile.

If the input fits in a single tile (smaller than tile_size), tiling overhead is skipped entirely.

The tiled refiner also deduplicates tile coordinates at image boundaries. The stride-based loop can generate overlapping boundary tiles that map to the same adjusted coordinates after clamping. A set of processed coordinates ensures each unique region is only processed once, eliminating redundant computation.

Implementation: CorridorKeyModule/core/optimized_model.py:262-383 (TiledCNNRefiner) Instantiated in: CorridorKeyModule/core/model_transformer.py:262-276

3. Sparse Tiled Refiner

Config flag: sparse_refiner: True Profile: optimized, v2, experimental Impact: Skips CNN refiner processing for tiles where the coarse alpha is uniformly near 0 or near 1. On typical green screen footage, this eliminates 50-70% of tile CNN passes.

Problem

In a typical green screen shot, the majority of the frame is either pure background (alpha near 0) or pure foreground (alpha near 1). Only a thin band of edge pixels, typically around hair, fingers, and translucent edges, requires refinement. However, the standard tiled refiner processes every tile identically, spending the same GPU time on uniform-alpha tiles as on complex edge tiles.

Solution

Before running the CNN on each tile, check the coarse alpha channel of that tile:

If alpha_tile.max() < 0.05: the tile is pure background, skip it
If alpha_tile.min() > 0.95: the tile is pure foreground, skip it
Otherwise: the tile contains edges, process normally

Skipped tiles contribute zero delta (the delta accumulator was initialized to zeros) and their blend weight is still accumulated so the weighted average remains correct.

This is quality-safe because the CNN refiner outputs additive "delta logits", which are corrections to the existing prediction. For regions that are already confidently background or foreground, the trained refiner naturally produces near-zero deltas. Skipping these tiles substitutes exactly zero, which closely matches what the refiner would have produced.

Implementation: CorridorKeyModule/core/optimized_model.py:368-376 (sparse skip in TiledCNNRefiner.forward())

4. In-Place Decode-and-Refine Pipeline

Profile: All profiles (built into the model) Impact: Reduces peak activation memory by reusing tensors instead of allocating new ones for logit-space operations.

Problem

The original decode-and-refine pipeline allocated separate tensors for each step: decoder logits, upsampled logits, coarse probabilities, refiner deltas, refined logits, and final probabilities. At 2048x2048 resolution with 4 channels (1 alpha + 3 FG), each full-resolution tensor is ~64 MB in FP16. Having multiple overlapping tensor lifetimes inflates peak activation memory.

Solution

Rewrite _decode_and_refine() to use in-place operations and explicit del statements:

In-place residual addition: alpha_logits_up.add_(delta[:, 0:1]) instead of alpha_logits_up = alpha_logits_up + delta[:, 0:1]
In-place sigmoid: alpha_logits_up.sigmoid_() instead of alpha = torch.sigmoid(alpha_logits_up)
In-place refiner scaling: delta_logits.mul_(refiner_scale) when scale != 1.0
Explicit tensor cleanup: del alpha_logits, fg_logits after upsample frees the H/4 resolution tensors immediately; del alpha_coarse, fg_coarse after building the refiner input; del delta_logits after the residual add

These in-place operations require torch.inference_mode() context (which disables autograd version counting).

The refiner_scale parameter is passed directly through forward() instead of using a forward hook, keeping the compiled graph clean.

Implementation: CorridorKeyModule/core/model_transformer.py:291-344 (_decode_and_refine())

5. channels_last Memory Format

Profile: All profiles on CUDA Impact: Improves CUDA convolution kernel selection and memory access patterns.

Problem

PyTorch defaults to NCHW (channels-first) memory layout. NVIDIA's cuDNN and Triton kernels often perform better with NHWC (channels-last) layout because it produces more coalesced memory accesses for convolution and batch normalization operations.

Solution

After model loading, convert the model and input tensors to channels_last memory format:

self.model = self.model.to(memory_format=torch.channels_last)
inp_t = inp_t.to(memory_format=torch.channels_last)

This is a metadata-only operation on the model (no weight copy occurs). The tensor data is transposed, but cuDNN and Triton automatically select kernels optimized for NHWC layout.

Implementation: CorridorKeyModule/base_engine.py:58-60 (model), base_engine.py:237-238 (input tensor)

6. torch.inference_mode()

Profile: All profiles Impact: Enables in-place operations and reduces per-tensor overhead by disabling autograd bookkeeping.

Problem

torch.no_grad() disables gradient computation but still tracks tensor versions for autograd safety. This prevents in-place operations from being used freely and adds overhead to every tensor allocation.

Solution

Replace @torch.no_grad() with @torch.inference_mode() on process_frame(). Inference mode:

Disables gradient computation (same as no_grad)
Disables autograd version counting, allowing safe in-place operations (.add_(), .sigmoid_(), .mul_())
Reduces per-tensor metadata overhead

Implementation: CorridorKeyModule/base_engine.py:175 (@torch.inference_mode() decorator on process_frame())

7. torch.compile Sub-Module Compilation

Config flag: compile_submodules: True Profile: v2, experimental Requires: Triton (install triton-windows on Windows) Impact: 34% faster median frame time through Triton kernel fusion. Reduces intermediate tensor materializations.

Problem

Eager-mode PyTorch executes each operation individually, launching a separate CUDA kernel for each Conv2d, GroupNorm, ReLU, linear projection, etc. Each kernel launch has dispatch overhead (~5-10 us), and intermediate results are materialized in VRAM between operations even when they could be fused.

Solution

Apply torch.compile() to individual sub-modules rather than the full model:

self.model.encoder = torch.compile(self.model.encoder)
self.model.alpha_decoder = torch.compile(self.model.alpha_decoder)
self.model.fg_decoder = torch.compile(self.model.fg_decoder)
self.model.refiner._process_tile = torch.compile(self.model.refiner._process_tile)

Why sub-modules, not the full model? The orchestration code in _decode_and_refine() contains control flow (conditional refiner_scale, del statements, in-place ops) and the tiled refiner uses Python loops with dynamic tile coordinates, all of which cause graph breaks in torch.compile. Compiling sub-modules individually keeps each compiled graph clean and break-free.

Why not CUDA graphs? Hiera's forward_intermediates() stores intermediate feature tensors at each stage boundary. CUDA graphs capture and replay GPU operations, but the stored tensors get overwritten on replay, causing RuntimeError: accessing tensor output of CUDAGraphs that has been overwritten. CUDA graphs are explicitly disabled via inductor_config.triton.cudagraphs = False.

What gets fused: Conv-GroupNorm-ReLU chains in the refiner, transformer block internals (LayerNorm-Linear-GELU sequences), element-wise operations, and attention projections. The Triton compiler generates custom GPU kernels that fuse these operation sequences, avoiding intermediate tensor materializations.

Warmup: The first ~5 frames incur 2-3x overhead as Triton traces and compiles kernels. The suppress_errors = True setting ensures graceful fallback to eager mode if any graph fails to compile.

Implementation: CorridorKeyModule/base_engine.py:66-97 (_compile_submodules())

8. cuDNN Benchmark Disable

Config flag: disable_cudnn_benchmark: True Profile: All profiles except original Impact: Reduces VRAM used by cuDNN workspace allocations during convolution benchmarking.

Problem

When torch.backends.cudnn.benchmark = True (PyTorch's default in many setups), cuDNN runs multiple convolution algorithms on the first call to find the fastest one. Each algorithm trial requires allocating workspace memory, which adds to VRAM usage. On memory-constrained GPUs, this benchmark overhead can push memory usage over the limit.

Solution

Set torch.backends.cudnn.benchmark = False. cuDNN will use its default heuristic-selected algorithm instead of benchmarking. The selected algorithm may be slightly slower for specific convolution shapes, but avoids the workspace memory overhead.

Implementation: CorridorKeyModule/base_engine.py:52-54

9. CUDA Cache Clearing

Config flag: cache_clearing: True Profile: optimized only (disabled in v2 where in-place ops manage memory) Impact: Prevents memory accumulation between pipeline stages.

Problem

PyTorch's CUDA caching allocator retains freed GPU memory blocks for potential reuse. While this avoids the overhead of repeated cudaMalloc/cudaFree calls, it means memory from one pipeline stage remains "reserved" (from the OS perspective) even after the tensors are freed. When the next stage has a different allocation pattern, it allocates additional memory on top of the cached blocks, inflating total reserved memory.

With the encoder, decoder, and refiner stages each having different tensor shapes and sizes, the caching allocator can accumulate reserved memory across all stages simultaneously.

Solution

Call torch.cuda.empty_cache() at two strategic points in the inference pipeline (before and after model inference) at the engine level.

Note: In the v2 profile, cache clearing is disabled. The combination of in-place operations (which reuse existing tensor memory instead of allocating new tensors) and torch.compile's fused kernels (which reduce intermediate allocations) makes cache clearing unnecessary. Disabling it avoids the cudaFree/cudaMalloc overhead on every frame.

Implementation: CorridorKeyModule/base_engine.py:242-258 (engine-level cache clearing)

10. Token Routing (Experimental)

Config flag: token_routing: True Profile: experimental only Status: Experimental, disabled by default. Requires fine-tuning for production use.

Concept

Route "easy" tokens (solid foreground/background, as determined by the alpha hint mask) to a lightweight LTRM (Lightweight Token Refinement Module) instead of full global self-attention. Only "edge" tokens (uncertain alpha values between configurable thresholds) go through the expensive O(N^2) global attention.

Edge tokens: Alpha hint between 0.02 and 0.98 (configurable) -> full attention
Easy tokens: Alpha hint below 0.02 or above 0.98 -> LTRM at O(N) cost

The LTRM architecture: LayerNorm -> Linear expand -> GELU -> DWConv 5x5 -> GELU -> Linear project -> ECA residual gating

The LTRM weights are zero-initialized (fc2 weights = 0), so the module starts as an identity function. This makes it fully compatible with the pretrained checkpoint without any fine-tuning. The model can be loaded and run with token routing enabled, but optimal quality requires fine-tuning the LTRM weights.

Implementation: CorridorKeyModule/core/optimized_model.py:101-254 (LTRM, ECA, HintBasedTokenRouter)

Architecture

Engine Hierarchy

_BaseCorridorKeyEngine (base_engine.py)
    Abstract base class: constructor, checkpoint loading,
    process_frame() pipeline, cuDNN disable, channels_last,
    torch.compile, inference_mode, metrics
    |
    |--- CorridorKeyEngine (inference_engine.py)
    |       Original engine. Uses GreenFormer directly.
    |       Defaults to OptimizationConfig.original() (all opts off)
    |
    |--- OptimizedCorridorKeyEngine (optimized_engine.py)
            Optimized engine. Uses OptimizedGreenFormer.
            Defaults to OptimizationConfig.optimized() (production opts on)

Model Hierarchy

GreenFormer (model_transformer.py)
    Base model: Hiera backbone, multiscale decoders, CNN refiner
    Handles: FlashAttention patching, tiled refiner, in-place
    decode-and-refine, refiner_scale parameter
    |
    |--- OptimizedGreenFormer (optimized_model.py)
            Extends GreenFormer with token routing machinery
            (LTRM + HintBasedTokenRouter)
            When routing is disabled, delegates entirely to GreenFormer.forward()

Design Principle

Optimizations are config-driven, not engine-driven. Both engines accept any OptimizationConfig. The GreenFormer base model handles FlashAttention, tiled refiner, sparse skip, and in-place operations based on the config, so even the "original" CorridorKeyEngine can use these optimizations if given the right config. The _BaseCorridorKeyEngine handles channels_last, inference_mode, torch.compile, and cache clearing at the engine level. The OptimizedCorridorKeyEngine simply defaults to the optimized profile and adds LTRM weight handling.

Inference Pipeline

Input (4096x2160 EXR, linear float)
  |
  v
[Lanczos4 resize to 2048x2048]
  |
  v
[Linear -> sRGB conversion] (if input_is_linear=True)
  |
  v
[ImageNet normalization + alpha hint concat -> 4-channel input]
  |
  v
[channels_last conversion] (CUDA only)
  |
  v
[torch.cuda.empty_cache()] (if cache_clearing)
  |
  v
[Hiera Encoder]    Stages 0-1: Windowed attention (efficient)
  |                Stages 2-3: Global attention (FlashAttention patched)
  |                (torch.compile'd if compile_submodules)
  v
[Multiscale Decoder]    Predicts coarse alpha (1ch) + coarse FG (3ch) at H/4
  |                     (torch.compile'd if compile_submodules)
  |
  |-- del H/4 tensors (in-place pipeline)
  |
  v
[F.interpolate to full resolution]
  |
  v
[sigmoid -> coarse probs -> cat -> refiner input]
  |
  |-- del coarse probs (in-place pipeline)
  |
  v
[TiledCNNRefiner]    7ch input (RGB + coarse predictions)
  |                  512x512 tiles, 128px overlap
  |                  Sparse skip: uniform-alpha tiles skipped
  |                  Dedup: boundary tiles processed once
  |                  (torch.compile'd _process_tile if compile_submodules)
  v
[In-place residual add in logit space]  logits.add_(delta * refiner_scale)
  |
  v
[In-place sigmoid activation]  logits.sigmoid_()
  |
  v
[torch.cuda.empty_cache()] (if cache_clearing)
  |
  v
[Lanczos4 resize back to 4096x2160]
  |
  v
[Post-processing: despill, premultiply, composite]
  |
  v
Output: alpha, FG (sRGB), processed (linear premul RGBA), comp (sRGB preview)

Auto-Backend Selection

In CorridorKeyModule/backend.py, the system auto-detects the optimal engine:

CUDA GPU with less than 16 GB VRAM: Uses OptimizedCorridorKeyEngine with OptimizationConfig.optimized()
CUDA GPU with 16 GB or more VRAM: Uses standard CorridorKeyEngine (no optimizations needed)
Apple Silicon with MLX available: Uses MLX backend

Configuration

OptimizationConfig Profiles

Profile	`flash_attention`	`tiled_refiner`	`sparse_refiner`	`disable_cudnn_benchmark`	`cache_clearing`	`compile_submodules`	`token_routing`
`original`	off	off	off	off	off	off	off
`optimized` (production)	on	on	on	on	on	off	off
`v2`	on	on	on	on	off	on	off
`experimental`	on	on	on	on	off	on	on

Profile selection guide:

optimized: Safe, no compilation warmup, works without Triton. Best for interactive use.
v2: Fastest steady-state throughput. Requires Triton (triton-windows on Windows). ~10s warmup on first frames.
experimental: Includes untrained token routing. For development only.

CLI Flags

# Use a named profile
uv run python corridorkey_cli.py --action run_inference --profile v2

# Individual toggles (override profile settings)
--flash-attention / --no-flash-attention
--tiled-refiner / --no-tiled-refiner
--sparse-refiner / --no-sparse-refiner
--tile-size N          # default: 512
--tile-overlap N       # default: 128
--disable-cudnn-benchmark / --no-disable-cudnn-benchmark
--cache-clearing / --no-cache-clearing
--compile / --no-compile
--token-routing / --no-token-routing
--metrics              # enable per-stage timing/VRAM reporting

Python API

from CorridorKeyModule.optimization_config import OptimizationConfig
from CorridorKeyModule.optimized_engine import OptimizedCorridorKeyEngine

# Production config (no compile warmup)
config = OptimizationConfig.optimized()

# V2 config (fastest, requires Triton)
config = OptimizationConfig.v2()

# Custom config
config = OptimizationConfig(
    flash_attention=True,
    tiled_refiner=True,
    sparse_refiner=True,
    tile_size=512,
    tile_overlap=128,
    disable_cudnn_benchmark=True,
    cache_clearing=False,
    compile_submodules=True,
    enable_metrics=True,
)

engine = OptimizedCorridorKeyEngine(
    checkpoint_path="CorridorKeyModule/checkpoints/CorridorKey.pth",
    device="cuda",
    img_size=2048,
    use_refiner=True,
    optimization_config=config,
)

# Process a frame (supports linear EXR input)
result = engine.process_frame(image_rgb, alpha_hint, input_is_linear=True)
# result["alpha"]     -> [H, W, 1] float32 alpha matte
# result["fg"]        -> [H, W, 3] float32 sRGB foreground
# result["processed"] -> [H, W, 4] float32 linear premultiplied RGBA (EXR-ready)
# result["comp"]      -> [H, W, 3] float32 sRGB composite preview
# result["metrics"]   -> PerformanceMetrics (if enable_metrics=True)

Benchmark Methodology

Test Footage

Source: Tears of Steel (scene 02_3c) License: CC-BY 3.0 (c) Blender Foundation | mango.blender.org Format: OpenEXR 16-bit half-float, 4096x2160 (DCI 4K), linear color space Frames: First 100 frames (~4.2 seconds at 24 fps) Content: Green screen footage with actors, real production footage rather than synthetic test data

Alpha Hint Generation

Alpha hints were auto-generated using HSV chroma keying:

Read linear EXR frame
Convert linear to sRGB (piecewise transfer function)
Convert sRGB to HSV
Threshold green hue range (35-85), saturation (>40), value (>30)
Invert (green = background, non-green = foreground)
Erode with 7px elliptical kernel (slight under-prediction, which the model handles better)
Gaussian blur with 21px kernel (produces soft/coarse edges the model expects)
Save as uint8 PNG

Script: tears_of_steel_test/generate_alpha_hints.py

Benchmark Script

benchmark_4k_vram.py runs each configuration in a separate subprocess to ensure clean GPU state. Supports configurable profile selection:

uv run python benchmark_4k_vram.py --profile v2           # single profile (default)
uv run python benchmark_4k_vram.py --profile baseline v2  # compare two profiles
uv run python benchmark_4k_vram.py --profile all           # compare all three
uv run python benchmark_4k_vram.py --frames 50             # fewer frames

Available profiles: baseline (Flash Attention only), optimized, v2 (compile), all (run all three).

For each config, it measures:

Per-frame wall-clock time (ms)
PyTorch allocator peak allocated/reserved memory
Device-level GPU memory via torch.cuda.mem_get_info() (polled at 25ms intervals in a background thread)
Per-stage timing (inference, postprocessing) via PerformanceMetrics

Output is written as EXR frame sequences (linear premultiplied RGBA + single-channel alpha) for quality comparison.

GPU Memory Measurement

Three levels of GPU memory are tracked:

Level	API	What it shows
Live tensors	`torch.cuda.max_memory_allocated()`	PyTorch tensor memory only
Allocator reserved	`torch.cuda.max_memory_reserved()`	Total memory held by PyTorch's caching allocator (includes freed-but-cached blocks)
Device-level	`torch.cuda.mem_get_info()`	Actual GPU memory usage including CUDA context, cuDNN workspace, and driver overhead. Equivalent to what Task Manager shows. Capped at physical VRAM and cannot detect shared memory spillover.

When allocator reserved exceeds physical VRAM, Windows spills into shared GPU memory (system RAM accessed over PCIe), which is dramatically slower. The benchmark reports this spillover explicitly.

Output Files

Benchmark Outputs

Output/
  comp_baseline/          # Processed RGBA EXR sequence (baseline)
  comp_optimized/         # Processed RGBA EXR sequence (optimized)
  comp_v2/                # Processed RGBA EXR sequence (v2)
  alpha_baseline/         # Alpha matte EXR sequence (baseline)
  alpha_optimized/        # Alpha matte EXR sequence (optimized)
  alpha_v2/               # Alpha matte EXR sequence (v2)

EXR Output Format

Processed RGBA: Linear premultiplied RGBA, half-float, PXR24 compression. Ready for compositing in Nuke, Fusion, After Effects, etc.
Alpha: Single-channel linear float

Test Data

tears_of_steel_test/
  frames/                 # 100 EXR source frames (4096x2160, ~51 MB each)
  alpha_hints/            # 100 PNG alpha hints (4096x2160, auto-generated)
  download_frames.py      # Downloads frames from media.xiph.org
  generate_alpha_hints.py # HSV chroma key alpha hint generator

How to Run

1. Download test footage

uv run python tears_of_steel_test/download_frames.py

Downloads the first 100 EXR frames (~5.1 GB) from the Tears of Steel open movie project.

2. Generate alpha hints

uv run python tears_of_steel_test/generate_alpha_hints.py

Generates coarse alpha hints using HSV chroma keying.

3. Install Triton (required for v2 profile)

# Windows
uv pip install triton-windows

# Linux
uv pip install triton

4. Run benchmark

# Run v2 profile (default)
uv run python benchmark_4k_vram.py

# Run specific profile
uv run python benchmark_4k_vram.py --profile optimized

# Compare all profiles
uv run python benchmark_4k_vram.py --profile all

# Fewer frames for a quick test
uv run python benchmark_4k_vram.py --profile v2 --frames 10

Processes frames through the selected configuration(s), generates the report at benchmark_4k_results.md, and writes output EXR sequences to Output/.

Key Implementation Files

File	Purpose
`CorridorKeyModule/optimization_config.py`	`OptimizationConfig` dataclass, profiles (`original`, `optimized`, `v2`, `experimental`), `PerformanceMetrics`
`CorridorKeyModule/base_engine.py`	`_BaseCorridorKeyEngine`: inference_mode, channels_last, torch.compile, cache clearing, refiner_scale
`CorridorKeyModule/optimized_engine.py`	`OptimizedCorridorKeyEngine` with LTRM weight handling
`CorridorKeyModule/core/optimized_model.py`	FlashAttention patch, TiledCNNRefiner (dedup + sparse), LTRM, ECA, TokenRouter
`CorridorKeyModule/core/model_transformer.py`	`GreenFormer` model: in-place `_decode_and_refine()`, refiner_scale parameter
`CorridorKeyModule/backend.py`	Auto-backend selection based on GPU VRAM
`benchmark_4k_vram.py`	4K benchmark script (configurable profiles)
`benchmark_4k_results.md`	Latest benchmark results

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github		.github
ClipsForInference		ClipsForInference
CorridorKeyModule		CorridorKeyModule
IgnoredClips		IgnoredClips
VideoMaMaInferenceModule		VideoMaMaInferenceModule
backend		backend
docs		docs
gvm_core		gvm_core
tears_of_steel_test		tears_of_steel_test
tests		tests
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CorridorKey_DRAG_CLIPS_HERE_local.bat		CorridorKey_DRAG_CLIPS_HERE_local.bat
CorridorKey_DRAG_CLIPS_HERE_local.sh		CorridorKey_DRAG_CLIPS_HERE_local.sh
Install_CorridorKey_Windows.bat		Install_CorridorKey_Windows.bat
Install_GVM_Windows.bat		Install_GVM_Windows.bat
Install_VideoMaMa_Windows.bat		Install_VideoMaMa_Windows.bat
README.md		README.md
RunGVMOnly.sh		RunGVMOnly.sh
RunInferenceOnly.sh		RunInferenceOnly.sh
benchmark_4k_results.md		benchmark_4k_results.md
benchmark_4k_vram.py		benchmark_4k_vram.py
clip_manager.py		clip_manager.py
corridorkey_cli.py		corridorkey_cli.py
device_utils.py		device_utils.py
pyproject.toml		pyproject.toml
test_vram.py		test_vram.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CorridorKey VRAM Optimizations

Table of Contents

Benchmark Results

Speed

VRAM

Warmup (V2 profile)

Why "Flash Attention Only" as baseline?

Visual Comparison

Raw Green Screen Input

Alpha Hint (Input Mask)

Composite Output (Baseline vs Optimized vs V2)

Alpha Matte (Baseline vs Optimized vs V2)

Optimizations Implemented

1. Flash Attention Patching

Problem

Solution

2. Tiled CNN Refiner

Problem

Solution

3. Sparse Tiled Refiner

Problem

Solution

4. In-Place Decode-and-Refine Pipeline

Problem

Solution

5. channels_last Memory Format

Problem

Solution

6. torch.inference_mode()

Problem

Solution

7. torch.compile Sub-Module Compilation

Problem

Solution

8. cuDNN Benchmark Disable

Problem

Solution

9. CUDA Cache Clearing

Problem

Solution

10. Token Routing (Experimental)

Concept

Architecture

Engine Hierarchy

Model Hierarchy

Design Principle

Inference Pipeline

Auto-Backend Selection

Configuration

OptimizationConfig Profiles

CLI Flags

Python API

Benchmark Methodology

Test Footage

Alpha Hint Generation

Benchmark Script

GPU Memory Measurement

Output Files

Benchmark Outputs

EXR Output Format

Test Data

How to Run

1. Download test footage

2. Generate alpha hints

3. Install Triton (required for v2 profile)

4. Run benchmark

Key Implementation Files

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages