Enable CorridorKey 4K inference (4096x2160) on consumer GPUs with 8 GB VRAM.
The original CorridorKey engine OOMs at its native 2048x2048 inference resolution on 8 GB GPUs. Even with Flash Attention enabled (the minimum to avoid the OOM), PyTorch's allocator reserves 9.8 GB and spills into system RAM. This optimization suite reduces that reserved memory to 3.8 GB (61% reduction) while also being 58% faster per frame, enabling full 4K DCI processing on an 8 GB laptop GPU with 4.4 GB of headroom.
- Benchmark Results
- Optimizations Implemented
- Architecture
- Configuration
- Benchmark Methodology
- Output Files
- How to Run
Tested on real 4K green screen footage from Tears of Steel (CC-BY 3.0, Blender Foundation).
Test setup: 100 frames, 4096x2160 OpenEXR 16-bit half-float, linear color space, NVIDIA RTX 4060 Laptop GPU (8 GB).
| Metric | Flash Attention Only (Baseline) | Optimized | V2 (compile) | Baseline vs V2 |
|---|---|---|---|---|
| Total time (100 frames) | 1118.8 s | 785.9 s | 646.5 s | -42.2% |
| Effective FPS | 0.09 fps | 0.13 fps | 0.15 fps | +66.7% |
| Avg frame time | 8423 ms | 5038 ms | 3946 ms | -53.1% |
| Median frame time | 8402 ms | 4979 ms | 3563 ms | -57.6% |
| Min frame time | -- | -- | 3198 ms | -- |
| Metric | Baseline | Optimized | V2 (compile) | Baseline vs V2 |
|---|---|---|---|---|
| Dedicated VRAM (physical) | 8188 MB (maxed out) | 5166 MB | 4916 MB | -40% |
| PyTorch allocator reserved | 9792 MB | 1582 MB | 3794 MB | -61% |
| Shared GPU memory spillover | ~1604 MB | 0 MB | 0 MB | Eliminated |
| Headroom within 8 GB VRAM | 0 MB | 3022 MB | 3272 MB | -- |
The baseline completely saturates the 8 GB GPU and spills ~1.6 GB into shared GPU memory (system RAM accessed over PCIe), which is significantly slower than dedicated VRAM. Both the optimized and v2 configs fit comfortably within dedicated VRAM.
Note on allocator reserved: The
optimizedprofile uses aggressivetorch.cuda.empty_cache()between stages, which returns cached blocks to the OS (hence the very low 1,582 MB reserved). Thev2profile disables cache clearing because in-place operations and torch.compile's fused kernels manage memory more efficiently. The allocator reserves more (3,794 MB), but actual device peak is lower (4,916 MB vs 5,166 MB) and throughput is higher.
The v2 profile incurs a one-time warmup cost as torch.compile traces and compiles Triton kernels:
| First 5 avg (ms) | Last 5 avg (ms) | Warmup overhead | |
|---|---|---|---|
| V2 | 11,735 | 3,456 | +240% |
After warmup, steady-state performance is 3,456 ms/frame (0.29 fps).
The original engine (no optimizations at all) OOMs immediately at 4K on 8 GB GPUs. Flash Attention is the minimum required optimization to avoid the out-of-memory crash. The baseline uses Flash Attention only to serve as the closest proxy to original behavior while remaining runnable.
All footage is from Tears of Steel scene 02_3c, 4096x2160 at 24 fps.
raw_greenscreen.mp4
alpha_hint.mp4
Baseline (Flash Attention only):
comp_baseline.mp4
Optimized (all optimizations):
comp_optimized.1.mp4
V2 (optimized + torch.compile):
comp_v2.mp4
Baseline:
alpha_baseline.mp4
Optimized:
alpha_optimized.mp4
V2 (optimized + torch.compile):
alpha_v2.mp4
Config flag: flash_attention: True
Profile: All profiles except original
Impact: Required to avoid OOM at 4K on 8 GB GPUs. Without this patch, the global attention blocks materialize a full N x N attention matrix, which does not fit in memory.
CorridorKey uses Meta's Hiera vision transformer as its backbone. Hiera organizes tokens into "mask units" for windowed attention (Stages 0-1), then switches to global attention (Stages 2-3) by setting num_windows = 1.
The problem is in how Hiera constructs its Q/K/V tensors. For global attention, it creates tensors with shape [B, heads, 1, N, head_dim], a 5D tensor where the num_windows dimension is 1. When this 5D non-contiguous tensor is passed to F.scaled_dot_product_attention, PyTorch's SDPA dispatcher silently falls back to the math backend, which materializes the full N x N attention matrix in memory instead of using the memory-efficient FlashAttention kernel.
At 2048x2048 input with Hiera's tokenization, N is large enough that this math-backend fallback consumes too much VRAM for consumer GPUs.
Monkey-patch the forward() method of Hiera's MaskUnitAttention on global-attention blocks (where use_mask_unit_attn == False). The patch:
- Squeezes the
num_windowsdimension from Q/K/V tensors - Makes them contiguous 4D tensors:
[B, heads, N, head_dim] - Passes them to
F.scaled_dot_product_attention, which now correctly dispatches to FlashAttention/memory-efficient kernels
Windowed attention blocks (Stages 0-1) are left unmodified since they are already efficient.
Implementation: CorridorKeyModule/core/optimized_model.py:40-93 (_patch_hiera_global_attention())
Applied in: CorridorKeyModule/core/model_transformer.py:225-230 (during GreenFormer.__init__())
Config flag: tiled_refiner: True, tile_size: 512, tile_overlap: 128
Profile: All profiles except original
Impact: Reduces VRAM usage during the refiner stage by processing the input in small tiles instead of the full 2048x2048 resolution at once.
The CNN Refiner (CNNRefinerModule) takes a 7-channel input (RGB + coarse alpha + coarse FG predictions) at the full 2048x2048 resolution and runs dilated residual convolution blocks to produce additive "delta logits" that sharpen edges. Processing the entire 2048x2048 input at once consumes significant VRAM in the intermediate feature maps.
Replace the standard refiner with TiledCNNRefiner, which processes the input in overlapping tiles:
- Tile size: 512x512 (default)
- Overlap: 128px (default)
- Stride:
tile_size - overlap= 384px
Each tile is processed independently through the same CNN pipeline. Tile outputs are merged using linear blend weights (ramps from 0 to 1 over the overlap region) to produce seamless results.
This is mathematically lossless because the refiner's receptive field is ~65px (from dilated residual blocks with dilations 1, 2, 4, 8), and the 128px overlap fully covers it. Any pixel's prediction depends only on inputs within 65px, which the overlap guarantees are identical whether processed as part of the full image or a tile.
If the input fits in a single tile (smaller than tile_size), tiling overhead is skipped entirely.
The tiled refiner also deduplicates tile coordinates at image boundaries. The stride-based loop can generate overlapping boundary tiles that map to the same adjusted coordinates after clamping. A set of processed coordinates ensures each unique region is only processed once, eliminating redundant computation.
Implementation: CorridorKeyModule/core/optimized_model.py:262-383 (TiledCNNRefiner)
Instantiated in: CorridorKeyModule/core/model_transformer.py:262-276
Config flag: sparse_refiner: True
Profile: optimized, v2, experimental
Impact: Skips CNN refiner processing for tiles where the coarse alpha is uniformly near 0 or near 1. On typical green screen footage, this eliminates 50-70% of tile CNN passes.
In a typical green screen shot, the majority of the frame is either pure background (alpha near 0) or pure foreground (alpha near 1). Only a thin band of edge pixels, typically around hair, fingers, and translucent edges, requires refinement. However, the standard tiled refiner processes every tile identically, spending the same GPU time on uniform-alpha tiles as on complex edge tiles.
Before running the CNN on each tile, check the coarse alpha channel of that tile:
- If
alpha_tile.max() < 0.05: the tile is pure background, skip it - If
alpha_tile.min() > 0.95: the tile is pure foreground, skip it - Otherwise: the tile contains edges, process normally
Skipped tiles contribute zero delta (the delta accumulator was initialized to zeros) and their blend weight is still accumulated so the weighted average remains correct.
This is quality-safe because the CNN refiner outputs additive "delta logits", which are corrections to the existing prediction. For regions that are already confidently background or foreground, the trained refiner naturally produces near-zero deltas. Skipping these tiles substitutes exactly zero, which closely matches what the refiner would have produced.
Implementation: CorridorKeyModule/core/optimized_model.py:368-376 (sparse skip in TiledCNNRefiner.forward())
Profile: All profiles (built into the model) Impact: Reduces peak activation memory by reusing tensors instead of allocating new ones for logit-space operations.
The original decode-and-refine pipeline allocated separate tensors for each step: decoder logits, upsampled logits, coarse probabilities, refiner deltas, refined logits, and final probabilities. At 2048x2048 resolution with 4 channels (1 alpha + 3 FG), each full-resolution tensor is ~64 MB in FP16. Having multiple overlapping tensor lifetimes inflates peak activation memory.
Rewrite _decode_and_refine() to use in-place operations and explicit del statements:
- In-place residual addition:
alpha_logits_up.add_(delta[:, 0:1])instead ofalpha_logits_up = alpha_logits_up + delta[:, 0:1] - In-place sigmoid:
alpha_logits_up.sigmoid_()instead ofalpha = torch.sigmoid(alpha_logits_up) - In-place refiner scaling:
delta_logits.mul_(refiner_scale)when scale != 1.0 - Explicit tensor cleanup:
del alpha_logits, fg_logitsafter upsample frees the H/4 resolution tensors immediately;del alpha_coarse, fg_coarseafter building the refiner input;del delta_logitsafter the residual add
These in-place operations require torch.inference_mode() context (which disables autograd version counting).
The refiner_scale parameter is passed directly through forward() instead of using a forward hook, keeping the compiled graph clean.
Implementation: CorridorKeyModule/core/model_transformer.py:291-344 (_decode_and_refine())
Profile: All profiles on CUDA Impact: Improves CUDA convolution kernel selection and memory access patterns.
PyTorch defaults to NCHW (channels-first) memory layout. NVIDIA's cuDNN and Triton kernels often perform better with NHWC (channels-last) layout because it produces more coalesced memory accesses for convolution and batch normalization operations.
After model loading, convert the model and input tensors to channels_last memory format:
self.model = self.model.to(memory_format=torch.channels_last)
inp_t = inp_t.to(memory_format=torch.channels_last)This is a metadata-only operation on the model (no weight copy occurs). The tensor data is transposed, but cuDNN and Triton automatically select kernels optimized for NHWC layout.
Implementation: CorridorKeyModule/base_engine.py:58-60 (model), base_engine.py:237-238 (input tensor)
Profile: All profiles Impact: Enables in-place operations and reduces per-tensor overhead by disabling autograd bookkeeping.
torch.no_grad() disables gradient computation but still tracks tensor versions for autograd safety. This prevents in-place operations from being used freely and adds overhead to every tensor allocation.
Replace @torch.no_grad() with @torch.inference_mode() on process_frame(). Inference mode:
- Disables gradient computation (same as
no_grad) - Disables autograd version counting, allowing safe in-place operations (
.add_(),.sigmoid_(),.mul_()) - Reduces per-tensor metadata overhead
Implementation: CorridorKeyModule/base_engine.py:175 (@torch.inference_mode() decorator on process_frame())
Config flag: compile_submodules: True
Profile: v2, experimental
Requires: Triton (install triton-windows on Windows)
Impact: 34% faster median frame time through Triton kernel fusion. Reduces intermediate tensor materializations.
Eager-mode PyTorch executes each operation individually, launching a separate CUDA kernel for each Conv2d, GroupNorm, ReLU, linear projection, etc. Each kernel launch has dispatch overhead (~5-10 us), and intermediate results are materialized in VRAM between operations even when they could be fused.
Apply torch.compile() to individual sub-modules rather than the full model:
self.model.encoder = torch.compile(self.model.encoder)
self.model.alpha_decoder = torch.compile(self.model.alpha_decoder)
self.model.fg_decoder = torch.compile(self.model.fg_decoder)
self.model.refiner._process_tile = torch.compile(self.model.refiner._process_tile)Why sub-modules, not the full model? The orchestration code in _decode_and_refine() contains control flow (conditional refiner_scale, del statements, in-place ops) and the tiled refiner uses Python loops with dynamic tile coordinates, all of which cause graph breaks in torch.compile. Compiling sub-modules individually keeps each compiled graph clean and break-free.
Why not CUDA graphs? Hiera's forward_intermediates() stores intermediate feature tensors at each stage boundary. CUDA graphs capture and replay GPU operations, but the stored tensors get overwritten on replay, causing RuntimeError: accessing tensor output of CUDAGraphs that has been overwritten. CUDA graphs are explicitly disabled via inductor_config.triton.cudagraphs = False.
What gets fused: Conv-GroupNorm-ReLU chains in the refiner, transformer block internals (LayerNorm-Linear-GELU sequences), element-wise operations, and attention projections. The Triton compiler generates custom GPU kernels that fuse these operation sequences, avoiding intermediate tensor materializations.
Warmup: The first ~5 frames incur 2-3x overhead as Triton traces and compiles kernels. The suppress_errors = True setting ensures graceful fallback to eager mode if any graph fails to compile.
Implementation: CorridorKeyModule/base_engine.py:66-97 (_compile_submodules())
Config flag: disable_cudnn_benchmark: True
Profile: All profiles except original
Impact: Reduces VRAM used by cuDNN workspace allocations during convolution benchmarking.
When torch.backends.cudnn.benchmark = True (PyTorch's default in many setups), cuDNN runs multiple convolution algorithms on the first call to find the fastest one. Each algorithm trial requires allocating workspace memory, which adds to VRAM usage. On memory-constrained GPUs, this benchmark overhead can push memory usage over the limit.
Set torch.backends.cudnn.benchmark = False. cuDNN will use its default heuristic-selected algorithm instead of benchmarking. The selected algorithm may be slightly slower for specific convolution shapes, but avoids the workspace memory overhead.
Implementation: CorridorKeyModule/base_engine.py:52-54
Config flag: cache_clearing: True
Profile: optimized only (disabled in v2 where in-place ops manage memory)
Impact: Prevents memory accumulation between pipeline stages.
PyTorch's CUDA caching allocator retains freed GPU memory blocks for potential reuse. While this avoids the overhead of repeated cudaMalloc/cudaFree calls, it means memory from one pipeline stage remains "reserved" (from the OS perspective) even after the tensors are freed. When the next stage has a different allocation pattern, it allocates additional memory on top of the cached blocks, inflating total reserved memory.
With the encoder, decoder, and refiner stages each having different tensor shapes and sizes, the caching allocator can accumulate reserved memory across all stages simultaneously.
Call torch.cuda.empty_cache() at two strategic points in the inference pipeline (before and after model inference) at the engine level.
Note: In the v2 profile, cache clearing is disabled. The combination of in-place operations (which reuse existing tensor memory instead of allocating new tensors) and torch.compile's fused kernels (which reduce intermediate allocations) makes cache clearing unnecessary. Disabling it avoids the cudaFree/cudaMalloc overhead on every frame.
Implementation: CorridorKeyModule/base_engine.py:242-258 (engine-level cache clearing)
Config flag: token_routing: True
Profile: experimental only
Status: Experimental, disabled by default. Requires fine-tuning for production use.
Route "easy" tokens (solid foreground/background, as determined by the alpha hint mask) to a lightweight LTRM (Lightweight Token Refinement Module) instead of full global self-attention. Only "edge" tokens (uncertain alpha values between configurable thresholds) go through the expensive O(N^2) global attention.
- Edge tokens: Alpha hint between 0.02 and 0.98 (configurable) -> full attention
- Easy tokens: Alpha hint below 0.02 or above 0.98 -> LTRM at O(N) cost
The LTRM architecture: LayerNorm -> Linear expand -> GELU -> DWConv 5x5 -> GELU -> Linear project -> ECA residual gating
The LTRM weights are zero-initialized (fc2 weights = 0), so the module starts as an identity function. This makes it fully compatible with the pretrained checkpoint without any fine-tuning. The model can be loaded and run with token routing enabled, but optimal quality requires fine-tuning the LTRM weights.
Implementation: CorridorKeyModule/core/optimized_model.py:101-254 (LTRM, ECA, HintBasedTokenRouter)
_BaseCorridorKeyEngine (base_engine.py)
Abstract base class: constructor, checkpoint loading,
process_frame() pipeline, cuDNN disable, channels_last,
torch.compile, inference_mode, metrics
|
|--- CorridorKeyEngine (inference_engine.py)
| Original engine. Uses GreenFormer directly.
| Defaults to OptimizationConfig.original() (all opts off)
|
|--- OptimizedCorridorKeyEngine (optimized_engine.py)
Optimized engine. Uses OptimizedGreenFormer.
Defaults to OptimizationConfig.optimized() (production opts on)
GreenFormer (model_transformer.py)
Base model: Hiera backbone, multiscale decoders, CNN refiner
Handles: FlashAttention patching, tiled refiner, in-place
decode-and-refine, refiner_scale parameter
|
|--- OptimizedGreenFormer (optimized_model.py)
Extends GreenFormer with token routing machinery
(LTRM + HintBasedTokenRouter)
When routing is disabled, delegates entirely to GreenFormer.forward()
Optimizations are config-driven, not engine-driven. Both engines accept any OptimizationConfig. The GreenFormer base model handles FlashAttention, tiled refiner, sparse skip, and in-place operations based on the config, so even the "original" CorridorKeyEngine can use these optimizations if given the right config. The _BaseCorridorKeyEngine handles channels_last, inference_mode, torch.compile, and cache clearing at the engine level. The OptimizedCorridorKeyEngine simply defaults to the optimized profile and adds LTRM weight handling.
Input (4096x2160 EXR, linear float)
|
v
[Lanczos4 resize to 2048x2048]
|
v
[Linear -> sRGB conversion] (if input_is_linear=True)
|
v
[ImageNet normalization + alpha hint concat -> 4-channel input]
|
v
[channels_last conversion] (CUDA only)
|
v
[torch.cuda.empty_cache()] (if cache_clearing)
|
v
[Hiera Encoder] Stages 0-1: Windowed attention (efficient)
| Stages 2-3: Global attention (FlashAttention patched)
| (torch.compile'd if compile_submodules)
v
[Multiscale Decoder] Predicts coarse alpha (1ch) + coarse FG (3ch) at H/4
| (torch.compile'd if compile_submodules)
|
|-- del H/4 tensors (in-place pipeline)
|
v
[F.interpolate to full resolution]
|
v
[sigmoid -> coarse probs -> cat -> refiner input]
|
|-- del coarse probs (in-place pipeline)
|
v
[TiledCNNRefiner] 7ch input (RGB + coarse predictions)
| 512x512 tiles, 128px overlap
| Sparse skip: uniform-alpha tiles skipped
| Dedup: boundary tiles processed once
| (torch.compile'd _process_tile if compile_submodules)
v
[In-place residual add in logit space] logits.add_(delta * refiner_scale)
|
v
[In-place sigmoid activation] logits.sigmoid_()
|
v
[torch.cuda.empty_cache()] (if cache_clearing)
|
v
[Lanczos4 resize back to 4096x2160]
|
v
[Post-processing: despill, premultiply, composite]
|
v
Output: alpha, FG (sRGB), processed (linear premul RGBA), comp (sRGB preview)
In CorridorKeyModule/backend.py, the system auto-detects the optimal engine:
- CUDA GPU with less than 16 GB VRAM: Uses
OptimizedCorridorKeyEnginewithOptimizationConfig.optimized() - CUDA GPU with 16 GB or more VRAM: Uses standard
CorridorKeyEngine(no optimizations needed) - Apple Silicon with MLX available: Uses MLX backend
| Profile | flash_attention |
tiled_refiner |
sparse_refiner |
disable_cudnn_benchmark |
cache_clearing |
compile_submodules |
token_routing |
|---|---|---|---|---|---|---|---|
original |
off | off | off | off | off | off | off |
optimized (production) |
on | on | on | on | on | off | off |
v2 |
on | on | on | on | off | on | off |
experimental |
on | on | on | on | off | on | on |
Profile selection guide:
optimized: Safe, no compilation warmup, works without Triton. Best for interactive use.v2: Fastest steady-state throughput. Requires Triton (triton-windowson Windows). ~10s warmup on first frames.experimental: Includes untrained token routing. For development only.
# Use a named profile
uv run python corridorkey_cli.py --action run_inference --profile v2
# Individual toggles (override profile settings)
--flash-attention / --no-flash-attention
--tiled-refiner / --no-tiled-refiner
--sparse-refiner / --no-sparse-refiner
--tile-size N # default: 512
--tile-overlap N # default: 128
--disable-cudnn-benchmark / --no-disable-cudnn-benchmark
--cache-clearing / --no-cache-clearing
--compile / --no-compile
--token-routing / --no-token-routing
--metrics # enable per-stage timing/VRAM reportingfrom CorridorKeyModule.optimization_config import OptimizationConfig
from CorridorKeyModule.optimized_engine import OptimizedCorridorKeyEngine
# Production config (no compile warmup)
config = OptimizationConfig.optimized()
# V2 config (fastest, requires Triton)
config = OptimizationConfig.v2()
# Custom config
config = OptimizationConfig(
flash_attention=True,
tiled_refiner=True,
sparse_refiner=True,
tile_size=512,
tile_overlap=128,
disable_cudnn_benchmark=True,
cache_clearing=False,
compile_submodules=True,
enable_metrics=True,
)
engine = OptimizedCorridorKeyEngine(
checkpoint_path="CorridorKeyModule/checkpoints/CorridorKey.pth",
device="cuda",
img_size=2048,
use_refiner=True,
optimization_config=config,
)
# Process a frame (supports linear EXR input)
result = engine.process_frame(image_rgb, alpha_hint, input_is_linear=True)
# result["alpha"] -> [H, W, 1] float32 alpha matte
# result["fg"] -> [H, W, 3] float32 sRGB foreground
# result["processed"] -> [H, W, 4] float32 linear premultiplied RGBA (EXR-ready)
# result["comp"] -> [H, W, 3] float32 sRGB composite preview
# result["metrics"] -> PerformanceMetrics (if enable_metrics=True)Source: Tears of Steel (scene 02_3c) License: CC-BY 3.0 (c) Blender Foundation | mango.blender.org Format: OpenEXR 16-bit half-float, 4096x2160 (DCI 4K), linear color space Frames: First 100 frames (~4.2 seconds at 24 fps) Content: Green screen footage with actors, real production footage rather than synthetic test data
Alpha hints were auto-generated using HSV chroma keying:
- Read linear EXR frame
- Convert linear to sRGB (piecewise transfer function)
- Convert sRGB to HSV
- Threshold green hue range (35-85), saturation (>40), value (>30)
- Invert (green = background, non-green = foreground)
- Erode with 7px elliptical kernel (slight under-prediction, which the model handles better)
- Gaussian blur with 21px kernel (produces soft/coarse edges the model expects)
- Save as uint8 PNG
Script: tears_of_steel_test/generate_alpha_hints.py
benchmark_4k_vram.py runs each configuration in a separate subprocess to ensure clean GPU state. Supports configurable profile selection:
uv run python benchmark_4k_vram.py --profile v2 # single profile (default)
uv run python benchmark_4k_vram.py --profile baseline v2 # compare two profiles
uv run python benchmark_4k_vram.py --profile all # compare all three
uv run python benchmark_4k_vram.py --frames 50 # fewer framesAvailable profiles: baseline (Flash Attention only), optimized, v2 (compile), all (run all three).
For each config, it measures:
- Per-frame wall-clock time (ms)
- PyTorch allocator peak allocated/reserved memory
- Device-level GPU memory via
torch.cuda.mem_get_info()(polled at 25ms intervals in a background thread) - Per-stage timing (inference, postprocessing) via
PerformanceMetrics
Output is written as EXR frame sequences (linear premultiplied RGBA + single-channel alpha) for quality comparison.
Three levels of GPU memory are tracked:
| Level | API | What it shows |
|---|---|---|
| Live tensors | torch.cuda.max_memory_allocated() |
PyTorch tensor memory only |
| Allocator reserved | torch.cuda.max_memory_reserved() |
Total memory held by PyTorch's caching allocator (includes freed-but-cached blocks) |
| Device-level | torch.cuda.mem_get_info() |
Actual GPU memory usage including CUDA context, cuDNN workspace, and driver overhead. Equivalent to what Task Manager shows. Capped at physical VRAM and cannot detect shared memory spillover. |
When allocator reserved exceeds physical VRAM, Windows spills into shared GPU memory (system RAM accessed over PCIe), which is dramatically slower. The benchmark reports this spillover explicitly.
Output/
comp_baseline/ # Processed RGBA EXR sequence (baseline)
comp_optimized/ # Processed RGBA EXR sequence (optimized)
comp_v2/ # Processed RGBA EXR sequence (v2)
alpha_baseline/ # Alpha matte EXR sequence (baseline)
alpha_optimized/ # Alpha matte EXR sequence (optimized)
alpha_v2/ # Alpha matte EXR sequence (v2)
- Processed RGBA: Linear premultiplied RGBA, half-float, PXR24 compression. Ready for compositing in Nuke, Fusion, After Effects, etc.
- Alpha: Single-channel linear float
tears_of_steel_test/
frames/ # 100 EXR source frames (4096x2160, ~51 MB each)
alpha_hints/ # 100 PNG alpha hints (4096x2160, auto-generated)
download_frames.py # Downloads frames from media.xiph.org
generate_alpha_hints.py # HSV chroma key alpha hint generator
uv run python tears_of_steel_test/download_frames.pyDownloads the first 100 EXR frames (~5.1 GB) from the Tears of Steel open movie project.
uv run python tears_of_steel_test/generate_alpha_hints.pyGenerates coarse alpha hints using HSV chroma keying.
# Windows
uv pip install triton-windows
# Linux
uv pip install triton# Run v2 profile (default)
uv run python benchmark_4k_vram.py
# Run specific profile
uv run python benchmark_4k_vram.py --profile optimized
# Compare all profiles
uv run python benchmark_4k_vram.py --profile all
# Fewer frames for a quick test
uv run python benchmark_4k_vram.py --profile v2 --frames 10Processes frames through the selected configuration(s), generates the report at benchmark_4k_results.md, and writes output EXR sequences to Output/.
| File | Purpose |
|---|---|
CorridorKeyModule/optimization_config.py |
OptimizationConfig dataclass, profiles (original, optimized, v2, experimental), PerformanceMetrics |
CorridorKeyModule/base_engine.py |
_BaseCorridorKeyEngine: inference_mode, channels_last, torch.compile, cache clearing, refiner_scale |
CorridorKeyModule/optimized_engine.py |
OptimizedCorridorKeyEngine with LTRM weight handling |
CorridorKeyModule/core/optimized_model.py |
FlashAttention patch, TiledCNNRefiner (dedup + sparse), LTRM, ECA, TokenRouter |
CorridorKeyModule/core/model_transformer.py |
GreenFormer model: in-place _decode_and_refine(), refiner_scale parameter |
CorridorKeyModule/backend.py |
Auto-backend selection based on GPU VRAM |
benchmark_4k_vram.py |
4K benchmark script (configurable profiles) |
benchmark_4k_results.md |
Latest benchmark results |