Training pipeline for Prism, a neural renderer that transforms game graphics toward photorealism in real-time.
Status: Training in progress. Models are actively being trained. First results are promising.
Prism takes a game's G-buffer (color, depth, motion vectors) and outputs a photorealistic version of the same scene. Not just upscaling -- full style transfer: lighting, materials, color grading, atmospheric effects.
Think DLSS 5 Ray Reconstruction, but open source and running on a single mid-range GPU.
U-Net + Transformer (2.5M params, <12ms at 1080p)
Game G-buffer (540p)
-> Conv Encoder: 540p -> 270p -> 135p -> 68p
-> 4x Windowed Transformer Blocks at 68p (global scene understanding)
-> Temporal GRU at 68p (frame-to-frame coherence)
-> Conv Decoder with skip connections: 68p -> 135p -> 270p -> 540p
-> PixelShuffle: 540p -> 1080p photorealistic output
- Encoder downsamples 8x so the transformer processes only 8K tokens (not 518K pixels)
- Windowed self-attention (8x8) provides global context for style transfer
- Temporal GRU at bottleneck resolution maintains frame coherence (~free)
- PixelShuffle upscales 2x or 3x (supports all DLSS/FSR presets)
- Inference: 12.4ms on RTX 5060 Ti using Vulkan cooperative vectors (tensor cores)
GAN training with multi-scale PatchDiscriminator:
# Basic training
python train.py --data path/to/dataset --device cuda:0 --model balanced --amp --wandb
# Resume from checkpoint
python train.py --data path/to/dataset --resume checkpoints/checkpoint_ep50.pth --amp --wandb- Mixed real + synthetic data (50/50 per batch)
- Discriminator only sees real video as "real" -- forces generator to learn photorealistic style
- Game-style augmentation on real video input (flat lighting, color banding, sharpening) -- teaches the model to reverse game artifacts
- Streaming temporal -- hidden state flows between frames (no BPTT, just forward)
- Both 2x and 3x PixelShuffle trained jointly
| Preset | Params | Transformer | Target Speed |
|---|---|---|---|
| fast | 590K | 2 blocks | ~3ms @ 1080p |
| balanced | 2.5M | 4 blocks | ~12ms @ 1080p |
| quality | 6.4M | 6 blocks | ~20ms @ 1080p |
Generate training data from video files:
# From video files (extracts frames + Depth Anything V2 + RAFT optical flow)
python generate_dataset.py --videos path/to/videos --output data/dataset --device cuda:0
# Batch generation from multiple sources
python generate_v3.py --device cuda:0 --output data/datasetEach sample contains:
- color [3, 256, 256] -- render-resolution input (with game-style degradation for real video)
- depth [1, 256, 256] -- monocular depth (Depth Anything V2)
- motion_vectors [2, 256, 256] -- optical flow (RAFT)
- ground_truth [3, 512, 512] or [3, 768, 768] -- target output (2x or 3x)
- is_real -- True for real video, False for game/synthetic
For best results, use diverse video covering:
- Nature (forests, water, mountains, caves)
- Urban (cities, streets, buildings)
- Medieval/fantasy (castles, villages -- directly game-relevant)
- Sci-fi/industrial (spaceships, factories)
- Action/FPS POV (first-person, fast motion)
- Faces (close-up, medium, far -- from both real and game sources)
- VFX/effects (particles, lighting, fire)
With --wandb, training logs:
- Loss curves (L1, adversarial, temporal, discriminator)
- Fixed sample comparisons every epoch: Nearest | Bilinear | Prism | GT (both 2x and 3x)
- Model architecture info
See prism-inference for the Vulkan compute inference engine with cooperative vectors (tensor cores).
MIT