Skip to content

Code of paper: ReTiDe: Real-Time Denoising for Energy-Efficient Motion Picture Processing with FPGAs

Notifications You must be signed in to change notification settings

RCSL-TCD/ReTiDe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

ReTiDe-Real-Time-Denoising-for-Energy-Efficient-Motion-Picture-Processing-with-FPGAs

Changhong Li, Clement Bled, Rosa Fernandez, Shreejith Shanker.

Conference Homepage: CVMP 2025

Figure 0

Abstract

Denoising is a core operation in modern video pipelines. In codecs, in-loop filters suppress sensor noise and quantisation artefacts to improve rate-distortion performance; in cinema post-production, denoisers are used for restoration, grain management, and plate clean-up. However, state-of-the-art deep denoisers are computationally intensive and, at scale, are typically deployed on GPUs, incurring high power and cost for real-time, high-resolution streams. ReTiDe (Real-Time Denoise) is a hardware-accelerated denoising system that serves inference on data-centre Field Programmable Gate Arrays (FPGAs). A compact convolutional model is quantised (post-training quantisation plus quantisation-aware fine-tuning) to INT8 and compiled for AMD Deep Learning Processor Unit (DPU)-based FPGAs. A client-server integration offloads computation from the host CPU/GPU to a networked FPGA service, while remaining callable from existing workflows, e.g., NUKE, without disrupting artist tooling. On representative benchmarks, ReTiDe delivers 37.71× Giga Operations Per Second (GOPS) throughput and 5.29× higher energy efficiency than prior FPGA denoising accelerators, with negligible degradation in Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). These results indicate that specialised accelerators can provide practical, scalable denoising for both encoding pipelines and post-production, reducing energy per frame without sacrificing quality or workflow compatibility.

Dataset

The datasets we used for benchmarking are listed as follows:

Architecture

  • The integration workflow of Vitis AI with NUKE is illustrated in the following figure. The AMD Vitis AI toolchain provides an end-to-end workflow for deploying quantised neural networks, bridging the gap between machine learning frameworks and FPGA-based deployment. Moreover, we also provide general client-server interfaces for the integration of other software. Starting from FP32 model descriptions written in popular frameworks such as TensorFlow and PyTorch, the toolchain performs model quantisation and operator conversion. The converted models can then be accelerated on the Deep Learning Processing Unit (DPU), which is specifically designed for convolutional and matrix-intensive workloads. By mapping computation-intensive kernels directly onto dedicated hardware engines, the toolchain not only reduces CPU overhead but also maximises parallelism and memory bandwidth utilisation. This hardware–software co-design significantly improves inference throughput while simultaneously reducing power consumption. Figure 1

  • U50 server-level FPGA accelerator card. Figure~\ref{fig:4} demonstrates a processing flow from the original noisy image ($\sigma=50$) to the denoised image with segmentation and corresponding parallelisation. Upon receiving denoising requests initiated by remote or local hosts, the incoming image or video stream is first processed by a pre-processor, which segments and batches the media into standardised input formats suitable for the model. This also facilitates parallel processing across multiple threads and DPU units.

Figure 2

Result

We benchmarked our model with both color and grayscale datasets and the results can be found as follows.

PSNR (dB) Comparison of Various Algorithms for Grayscale Image Denoising

Figure 3

Type Method BSD68 15 BSD68 25 BSD68 50 URBAN100 15 URBAN100 25 URBAN100 50
FP32 BM3D 30.95 25.32 24.89 31.91 29.06 24.45
FFDNet 31.45 28.96 25.16 33.76 31.41 28.09
IRCNN 31.46 28.79 25.11 33.08 29.62 24.53
DnCNN-20 31.60 29.14 26.20 33.76 30.19 19.34
SwinIR 31.76 29.10 25.40 33.44 30.43 25.47
ReTiDe 31.48 29.09 26.20 33.25 30.60 26.55
QNNs L-DnCNN 31.44 29.01 26.08 - - -
ReTiDe (P) 29.92 28.35 26.73 29.92 28.35 25.52
ReTiDe (Q) 30.94 29.23 26.73 30.20 28.46 25.61

PSNR (dB) and SSIM Comparison of Popular Denoisers for Colour Denoising on BSD100

Type Method Blind/Nonblind 5 PSNR 5 SSIM 15 PSNR 15 SSIM 25 PSNR 25 SSIM 35 PSNR 35 SSIM 45 PSNR 45 SSIM
FP32 BM3D Nonblind 39.85 0.98 33.17 0.9223 30.16 0.8598 28.17 0.8007 26.62 0.7470
DnCNN Blind 39.72 0.9728 33.46 0.9245 30.56 0.8711 28.62 0.8217 27.11 0.7766
FFDNet Nonblind 39.84 0.9788 33.65 0.9265 31.00 0.8772 29.37 0.8337 28.23 0.7958
IRCNN Nonblind 39.95 0.9789 33.41 0.9234 30.45 0.8678 28.43 0.8114 26.87 0.7578
ReTiDe Blind 39.46 0.9761 33.27 0.9205 30.65 0.8682 29.03 0.8224 27.89 0.7826
QNNs ReTiDe (P) Blind 32.94 0.8943 30.38 0.8414 28.95 0.8149 27.96 0.7941 27.06 0.7713
ReTiDe (Q) Blind 33.22 0.9425 31.03 0.9008 29.37 0.8576 28.14 0.8164 27.14 0.7811

Figure 4

Performance Comparison Across Platforms: Frequency, Throughput, Power, and Energy Efficiency

Method Platform Thr. (GOPS) Power (W) Energy Eff. (GOPS/W)
L-DnCNN I7-7700HQ CPU 29.5 45 0.66
TNet-mini I5-12400F CPU 164.3 65 2.53
ReTiDe U7-265K CPU 770.2 42.1 18.30
L-DnCNN RTX 1070 GPU 1066.7 115 9.28
TNet-mini RTX 2080Ti GPU 1785.7 250 7.14
ReTiDe A4000 GPU 8,285.5 236.3 35.06
L-DnCNN MZU03A-EG FPGA 41.8 2.4 17.18
TNet-mini MZU03A-EG FPGA 99.3 2.6 38.51
ReTiDe Alveo U50 FPGA 3,746.1 18.4 203.59

Citation

  • If you found this work is useful for you, please cite our work below.
@article{li2025retide,
  title={ReTiDe: Real-Time Denoising for Energy-Efficient Motion Picture Processing with FPGAs},
  author={Li, Changhong and Bled, Cl{\'e}ment and Fernandez, Rosa and Shanker, Shreejith},
  journal={arXiv preprint arXiv:2510.03812},
  year={2025}
}

Demo

Acknowledgement

This work was funded by the Horizon CL4 2022 - EU Project Emerald – 101119800.

About

Code of paper: ReTiDe: Real-Time Denoising for Energy-Efficient Motion Picture Processing with FPGAs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published