This repository contains a summary of the research paper Photorealistic Style Transfer via Wavelet Transforms by Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang and Jung-Woo Ha. Additionally it contains a PyTorch implementaion of the model. This was obtained from the official repository by the authors here. This repository serves as an exploration of the paper by me and is not an official implementation of the paper. All pretrained models and weights are obtained from the official repository.
This research paper focuses on style transfer between two images. Existing methods of style transfer are limited by spatial distortions or unrealistic artifacts. Here, a network architecture that enhances photorealism is introduced. Wavelet corrected transfer is proposed based on whitening and colouring transforms. Additionally, video stylization is also possible.
A photorealistic style transfer needs to:
- Apply the reference style
- Should not hurt the details of an image
The paper proposes a wavelet corrected transfer based on whitening and coloring transforms
Whitening and Colouring Transform comprises of 2 components. The whitening transform is used to remove corelation between features.
where
The colouring transform is used to re-introduce the correlation between features.
where
In style transfer, whitening is used to normalize the covariances and variances in the source image to be unit values and colouring is used to re-introduce the covariances and variances of the style to the whitened image this generating the stylized image. The issue is that the process has a time complexity of
PhotoWCT replaced the upsampling laters of the VGG decoder with unpooling. The goal was to compensate for information loss during encoding. But it couldn't solve the information loss from the max pooling of the VGG network and required post processing steps.
Haar wavelet pooling has four kernels. They are
THerefore the output of the Haar wavelet pooling operation has four channels. Low pass filter captures smooth surface and texture while the high pass filters extract vertical, horizontal and diagonal edge information. Output of each kernel is given as LL, LH, HL and HH.
The signal can be reconstructed using the mirror operation (wavelet unpooling). Therefore there is minimal information loss unlike max pooling where there is no inverse operation.
The model architecture is an improvement over PhotoWCT. The max pooling and unpooling layers are replaced wwith wavelet pooling and unpooling. ImageNet pre-trained VGG-19 is used as the encoder. High frequency components are directly passed to the decoder while only the low frequency component is passed to the next encoding layer.
WCT performs style transfer with arbitrary styles by matching the correlation between content and the style in the VGG feature domain. The content features are projected to the eigenspace of style features using SVD. Then the transferred features are passed to the decoder to obtain the stylized image.
Previous approaches used a multi-level stylization. Here features are progressively transformed within a single forward pass. WCT is sequentially applied at each scale within a single encoder-decoder network. Training procedure is simple and it avoids the errors being amplified due to recursively encoding and decoding the signal in the VGG network.
PhotoWCT suffers from loss of spatial information by max-pooling as shown below:
Since low frequency components capture textures and high frequency components detect edges, individually applying WCT to these components allows for individual stylization. If the style is only applied to the low frequency components as shown above, the edges remain unchanged. Only using the low frequency component is equivalent to average pooling.
The paper compares stylization results using other pooling variants. Split pooling and learnable pooling are studied. Split pooling can carry wole information. Learnable pooling is a trainable convolutional layer and it does not represent content faithfully.
Concatenation was adopted instead of summing to acheive better reconstruction. Four feature components from the corresponding scale and feature output before wavelet pooling are concatenated. This produces better results at the cost of additional parameters. Summing produces a more stylized output while concatenation produces a clearer image.
Since wavelet pooling and unpooling are invertible operations, a multi-level strategy can be adopted to increase the contrast in the transfered style. This produces more vivid results.
Based on provided results, it is clear the the mew model produces higher quality photorealistic images compared to previous methods. Additionally it produces better results for video stylization. Memory utilization and runtime are also minimal compared to previous approaches.
-
Clone the repository using the following command:
git clone https://github.com/Warren-SJ/Photorealistic-Style-Transfer.git -
Navigate to the cloned directory:
cd Photorealistic-Style-Transfer -
Run the following command to install the required packages:
pip install -r requirements.txt -
Run the following command to start the application:
python app.py -
Follow the instructions in the application to upload your content and style images. The application will then generate the stylized image in the output directory specified.
Note: Requires a CUDA enabled GPU to run.





