Hi, @lightChaserX.
Thank you for your incredible work on this project. The paper is truly impressive and has provided a lot of insights.
I have a few questions regarding some implementation details:
Q1. In section 3.3 of the paper, it's mentioned that the cropped patch predictions are merged in the latent space. Is the latent decoder used from the Stable Diffusion 1.5 pre-trained model? Does it support high-resolution decoding, considering the original P3M images are mostly 1080p? If you've designed your own decoder, could you please share some details about its implementation?
Q2. Regarding the blend operation in Figure 3 of the paper, could you clarify the method used? It wasn't specified in the paper or the supplementary materials.
Q3. The supplementary materials mention that all images are randomly cropped to 256 x 256. Why not use Stable Diffusion 1.5's default training resolution of 512 x 512? What advantages do smaller resolution patches have over high-resolution ones?
I appreciate your time and look forward to your response.
Thank you!