why mask and position images need to be fed into two times?

Hi author,
     Thanks for sharing your excellent work.
     I have one question, in the visual encoding module, the mask and position images are fed into both visual encoding module and original flux fill model. So the mask and position images are fed into model two times. Is there any reason here? Thanks