How can I interpret your output? I noticed that the output has three channels. I assume this is not object localization.