Can we just use text as input to enforce the joint learning of image appearance, spatial relationship, and geometry in a unified network?