Why body skeleton is required as input for the first stage structure model?

Can we just use text as input to enforce the joint learning of image appearance, spatial relationship, and geometry in a unified network?