Hello! Could you please elaborate on how the zero-shot generalization capability for cross-scene tasks is manifested ?