Embedding similarity

The similarity between embeddings of text, video, audio, etc are not high, usually around 0.1 - 0.3, how do we know how relevant the embeddings are to each other? Can this encoder be trusted for downstream tasks such as semantic search in video? If so, what is the appropriate way to use these embeddings?