Hi!
When I read your source code, I found you set vocab_size = self.codebook_size + 1000 + 1 in token embbeding stage. Why not directly set vocal_size=self.codebook_size? What does the extra 1001 embeddings mean? Are these embeddings of class labels and mask tokens? Can I understand it this way, that is, when there is no class condition, vocal_size should be set to self.codebook+1?
Looking forward to your reply!