Your intrinsic reward calculation has bug.
In your code: intrinsic_reward = self.alpha*(predict_next_feature - mu).pow(2).sum(1)+ (1 - self.alpha)*torch.mean(torch.sqrt(torch.clip(abs(predict_next_feature ** 2 - mu ** 2) / (B2 - mu ** 2),1e-3,1)), axis=1)
- the first term (predict_next_feature - mu) calculate by sum()
- the second term (torch.sqrt(torch.clip(abs(predict_next_feature ** 2 - mu ** 2) / (B2 - mu ** 2),1e-3,1)) calculate by mean()
Your target and prediction networks have 512 output dimensions. Therefore, the second term will be smaller than the first term by 512 times. If we only use mean or sum for both terms, I think we need use alpha1 = 0.9 and alpha2 = 0.1 / 512 instead of alpha = 0.9 (second term use 1-alpha = 0.1) if the DRND paper is using your intrinsic reward calculation.