Skip to content

Intrinsic reward problem - mu using .sum() but B2 using .mean() - alpha hyperparameter? #1

@CVHvn

Description

@CVHvn

Your intrinsic reward calculation has bug.

In your code: intrinsic_reward = self.alpha*(predict_next_feature - mu).pow(2).sum(1)+ (1 - self.alpha)*torch.mean(torch.sqrt(torch.clip(abs(predict_next_feature ** 2 - mu ** 2) / (B2 - mu ** 2),1e-3,1)), axis=1)

  • the first term (predict_next_feature - mu) calculate by sum()
  • the second term (torch.sqrt(torch.clip(abs(predict_next_feature ** 2 - mu ** 2) / (B2 - mu ** 2),1e-3,1)) calculate by mean()

Your target and prediction networks have 512 output dimensions. Therefore, the second term will be smaller than the first term by 512 times. If we only use mean or sum for both terms, I think we need use alpha1 = 0.9 and alpha2 = 0.1 / 512 instead of alpha = 0.9 (second term use 1-alpha = 0.1) if the DRND paper is using your intrinsic reward calculation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions