Intrinsic reward problem - mu using .sum() but B2 using .mean() - alpha hyperparameter?

Your intrinsic reward calculation has bug.

In your code: intrinsic_reward = self.alpha*(predict_next_feature - mu).pow(2).sum(1)+ (1 - self.alpha)*torch.mean(torch.sqrt(torch.clip(abs(predict_next_feature ** 2 - mu ** 2) / (B2 - mu ** 2),1e-3,1)), axis=1)
- the first term (predict_next_feature - mu) calculate by sum()
- the second term (torch.sqrt(torch.clip(abs(predict_next_feature ** 2 - mu ** 2) / (B2 - mu ** 2),1e-3,1)) calculate by mean()

Your target and prediction networks have 512 output dimensions. Therefore, the second term will be smaller than the first term by 512 times. If we only use mean or sum for both terms, I think we need use alpha1 = 0.9 and alpha2 = 0.1 / 512 instead of alpha = 0.9 (second term use 1-alpha = 0.1) if the DRND paper is using your intrinsic reward calculation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intrinsic reward problem - mu using .sum() but B2 using .mean() - alpha hyperparameter? #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Intrinsic reward problem - mu using .sum() but B2 using .mean() - alpha hyperparameter? #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions