https://arxiv.org/abs/2306.17693
to implement. Explores high-uncertainty regions by using an ensemble of policy heads with a shared torso. A random head generates the on-policy trajectory, and the loss is computed by averaging contributions over heads, where each head is independently included with probability p.