Thompson sampling for improved exploration in GFlowNets

https://arxiv.org/abs/2306.17693

to implement. Explores high-uncertainty regions by using an ensemble of policy heads with a shared torso. A random head generates the on-policy trajectory, and the loss is computed by averaging contributions over heads, where each head is independently included with probability p.