Data sampling of BPTT

Hi Sebastian,

I am working on a project implementing BPTT. I see in your implementation that the states used for policy updates are sampled from the replay buffer. According to the RL objective J(\theta)=E_{s~initial_dist}[V(s)], shouldn't we sample states from the initial distribution? 

Thanks for your wonderful code!

Shenao