Thanks for you excellent work.
I have a question about your Algorithm 1 (TreeRL main process). When new trajectories are sampled from forking points and appended in $\Tau$, but actually you choose $L=1$ (the largest is 2) in you experiments, which means all forking points is sampled from originally initialized trajectory $Y$.
I wonder the effectiveness of advantage/reward estimations of these forked trajectories, since the process supervision will be too sparse for them.
Looking forward to your reply.