Initial experiment: soft optim vs normal

- [ ] Run the train loop
- [ ] Run inference to get e.g. 1000 inference samples (e.g. with proxy reward and prior reward)
- [ ] Get the proxy value cut-off using `get_proxy_value_cutoff`
- [ ] Run model twice: with soft-optim likelihood + tiny KL , and normal likelihood + tiny KL .
- [ ] Show the "breaks rules" stats for both runs