High Policy Loss in SAC_CQL

`policy_loss` in SAC_CQL is significantly higher than the official implementation when tested with `hopper-expert-v0` in d4rl.
https://github.com/waffoo/accel/blob/af3f511ea816b2dd80346fe5a0b5e2b395c190ad/accel/agents/sac_cql.py#L261

With the author's implementation, we can get the loss lower than -350, while using accel we can't even reach -300, which leads to slower and unstable learning.