Tracking model performance with confidence threshold

It'd be useful to have a standard method to sweep model confidence and plot box_precision/recall to gauge where to set the score_threshold. I knocked up a very simple script that processes a validation dataset. 

<img width="1200" height="800" alt="Image" src="https://github.com/user-attachments/assets/c15ccc76-7e89-4a2f-a800-178c1f80ed24" />

Add this or something similar as a `script`?