It'd be useful to have a standard method to sweep model confidence and plot box_precision/recall to gauge where to set the score_threshold. I knocked up a very simple script that processes a validation dataset.
Add this or something similar as a script?