Retraining should be its own module using a set format, possibly derived from the parquet files and some other hybrid. This would allow us to update the training using newly added records to Neotoma (for example).
I need to dig through all the files and the notebooks to see how it's all done exactly, but this would be very helpful, because a first pass of some broader sets of articles seems to show a number of false positives.