From you paper:

This is now pretty easy... with pyserini on PyPI.
But the real point of this issue is this: currently, as a I understand it, the input to evaluation is a file. Can we make it so that we can compute evaluation metrics directly from in-memory data structures?
Question is, what should the in-memory data structures look like? A Panda DF with the standard trec output format columns? A dictionary to support random access by qid? Something else?
If we can converge on something, I can even try to volunteer some of my students to contribute to this effort... :)