We should integrate TIREx Tracker with [HF Evaluate](https://huggingface.co/docs/evaluate/index) for simple evaluation of Hugging Face models. [This](https://huggingface.co/docs/evaluate/creating_and_sharing) may be a good starting point. Alternatively implement a HF LightEval [metric](https://huggingface.co/docs/lighteval/adding-a-new-metric) since LightEval seems to be better supported.