-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Terms
- I have searched open and closed issues
- I agree to follow Wikimedia's Code of Conduct
Issue
I think it would be interesting to evaluate the performance of the pipeline at different stages.
- How good is the retrieval?
- How do different embedding models perform in comparison?
- What is the best amount of contexts to give into the model?
- Which model answers questions best?
- Takes up the actual facts from the context
- Least hallucinations
- Best phrasing
For the last GB&C Silvan and I implemented something very simple, but conceptually similar for the askwikidata prototype:
https://github.com/rti/askwikidata/blob/main/eval.py
There are also frameworks such as Ragas that might help https://docs.ragas.io/en/latest/getstarted/evaluation.html#metrics
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels