Eval full pipeline

### Terms

- [X] I have searched [open and closed issues](https://github.com/rti/gbnc/issues?q=is%3Aissue)
- [X] I agree to follow [Wikimedia's Code of Conduct](https://www.mediawiki.org/wiki/Code_of_Conduct)

### Issue

I think it would be interesting to evaluate the performance of the pipeline at different stages.

 - How good is the retrieval?
   - How do different embedding models perform in comparison?
 - What is the best amount of contexts to give into the model?
 - Which model answers questions best?
   - Takes up the actual facts from the context
   - Least hallucinations
   - Best phrasing

For the last GB&C Silvan and I implemented something very simple, but conceptually similar for the askwikidata prototype:  
https://github.com/rti/askwikidata/blob/main/eval.py

There are also frameworks such as Ragas that might help https://docs.ragas.io/en/latest/getstarted/evaluation.html#metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval full pipeline #29

Terms

Issue

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Eval full pipeline #29

Description

Terms

Issue

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions