Skip to content

Eval full pipeline #29

@rti

Description

@rti

Terms

Issue

I think it would be interesting to evaluate the performance of the pipeline at different stages.

  • How good is the retrieval?
    • How do different embedding models perform in comparison?
  • What is the best amount of contexts to give into the model?
  • Which model answers questions best?
    • Takes up the actual facts from the context
    • Least hallucinations
    • Best phrasing

For the last GB&C Silvan and I implemented something very simple, but conceptually similar for the askwikidata prototype:
https://github.com/rti/askwikidata/blob/main/eval.py

There are also frameworks such as Ragas that might help https://docs.ragas.io/en/latest/getstarted/evaluation.html#metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions