A small evaluation toolkit for text-to-SPARQL systems. It runs a configurable set of metrics over JSONL datasets and can execute queries against local RDF files or SPARQL endpoints.
- Metrics for query exact match, token overlap, answer-set quality, BLEU/ROUGE, CodeBLEU, and more.
- Execution backends for local RDF (RDFLib) and remote SPARQL endpoints.
- Pluggable LLM-based judging via an Ollama backend.
- Python API for quick experiments.
The package is available on PyPI and can be installed directly with pip:
pip install t2s-metricsFor development (editable install), you can use:
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e .Input evaluation files must be JSON Lines (.jsonl) with one object per line.
Each object must include:
id(string): unique query/case identifiergolden(string): reference SPARQL querygenerated(string): system-generated SPARQL queryorder_matters(boolean): whether answer order must be preserved
This is exactly what JsonlEval expects in t2smetrics/core/eval.py.
Example (from datasets/ck25/eval/AIFB.jsonl):
{"id": "ck25:1-en", "golden": "PREFIX pv: <http://ld.company.org/prod-vocab/>\nSELECT DISTINCT ?result\nWHERE\n{\n <http://ld.company.org/prod-instances/empl-Karen.Brant%40company.org> pv:memberOf ?result .\n ?result a pv:Department .\n}\n", "generated": "SELECT ?department WHERE { ?person :name \"Ms. Brant\"; :worksIn ?department. }", "order_matters": false}The library supports two execution backend families:
- Local RDF file execution with
RDFLibBackend - Remote SPARQL endpoint execution with
SparqlEndpointBackend
SparqlEndpointBackend is generic SPARQL 1.1 and works with endpoints such as
QLever and Corese (and also GraphDB, Fuseki, Virtuoso, Blazegraph, etc.).
from t2smetrics.execution.rdflib_backend import RDFLibBackend
from t2smetrics.execution.sparql_endpoint_backend import SparqlEndpointBackend
# Option 1: local KG file
local_backend = RDFLibBackend("./datasets/example/kg/example.ttl")
# Option 2: remote endpoint (e.g., QLever/Corese)
endpoint_backend = SparqlEndpointBackend("http://localhost:8886/")For LLM-based metrics (for example LLMJudge), the library currently provides
OllamaBackend for local inference.
from t2smetrics.llm.ollama_backend import OllamaBackend
llm_backend = OllamaBackend(model="gemma3:4b")The LLM layer is extensible via LLMBackend (t2smetrics/llm/base.py).
To plug another provider, implement judge(prompt: str, timeout: int = 30) -> dict
and return a dictionary with a numeric score (recommended in [0, 1]).
from t2smetrics.llm.base import LLMBackend
class MyLLMBackend(LLMBackend):
def judge(self, prompt: str, timeout: int = 30) -> dict:
# Call your provider/client here
return {"score": 0.85, "raw": "optional provider response"}Then pass your backend to Experiment(..., llm_backend=...).
from t2smetrics.core.experiment import Experiment
from t2smetrics.core.eval import JsonlEval
from t2smetrics.metrics.text_metrics import Bleu
from t2smetrics.metrics.token import TokenF1
jsonl_eval = JsonlEval("./datasets/example/eval/example.jsonl")
metrics = [Bleu(), TokenF1()]
experiment = Experiment(jsonl_eval, metrics)
_, summary = experiment.run()
print("\n=== SUMMARY ===")
for k, v in summary.items():
print(f"{k}: {v:.4f}")from t2smetrics.core.experiment import Experiment
from t2smetrics.core.eval import JsonlEval
from t2smetrics.execution.rdflib_backend import RDFLibBackend
from t2smetrics.llm.ollama_backend import OllamaBackend
from t2smetrics.metrics.answer_set.f1 import AnswerSetF1
from t2smetrics.metrics.answer_set.precision import AnswerSetPrecision
from t2smetrics.metrics.answer_set.precision_qald import PrecisionQALD
from t2smetrics.metrics.answer_set.recall import AnswerSetRecall
from t2smetrics.metrics.answer_set.recall_qald import RecallQALD
from t2smetrics.metrics.exact import QueryExactMatch
from t2smetrics.metrics.codebleu.codebleu import CodeBLEU
from t2smetrics.metrics.answer_set.f1_qald import F1QALD
from t2smetrics.metrics.answer_set.f1_spinach import F1Spinach
from t2smetrics.metrics.answer_set.mrr import MRR
from t2smetrics.metrics.answer_set.hit_at_k import HitAtK
from t2smetrics.metrics.answer_set.ndcg import NDCG
from t2smetrics.metrics.answer_set.p_at_k import PrecisionAtK
from t2smetrics.metrics.distance import (
LevenshteinDistance,
JaccardSimilarity,
CosineSimilarity,
EuclideanDistance,
)
from t2smetrics.metrics.llm_judge import LLMJudge
from t2smetrics.metrics.text_metrics import Bleu, RougeN, Meteor, SPBleu
from t2smetrics.metrics.uri.uri_hallucination import URIHallucination
from t2smetrics.metrics.query_execution import QueryExecution
from t2smetrics.metrics.token import SPF1, TokenRecall, TokenPrecision, TokenF1
jsonl_eval = JsonlEval("./datasets/example/eval/example.jsonl")
execution_backend = RDFLibBackend("./datasets/example/kg/example.ttl")
llm_backend = OllamaBackend()
metrics = [
AnswerSetPrecision(),
AnswerSetRecall(),
AnswerSetF1(),
Bleu(),
SPBleu(),
CodeBLEU(),
CosineSimilarity(),
EuclideanDistance(),
F1QALD(),
PrecisionQALD(),
RecallQALD(),
F1Spinach(),
HitAtK(k=5),
JaccardSimilarity(),
LLMJudge(),
LevenshteinDistance(),
MRR(),
Meteor(),
NDCG(),
PrecisionAtK(k=1),
QueryExecution(),
QueryExactMatch(),
RougeN(1),
RougeN(2),
RougeN(3),
RougeN(4),
TokenF1(),
SPF1(),
TokenPrecision(),
TokenRecall(),
URIHallucination(),
]
experiment = Experiment(
jsonl_eval=jsonl_eval,
metrics=metrics,
execution_backend=execution_backend,
llm_backend=llm_backend,
verbose=True,
)
results, summary = experiment.run()
print("=== PER QUERY RESULTS ===")
for r in results:
print(r)
print("\n=== SUMMARY ===")
for k, v in summary.items():
print(f"{k}: {v:.4f}")For a complete run over multiple systems and export of aggregated metrics to JSON,
see t2smetrics/run_text2sparql.py.
Typical workflow:
- Choose a dataset folder (for example
datasets/ck25). - Put input files under
datasets/<dataset>/eval/*.jsonl. - Start your SPARQL endpoint (for example QLever/Corese).
- Set endpoint URL in the script (example:
http://localhost:8886/). - Run:
python -m t2smetrics.run_text2sparqlThe script writes timestamped summary files under:
datasets/<dataset>/results/<dataset>-YYYYMMDD-HHMMSS.json
These result files are then directly consumable by the dashboard.
The dashboard reads JSON result files (generated in datasets/*/results/*.json)
and serves an interactive UI (Radar, Bar, Correlation Heatmap, Parallel Coordinates,
Scatter Matrix).
Launch with auto-discovery:
python -m t2smetrics.cli dashboardLaunch with explicit files:
python -m t2smetrics.cli dashboard \
datasets/ck25/results/ck25-20260306-133227.json \
datasets/db25/results/db25-20260306-132100.jsonThen open:
http://127.0.0.1:8050
python setup.py sdist bdist_wheelThere are no automated tests yet. If you add tests, run them with:
python -m pytestt2s-metrics is provided under the terms of the GNU Affero General Public License 3.0 (AGPL-3.0).
This repository provides several third-party contributions redistributed with their original licenses.
t2s-metrics reuses the CK25 Corporate Knowledge Reference Dataset for Benchmarking Text-2-SPARQL QA Approaches that we modified to account for file format requirements (jsonl format).
The modified version is redistributed in directory dataset/ck25 under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0).
t2s-metrics reuses the QCan software for canonicalising SPARQL queries.
QCan is written in Java. In this repository, we distribute the compiled jar of QCan v1.1, third_party_lib/qcan-1.1-jar-with-dependencies.jar, under the terms of the Apache 2.0 license.
