-
Notifications
You must be signed in to change notification settings - Fork 1
Description
First of all, thank you for the great work on "BEAVER: An Enterprise Benchmark for Text-to-SQL". This dataset addresses a critical gap in enterprise-level Text-to-SQL research.
However, upon checking the repository, I noticed that the evaluation scripts corresponding to the metrics defined in Section 3.2 of the paper are missing. While the dataset is available, the lack of a standardized evaluation codebase makes it difficult for the community to fairly compare against the results reported in the paper.
Specific Missing Components
Implementing these metrics manually might lead to discrepancies due to implementation details (e.g., case sensitivity, handling of set order in SQL results, exact logic for "Perfect-recall"). We urgently need the official implementation for:
- Table Retrieval Metrics: Specifically the logic for Perfect-recall (PR) @ top-k.
- Column Mapping Metrics: The implementation for Exact Score and F1 Score calculation for (topic phrase, column name) pairs.
- SQL Execution Accuracy: The driver script used to execute the predicted SQL vs. Gold SQL on the databases. (e.g., How are the results compared? Does it handle order-agnostic comparison for non-ordered queries? How are timeouts handled?)
Why this is important
Since BEAVER is proposed as a benchmark, consistent evaluation logic is paramount. Without the official scripts, future works cannot reliably claim improvements over the baselines established in the paper.
Could you please release the evaluation/ folder or the relevant scripts used to generate the results in Tables 2, 3, and 4?
Thank you for your contribution!