Release of Official Evaluation Scripts for Reproducibility (Metrics in Section 3.2)

First of all, thank you for the great work on "BEAVER: An Enterprise Benchmark for Text-to-SQL". This dataset addresses a critical gap in enterprise-level Text-to-SQL research.

However, upon checking the repository, I noticed that the **evaluation scripts** corresponding to the metrics defined in **Section 3.2** of the paper are missing. While the dataset is available, the lack of a standardized evaluation codebase makes it difficult for the community to fairly compare against the results reported in the paper.

**Specific Missing Components**
Implementing these metrics manually might lead to discrepancies due to implementation details (e.g., case sensitivity, handling of set order in SQL results, exact logic for "Perfect-recall"). We urgently need the official implementation for:

1.  **Table Retrieval Metrics:** Specifically the logic for **Perfect-recall (PR) @ top-k**.
2.  **Column Mapping Metrics:** The implementation for **Exact Score** and **F1 Score** calculation for (topic phrase, column name) pairs.
3.  **SQL Execution Accuracy:** The driver script used to execute the predicted SQL vs. Gold SQL on the databases. (e.g., How are the results compared? Does it handle order-agnostic comparison for non-ordered queries? How are timeouts handled?)

**Why this is important**
Since BEAVER is proposed as a benchmark, consistent evaluation logic is paramount. Without the official scripts, future works cannot reliably claim improvements over the baselines established in the paper.

Could you please release the `evaluation/` folder or the relevant scripts used to generate the results in Tables 2, 3, and 4?

Thank you for your contribution!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release of Official Evaluation Scripts for Reproducibility (Metrics in Section 3.2) #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Release of Official Evaluation Scripts for Reproducibility (Metrics in Section 3.2) #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions