Wikisql by Moneebah-Conrad · Pull Request #113 · Conrad-X/text2SQL

Moneebah-Conrad · 2025-05-07T11:20:33Z

This PR corresponds to the following Task

This Draft PR allows for the wikisql dataset to be used with the current solution so that it may be used for internal benchmarking etc.

For Reviewers (Optional)

All of the new scripts added have descriptive comments and contain the usage command as well.
Start with executing the 3 scripts in server/preprocess/wikisql in the following order:
configure_wiki.py -> convert_wiki.py -> prepare_wiki.py
configure_wiki.py will download the dataset from the original github repository and organize it in a format similar to that of the bird datsaet
convert_wiki.py will convert the differently formatted sqls into a standard and executable sql and save it into a {dataset}.json file
prepare_wiki.py will add in the missing files so that the format is similar to that of the brid dataset (it adds description csv, {dataset}_tables.json and processed_test.json
If process_dataset_sequentially is stuck for an extremely long amount of time then do check in the logs - 9/10 times its either a quota exhausted issue OR error generating result by gemini that will stem from the get_prompt function and format_schema functions.
Added a 4th script as a 'control' called test_predictions.py. The text2sql solution should perform in a similar manner to just running this one singular script which simply calls gemini once with a single prompt that includes information about the question and schema and asks gemini to generate the SQL (for 200 queries the accuracy was ~73%)
Added 5th script as a 'Execution accuracy test' called evaluate_wiki.py: test.json/dev.json contains a key value pair that has the execution_result for the gold sql. Comparing it with predicted SQL's result will return a percentage accuracy (depending on how many queries were processed) and also detailed eval results in a json file

Please refer to the following screenshots:

After running configure_sql.py which downloads dataset and organizes it into relevant folders.

Final result after executing all scripts and process_dataset_sequentially:

Eval results will be stored in this folder and the version updates each time you run the script

Moneebah-Conrad added 11 commits April 21, 2025 12:13

Initial commit: WikiSQL path and description

ae0dd10

Update to latest main branch

e272741

Update: Improved preprocessing file and minor changes in path files

a439578

Fix: Attempting to align with main and fixing minor change

228fd10

feat: scripts for folder org and conversion

a63b736

chore: fixed comments

5f533d8

chore: adjust script

69c5bc3

fix: clean up scripts

206c74b

fix attempt: logging+ schema_engine issue

c434fa1

feat: added control script

cbe0511

feat: added evaluation script

2d5a4dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikisql#113

Wikisql#113
Moneebah-Conrad wants to merge 11 commits intomainfrom
wikisql

Moneebah-Conrad commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Moneebah-Conrad commented May 7, 2025

For Reviewers (Optional)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant