Composable pipeline for curating large-scale text-to-SQL corpora by extending database schemas, synthesising natural-language questions, and validating SQL programs with LLMs.
The dataset can be accessed under trl-lab/SQaLe-text-to-SQL-dataset/ on Hugging Face Datasets.
Link to the associated paper: https://openreview.net/forum?id=6PsKDjgoEy
- Automates schema growth, question generation, SQL synthesis, and quality annotations.
- Runs locally or on Slurm clusters with GPU-backed vLLM inference.
- Ships reusable ReFoRCE text-to-SQL components plus utility scripts for metadata, analysis, and Hugging Face uploads.
- Produces JSONL artefacts ready for downstream fine-tuning or evaluation.
extend_schemas.py– grows SQLite schemas with LLM assistance and on-the-fly validation.create_questions.py– batches vLLM calls to craft semi-synthetic NL questions per schema.create_sql_queries.py– generates executable SQL via the ReFoRCE voting pipeline.add_metadata.py/augment_questions_vllm.py– enrich or paraphrase dataset entries.ReFoRCE/– lightweight fork of the ReFoRCE text-to-SQL execution engine.data/– example schemas, generated triplets, and analysis helpers.jobs/– Slurm templates for running each phase on H100 class GPUs.util/– helper scripts (dataset splits, HF pushes, stats).
- Clone & environment
git clone https://github.com/trl-lab/text-to-sql-dataset.git cd text-to-sql-dataset conda env create -f conda-env.yml conda activate sqale - Provision credentials (as needed):
- Hugging Face:
export HF_TOKEN=...for dataset uploads.
- Hugging Face:
A CUDA enabled GPU is required for the default vLLM models listed in the job scripts.
The pipeline is modular; individual stages can be rerun or swapped. Paths below assume you stay in the repo root.
-
Convert Schemapile into CREATE TABLE seeds
python convert_to_create_statement.py
- Downloads
trl-lab/schemapileviadatasetsand rewrites each entry into executable SQLiteCREATE TABLEfiles underdata/datasets/anddata/statements/. - Validates every schema by building a temporary SQLite database; failed conversions are skipped with the error logged to stdout.
- Downloads
-
Extend baseline schemas
python extend_schemas.py \ --folder data/test/statements \ --out-dir data/test/extended \ --model Qwen/Qwen3-32B-FP8 \
- Adds 15–25 coherent tables per iteration until a target size is reached.
- Every batch is executed against SQLite to ensure syntactic and referential correctness.
-
Seed exemplar questions (optional but recommended)
python create_examples.py
- Merges Spider/BIRD prompts with join counts into
data/examples.csv, guiding later prompting diversity. - This can also be replaced with a custom CSV of your own question templates.
- Merges Spider/BIRD prompts with join counts into
-
Generate NL questions
python create_questions.py \ --schema_folder data/test/extended \ --example_questions_file data/examples.csv \ --model Qwen/Qwen3-32B-FP8 \ --out data/questions.jsonl
- Uses vLLM to request multiple questions per schema while enforcing a join-count distribution.
- Automatically retries malformed outputs and records prompt metadata.
-
Synthesize SQL with ReFoRCE
python create_sql_queries.py \ --questions_file data/questions.jsonl \ --model Qwen/Qwen3-32B-FP8 \ --batch_size 64 \ --out data/semi_synthetic_dataset.jsonl
- Generates candidates, executes them inside an in-memory SQLite clone, and runs a judge pass for final selection.
- Output records include
prompt,sql_statement,schema, andcmd_typefields.
-
Add structural metadata
python add_metadata.py \ --input data/semi_synthetic_dataset.jsonl \ --output data/combined_with_metadata.jsonl
- Appends token counts, join counts, table/column statistics, and normalises field names.
-
(Optional) Question paraphrases
python augment_questions_vllm.py \ --input data/combined_with_metadata.jsonl \ --output data/combined_with_paraphrases.jsonl \ --model Qwen/Qwen3-14B-FP8 \ --num-alternatives 3
- Generates faithful rewrites per entry, updating token counts and tagging augmentation provenance.
-
Publish to Hugging Face
python util/push_to_hf.py \ --jsonl data/combined_with_paraphrases.jsonl \ --repo-id your-org/text-to-sql \ --split-name train \ --private
data/test/statements/– initial schema seeds (SQLite DDL).data/test/extended/– schema extensions per growth iteration.data/questions.jsonl– generated NL questions with schema context.data/semi_synthetic_dataset.jsonl– question/SQL/schema triplets.data/combined_with_metadata.jsonl– enriched records ready for release.data/complete/– sharded outputs from large Slurm runs.
- Templates in
jobs/*.jobload the environment module stack, create the Conda env, and call the relevant script. - Adjust
--array,--start_index, and--end_indexto chunk workloads across GPUs. - Redirected logs live in
slurm_output/.
If you find SQaLe useful in your research, please cite the following paper:
@inproceedings{
wolff2025sqale,
title={{SQ}aLe: A large text-to-{SQL} corpus grounded in real schemas},
author={Cornelius Wolff and Daniel Gomm and Madelon Hulsebos},
booktitle={EurIPS 2025 Workshop: AI for Tabular Data},
year={2025},
url={https://openreview.net/forum?id=6PsKDjgoEy}
}
This repository is licensed under the GPLv3 License. See LICENSE for details.