OrQA (Open Data Retrieval and Question Answering) is a workflow for generating new benchmark datasets for retrieval and tabular question answering model evaluation on Open Data.
The workflow is composed of four main stages:
- Crawling data and metadata from the desired Open Data endpoint
- Searching for candidate related tables
- Evaluating the previously found pairs
- Generating questions and corresponding SQL queries
All scripts needed to run your own experiments are located in the scripts folder.
OrQA is built on top of Ollama and LiteLLM.
You will need to manually install Ollama before running the scripts.
Install the required Python packages via Conda:
$ conda env create -f environment.ymland manually install LiteLLM proxy:
$ pip install 'litellm[proxy]'Before running the evaluation and generation scripts, start the Ollama server:
$ ollama serve Then, launch LiteLLM:
(orqa) $ litellm --config litell_config.yml Use the following commands to create a new dataset from the first 1000 available packages on the Canadian Open Data portal:
(orqa) $ python orqa_0_open_data_crawler.py CAN 0 1000 https://open.canada.ca/data/api/action
(orqa) $ python orqa_1_create_blend_index.py CAN 0 1000
(orqa) $ python orqa_2_search_candidates.py CAN 0 1000
(orqa) $ python orqa_3_evaluation.py CAN 0 1000
(orqa) $ python orqa_4_generate_questions.py CAN 0 1000At this stage, customization of the workflow—such as selecting different models for question generation—is not yet available via command-line arguments or external config files. These settings must be hardcoded directly into the scripts.
In the dataset folder there is a first dataset version generated with OrQA workflow: this dataset contains 1,000 questions created from Canadian and UK Open Data tables.
