EureQA is an Extractive Question Answering model based on the ALBERT Base v2 architecture and finetuned on SQuAD 2.0 dataset.
-
Install miniforge.
-
Create conda environment.
conda env create -f environment.yml- Activate the
eureqaenvironment.
conda activate eureqa - Run the following command in the
eureqaenvironment.
python app.py- Open the address in the console.
Alternatively, you can try online demo at Huggingface Spaces.
EureQA is a transformers model, based on on ALBERT and fine-tuned on an extractive question answering dataset SQuAD 2.0. This means it was fine-tuned to answer questions given some context. It can also detect if there is no answer in the context to the provided question.
Since EureQA is based on ALBERT, it shares its layers across its Transformer, meaning all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
This model has the following configuration:
- 12 repeating layers
- 128 embedding dimension
- 768 hidden dimension
- 12 attention heads
- 11M parameters
You can use this model for the extractive question answering task on your data. Since the model was finetuned on SQuAD 2.0 dataset, it might perform worse on OOD data.
The model was trained and evaluated on the SQuAD 2.0 dataset.
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
Training is performed using the run_qa.py script from huggingface examples:
python run_qa.py \
--model_name_or_path albert/albert-base-v2 \ # Pretrained model identifier on the Huggingface model hub
--dataset_name squad_v2 \ # Dataset name on the Huggingface datasets hub
--version_2_with_negative \ # Indicates the answer might be absent in the context
--do_train \ # Perform model training
--num_train_epochs 2 \ # Number of training epochs
--per_device_train_batch_size 12 \ # Train batch size
--learning_rate 3e-5 \ # Learning rate
--max_seq_length 384 \ # Maximum sequence length (in tokens) to pass to the model. This includes both question and context. Longer sequences are truncated with the overflow becoming the context of another sequence
--doc_stride 128 \ # Stride to use when splitting long sequences into several ones
--bf16 True \ # Use bf16 during training. This can speedup the training when using some GPUs
--torch_compile \ # Compile the model with `torch.compile`. This can speedup the training and inference of the model
--do_eval \ # Perform model evaluation
--per_device_eval_batch_size 64 \ # Evaluation batch size. Since we don't store the gradients during evaluation, it can be larger that train batch size
--eval_strategy "steps" \ # Evaluate the model each N steps
--eval_steps 2000 \ # Steps at which the model should be evaluated
--eval_on_start True \ # Evaluate the model before the training (evaluates the pretrained model with initialized head)
--bf16_full_eval \ # Use bf16 during evaluation
--logging_steps 2000 \ # Steps at which the logging should be performed
--save_steps 2000 # Steps at which the model should be saved
--output_dir ./tmp/debug_squad2/ # Directory to save the model and model info toThe following hyperparameters were used during training:
- Learning rate: 3e-05
- Train batch size: 12
- Eval batch size: 64
- Seed: 42
- Optimized: Adam with betas=(0.9,0.999) and epsilon=1e-08
- Learning rate scheduler: linear
- Epochs: 2.0
Training loss is 0.9652.
| Metric category | Exact Match | F1 score |
|---|---|---|
| Answer in the context | 75.0 | 81.4 |
| No answer in the context | 82.1 | 82.1 |
| Overall | 78.6 | 81.8 |
Comparison with the current SOTA single models (base variants)
| Model | Exact Match | F1 score |
|---|---|---|
| BERT | 73.7 | 76.3 |
| RoBERTa | 80.5 | 83.7 |
| ELECTRA | 80.5 | 83.3 |
| DeBERTa v3 | 85.4 | 88.4 |
| ALBERT (original) | 79.3 | 82.1 |
| EureQA (ours) | 78.6 | 81.8 |
You can find training log on ClearML (you must be registered on clear.ml).