Skip to content

NLP-CISUC/Benchmark-ScientificQA-RAG-pt-PT

Repository files navigation

Benchmarking RAG for Scientific QA in European Portuguese

Code and data for the paper:

Benchmarking Retrieval-Augmented Generation for Scientific Knowledge QA in European Portuguese José Matos, Catarina Silva, Hugo Gonçalo Oliveira @ 3rd International Workshop on Natural Scientific Language Processing (NSLP 2026)

Overview

We evaluate the impact of Retrieval-Augmented Generation (RAG) on scientific question answering in European Portuguese (pt-PT). Five instruction-tuned small language models (4–12B) are benchmarked across 16 RAG configurations varying retriever language specialization, reranking, and number of retrieved chunks.

Project Structure

├── rag-adapter/                  # RAG proxy server (for use with lm-eval-harness) and vector DB creation
│   ├── create_vectordb.py        # Indexes corpus into ChromaDB
│   ├── precompute_retrieval.py   # Pre-computes top-30 chunks per question
│   └── rag_proxy.py              # Proxy server that injects retrieved context
├── lm-evaluation-harness/        # Clone of the lm-evaluation-harness repo (optional)
├── results/                      # Raw evaluation outputs
├── eval_multiple_models.py       # Main evaluation script (used for reported results)
├── full_experiment_results.csv   # All results in a single CSV
└── plots_from_csv.ipynb          # Generates paper figures from CSV

Quick Start

1. Create vector DBs

Run once per embedding model to create separate ChromaDB instances:

...
# Embedding model — run this script once per encoder to create separate DBs
# Options:
#   "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"  (multilingual)
#   "PORTULAN/serafim-335m-portuguese-pt-sentence-encoder-ir"      (Portuguese)

model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
...
___

python rag-adapter/create_vectordb.py 

2. Run evals

# Start the model server (e.g., via vLLM)
# Start the RAG proxy server
python rag-adapter/rag_proxy.py

# Run evaluation grid (set before to specified model and configs)
python eval_multiple_models.py

Configure the model under evaluation and RAG settings in eval_multiple_models.py.

Models

Model HuggingFace ID
AMALIA amaliallm.pt (not yet available to general public)
EuroLLM-9B-Instruct utter-project/EuroLLM-9B-Instruct
Qwen3-8B Qwen/Qwen3-8B
Gemma-3-4B-IT google/gemma-3-4b-it
Gemma-3-12B-IT google/gemma-3-12b-it

Data

  • Evaluation: Global MMLU Lite — Portuguese test split (400 questions)
  • Knowledge base: CorEGe-PT — 32,000+ Portuguese scientific documents

Hardware

All experiments were run on a single NVIDIA RTX A6000 GPU.

Citation

@inproceedings{matos2026benchmarking,
  title={Benchmarking Retrieval-Augmented Generation for Scientific Knowledge QA in European Portuguese},
  author={Matos, Jos{\'e} and Silva, Catarina and Oliveira, Hugo Gon{\c{c}}alo},
  booktitle={Proceedings of the 3rd Workshop on Natural Scientific Language Processing (NSLP) at LREC 2026},
  year={2026},
}

More

See paper for full methodology and results, and any additional details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors