Benchmarking RAG for Scientific QA in European Portuguese

Code and data for the paper:

Benchmarking Retrieval-Augmented Generation for Scientific Knowledge QA in European Portuguese José Matos, Catarina Silva, Hugo Gonçalo Oliveira @ 3rd International Workshop on Natural Scientific Language Processing (NSLP 2026)

Overview

We evaluate the impact of Retrieval-Augmented Generation (RAG) on scientific question answering in European Portuguese (pt-PT). Five instruction-tuned small language models (4–12B) are benchmarked across 16 RAG configurations varying retriever language specialization, reranking, and number of retrieved chunks.

Project Structure

├── rag-adapter/                  # RAG proxy server (for use with lm-eval-harness) and vector DB creation
│   ├── create_vectordb.py        # Indexes corpus into ChromaDB
│   ├── precompute_retrieval.py   # Pre-computes top-30 chunks per question
│   └── rag_proxy.py              # Proxy server that injects retrieved context
├── lm-evaluation-harness/        # Clone of the lm-evaluation-harness repo (optional)
├── results/                      # Raw evaluation outputs
├── eval_multiple_models.py       # Main evaluation script (used for reported results)
├── full_experiment_results.csv   # All results in a single CSV
└── plots_from_csv.ipynb          # Generates paper figures from CSV

Quick Start

1. Create vector DBs

Run once per embedding model to create separate ChromaDB instances:

...
# Embedding model — run this script once per encoder to create separate DBs
# Options:
#   "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"  (multilingual)
#   "PORTULAN/serafim-335m-portuguese-pt-sentence-encoder-ir"      (Portuguese)

model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
...
___

python rag-adapter/create_vectordb.py

2. Run evals

# Start the model server (e.g., via vLLM)
# Start the RAG proxy server
python rag-adapter/rag_proxy.py

# Run evaluation grid (set before to specified model and configs)
python eval_multiple_models.py

Configure the model under evaluation and RAG settings in eval_multiple_models.py.

Models

Model	HuggingFace ID
AMALIA	amaliallm.pt (not yet available to general public)
EuroLLM-9B-Instruct	`utter-project/EuroLLM-9B-Instruct`
Qwen3-8B	`Qwen/Qwen3-8B`
Gemma-3-4B-IT	`google/gemma-3-4b-it`
Gemma-3-12B-IT	`google/gemma-3-12b-it`

Data

Evaluation: Global MMLU Lite — Portuguese test split (400 questions)
Knowledge base: CorEGe-PT — 32,000+ Portuguese scientific documents

Hardware

All experiments were run on a single NVIDIA RTX A6000 GPU.

Citation

@inproceedings{matos2026benchmarking,
  title={Benchmarking Retrieval-Augmented Generation for Scientific Knowledge QA in European Portuguese},
  author={Matos, Jos{\'e} and Silva, Catarina and Oliveira, Hugo Gon{\c{c}}alo},
  booktitle={Proceedings of the 3rd Workshop on Natural Scientific Language Processing (NSLP) at LREC 2026},
  year={2026},
}

More

See paper for full methodology and results, and any additional details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
lm-evaluation-harness		lm-evaluation-harness
rag-adapter		rag-adapter
results/global_mmlu_lite		results/global_mmlu_lite
README.md		README.md
benchmark_global_mmlu_lite_pt.sh		benchmark_global_mmlu_lite_pt.sh
benchmark_global_mmlu_lite_pt_w_rag.sh		benchmark_global_mmlu_lite_pt_w_rag.sh
eval_multiple_models.py		eval_multiple_models.py
eval_single_model.py		eval_single_model.py
full_experiment_results.csv		full_experiment_results.csv
plots_from_csv.ipynb		plots_from_csv.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking RAG for Scientific QA in European Portuguese

Overview

Project Structure

Quick Start

1. Create vector DBs

2. Run evals

Models

Data

Hardware

Citation

More

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking RAG for Scientific QA in European Portuguese

Overview

Project Structure

Quick Start

1. Create vector DBs

2. Run evals

Models

Data

Hardware

Citation

More

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages