Code and data for the paper:
Benchmarking Retrieval-Augmented Generation for Scientific Knowledge QA in European Portuguese José Matos, Catarina Silva, Hugo Gonçalo Oliveira @ 3rd International Workshop on Natural Scientific Language Processing (NSLP 2026)
We evaluate the impact of Retrieval-Augmented Generation (RAG) on scientific question answering in European Portuguese (pt-PT). Five instruction-tuned small language models (4–12B) are benchmarked across 16 RAG configurations varying retriever language specialization, reranking, and number of retrieved chunks.
├── rag-adapter/ # RAG proxy server (for use with lm-eval-harness) and vector DB creation
│ ├── create_vectordb.py # Indexes corpus into ChromaDB
│ ├── precompute_retrieval.py # Pre-computes top-30 chunks per question
│ └── rag_proxy.py # Proxy server that injects retrieved context
├── lm-evaluation-harness/ # Clone of the lm-evaluation-harness repo (optional)
├── results/ # Raw evaluation outputs
├── eval_multiple_models.py # Main evaluation script (used for reported results)
├── full_experiment_results.csv # All results in a single CSV
└── plots_from_csv.ipynb # Generates paper figures from CSV
Run once per embedding model to create separate ChromaDB instances:
...
# Embedding model — run this script once per encoder to create separate DBs
# Options:
# "sentence-transformers/paraphrase-multilingual-mpnet-base-v2" (multilingual)
# "PORTULAN/serafim-335m-portuguese-pt-sentence-encoder-ir" (Portuguese)
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
...
___
python rag-adapter/create_vectordb.py # Start the model server (e.g., via vLLM)
# Start the RAG proxy server
python rag-adapter/rag_proxy.py
# Run evaluation grid (set before to specified model and configs)
python eval_multiple_models.pyConfigure the model under evaluation and RAG settings in eval_multiple_models.py.
| Model | HuggingFace ID |
|---|---|
| AMALIA | amaliallm.pt (not yet available to general public) |
| EuroLLM-9B-Instruct | utter-project/EuroLLM-9B-Instruct |
| Qwen3-8B | Qwen/Qwen3-8B |
| Gemma-3-4B-IT | google/gemma-3-4b-it |
| Gemma-3-12B-IT | google/gemma-3-12b-it |
- Evaluation: Global MMLU Lite — Portuguese test split (400 questions)
- Knowledge base: CorEGe-PT — 32,000+ Portuguese scientific documents
All experiments were run on a single NVIDIA RTX A6000 GPU.
@inproceedings{matos2026benchmarking,
title={Benchmarking Retrieval-Augmented Generation for Scientific Knowledge QA in European Portuguese},
author={Matos, Jos{\'e} and Silva, Catarina and Oliveira, Hugo Gon{\c{c}}alo},
booktitle={Proceedings of the 3rd Workshop on Natural Scientific Language Processing (NSLP) at LREC 2026},
year={2026},
}See paper for full methodology and results, and any additional details.