This repository contains the official implementation for the paper: Assessment of Pre-Trained Models Across Languages and Grammars Alberto Muñoz-Ortiz, David Vilares, and Carlos Gómez-Rodríguez Presented at IJCNLP-AACL 2023 in Nusa Dua, Bali, Indonesia.
This project evaluates the performance of various pre-trained language models (like BERT, XLM-R, and CANINE) across different languages and grammatical structures (Constituency and Dependency parsing).
The repository is organized as follows:
src/: Core logic and model implementations.scripts/: Entry-point scripts for training, evaluation, and plotting.notebooks/: Jupyter notebooks for data analysis and visualization.data/: Directory for storing datasets and intermediate scores.results/: Output logs and evaluation results.config/: Model and training configurations.
-
Clone the repository:
git clone https://github.com/amunozo/multilingual-assessment.git cd multilingual-assessment -
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
To train the models for dependency parsing:
python scripts/train.pyTo evaluate the trained models:
python scripts/eval.pyTo generate plots from the evaluation scores:
python scripts/plot.pyThe main findings of the paper show how different subword tokenization strategies and model architectures impact the cross-lingual transferability of grammatical knowledge. For detailed results, please refer to our paper.
If you use this code or our findings in your research, please cite:
@inproceedings{munoz-ortiz-etal-2023-assessment,
title = "Assessment of Pre-Trained Models Across Languages and Grammars",
author = "Mu{\~n}oz-Ortiz, Alberto and
Vilares, David and
G{\'o}mez-Rodr{\'i}guez, Carlos",
booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = nov,
year = "2023",
address = "Nusa Dua, Bali",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.ijcnlp-main.23",
pages = "343--358",
}For any questions or issues, please contact the main author: Alberto Muñoz-Ortiz - alberto.munoz.ortiz@udc.es
We acknowledge the European Research Council (ERC), which has funded this research under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), grant FPI 2021 (PID2020-113230RB-C21) funded by MCIN/AEI/10.13039/501100011033, and Centro de Investigación de Galicia ‘‘CITIC’’, funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS).