Skip to content

amunozo/multilingual-assessment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Assessment of Pre-Trained Models Across Languages and Grammars

License: MIT Python 3.8+ Paper

This repository contains the official implementation for the paper: Assessment of Pre-Trained Models Across Languages and Grammars Alberto Muñoz-Ortiz, David Vilares, and Carlos Gómez-Rodríguez Presented at IJCNLP-AACL 2023 in Nusa Dua, Bali, Indonesia.


Overview

This project evaluates the performance of various pre-trained language models (like BERT, XLM-R, and CANINE) across different languages and grammatical structures (Constituency and Dependency parsing).

Project Structure

The repository is organized as follows:

  • src/: Core logic and model implementations.
  • scripts/: Entry-point scripts for training, evaluation, and plotting.
  • notebooks/: Jupyter notebooks for data analysis and visualization.
  • data/: Directory for storing datasets and intermediate scores.
  • results/: Output logs and evaluation results.
  • config/: Model and training configurations.

Installation

  1. Clone the repository:

    git clone https://github.com/amunozo/multilingual-assessment.git
    cd multilingual-assessment
  2. Create a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

Usage

Training

To train the models for dependency parsing:

python scripts/train.py

Evaluation

To evaluate the trained models:

python scripts/eval.py

Plotting Results

To generate plots from the evaluation scores:

python scripts/plot.py

Results

The main findings of the paper show how different subword tokenization strategies and model architectures impact the cross-lingual transferability of grammatical knowledge. For detailed results, please refer to our paper.

Citation

If you use this code or our findings in your research, please cite:

@inproceedings{munoz-ortiz-etal-2023-assessment,
    title = "Assessment of Pre-Trained Models Across Languages and Grammars",
    author = "Mu{\~n}oz-Ortiz, Alberto  and
      Vilares, David  and
      G{\'o}mez-Rodr{\'i}guez, Carlos",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = nov,
    year = "2023",
    address = "Nusa Dua, Bali",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp-main.23",
    pages = "343--358",
}

Contact

For any questions or issues, please contact the main author: Alberto Muñoz-Ortiz - alberto.munoz.ortiz@udc.es

Acknowledgments

We acknowledge the European Research Council (ERC), which has funded this research under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), grant FPI 2021 (PID2020-113230RB-C21) funded by MCIN/AEI/10.13039/501100011033, and Centro de Investigación de Galicia ‘‘CITIC’’, funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published