Skip to content

This is a repository of a lexical-semantic dataset in 15 natural languages.

License

Notifications You must be signed in to change notification settings

nexuslinguarum/MultiLexBATS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiLexBATS

The Multilingual Lexical BATS datasets comprises lexical semantic relations in 15 natural languages listed in the table below.

The dataset folder contains a folder "all_languages" with ID- and English-aligned columns for each contained languages. Additionally, MultiLexBATS is also provided as individual language files with one folder and all relations files for each language in the dataset folder.

All scripts used in this experiments, e.g. for running statistics on the dataset (folder "stats") or querying generative pre-trained transformers i.e., BLOOM via the HuggingFace Interface API, are provided in the "scripts" folder.

For the languages that correspond to MATS languages, we utilised the same templates as in MATS. For all other languages, first language speakers created templates. For languages where there is not a direct equivalence to the English template, first language speakers proposed several templates that we tested. Please consult the final paper on which templates performed best in the experiments.

Language Prompt
EN ``<a>'' is to ``<b>'' as ``<c>'' is to ``<d>''.
AL ``<a>'' është për ``<b>'' ashtu si ``<c>'' për ``<d>''.
BM ``<a>'' ye ``<b>'' ye i n’a fɔ ``<c>'' ye ``<d>'' ye
DE ``<a>'' ist so zu ``<b>'' wie ``<c>'' zu <d> ist.
EL το ``<a>'' είναι προς το ``<b>'' ό,τι το ``<c>'' προς το ``<d>''.
ES ``<a>'' es a``<b>''como ``<c>'' es a ``<d>''.
FR ``<a>'' est à ``<b>'' ce que ``<c>'' est à ``<d>''.
HE ``<a>'' ל ``<b>'' כ ``<c>'' ל ``<d>''
HR1 ``<a>'' je za ``<b>'' kao što je ``<c>'' za ``<d>''.
HR2 Riječ ``<a>'' je riječi ``<b>'' jednako što je riječ ``<c>'' riječi ``<d>''.
HR3 Odnos između riječi ``<a>'' i ``<b>'' jednak je odnosu između riječi ``<c>'' i ``<d>''.
IT ``<a>'' sta a ``<b>'' come ``<c>'' sta a ``<d>''.
LT ``<a>'' yra ``<b>'' taip, kaip ``<c>'' yra ``<d>''.
MK1 ``<a>'' е за ``<b>'' исто што и ``<c>'' за ``<d>''.
MK2 Зборот ``<a>'' за зборот ``<b>'' е исто што и зборот ``<c>'' за зборот ``<d>''.
MK3 Односот меѓу зборовите ``<a>'' и ``<b>'' е еднаков со односот меѓу зборовите ``<c>'' и ``<d>''.
PT ``<a>'' está para ``<b>'' assim como ``<c>'' está para ``<d>''.
RO ``<a>'' este pentru ``<b>'' cum ``<c>'' este pentru ``<d>''.
SK1 Slovo ``<a>'' sa má k slovu ``<b>'' ako slovo ``<c>'' k slovu ``<d>''.
SK2 Vzťah medzi slovami ``<a>'' a ``<b>'' je rovnaký ako medzi ``<c>'' a ``<d>''.
SK3 ``<a>'' sa má k ``<b>'' ako ``<c>'' k ``<d>''.
SL1 Beseda ``<a>'' je besedi ``<b>'' enako, kot je beseda ``<c>'' besedi ``<d>''.
SL2 Beseda ``<a>'' je besedi ``<b>'' enako, kot je besedi ``<d>'' beseda ``<c>''.
SL3 ``<a>'' in ``<b>'' sta kot ``<c>'' in ``<d>''.

Detailed results on analogy completion with the above templates as well as translation prediction with XLM-R and mBERT are provided in the LREC submission.

The detailed results achieved on analogy completion with BLOOM are reported in the following table, where only an overview graphic is included in the LREC submission.

L01 hypern - animals L02 hypern - misc L03 hyponyms - misc L04 meronyms - substance L05 meronyms - member L06 meronyms - part L07 synonyms - intensity L08 synonyms - exact L09 antonyms - gradable L10 antonyms - binary Total
EN 0.70 0.60 0.30 0.40 0.26 0.23 0.23 0.27 0.40 0.73 0.41
AL 0.33 0.50 0.37 0.27 0.07 0.10 0.07 0.10 0.10 0.07 0.20
BM 0.13 0.13 0.17 0.23 0.03 0.23 0.07 0.10 0.03 0.03 0.12
DE 0.47 0.57 0.13 0.30 0.10 0.33 0.03 0.13 0.07 0.27 0.24
EL 0.40 0.17 0.13 0.07 0.07 0.10 0.10 0.13 0.13 0.33 0.16
ES 0.90 0.53 0.27 0.33 0.20 0.20 0.10 0.13 0.30 0.47 0.34
FR 0.77 0.50 0.33 0.57 0.17 0.20 0.00 0.23 0.23 0.50 0.35
HE 0.17 0.17 0.10 0.10 0.03 0.10 0.10 0.07 0.13 0.10 0.11
HR 0.40 0.30 0.03 0.33 0.07 0.13 0.03 0.17 0.03 0.13 0.16
IT 0.60 0.67 0.17 0.23 0.07 0.13 0.13 0.13 0.17 0.47 0.28
LT 0.40 0.37 0.03 0.03 0.07 0.10 0.10 0.07 0.07 0.10 0.13
MK 0.37 0.47 - 0.10 0.03 - 0.10 0.07 - 0.23 0.20
PT 0.60 0.57 0.17 0.36 0.10 0.33 0.13 0.20 0.40 0.70 0.36
SL 0.36 0.33 0.07 0.30 0.10 0.17 0.10 0.07 0.10 0.13 0.17
SK 0.33 0.47 0.13 0.13 0.07 0.10 0.10 0.23 0.17 0.23 0.20
RO 0.27 0.57 0.13 0.13 0.03 0.07 0.17 0.13 0.10 0.10 0.17
Total 0.45 0.43 0.17 0.24 0.09 0.17 0.10 0.14 0.16 0.29 0.22

About

This is a repository of a lexical-semantic dataset in 15 natural languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published