UniversalCEFR is an initiative to compile and index open, publicly accessible datasets based on the CEFR Framework (Common European Framework of Reference) to enable open research language proficiency assessment. Datasets indexed in UniversalCEFR are standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages.
The full paper is uploaded on Arxiv: https://arxiv.org/abs/2506.01419
The full informative data directory is listed in this repo: universalcefr-data-directory
If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access or download their transformed, standardised version through the UniversalCEFR Huggingface Org: https://huggingface.co/UniversalCEFR
If you use any datasets indexed in UniversalCEFR, please cite the original dataset papers they are associated with. You can find them in the data directory repo.
Note that there are a few datasets in UniversalCEFR---EFCAMDAT, APA-LHA, and DEPlain---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in universal-cefr-experiments repository to transform the raw version to UniversalCEFR version.
An initiative started as a collaboration between the researchers around the world who are interested in 1) building a more open and accessible language proficiency assessment data resources that are also 2) standardised for maximized machine readability.
- Joseph Marvin Imperial (University of Bath, UK and National University Philippines)
- Abdullah Barayan (Cardiff University, UK)
- Regina Stodden (Bielefeld University, Germany)
- Rodrigo Wilkens (University of Exeter, UK)
- Ricardo Muñoz Sánchez (University of Gothenburg, Sweden)
- Lingyun Gao (UCLouvain, Belgium)
- Melissa Torgbi (University of Bath, UK)
- Dawn Knight (Cardiff University, UK)
- Gail Forey (University of Bath, UK)
- Reka R. Jablonkai (University of Bath, UK)
- Ekaterina Kochmar (MBZUAI, UAE)
- Robert Reynolds (Brigham Young University, USA)
- Eugénio Ribeiro (INESC-ID Lisboa and Instituto Universitário de Lisboa , Portugal)
- Horacio Saggion (Universitat Pompeu Fabra, Spain)
- Elena Volodina (University of Gothenburg, Sweden)
- Sowmya Vajjala (National Research Council, Canada)
- Thomas François (UCLouvain, Belgium)
- Fernando Alva-Manchego (Cardiff University, UK)
- Harish Tayyar Madabushi (University of Bath, UK)
We want to grow this community of researchers, language experts, and educators to further advance openly accessible CEFR/language proficiency assessment datasets for all.
If you're interested in this direction and/or have open dataset/s you want to add to UniversalCEFR for better exposure and utility to researchers, please fill up this form.
When we index your dataset to UniversalCEFR, we will cite you and the paper/project from which the dataset came across the UniversalCEFR platforms.
For questions, concerns, clarifications, and issues, please contact Joseph Marvin Imperial (jmri20@bath.ac.uk).
Please use the following information when citing UniversalCEFR:
BibTex Format:
@article{imperial2025universalcefr,
title = {{UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment}},
author = {Joseph Marvin Imperial and Abdullah Barayan and Regina Stodden and Rodrigo Wilkens
and Ricardo Muñoz Sánchez and Lingyun Gao and Melissa Torgbi and Dawn Knight and Gail Forey
and Reka R. Jablonkai and Ekaterina Kochmar and Robert Reynolds and Eugénio Ribeiro and
Horacio Saggion and Elena Volodina and Sowmya Vajjala and Thomas François and
Fernando Alva-Manchego and Harish Tayyar Madabushi},
journal = {arXiv preprint arXiv:2506.01419},
year = {2025},
url = {https://arxiv.org/abs/2506.01419}}
APA Format:
Imperial, J. M., Barayan, A., Stodden, R., Wilkens, R., Muñoz Sánchez, R., Gao, L., Torgbi, M., Knight, D., Forey, G., Jablonkai, R. R., Kochmar, E., Reynolds, R., Ribeiro, E., Saggion, H., Volodina, E., Vajjala, S., François, T., Alva-Manchego, F., & Tayyar Madabushi, H. (2025). UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment. arXiv. https://arxiv.org/abs/2506.01419
Written with StackEdit.