UniversalCEFR

UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

UniversalCEFR is an initiative to compile and index open, publicly accessible datasets based on the CEFR Framework (Common European Framework of Reference) to enable open research language proficiency assessment. Datasets indexed in UniversalCEFR are standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages.

The full paper is uploaded on Arxiv: https://arxiv.org/abs/2506.01419

📕 Accessing UniversalCEFR

The full informative data directory is listed in this repo: universalcefr-data-directory

If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access or download their transformed, standardised version through the UniversalCEFR Huggingface Org: https://huggingface.co/UniversalCEFR

If you use any datasets indexed in UniversalCEFR, please cite the original dataset papers they are associated with. You can find them in the data directory repo.

Note that there are a few datasets in UniversalCEFR---EFCAMDAT, APA-LHA, and DEPlain---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in universal-cefr-experiments repository to transform the raw version to UniversalCEFR version.

🤝 Join the UniversalCEFR Initiative

Initiators and Collaborators

An initiative started as a collaboration between the researchers around the world who are interested in 1) building a more open and accessible language proficiency assessment data resources that are also 2) standardised for maximized machine readability.

Joseph Marvin Imperial (University of Bath, UK and National University Philippines)
Abdullah Barayan (Cardiff University, UK)
Regina Stodden (Bielefeld University, Germany)
Rodrigo Wilkens (University of Exeter, UK)
Ricardo Muñoz Sánchez (University of Gothenburg, Sweden)
Lingyun Gao (UCLouvain, Belgium)
Melissa Torgbi (University of Bath, UK)
Dawn Knight (Cardiff University, UK)
Gail Forey (University of Bath, UK)
Reka R. Jablonkai (University of Bath, UK)
Ekaterina Kochmar (MBZUAI, UAE)
Robert Reynolds (Brigham Young University, USA)
Eugénio Ribeiro (INESC-ID Lisboa and Instituto Universitário de Lisboa , Portugal)
Horacio Saggion (Universitat Pompeu Fabra, Spain)
Elena Volodina (University of Gothenburg, Sweden)
Sowmya Vajjala (National Research Council, Canada)
Thomas François (UCLouvain, Belgium)
Fernando Alva-Manchego (Cardiff University, UK)
Harish Tayyar Madabushi (University of Bath, UK)

How to Join?

We want to grow this community of researchers, language experts, and educators to further advance openly accessible CEFR/language proficiency assessment datasets for all.

If you're interested in this direction and/or have open dataset/s you want to add to UniversalCEFR for better exposure and utility to researchers, please fill up this form.

When we index your dataset to UniversalCEFR, we will cite you and the paper/project from which the dataset came across the UniversalCEFR platforms.

Contact

For questions, concerns, clarifications, and issues, please contact Joseph Marvin Imperial (jmri20@bath.ac.uk).

📜 Reference

Please use the following information when citing UniversalCEFR:

BibTex Format:

@article{imperial2025universalcefr,
  title = {{UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment}},
  author = {Joseph Marvin Imperial and Abdullah Barayan and Regina Stodden and Rodrigo Wilkens 
    and Ricardo Muñoz Sánchez and Lingyun Gao and Melissa Torgbi and Dawn Knight and Gail Forey 
    and Reka R. Jablonkai and Ekaterina Kochmar and Robert Reynolds and Eugénio Ribeiro and 
    Horacio Saggion and Elena Volodina and Sowmya Vajjala and Thomas François and 
    Fernando Alva-Manchego and Harish Tayyar Madabushi},
  journal = {arXiv preprint arXiv:2506.01419},
  year = {2025},
  url = {https://arxiv.org/abs/2506.01419}}

APA Format:

Imperial, J. M., Barayan, A., Stodden, R., Wilkens, R., Muñoz Sánchez, R., Gao, L., Torgbi, M., Knight, D., Forey, G., Jablonkai, R. R., Kochmar, E., Reynolds, R., Ribeiro, E., Saggion, H., Volodina, E., Vajjala, S., François, T., Alva-Manchego, F., & Tayyar Madabushi, H. (2025). UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment. arXiv. https://arxiv.org/abs/2506.01419

Written with StackEdit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UniversalCEFR

UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

📕 Accessing UniversalCEFR

🤝 Join the UniversalCEFR Initiative

Initiators and Collaborators

How to Join?

Contact

📜 Reference

Pinned Loading

Repositories

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!