Skip to content

[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly

License

Notifications You must be signed in to change notification settings

luciusssss/ZhuangBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Teaching Large Language Models an Unseen Language on the Fly

Data and code for the following papers:

ACL'24 Findings (Full-Length Paper) Teaching Large Language Models an Unseen Language on the Fly

ICLR'24 Tiny Paper Can LLMs Learn a New Language on the Fly? A Case Study on Zhuang

Project Website

Dataset

We present ZhuangBench, a collection of NLP resources for Zhuang (壮语), a low-resource language spoken in China.

It consists of a Zhuang-Chinese dictionary, a Zhuang-Chinese parallel corpus, and Zhuang-Chinese machine translation test set.

Important: Preventing Test Set Contamination We encrypted the source files of ZhuangBench in data.zip to prevent test set contamination. The password is zhuangbench.

List of files:

  • dictionary_za2zh.jsonl: Zhuang-Chinese dictionary.
  • dictionary_zh2za.jsonl: Chinese-Zhuang dictionary.
  • parallel_corpus.json: Zhuang-Chinese parallel corpus.
  • test_translation_set.json: Zhuang-Chinese machine translation test set.
  • preprocessed/dictionary_za2zh_web+giza.jsonl: Zhuang-Chinese dictionary augmented with BLI from Giza++.
  • preprocessed/dictionary_zh2za_web+giza+synonym.jsonl: Chinese-Zhuang dictionary augmented with BLI from Giza++ and synonyms.

Beta Version

Our ICLR'24 Tiny Paper uses a beta version of the dataset, ZhuangBench-Beta. We provide the data in data-beta-version.zip (password: zhuangbench-beta). This data is for archival purposes only. We recommend using the newer data in data.zip, which is larger and includes typo corrections.

Code

We provide code of DiPMT++ to reproduce the results in the paper.

Install the dependencies:

pip install -r requirements.txt

Use the scripts in ./scripts to run the LLMs and evaluate the results.

License

The license for the code and data is MIT.

Citation

@inproceedings{zhang2024teaching,
  title={Teaching Large Language Models an Unseen Language on the Fly},
  author={Zhang, Chen and Liu, Xiao and Lin, Jiuheng and Feng, Yansong},
  booktitle={Findings of the Association for Computational Linguistics ACL 2024},
  pages={8783--8800},
  year={2024}
}
@inproceedings{zhang2024can,
  title={Can {LLM}s Learn a New Language on the Fly? A Case Study on Zhuang},
  author={Chen Zhang and Mingxu Tao and Quzhe Huang and Zhibin Chen and Yansong Feng},
  booktitle={The Second Tiny Papers Track at ICLR 2024},
  year={2024},
}

About

[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly

Topics

Resources

License

Stars

Watchers

Forks