LUMA is a multimodal dataset designed for benchmarking multimodal learning and multimodal uncertainty quantification. This dataset includes audio, text, and image modalities, enabling researchers to study uncertainty quantification in multimodal classification settings.
The paper was presented at SIGIR 2025 Conference. (Link to the pre-print, slides)
LUMA consists of:
- Audio Modality:
wavfiles of people pronouncing the class labels of the selected 50 classes. - Text Modality: Short text passages about the class labels, generated using large language models.
- Image Modality: Images from a 50-class subset from CIFAR-10/100 datasets, as well as generated images from the same distribution.
The dataset allows controlled injection of uncertainties, facilitating the study of uncertainty quantification in multimodal data.
- Anaconda / Miniconda
- Git
Clone the repository and navigate into the project directory:
git clone https://github.com/bezirganyan/LUMA.git
cd LUMAInstall and activate the conda enviroment
conda env create -f environment.yml
conda activate luma_envMake sure you have git-lfs installed (https://git-lfs.com), it will be automatically installed by conda if you did previous steps. Then perform:
git lfs install
Download the dataset under the data folder (you can also choose other folder names, and updated config files, data folder is the default in the default configurations)
git clone https://huggingface.co/datasets/bezirganyan/LUMA dataThe provided Python tool allows compiling different versions of the dataset with various amounts and types of uncertainties.
To compile the dataset with specified uncertainties, create or edit the configuration file similar to the files in cfg directory, and run:
python compile_dataset.py -c <your_yaml_config_file>
After compiling the dataset, you can use the LUMADataset class from the dataset.py file. Example of the usage can be found in run_baselines.py file.
If you want to get all the data (without sampling or noise) without alignment (to perform your own alignment, or use the data without alignment for other tasks) you can run the following command:
python get_unprocessed_data.py
If you use the dataset, please cite our paper with:
@inproceedings{luma_dataset2025,
title={LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data},
author={Grigor Bezirganyan and Sana Sellami and Laure Berti-Équille and Sébastien Fournier},
booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
year={2025}
}
