LUMA: Learning from Uncertain and Multimodal Data

Overview

LUMA is a multimodal dataset designed for benchmarking multimodal learning and multimodal uncertainty quantification. This dataset includes audio, text, and image modalities, enabling researchers to study uncertainty quantification in multimodal classification settings.

The paper was presented at SIGIR 2025 Conference. (Link to the pre-print, slides)

Dataset Summary

LUMA consists of:

Audio Modality: wav files of people pronouncing the class labels of the selected 50 classes.
Text Modality: Short text passages about the class labels, generated using large language models.
Image Modality: Images from a 50-class subset from CIFAR-10/100 datasets, as well as generated images from the same distribution.

The dataset allows controlled injection of uncertainties, facilitating the study of uncertainty quantification in multimodal data.

Getting Started

Prerequisites

Anaconda / Miniconda
Git

Installation

Clone the repository and navigate into the project directory:

git clone https://github.com/bezirganyan/LUMA.git 
cd LUMA

Install and activate the conda enviroment

conda env create -f environment.yml
conda activate luma_env

Make sure you have git-lfs installed (https://git-lfs.com), it will be automatically installed by conda if you did previous steps. Then perform:

git lfs install

Download the dataset under the data folder (you can also choose other folder names, and updated config files, data folder is the default in the default configurations)

git clone https://huggingface.co/datasets/bezirganyan/LUMA data

Usage

The provided Python tool allows compiling different versions of the dataset with various amounts and types of uncertainties.

To compile the dataset with specified uncertainties, create or edit the configuration file similar to the files in cfg directory, and run:

python compile_dataset.py -c <your_yaml_config_file>

Usage in Deep Learning models

After compiling the dataset, you can use the LUMADataset class from the dataset.py file. Example of the usage can be found in run_baselines.py file.

Unprocessed & Unaigned data

If you want to get all the data (without sampling or noise) without alignment (to perform your own alignment, or use the data without alignment for other tasks) you can run the following command:

python get_unprocessed_data.py

If you use the dataset, please cite our paper with:

@inproceedings{luma_dataset2025,
  title={LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data}, 
  author={Grigor Bezirganyan and Sana Sellami and Laure Berti-Équille and Sébastien Fournier},
  booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
baselines		baselines
cfg		cfg
data_generation		data_generation
review		review
split_indices		split_indices
.gitignore		.gitignore
LICENSE		LICENSE
LUMA_SIGIR_2025_Poster.pdf		LUMA_SIGIR_2025_Poster.pdf
README.md		README.md
compile_dataset.py		compile_dataset.py
dataset.py		dataset.py
dataset_licenses.md		dataset_licenses.md
environment.yml		environment.yml
get_unprocessed_data.py		get_unprocessed_data.py
requirements.txt		requirements.txt
run_baselines.py		run_baselines.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LUMA: Learning from Uncertain and Multimodal Data

Overview

Dataset Summary

Getting Started

Prerequisites

Installation

Usage

Usage in Deep Learning models

Unprocessed & Unaigned data

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LUMA: Learning from Uncertain and Multimodal Data

Overview

Dataset Summary

Getting Started

Prerequisites

Installation

Usage

Usage in Deep Learning models

Unprocessed & Unaigned data

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages