dataloader-benchmarks

This is a benchmarking suite for data loading libraries.

Citation

This repository contains the code and experiments accompanying the paper:

@inproceedings{ofeidis2024overview,
  title={An overview of the data-loader landscape: Comparative performance analysis},
  author={Ofeidis, Iason and Kiedanski, Diego and Tassiulas, Leandros},
  booktitle={2024 IEEE International Conference on Big Data (BigData)},
  pages={360--367},
  year={2024},
  organization={IEEE}
}

If you use this repository in your research or find it helpful, please cite our work.

Set up

Environmental variables

Create a .env file with the following information

DOCKER_NAME=<name of org>/<name of container>:<version>
DYNACONF_AWS_ACCESS_KEY_ID=<aws id>
DYNACONF_AWS_SECRET_ACCESS_KEY=<aws secret>
DYNACONF_BUCKET_NAME=<bucket name in aws> # needs to exist before running experiments

Running locally

Clone this repository
Build the docker container: ./scripts/build.sh
Run the container: ./scripts/run.sh
Run all the experiments: ./experiments/run_all.sh

Running on AWS

Create the file ~/.aws/credentials with the following content:

[default]
aws_access_key_id = <aws id> 
aws_secret_access_key = <aws secret>

Make sure that an S3 bucket is created with the name defined above and that it is accessible with the credentials provided.
Download the get_ecr script to fetch the latest docker image: wget https://raw.githubusercontent.com/kiedanski/dataloader-benchmarks/main/scripts/get_erc.sh && chmod +x get_ecr.sh
Download the latest docker image locally: ./get_ecr.sh
Download the run script: wget https://raw.githubusercontent.com/kiedanski/dataloader-benchmarks/main/scripts/run.sh && chmod +x run.sh
Execute the run command to get into the docker container: ./run.sh
Run all the experiments: ./experiments/run_all.sh

Collecting results and plotting

Inside the container run:

python src/plots/download_results.py
python src/plots/generate_plots.py

Implemented Libraries and Datasets

		Pytorch	FFCV	Hub	Deep Lake	Torchdata	Webdataset	Squirrel	NVIDIA DALI
CIFAR-10	default	✅	✅	✅	✅	✅	✅	✅	✅
	remote	❌	✅	✅	✅	❌	✅	❓	❌
	filtering	✅	❓	✅	✅	✅	✅	❓	❌
	multi-gpu	✅	✅	❌	✅	✅	✅	✅	✅
RANDOM	default	✅	✅	✅	✅	✅	✅	✅	✅
	remote	❌	✅	✅	✅	❌	✅	❓	❌
	filtering	✅	❓	✅	✅	✅	✅	❓	❌
	multi-gpu	✅	✅	❌	✅	✅	✅	✅	✅
CoCo	default	✅	❌	✅	✅	✅	✅	✅	✅
	remote	❌	❌	✅	✅	❌	✅	❓	❌
	filtering	✅	❌	✅	✅	✅	✅	❓	❌
	multi-gpu	✅	❌	❌	✅	✅	✅	✅	✅

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
experiments		experiments
infrastructure		infrastructure
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
docker-compose.yaml		docker-compose.yaml
params.yaml		params.yaml
settings.toml		settings.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataloader-benchmarks

Citation

Set up

Environmental variables

Running locally

Running on AWS

Collecting results and plotting

Implemented Libraries and Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

smartnets/dataloader-benchmarks

Folders and files

Latest commit

History

Repository files navigation

dataloader-benchmarks

Citation

Set up

Environmental variables

Running locally

Running on AWS

Collecting results and plotting

Implemented Libraries and Datasets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages