URT is a tool for automatically downloading datasets from diverse sources and converting them to the standardized BIDS format. It supports seamless downloads of whole dataset collections with one simple command.
Currently it supports Synapse, TCIA and OpenNeuro as sources but the tool is build with modularity and extensibility in mind, thus it is easy to integrate new sources. Adding new datasets for from already implemented sources is easy and only requires some metadata about the format and source of the dataset. For a subset of all downloadable datasets a conversion mapping exists for the automatic download and conversion to the BIDS format by using the Bidscoin library. The collection of automatically convertible datasets is expected to grow in the future to relief researchers from the manual labor involved in this process.
All built-in downloaders emphasize fault tolerance and data integrity.
- Unified Retrieval Tool (URT)
- Prerequisites
- Usage
- Supported Datasets
- Accessing restricted datasets
- Architecture Details
- Known Problems
- Changelog
It is strongly suggested to use URT with docker or singularity to avoid dependency conflicts. This also reduces the amount of prerequisites. Nevertheless the tool can be used without container software as well.
For the basic usage it is highly recommended to use conda or mamba for managing the environment. The tool officially supports Linux, but other distributions like MacOS or Windows (with WSL) might work as well.
- Conda or Mamba
- OPTIONAL: aws cli (required for downloads from OpenNeuro)
- OPTIONAL: aspera-cli (required for downloads from TCIA which are stored as NIfTI)
First download the URT repo
git clone https://github.com/LuxImagingAI/URT.gitand navigate to the URT directory:
cd URTThe environment.yaml file can be used to automatically set up an evironment with all the required dependencies, this avoids dependency conflicts. Use
conda env create -f environment.yamlto create the environment and activate it via:
conda activate URTThe environment contains all required dependencies for downloading datasets from Synapse, DICOM datasets from TCIA and the bids conversion. If datasets from OpenNeuro or NIfTI datasets from TCIA are required then the optional dependencies need to be installed as well. Unfortunately it is not possible to include them in the conda environment. Docker/singularity is the preferred way if you want to avoid installing these dependencies.
Docker version 4.24 or newer
Singularity version 3.8.1 or newer
--dataset:
The name of the dataset which should be downloaded.
Alternatively a .yaml file containing a list of datasets for batch-processing of multiple datasets. An example can be seen in "config/dataset_list.yaml".
A list of type "[DATASET_1, DATASET_1, ...]" can be given as well.
--output_dir:
The output directory for the data.
Default: "./output"
--temp_dir:
The directory where the data is stored temporarily, i.e. during the download (before compression). Can be useful in situations where not much space is left on the output drive.
Default: "./temp"
--cache_dir:
Directory which is used for the output of the logs and the http cache.
Default: "~/.cache/URT"
--credentials:
Path to credentials.yaml file containing the credentials. The credentials file is only needed in cases where datasets with limited access are downloaded.
Default: "config/credentials.yaml"
--bids:
If this argument is given the dataset(s) will be converted to bids after the download (if the required data for the conversion is given in datasets).
Default: False
--compress:
If this argument is added the output folder will be automatically compressed.
Default: False
The following command will start URT with the given arguments.
python3 URT.py --dataset DATASET [--output_dir OUTPUT_DIR] [--temp_dir TEMP_DIR] [--cache_dir CACHE_DIR] [--credentials CREDENTIALS_FILE] [--bids] [--compress]URT will choose the appropriate downloader for the given collection (based on datasets/datasets.yaml). If the collection cannot be found it will fall back to downloading via the nbia REST API (TCIA) and attempt a download of the collection. BIDS conversion is not possible in this case.
If datasets from OpenNeuro or TCIA via Aspera are downloaded make sure that the additional dependencies are installed.
By using docker you can avoid the installation of any dependencies and achieve higher reproducibility.
The container can be started by executing:
docker run -v ./output:/URT/output -v ./temp:/URT/temp -v ./cache:/URT/cache [-v ./config:/URT/config] --platform=linux/amd64 imagingai/urt:latest --dataset DATASET [--bids] [--compress]In the case of docker the output directory, temporary directory and cache directory can be changed by modifying the mounted volumes in the docker run command. E.g. replacing "./output:/downloader/output" by "~/output:/downloader/output" will move the output folder to the home directory.
Warning: The same restrictions concerning the private repo apply here.
The usage of docker compose is supported as well. Start docker compose by executing:
docker compose upThe arguments and volumes can be changed in the compose.yaml file.
Singularity is supported as well. The following command can be used to pull the docker image from dockerhub, convert it to the singularity image format .sif and run it:
singularity run --cleanenv --writable-tmpfs --no-home --bind ./output:/URT/output --bind ./temp:/URT/temp --bind ./cache:/URT/cache [--bind ./config:/URT/config] docker://imagingai/urt:latest --dataset DATASET [--bids] [--compress]Similar to docker, the output folder can be changed by changing the path of the mounted directories.
Warning: Unfortunately some datasets require a registration and access request on the respective website. If you want to download one or several of these datasets make sure that you have access permissions and that your credentials are entered in the credentials.yaml file (as described in this chapter)
| Acronym | Name | Source | Downloader | Format | BIDS support | Access |
|---|---|---|---|---|---|---|
| BTUP | Brain-Tumor-Progression | TCIA | TciaDownloader | DICOM | Yes | Limited |
| BGBM | Burdenko-GBM-Progression | TCIA | TciaDownloader | DICOM | Yes | Limited |
| RIDN | RIDER Neuro MRI | TCIA | TciaDownloader | DICOM | Yes | Limited |
| QBDM | QIN-BRAIN-DSC-MRI | TCIA | TciaDownloader | DICOM | Yes | Limited |
| UPDG | UCSF-PDGM | TCIA | AsperaDownloader | NIfTI | No | Open |
| BRAG | Brats-2023-GLI | SynapseDownloader | Synapse | NIfTI | Yes | Limited |
| BRSA | Brats-2023-SSA | SynapseDownloader | Synapse | NIfTI | No | Limited |
| BTC1 | BTC_preop | OpenNeuro | AwsDownloader | BIDS | Yes (already in BIDS) | Open |
| BTC2 | BTC_postop | OpenNeuro | AwsDownloader | BIDS | Yes (already in BIDS) | Open |
Some datasets require access permissions from the respective website for the download. If you want to download one of these make sure that you indeed have the permissions to do so and enter your credentials in the credentials.yaml file ("config/credentials.yaml").
If you are using docker or singularity you need to mount the folder containing the .yaml file to "/URT/config" inside the container, as shown in the docker example and singularity example.
OpenNeuro generally does not require any access permissions, while TCIA and Synapse usually do.
If you want to access restricted datasets form TCIA make sure that you have permissions with your TCIA account. Afterwards enter the credentials to the credentials.yaml file as shown in the example fil:
TCIA:
user: USERNAME
password: PASSWORDSynapse requires the user to create an access token for accessing restricted datasets through a user account. As with TCIA make sure that you indeed have permissions to access the dataset. Create a personal access token by logging in to Synapse and navigting to Your Account / Account Settings / Personal Access Token / Manage Personal Access Tokens. Afterwards add the access token to the credentials.yaml file:
Synapse:
token: ACCESS_TOKENSynapse provides instructions for generating an access token.
URT consists of a collection of downloaders, bidsmaps and modules.
When given the name of a dataset URT will check in it's datasets file ("datasets/datasets.yaml") whether the required metadata is available. If yes then it will attempt to download the dataset. If no it will fall back to the TciaDownloader, which supports DICOM datasets downloads on The Cancer Imaging Archive, and will attempt to search for the unknown dataset. If the dataset is not found then URT will skip this dataset and attempt to download to next dataset in the list (if there is any).
If the "--bids" option is chosen then URT will attempt to convert the dataset to the BIDS standard by using the Bidscoin library and the correct bidsmap in "datasets/bidsmaps". If none is given then URT will not attempt to download the dataset in the first place and will warn the user about the issue.
In order to make URT as modular as possible it supports the application of arbitrary modules after the download and bids conversion. Modules are functions which are given the path of the dataset and thus can implement any desired postprocessing of the dataset.
Adding datasets for download to URT is a simple process. It requires the addition of relevant metadata to the downloader, like dataset format, the source and the downloader which should be used to "datasets/datasets.yaml". If you want to add a new dataset the easiest way is to copy an entry of an existing datasets from the same source and downloader and modify it.
Adding a dataset for automatic BIDS conversion is a more involved process, because creating the bidsmaps required for the mapping to the BIDS formatted dataset requires domain knowledge. Since URT uses the bidscoin library the required information for creating bidsmaps can be found there. Bidsmaps are stored in "datasets/bidsmaps" and are required to have the same name as the dataset in the "datasets/datasets.yaml" file.
If you only want to add a dataset for download make sure that there is no "bids" entry in the .yaml file for the specific dataset, since URT will try to find the bidsmap and convert the dataset if there is.
Downloaders are stored in the "downloader" folder in URT. If you want to add a downloader for a new source add a new class inheriting from the "Downloader" class to the folder. The filename should be the same as the class name. The downloaders are referenced by their name in URT, thus the new downloader will automatically be used by URT if a dataset contains the downloader's name in the datasets.yaml file.
Modules are functions in the "utils/Modules.py" file which can be called by URT after the download (and BIDS conversion, if chosen) to do arbitrary postprocessing on the datasets.
Modules are executed by adding the "modules" key to the dataset in the "datasets.yaml" file. "modules" contains a list of modules which are executed in the specified order. Each module is defined by the "name" and "data" key, the "name" key identifies which module is to be executed and the "data" key contains arbitrary data which is given to the module as input.
Examples for modules can be found in the "utils/Modules.py" file.
- The synapseclient library sometimes seems to get stuck when run in a docker container
Only the last version updates are indicated here. The full changelog can be found in the CHANGELOG.md.
- Bugfix: replaced shutil.rename by explicit copy and removal to avoid problems with distributed filesystems
- Credentials.yaml is automatically generated if none exists
- Error reports more concise
- Fixed error in Dockerfile
- Changed subprocess calls (better error handling)
- Downgraded to Bidscoin 4.3.0 (4.3.2 causes BIDS conversion of Brats to fail)
- Reduced size of docker container (from ~3GB to ~1GB)
- Password and username are now censored in logfiles
- Bidsmaps for Brain-Tumor-Progression, QIN-Brain-DSC-MRI
- Bidsmap for RIDER NEURO MRI
- Changed URL for token request of the TCIA downloader (previous URL does not work anymore)
- Switched to Bidscoin version 4.3.2
- Perfusion now under "extra_data"
- Automatic creation of dseg.tsv for supported datasets
- Support for datasets: Brats-2023-GLI, Brats-2023-SSA, Brain-Tumor-Progression, Burdenko-GBM-Progression, EGD
- Modules: allow arbitrary modifications of the downloaded datasets
- Credentials file supporting seperate credentials for every downloader
- Readme.md: additional information about URT, it's architecture, its' usage and supported datasets.
- Detection for already downloaded and corrupted datasets
- Automatic detection of compressed and uncompressed datasets: avoids re-downloads e.g. if compressed dataset is available and uncompressed is supposed to be downloaded
- SynapseDownloader for downloading datasets from Synapse via synapseclient
- Readme.md: error in docker and singularity command (mounting collections.yaml)
- Minor changes in logger
- Increased modularity of the downloaders
- Downloader class: every downloader has to inherit from "Downloader"
- Numerous enhancements concerning readability and simplicity
- Manual download represented by the "Manual" downloader class instead of "none"
- Changed format of datasets in datasets.yaml
- Attempt to move errors to the beginning of the process to avoid disruptions during the download process
- Download and conversion of multiple datasets more robust
- Ruby delivered with conda environment
- Username and password are not given as argument anymore
- Duplicate code: opening dataset.yaml (now opened once during object initialization)
