Julien Guinot*,1,2, Elio Quinton2, György Fazekas1
1 Centre for Digital Music, Queen Mary University of London, U.K.
2 Music & Audio Machine Learning Lab, Universal Music Group, London, U.K.
*Correspondence to j.guinot@qmul.ac.uk
Multimodal contrastive models have achieved strong performance in text-audio retrieval and zero-shot settings, but improving joint embedding spaces remains an active research area. Less attention has been given to making these systems controllable and interactive for users. In text-music retrieval, the ambiguity of freeform language creates a many-to-many mapping, often resulting in inflexible or unsatisfying results.
We introduce Generative Diffusion Retriever (GDR), a novel framework that leverages diffusion models to generate queries in a retrieval-optimized latent space. This enables controllability through generative tools such as negative prompting and denoising diffusion implicit models (DDIM) inversion, opening a new direction in retrieval control. GDR improves retrieval performance over contrastive teacher models and supports retrieval in audio-only latent spaces using non-jointly trained encoders. Finally, we demonstrate that GDR enables effective post-hoc manipulation of retrieval behavior, enhancing interactive control for text-music retrieval tasks.
- Generative Retrieval: Novel diffusion-based approach for text-music retrieval that generates queries in audio latent space
- Controllable: Enables negative prompting and DDIM inversion for interactive retrieval refinement
- Flexible Encoders: Works with non-jointly trained audio and text encoders
- Improved Performance: Outperforms contrastive teacher models on in-domain datasets
- Interactive: Enables post-hoc manipulation of retrieval behavior for better user control
Instead of encoding text queries and audio keys in a joint embedding space (top), GD-Retriever generates queries in the audio space directly through conditioning on a text query (bottom). This approach enables generative controllability mechanisms that are not available in traditional contrastive retrieval systems.
- Clone the repository:
git clone https://github.com/your-username/Diff-GAR.git
cd Diff-GAR- Create a conda environment:
conda create -n diffgar python=3.10
conda activate diffgar- Install the required dependencies:
pip install -r requirements.txtThe codebase is designed to be modular and flexible. Most configuration is handled through YAML files in the config/ directory. You only need to modify paths if you're using the same datasets as in the paper or if you want to use your own datasets.
If you're using the datasets from the paper (SongDescriber, MusicCaps, PrivateCaps), you'll need to update the placeholder paths in the configuration files. The main files to modify are:
- Evaluation utilities:
diffgar/evaluation/utils.py - Data loaders:
diffgar/dataloading/dataloaders.py - Configuration files:
config/train_ldm/data/local/ - Extract features config:
config/extract_features/
For custom datasets, you can:
- Create your own data loading scripts following the existing patterns
- Modify the configuration files to point to your data
- Use the existing evaluation framework with your own data
The codebase supports various audio-text encoders (CLAP, MULE, MusCALL) and can be easily extended to work with new datasets and encoders.
To train a GD-Retriever model:
python train_ldm.py --config config/train_ldm/model/encoder_pair/clap/unet/train_ldm_sample_pred_base.yamlTo evaluate retrieval performance:
python eval_retrieval.py --task song_describer --model_name your_model_name --model_step 50000To evaluate fidelity and diversity:
python eval_fidelity.py --task song_describer --model_name your_model_name --model_step 50000To extract features using pre-trained encoders:
python extract_dataset.py --config config/extract_features/clap/extract_songdescriber.yamlGD-Retriever consists of:
- Audio Encoder: CLAP, MULE, or MusCALL for audio feature extraction
- Text Encoder: T5 or CLAP text encoder for text feature extraction
- Diffusion Model: UNet or MLP-based diffusion model for generating audio embeddings
- Retrieval Head: Cross-modal similarity computation
The model is trained on latent sequences of length T=64 (1 minute of audio) with a batch size of 256 for 100k steps using AdamW optimizer with linear warmup and cosine decay.
The following datasets were used for the paper:
| Dataset | #tracks | #captions | Hours | Training | Eval |
|---|---|---|---|---|---|
| SongDescriber | 0.7k | 1.1k | 23.3 | ❌ | ✅ |
| MusicCaps | 5.5k | 5.5k | 15.3 | ❌ | ✅ |
| PrivateCaps | 251k | 251k | 12.5k | ✅ | ✅ |
But all text-music / text-audio datasets should be supported
GD-Retriever outperforms contrastive teacher models on in-domain datasets while maintaining competitive performance on out-of-domain evaluation sets.
| Model | Metric | PC | SD | MC |
|---|---|---|---|---|
| CLAP | R@1 ↑ | 2.2 | 3.1 | 3.8 |
| R@5 ↑ | 7.2 | 13.7 | 12.9 | |
| R@10 ↑ | 12.3 | 23.2 | 19.5 | |
| MedR (%) ↓ | 3.7 | 4.0 | 1.4 | |
| GDR-CLAP | R@1 ↑ | 6.9 | 4.7 | 2.7 |
| R@5 ↑ | 17.1 | 15.3 | 7.6 | |
| R@10 ↑ | 22.9 | 24.7 | 11.5 | |
| MedR (%) ↓ | 1.6 | 3.8 | 2.9 | |
| MusCALL | R@1 ↑ | 10.1 | 3.6 | 1.0 |
| R@5 ↑ | 26.2 | 13.6 | 3.9 | |
| R@10 ↑ | 35.1 | 22.0 | 7.0 | |
| MedR (%) ↓ | 0.4 | 4.2 | 5.1 | |
| GDR-MusCALL | R@1 ↑ | 10.8 | 5.1 | 1.8 |
| R@5 ↑ | 25.1 | 16.9 | 6.4 | |
| R@10 ↑ | 33.3 | 25.5 | 9.9 | |
| MedR (%) ↓ | 0.6 | 3.5 | 3.4 |
PC: PrivateCaps, SD: SongDescriber, MC: MusicCaps
GD-Retriever enables effective negative prompting for retrieval refinement, maintaining semantic similarity to the original query while distancing from negative attributes.
DDIM inversion allows fine-grained control over retrieval queries, enabling users to refine specific attributes while preserving semantic similarity to the original prompt.
GD-Retriever can work with non-jointly trained encoders, demonstrating flexibility in encoder selection:
| Model | Text Encoder | R@1 | R@5 | R@10 | MR |
|---|---|---|---|---|---|
| GDR-CLAP | T5 | 8.1 | 21.1 | 29.2 | 0.8 |
| CLAP | 6.9 | 17.1 | 22.9 | 1.6 | |
| GDR-MULE | T5 | 7.6 | 18.5 | 25.3 | 1.6 |
Results on PrivateCaps dataset
This repository is organized as follows:
config/: Configuration files for training, evaluation, and feature extractiondiffgar/: Main package containing:dataloading/: Data loading utilities and datasetsevaluation/: Evaluation scripts and metricsmodels/: Model implementations (CLAP, MULE, MusCALL, diffusion models)
train_ldm.py: Main training scripteval_retrieval.py: Retrieval evaluation scripteval_fidelity.py: Fidelity and diversity evaluation scriptextract_dataset.py: Feature extraction script
If you use this code in your research, please cite our paper:
Citation key coming soon - paper not yet published
This project is licensed under the MIT License - see the LICENSE file for details.
We thank the authors of CLAP and MusCALL for providing the pre-trained audio-text encoders used in this work. MULE weights and code were used from this repo : MuLOOC, from the paper Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations
For questions and issues, please open an issue on GitHub or contact j.guinot@qmul.ac.uk.




