An AI model that generates MIDI music based on the emotional content of artwork using CLIP-based image encoding.
This project uses the following external dependencies:
- midi-emotion - For MIDI music generation based on emotions
- Clone the repository with submodules:
git clone --recursive https://github.com/vincentamato/aria.git
cd aria- Initialize and update submodules (if cloned without --recursive):
git submodule init
git submodule update- Install the package and dependencies:
pip install -e . # This will also install the local midi-emotion package- Download model files:
- Visit ARIA on Hugging Face
- Download the following files and place them in the corresponding directories:
models/ ├── continuous_concat/ # For continuous vector concatenation │ ├── model.pt │ ├── mappings.pt │ └── model_config.pt ├── continuous_token/ # For continuous vector prepending │ ├── model.pt │ ├── mappings.pt │ └── model_config.pt └── discrete_token/ # For discrete emotion tokens ├── model.pt ├── mappings.pt └── model_config.pt - Also download
image_encoder.ptfor the CLIP-based image emotion model
ARIA uses two main components:
- A CLIP-based image encoder that extracts emotional content (valence and arousal) from artwork
- A music generation model (midi-emotion) that creates MIDI music based on these emotions
ARIA supports different ways of incorporating emotional information into the music generation process:
-
Continuous Concat (Default): Embeds emotions as continuous vectors and concatenates them to all tokens in the sequence. This provides consistent emotional influence throughout the generation.
-
Continuous Token: Embeds emotions as continuous vectors and prepends them to the sequence. The emotional information is provided at the start of generation.
-
Discrete Token: Quantizes emotions into discrete bins and uses them as special tokens. Useful when you want more distinct emotional categories.
-
None: Generates music without emotional conditioning. Use this for baseline comparison or when you want purely structural music generation.
Generate music from an image using the following command:
python src/models/aria/generate.py \
--image path/to/your/image.jpg \
--image_model_checkpoint path/to/image/model.pt \
--midi_model_dir path/to/midi_emotion/model \
--conditioning continuous_token \
--out_dir output--image: Path to the input image--image_model_checkpoint: Path to the CLIP-based image emotion model checkpoint--midi_model_dir: Path to the midi-emotion model directory--conditioning: Type of emotion conditioning (choices: none, discrete_token, continuous_token, continuous_concat)
--out_dir: Directory to save generated MIDI (default: "output")--gen_len: Length of generation in tokens (default: 512)--temperature: Temperature for sampling [note_temp, rest_temp] (default: [1.2, 1.2])--top_k: Top-k sampling, -1 to disable (default: -1)--top_p: Top-p sampling threshold (default: 0.7)--min_instruments: Minimum number of instruments required (default: 1)--cpu: Force CPU inference--batch_size: Number of samples to generate (default: 1)
The model will output:
- Predicted emotional values (valence and arousal)
- Path to the generated MIDI file
This project incorporates the following open-source works:
- midi-emotion by Serkan Sulun et al. (https://github.com/serkansulun/midi-emotion)