RAVE models encode data from the audio domain into highly compressed latent representations. The scripts in this repository build on the idea, that based on statistical information retrieved from these encodings, a speculative compositional practice can be established inside the latent space of the models. It is derived from the improvisation tactics empirically proven in Latent Jamming.
The process consists of three phases:
- audio data encoding: 1-n audio files are encoded into arrays in the shape of a model's latent space.
- generation of synthetic data: 1-n encodings are evaluated for their data distribution. Based on the results, arrays of synthetic data are generated which are used to populate preset patterns and apply alterations to these patterns.
- decoding of synthetic data into audio data: 1-n generated/ synthetic data arrays are being decoded and up sampled back to the audio domain using the same model as in the encoding process.
Author: Martin Heinze | marts~. Year: 2026
Create and activate a new conda environment e.g.:
conda create -n unravel python=3.14
conda activate unravelClone this repository, run pip install on requirements.txt
git clone https://github.com/devstermarts/UnRAVEl.git
cd UnRAVEl
pip install -r requirements.txtFFMPEG is required. Install with:
conda install ffmpeg -yencoder.py takes one or many audio files to encode to .npy arrays using a .ts model. For best results, data should be similar to (or the same as) what the model has seen during training since we want to generate synthetic data in a value range the decoder can handle well.
python encoder.py \
--input /path/to/your/audio/files \ # or path to a single audio file
--extensions .opus \ # pick audio file extension
--model /path/to/your/pretrained/model.ts \
--output # default is './_encoder-output'generator.py analyzes one or many .npy arrays for value distribution and correlation and creates arrays of synthetic data to populate patterns (see below), which can be exported as a 'latent audio' file - basically a multi channel (= latent dimensions), double precision (= for values outside -1/+1 boundary) .wav file at e.g. 21Hz resolution (if data source was 44.1KHz) for use with ch4ns0n/ch8ns0n components inside Pure Data. Alternatively, the export writes a .npy file that can be transformed using the decoder.py script.
python generator.py \
--input /path/to/your/numpy/files \ # or path to a single .npy file
--seed 123 \ # seed for random generator, required for reproducible patterns
--iqr \ # use interquartile range to calculate value distribution
--num_latents 21 \ # length of synthetic latent pattern
--num_loops 4 \ # number of loops/repetitions used in pattern logic
--sample_rate 44100 \ # must be sample rate the model has been trained on
--num_files 3 \ # number of files to be generated, default is 1
--distribution correlation-based \ # distribution method to sample from
--pattern swapper \ # pattern generator, select from multiple patterns (or add your own)
--embeddings \ # create .npy files instead of 'latent audio' files.
--output /path/to/generated/files # default is './_generator-output'Encodings are analyzed for their mean, standard deviation (both per latent dimension) and correlation between latent dimensions. When --iqr flag is used, interquartile range lower and upper bound is determined along with median and deviation instead. When generating synthetic data, these values inform distribution types (normal, uniform, iqr-uniform, correlation-based). Synthetic data in normal distribution (with and without --iqr) as well as correlation-based distribution should be the most truthful in relation to the model's capabilities.
Latent embeddings created with RAVE are highly compressed representations of audio domain data. A sample rate of 44.1KHz corresponds to 21 data points times the number of latent dimensions in the model's configuration. The audio domain resolution is high enough to be more or less irrelevant for tempo considerations, however, with the low resolution in latent space, limitations to achievable tempi are expressed by:
60 * (model sample rate / model compression) / latent data points
For example:
60 * (44100 / 2048) / 11 = 117.45 BPM
This leads to the following quantized tempi (in 4ths, 8ths for double time) achievable by looping k amount of latent data points (in the script: --num_latents and --num_loops):
| k | Tempo BPM | Double time |
|---|---|---|
| 1 | 1291.99 | - |
| 2 | 645.99 | 1291.98 |
| 3 | 430.66 | - |
| 4 | 322.99 | 645.98 |
| 5 | 258.39 | - |
| 6 | 215.33 | 430.66 |
| 7 | 184.57 | - |
| 8 | 161.49 | 322.98 |
| 9 | 143.55 | - |
| 10 | 129.19 | 258.38 |
| 11 | 117.45 | - |
| 12 | 107.66 | 215.32 |
| 13 | 99.38 | - |
| 14 | 92.28 | 184.56 |
| 15 | 86.13 | - |
| 16 | 80.74 | 161.48 |
| 17 | 75.99 | - |
| 18 | 71.77 | 143.54 |
| 19 | 67.99 | - |
| 20 | 64.59 | 129.18 |
| ... | ... | ... |
The script comes with the following POC patterns:
| Pattern | Details | Audio example |
|---|---|---|
| fibo | Repeats a given array of shape (data points, latent dimensions) along the fibonacci series of integers where 1 corresponds to the first row in the array, 2 corresponds to the first two rows in the array, ..., 8 corresponds to rows 0-7 and so on. Note that above tempo considerations do not apply to this pattern type. |
1 2 |
| orale | Approximation to a standard sequence in electronic music building an array using a 3:1 scheme where the original array is repeated three times and a fourth time with subtle changes applied to its values. This sequence is then repeated and altered again in the same scheme of 3:1. | 1 2 |
| blender | Blends two arrays into one another by replacing single data points sequentially after --num_loops repetitions, starting with the first value in the first dimension, followed by the first value in the second dimension and so on until the last value in the last dimensions has been reached. The pattern can be inverted, mirrored or mirror-inverted (implemented, not active). |
1 2 |
| swapper | Swaps values of randomly picked data points in two arrays of the same size and repeats the altered array --num_loops times. |
1 2 |
decoder.py decodes one or many .npy arrays back to the audio domain using the same .ts model that has been used in the encoding step.
python decoder.py \
--input /path/to/your/numpy/files \ # or path to a single .npy file
--model /path/to/your/pretrained/model.ts \ # array shape must match model's expected shape
--output /path/to/decoded/audio/files # default is './_decoder-output'inspector.py inspects one or many models' state_dictchecks for other parameters like sample rate, number of latent dimensions and architecture. Details can be stored to a .csv file using the --save and --output flags. With --dict and --attr, contents of dedicated items from the state_dict and attributes (methods, properties, data attributes) can be retrieved through CLI. The script generally works with other architectures, e.g. vschaos2, MSPrior or AFTER but it's not explicitly written for these model types; it partly works with SAT autoencoders for quick inspection. The script has informative value only, it does not create any particular data for use in the other scripts but can be handy for debugging
python inspector.py \
--input /path/to/your/pretrained/model.ts \ # or path containing multiple models
--extensions .ts \ # model file name extension
--save \ # stores model state_dict entries and shape to .csv
--output /path/to/output \ # default is'./_inspector-output'
--dict \ # display all params in state_dict and enters detail request mode in CLI
--attr \ # displays all attributes in model and enters detail request mode in CLI
--config # displays full model config as used in training (only SAT checkpoint files)- Python: 3.14
- Torch/ Torchaudio: 2.10
Tested with models self trained with RAVE <= v2.2.2. Inspection more or less works with vschaos2, MSPrior, AFTER and Stable Audio autoencoders.
- Built on RAVE
UnRAVEl © 2026 by Martin Heinze is licensed under CC BY-SA 4.0