This repository contains an implementation of Myna-RPE, an extension of the Myna masked contrastive learning framework for audio representation learning.
Myna-RPE introduces Relative Positional Embeddings (RPE), including 1D and 2D ALiBi and RoPE variants, enabling end-to-end track-level embeddings from full-length mel-spectrograms without chunking or aggregation.
This code reproduces the methods and experiments described in:
Relative Positional Embeddings for Track-Level Representations in Masked Contrastive Learning
Modern self-supervised music models typically train on short fixed-length segments and later average multiple embeddings at inference time — losing long-range structure.
Myna-RPE removes this limitation by adapting Relative Positional Embedding algorithms to ViT-based audio encoders under patchout masking.
This enables models that can:
- Process entire tracks in a single forward pass
- Extrapolate to spectrograms longer than those seen during training
- Preserve global structure across time and frequency
- Improve downstream MIR performance on multiple benchmarks
Myna-RPE is fully modular — each positional embedding scheme can be toggled independently.
- Full PyTorch implementation of:
- 2D ALiBi
- 1D ALiBi
- Learned frequency embeddings
- RoPE positional embeddings
- Sinusoidal positional embeddings
- Patchout-aware RPE (coordinates preserved after patchout)
- Myna-style AST encoder with CLS token
- Contrastive pretraining with InfoNCE
- Mel-spectrogram preprocessing pipeline
- Training scripts for MTG-Jamendo Top-50 Tags
- Evaluation scripts for:
- GTZAN (genre)
- GiantSteps Key (key detection)
- EmoMusic (emotion regression)
| Model | GTZAN Acc | GiantSteps Acc | EmoMusic A | EmoMusic V | Avg |
|---|---|---|---|---|---|
| 1D ALiBi + F-Embed | 74.87 | 82.57 | 59.37 | 43.45 | – |
| 2D ALiBi | 78.39 | 76.50 | 68.06 | 44.05 | ≈66% |
2D ALiBi shows consistent improvements over 1D, particularly for genre classification and emotion prediction.
Myna-RPE builds on the Myna contrastive learning framework (itself extending CLMR).
The goal is to learn discriminative audio representations via self-supervised contrastive learning.
For each batch:
- Convert a track to a mel-spectrogram.
- Sample two random fixed-length chunks → positive pair.
- Treat chunks from different tracks as negative pairs.
- Patchify the spectrogram and apply patchout masking.
- Feed tokens into a ViT encoder.
- Apply InfoNCE to make positives similar & negatives dissimilar.
- Basic scripts to interface with the Spotify API
This trains the encoder to model high-level musical structure while remaining robust to masking and augmentation.
A modified Audio Spectrogram Transformer (from Myna) featuring:
- A prepended CLS token for global pooling
- 16×16 non-overlapping patches
- Patchout (random token dropout)
- Standard Transformer encoder layers
- No mean pooling — only CLS is used
The key innovation:
Absolute positional embeddings are replaced with Relative Positional Embeddings, allowing the model to process arbitrary-length spectrograms — including full songs.
Several promising research directions extend naturally from Myna-RPE:
Building on the techniques described in
“Convex Hull and K-Means Loss for Self-Supervised Representations” (Eng. Appl. AI, 2024 — https://doi.org/10.1016/j.engappai.2024.108612),
Myna-RPE can incorporate additional geometric constraints on the embedding space.
These losses encourage:
- K-Means Loss: tighter, more coherent clusters of songs or track segments
- Convex Hull Loss: embeddings that expand to better capture diversity within musical categories
This could improve downstream retrieval, tagging, and clustering tasks beyond contrastive-only training.
Instead of sampling fixed-length chunks, an extended Myna-RPE could:
- Randomly sample variable-length excerpts
- Mix short phrases, mid-length clips, and near-full sections
- Train encoders to adapt seamlessly to the natural duration variability in music
Because RPE allows arbitrary sequence length, the model can generalize across varying temporal scales without modification.
A geometry-aware embedding space opens the door to playlist and collection modeling:
- Treat a playlist as a convex hull enclosing the embeddings of its songs
- Measure whether a new track lies inside, near, or outside the playlist hull
- Generate playlists by:
- selecting songs whose embeddings fill out a target hull shape
- expanding a playlist’s convex hull and sampling points near its boundary
- using hull interpolation to create thematic transitions between playlists
All models were trained using a single A100 for 512 epochs in parallel. Compute was gratefully provided by Yonsei University's ai compute cluster (Thank you!)
MIT License