A high-accuracy machine learning system that identifies songs from short audio clips using deep learning and vector similarity search.
This project demonstrates how to build a music identification system using deep learning and similarity search without relying on traditional audio fingerprinting databases.
By combining YAMNet audio embeddings, FAISS vector search, and an ensemble voting strategy, the system achieves 93 percent accuracy on real-world audio clips.
Key achievement: Identifies songs in approximately 10 milliseconds using only 8 seconds of audio.
- 93 percent identification accuracy on thousands of songs
- Average query latency of approximately 10 milliseconds
- Robust to noise, compression, and partial audio clips
- Uses YAMNet for audio embeddings
- Uses FAISS for efficient similarity search
- Voting ensemble across multiple clips per song
- Scalable architecture suitable for large catalogs
- Pre-trained model included
- Dataset: Free Music Archive (FMA Medium)
- Total duration: ~30 hours
- License: Creative Commons
- Source: https://github.com/mdeff/fma
| Metric | Value |
|---|---|
| Dataset size | 5,564 songs |
| Accuracy | 93 percent |
| Query speed | ~10 ms |
| Minimum audio length | 8 seconds |
| Embedding dimension | 521 |
| Index type | FAISS IndexFlatL2 |
| Model size | ~50 MB |
Accuracy breakdown:
- Single clip (8s): 70–80 percent
- Voting method (3 clips): 93 percent
- Noisy audio supported
- Compressed audio supported
Core technologies:
- YAMNet (Google audio embedding model)
- FAISS (vector similarity search)
- TensorFlow
- Librosa
- NumPy
Development environment:
- Python 3.8+
- Jupyter Notebook
- Google Colab for training
Audio input (8-second clip)
↓
Librosa load and resample (16 kHz)
↓
YAMNet embedding extraction (521-dimensional vectors)
↓
FAISS similarity search (L2 distance)
↓
Top-K matching tracks with similarity scores
- Audio clips are loaded and resampled to 16 kHz.
- YAMNet extracts frame-level audio embeddings.
- Embeddings are averaged to form a fixed-length vector.
- FAISS performs nearest-neighbor search using L2 distance.
- A voting strategy across multiple clips improves robustness.
For each song, embeddings are extracted from:
- Start of the song (0–8 seconds)
- Middle segment
- End segment
The final representation is the average of all embeddings, improving robustness and accuracy.
git clone https://github.com/yourusername/shazam-clone.git
cd shazam-clone
pip install -r requirements.txt