AI Shazam Clone – Song Identification with Machine Learning

A high-accuracy machine learning system that identifies songs from short audio clips using deep learning and vector similarity search.

Overview

This project demonstrates how to build a music identification system using deep learning and similarity search without relying on traditional audio fingerprinting databases.

By combining YAMNet audio embeddings, FAISS vector search, and an ensemble voting strategy, the system achieves 93 percent accuracy on real-world audio clips.

Key achievement: Identifies songs in approximately 10 milliseconds using only 8 seconds of audio.

Features

93 percent identification accuracy on thousands of songs
Average query latency of approximately 10 milliseconds
Robust to noise, compression, and partial audio clips
Uses YAMNet for audio embeddings
Uses FAISS for efficient similarity search
Voting ensemble across multiple clips per song
Scalable architecture suitable for large catalogs
Pre-trained model included

Dataset

Dataset: Free Music Archive (FMA Medium)
Total duration: ~30 hours
License: Creative Commons
Source: https://github.com/mdeff/fma

Performance Metrics

Metric	Value
Dataset size	5,564 songs
Accuracy	93 percent
Query speed	~10 ms
Minimum audio length	8 seconds
Embedding dimension	521
Index type	FAISS IndexFlatL2
Model size	~50 MB

Accuracy breakdown:

Single clip (8s): 70–80 percent
Voting method (3 clips): 93 percent
Noisy audio supported
Compressed audio supported

Tech Stack

Core technologies:

YAMNet (Google audio embedding model)
FAISS (vector similarity search)
TensorFlow
Librosa
NumPy

Development environment:

Python 3.8+
Jupyter Notebook
Google Colab for training

Architecture

Audio input (8-second clip)
↓
Librosa load and resample (16 kHz)
↓
YAMNet embedding extraction (521-dimensional vectors)
↓
FAISS similarity search (L2 distance)
↓
Top-K matching tracks with similarity scores

How It Works

Audio clips are loaded and resampled to 16 kHz.
YAMNet extracts frame-level audio embeddings.
Embeddings are averaged to form a fixed-length vector.
FAISS performs nearest-neighbor search using L2 distance.
A voting strategy across multiple clips improves robustness.

Voting Method

For each song, embeddings are extracted from:

Start of the song (0–8 seconds)
Middle segment
End segment

The final representation is the average of all embeddings, improving robustness and accuracy.

Installation

git clone https://github.com/yourusername/shazam-clone.git
cd shazam-clone
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
final_shazam_model		final_shazam_model
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Shazam_model.ipynb		Shazam_model.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Shazam Clone – Song Identification with Machine Learning

Overview

Features

Dataset

Performance Metrics

Tech Stack

Architecture

How It Works

Voting Method

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Shazam Clone – Song Identification with Machine Learning

Overview

Features

Dataset

Performance Metrics

Tech Stack

Architecture

How It Works

Voting Method

Installation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages