GitHub - NextTrack-WAT-ai/Audio-Features

🎧 Audio Feature Comparison – Custom Algorithm vs Spotify

This project compares the output of a custom audio feature extraction algorithm with Spotify’s official features, using a dataset of music tracks. It helps evaluate how close the custom model is to Spotify’s ground truth and highlights areas of strength and improvement.

The algorithm to extract audio feature is specifically located in the extract_feature function in soundcloud_pipeline.py

🔩 Infra

✅ soundcloud_pipeline.py

It is an automated system that searches for songs on SoundCloud, downloads them using yt-dlp, analyzes their audio features using librosa, and compares them to Spotify's official metadata. It tracks progress with a checkpoint system and saves results for later evaluation.

📁 Dataset Files

✅ music_info_cleaned.csv

A cleaned version of the original Music_Info.csv.

Only the first 1,000 songs that were used for testing and evaluation were cleaned.

✅ music_info_cleaned_fixed.csv

A fully cleaned version of the entire Music_Info.csv.

Note: Only the first 1,000 songs were experimentally validated — the rest may still contain edge cases.

🧹 Cleaning Criteria

From the initial 1,000 songs downloaded:

165 songs had issues and were filtered out.

Songs were excluded if their titles contained:

"remastered", "live", "mono", "album version", "slowed + reverb", etc.

Of the 165:

35 songs were recoverable after correcting filename quirks (e.g., replacing slashes like AC/DC → AC DC).

The remaining 130 were removed due to inconsistencies.

✅ clean_dataset.py

this script was used to replace all '/' in artist or name into a space

📊 Summary Files

✅ og_summary.csv

A summary of comparisons between the Spotify audio features and the custom algorithm for the initial 1,000 songs.

✅ og_audio_feature_cache.csv

Contains audio features generated by the custom algorithm for each track.

Used to avoid recomputation and enable quick lookup by file path.

🚀 How to Use

Choose songs to analyze

Open app.py and set start_index and end_index to values above 1001 (since the model was trained on the first 1000 songs).

Example:

from soundcloud_pipeline import SoundCloudPipeline, compare_result
pipeline = SoundCloudPipeline(start_index=1001, end_index=1030)
pipeline.download_songs()

Download and clean audio

⚠️ Recommended workflow (Important):

To avoid wasting time analyzing irrelevant songs:

Comment out the analyze_songs() call in app.py.

Run the script to download all candidate songs using:
```
pipeline.download_songs()
```
Manually inspect the downloaded songs and delete any that don't meet the bar (e.g., remastered, live versions, slowed + reverb).
Then, comment out download_songs() and uncomment analyze_songs() to extract features and run comparisons:

pipeline.analyze_songs()

Analyze the audio

This will:

Run the custom feature extractor

Compare results to Spotify’s ground truth

Save a .csv per song to the comparisons/ folder

✅ You should ideally test with 30+ clean songs.

Analyze performance

Run comparison_analysis.py to compute evaluation metrics, the output will be at a_summary.csv:

Mean: the average error — tells how far off the predictions are on average.

Median: the middle error — helps show typical performance, less affected by outliers.

Standard Deviation (std): how much the errors vary — low = consistent, high = unstable.

Max / Min: the best and worst errors observed — shows the range of performance.

🔬 Evaluation Goal

These metrics help you:

Identify which features are well-estimated

Spot features that are consistently biased

Flag features that are unstable and need tuning

🧪 Encouragement to Collaborate

There are ~900 comparisons total, but only a subset is included in this repo.

Feel free to pick 20–30 random songs and run comparisons yourself.

Let me know if you notice anything significantly off — testing edge cases is encouraged.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
Music_Info.csv		Music_Info.csv
PIPELINE_OVERVIEW.md		PIPELINE_OVERVIEW.md
README.md		README.md
clean_dataset.py		clean_dataset.py
music_info_cleaned.csv		music_info_cleaned.csv
music_info_cleaned_fixed.csv		music_info_cleaned_fixed.csv
og_audio_features_cache.csv		og_audio_features_cache.csv
og_summary.csv		og_summary.csv
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
soundcloud_pipeline.py		soundcloud_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages