ChordShot

Image-Based Context-Aware Music Generation

ChordShot is a research-oriented project that explores how visual information from images can be translated into meaningful musical compositions. The system analyzes an image to understand its scene, dominant colors, and objects, and then generates a short piece of music that reflects the emotional and contextual characteristics of the visual input.

This repository contains the complete implementation of the system described in the accompanying research paper, which is included in the repo for reference.

Project Motivation

Music generation systems typically rely on textual prompts or symbolic inputs. However, images naturally convey rich emotional and contextual information that is difficult to express explicitly in text. ChordShot investigates whether visual cues such as environment type, color tone, and objects can be used as an alternative creative input for music generation.

The goal of this project is not to replace human composition, but to study cross-modal alignment between vision and sound, and to understand how visual semantics can influence musical structure, mood, and instrumentation.

What ChordShot Does

Given a single image, the system:

Classifies the scene (e.g., indoor, urban, natural, outdoor scenes)
Extracts dominant colors to infer emotional tone
Detects objects present in the image
Maps visual features to musical attributes
Generates a 30-second music clip using a transformer-based music generation model

The entire process is automatic and requires no manual prompt engineering from the user.

System Overview

The pipeline is divided into four main components:

Scene Classification Uses traditional computer vision features (DAISY + HOG) and SVM classifiers trained on the Scene-15 dataset.
Dominant Color Analysis Applies K-Means clustering to identify the most prominent colors in the image, which are then associated with affective cues.
Object Detection Uses a pretrained YOLOv8 model to identify semantically meaningful objects that help refine musical instrumentation and texture.
Music Generation Visual features are converted into a structured textual description, which conditions the MusicGen-Small model to synthesize audio.

Repository Structure

ChordShot/
│
├── app.py
│   Main application entry point. Handles image upload, feature extraction,
│   prompt construction, and music generation.
│
├── music_gen.py
│   Core logic for generating music using the MusicGen model based on
│   image-derived semantic prompts.
│
├── image_features.json
│   Stores extracted visual features (scene label, dominant colors,
│   detected objects) for a given input image.
│
├── generated_music.wav
├── music_from_image.wav
├── musicgen_output.wav
│   Example audio outputs generated by the system during experimentation
│   and testing.
│
├── models/
│   Contains pretrained and serialized models used in the pipeline.
│
├── yolov8m.pt
├── yolov8l.pt
│   Pretrained YOLOv8 object detection weights (medium and large variants).
│
├── static/
│   Static assets used by the web interface (CSS, images, frontend resources).
│
├── templates/
│   HTML templates for the Flask-based user interface.
│
├── Reviews/
│   Contains project review and presentation PDFs used during evaluations.
│
├── requirements.txt
│   Python dependencies required to run the project.
│
├── README.md
│   Project documentation.

How Music Is Generated

Instead of feeding raw image data directly into a generative model, ChordShot follows an interpretable intermediate step.

Visual features are mapped to:
- tempo (slow / moderate / fast)
- mood (calm, energetic, ambient)
- instrumentation (acoustic, electronic, atmospheric)

These attributes are combined into a natural-language description that reflects the image context. This description is then passed to the MusicGen-Small model, which generates a waveform of approximately 30 seconds at a 16 kHz sampling rate.

This design choice makes the system more interpretable and easier to modify or extend.

Running the Project

Requirements

Python 3.9+
PyTorch
FFmpeg (for audio handling)

Installation

pip install -r requirements.txt

Example Usage

python app.py --image path/to/image.jpg

The generated music will be saved as a .wav file in the output directory.

Results and Observations

Scene classification achieved around 76% accuracy on the Scene-15 dataset.
Generated music generally aligns well with the perceived mood of the input image.
Users reported stronger emotional consistency when color and object information were both included, compared to using scene context alone.

These observations are discussed in more detail in the paper included in this repository.

Limitations

Visual-to-music mappings are currently heuristic-based.
MusicGen processes text prompts, not images directly.
Output duration and audio quality are limited by the chosen model.
Real-time performance depends on hardware capability.

Future Directions

Learning visual–musical mappings using multimodal training
Supporting longer and higher-quality compositions
Adding user feedback and control mechanisms
Exploring real-time and interactive applications
Investigating direct image-to-audio conditioning models

Paper

The full paper describing the methodology, experiments, and analysis is available in this repository:

/paper/ChordShot_Paper.pdf

Authors

Varun M - Final year CSE, CAHCET
Vishaal K R - Final year CSE, CAHCET
Sujithkumar P - Final year CSE, CAHCET

Project Guide: Dr. K. Abrar Ahmed Department of Computer Science and Engineering C. Abdul Hakeem College of Engineering and Technology

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChordShot

Project Motivation

What ChordShot Does

System Overview

Repository Structure

How Music Is Generated

Running the Project

Requirements

Installation

Example Usage

Results and Observations

Limitations

Future Directions

Paper

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Reviews		Reviews
__pycache__		__pycache__
models		models
paper		paper
static		static
templates		templates
README.md		README.md
app.py		app.py
generated_music.wav		generated_music.wav
image_features.json		image_features.json
music_from_image.wav		music_from_image.wav
music_gen.py		music_gen.py
musicgen_output.wav		musicgen_output.wav
requirements.txt		requirements.txt
yolov8l.pt		yolov8l.pt
yolov8m.pt		yolov8m.pt

Folders and files

Latest commit

History

Repository files navigation

ChordShot

Project Motivation

What ChordShot Does

System Overview

Repository Structure

How Music Is Generated

Running the Project

Requirements

Installation

Example Usage

Results and Observations

Limitations

Future Directions

Paper

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages