Skip to content

Varun-Mayilvaganan/ChordShot

Repository files navigation

ChordShot

Image-Based Context-Aware Music Generation

ChordShot is a research-oriented project that explores how visual information from images can be translated into meaningful musical compositions. The system analyzes an image to understand its scene, dominant colors, and objects, and then generates a short piece of music that reflects the emotional and contextual characteristics of the visual input.

This repository contains the complete implementation of the system described in the accompanying research paper, which is included in the repo for reference.


Project Motivation

Music generation systems typically rely on textual prompts or symbolic inputs. However, images naturally convey rich emotional and contextual information that is difficult to express explicitly in text. ChordShot investigates whether visual cues such as environment type, color tone, and objects can be used as an alternative creative input for music generation.

The goal of this project is not to replace human composition, but to study cross-modal alignment between vision and sound, and to understand how visual semantics can influence musical structure, mood, and instrumentation.


What ChordShot Does

Given a single image, the system:

  1. Classifies the scene (e.g., indoor, urban, natural, outdoor scenes)
  2. Extracts dominant colors to infer emotional tone
  3. Detects objects present in the image
  4. Maps visual features to musical attributes
  5. Generates a 30-second music clip using a transformer-based music generation model

The entire process is automatic and requires no manual prompt engineering from the user.


System Overview

The pipeline is divided into four main components:

  • Scene Classification Uses traditional computer vision features (DAISY + HOG) and SVM classifiers trained on the Scene-15 dataset.

  • Dominant Color Analysis Applies K-Means clustering to identify the most prominent colors in the image, which are then associated with affective cues.

  • Object Detection Uses a pretrained YOLOv8 model to identify semantically meaningful objects that help refine musical instrumentation and texture.

  • Music Generation Visual features are converted into a structured textual description, which conditions the MusicGen-Small model to synthesize audio.


Repository Structure

ChordShot/
│
├── app.py
│   Main application entry point. Handles image upload, feature extraction,
│   prompt construction, and music generation.
│
├── music_gen.py
│   Core logic for generating music using the MusicGen model based on
│   image-derived semantic prompts.
│
├── image_features.json
│   Stores extracted visual features (scene label, dominant colors,
│   detected objects) for a given input image.
│
├── generated_music.wav
├── music_from_image.wav
├── musicgen_output.wav
│   Example audio outputs generated by the system during experimentation
│   and testing.
│
├── models/
│   Contains pretrained and serialized models used in the pipeline.
│
├── yolov8m.pt
├── yolov8l.pt
│   Pretrained YOLOv8 object detection weights (medium and large variants).
│
├── static/
│   Static assets used by the web interface (CSS, images, frontend resources).
│
├── templates/
│   HTML templates for the Flask-based user interface.
│
├── Reviews/
│   Contains project review and presentation PDFs used during evaluations.
│
├── requirements.txt
│   Python dependencies required to run the project.
│
├── README.md
│   Project documentation.

How Music Is Generated

Instead of feeding raw image data directly into a generative model, ChordShot follows an interpretable intermediate step.

  • Visual features are mapped to:

    • tempo (slow / moderate / fast)
    • mood (calm, energetic, ambient)
    • instrumentation (acoustic, electronic, atmospheric)

These attributes are combined into a natural-language description that reflects the image context. This description is then passed to the MusicGen-Small model, which generates a waveform of approximately 30 seconds at a 16 kHz sampling rate.

This design choice makes the system more interpretable and easier to modify or extend.


Running the Project

Requirements

  • Python 3.9+
  • PyTorch
  • FFmpeg (for audio handling)

Installation

pip install -r requirements.txt

Example Usage

python app.py --image path/to/image.jpg

The generated music will be saved as a .wav file in the output directory.


Results and Observations

  • Scene classification achieved around 76% accuracy on the Scene-15 dataset.
  • Generated music generally aligns well with the perceived mood of the input image.
  • Users reported stronger emotional consistency when color and object information were both included, compared to using scene context alone.

These observations are discussed in more detail in the paper included in this repository.


Limitations

  • Visual-to-music mappings are currently heuristic-based.
  • MusicGen processes text prompts, not images directly.
  • Output duration and audio quality are limited by the chosen model.
  • Real-time performance depends on hardware capability.

Future Directions

  • Learning visual–musical mappings using multimodal training
  • Supporting longer and higher-quality compositions
  • Adding user feedback and control mechanisms
  • Exploring real-time and interactive applications
  • Investigating direct image-to-audio conditioning models

Paper

The full paper describing the methodology, experiments, and analysis is available in this repository:

/paper/ChordShot_Paper.pdf

Authors

  • Varun M - Final year CSE, CAHCET
  • Vishaal K R - Final year CSE, CAHCET
  • Sujithkumar P - Final year CSE, CAHCET

Project Guide: Dr. K. Abrar Ahmed Department of Computer Science and Engineering C. Abdul Hakeem College of Engineering and Technology

About

Chordshot is an AI-powered tool that generates personalized 30-second background music based on the mood and theme of any given image. The system analyzes the visual content of the image and generates music that complements the image’s emotional tone.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors