Skip to content

Eric2i/Think-with-Sound

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Think-with-Sound (TwS)

Paper Project Page

Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

Zhen Xiong¹, Yujun Cai², Zhecheng Li³, Junsong Yuan⁴, Yiwei Wang⁵

¹USC, ²UQ, ³UCSD, ⁴UB, ⁵UC Merced

This repository contains the implementation of TwS (Thinking-with-Sound), a training-free framework that enables Large Audio-Language Models (LALMs) to perform multi-step reasoning by interleaving linguistic analysis with dynamic audio manipulation.

🔥 Highlights

  • Agentic Tool-using Framework: No additional training required
  • Model Agnostic Pipeline: Our framework does not rely on any specific model architecture
  • Scales with Model Size: Larger LALMs benefit more from TwS framework

🎯 Key Results

Model MELD Clean MELD-Hard1k Baseline MELD-Hard1k + TwS Δ
Qwen2.5-Omni (3B) 50.18% 27.44% 52.17% +24.73%
Qwen2.5-Omni (7B) 47.65% 12.36% 48.97% +36.61%
Voxtral (24B) 51.62% 24.55% 49.49% +24.94%

Key Finding: Baseline LALMs suffer >50% performance degradation on perturbed audio. TwS recovers performance to near-clean-audio levels.

🏗️ Overview

🧠 Interleaved reasoning

Alternates between linguistic reasoning and audio manipulation

🔧 Audio Operators (21 operators across 4 categories)

  • Denoising (4 ops): Spectral/Wiener denoising, echo cancellation
  • Enhancement (5 ops): Spectral enhancement, compression, harmonic separation
  • Normalization (6 ops): Amplitude/loudness normalization, pre-emphasis
  • Analysis (6 ops): Spectral analysis, pitch tracking, energy/temporal analysis

📦 Installation

# Clone the repository
git clone https://github.com/eric2i/Think-with-Sound.git
cd Think-with-Sound

🚀 Quick Start

import librosa

from framework.tws_engine import TwSEngine
from framework.prompt_templates import PromptTemplates, MELD_EMOTIONS
from operators import OperatorRegistry

# Load audio
audio, sr = librosa.load('path/to/audio.wav')

# Initialize components
registry = OperatorRegistry()
engine = TwSEngine(registry, max_steps=5)
templates = PromptTemplates(registry)

# Generate prompt
prompt = templates.generate_init_prompt(
    instruction="Classify the emotion in this audio",
    emotion_categories=MELD_EMOTIONS, # classification categories
    use_tws=True
)

# Run TwS inference
result = engine.run_inference(
    audio=audio,
    sr=sr,
    initial_prompt=prompt,
    model_generate_fn=your_model.generate  # any instruction-following LALM wrapper
)

print(f"Predicted emotion: {result['final_answer']}")
print(f"Tools called: {[tc['tool'] for tc in result['tool_calls_made']]}")

📄 Citation

If you find this work useful, please cite:

@article{xiong2025thinking,
  title={Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models},
  author={Xiong, Zhen and Cai, Yujun and Li, Zhecheng and Yuan, Junsong and Wang, Yiwei},
  journal={arXiv preprint arXiv:2509.21749},
  year={2025}
}

About

Implementation of Think-with-Sound (TwS) framework

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages