Think-with-Sound (TwS)

Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

Zhen Xiong¹, Yujun Cai², Zhecheng Li³, Junsong Yuan⁴, Yiwei Wang⁵

¹USC, ²UQ, ³UCSD, ⁴UB, ⁵UC Merced

This repository contains the implementation of TwS (Thinking-with-Sound), a training-free framework that enables Large Audio-Language Models (LALMs) to perform multi-step reasoning by interleaving linguistic analysis with dynamic audio manipulation.

🔥 Highlights

Agentic Tool-using Framework: No additional training required
Model Agnostic Pipeline: Our framework does not rely on any specific model architecture
Scales with Model Size: Larger LALMs benefit more from TwS framework

🎯 Key Results

Model	MELD Clean	MELD-Hard1k Baseline	MELD-Hard1k + TwS	Δ
Qwen2.5-Omni (3B)	50.18%	27.44%	52.17%	+24.73%
Qwen2.5-Omni (7B)	47.65%	12.36%	48.97%	+36.61%
Voxtral (24B)	51.62%	24.55%	49.49%	+24.94%

Key Finding: Baseline LALMs suffer >50% performance degradation on perturbed audio. TwS recovers performance to near-clean-audio levels.

🏗️ Overview

🧠 Interleaved reasoning

Alternates between linguistic reasoning and audio manipulation

🔧 Audio Operators (21 operators across 4 categories)

Denoising (4 ops): Spectral/Wiener denoising, echo cancellation
Enhancement (5 ops): Spectral enhancement, compression, harmonic separation
Normalization (6 ops): Amplitude/loudness normalization, pre-emphasis
Analysis (6 ops): Spectral analysis, pitch tracking, energy/temporal analysis

📦 Installation

# Clone the repository
git clone https://github.com/eric2i/Think-with-Sound.git
cd Think-with-Sound

🚀 Quick Start

import librosa

from framework.tws_engine import TwSEngine
from framework.prompt_templates import PromptTemplates, MELD_EMOTIONS
from operators import OperatorRegistry

# Load audio
audio, sr = librosa.load('path/to/audio.wav')

# Initialize components
registry = OperatorRegistry()
engine = TwSEngine(registry, max_steps=5)
templates = PromptTemplates(registry)

# Generate prompt
prompt = templates.generate_init_prompt(
    instruction="Classify the emotion in this audio",
    emotion_categories=MELD_EMOTIONS, # classification categories
    use_tws=True
)

# Run TwS inference
result = engine.run_inference(
    audio=audio,
    sr=sr,
    initial_prompt=prompt,
    model_generate_fn=your_model.generate  # any instruction-following LALM wrapper
)

print(f"Predicted emotion: {result['final_answer']}")
print(f"Tools called: {[tc['tool'] for tc in result['tool_calls_made']]}")

📄 Citation

If you find this work useful, please cite:

@article{xiong2025thinking,
  title={Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models},
  author={Xiong, Zhen and Cai, Yujun and Li, Zhecheng and Yuan, Junsong and Wang, Yiwei},
  journal={arXiv preprint arXiv:2509.21749},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
framework		framework
operators		operators
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Think-with-Sound (TwS)

🔥 Highlights

🎯 Key Results

🏗️ Overview

🧠 Interleaved reasoning

🔧 Audio Operators (21 operators across 4 categories)

📦 Installation

🚀 Quick Start

📄 Citation

About

Uh oh!

Releases

Packages

Languages

Eric2i/Think-with-Sound

Folders and files

Latest commit

History

Repository files navigation

Think-with-Sound (TwS)

🔥 Highlights

🎯 Key Results

🏗️ Overview

🧠 Interleaved reasoning

🔧 Audio Operators (21 operators across 4 categories)

📦 Installation

🚀 Quick Start

📄 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages