Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
Zhen Xiong¹, Yujun Cai², Zhecheng Li³, Junsong Yuan⁴, Yiwei Wang⁵
¹USC, ²UQ, ³UCSD, ⁴UB, ⁵UC Merced
This repository contains the implementation of TwS (Thinking-with-Sound), a training-free framework that enables Large Audio-Language Models (LALMs) to perform multi-step reasoning by interleaving linguistic analysis with dynamic audio manipulation.
- Agentic Tool-using Framework: No additional training required
- Model Agnostic Pipeline: Our framework does not rely on any specific model architecture
- Scales with Model Size: Larger LALMs benefit more from TwS framework
| Model | MELD Clean | MELD-Hard1k Baseline | MELD-Hard1k + TwS | Δ |
|---|---|---|---|---|
| Qwen2.5-Omni (3B) | 50.18% | 27.44% | 52.17% | +24.73% |
| Qwen2.5-Omni (7B) | 47.65% | 12.36% | 48.97% | +36.61% |
| Voxtral (24B) | 51.62% | 24.55% | 49.49% | +24.94% |
Key Finding: Baseline LALMs suffer >50% performance degradation on perturbed audio. TwS recovers performance to near-clean-audio levels.
Alternates between linguistic reasoning and audio manipulation
- Denoising (4 ops): Spectral/Wiener denoising, echo cancellation
- Enhancement (5 ops): Spectral enhancement, compression, harmonic separation
- Normalization (6 ops): Amplitude/loudness normalization, pre-emphasis
- Analysis (6 ops): Spectral analysis, pitch tracking, energy/temporal analysis
# Clone the repository
git clone https://github.com/eric2i/Think-with-Sound.git
cd Think-with-Soundimport librosa
from framework.tws_engine import TwSEngine
from framework.prompt_templates import PromptTemplates, MELD_EMOTIONS
from operators import OperatorRegistry
# Load audio
audio, sr = librosa.load('path/to/audio.wav')
# Initialize components
registry = OperatorRegistry()
engine = TwSEngine(registry, max_steps=5)
templates = PromptTemplates(registry)
# Generate prompt
prompt = templates.generate_init_prompt(
instruction="Classify the emotion in this audio",
emotion_categories=MELD_EMOTIONS, # classification categories
use_tws=True
)
# Run TwS inference
result = engine.run_inference(
audio=audio,
sr=sr,
initial_prompt=prompt,
model_generate_fn=your_model.generate # any instruction-following LALM wrapper
)
print(f"Predicted emotion: {result['final_answer']}")
print(f"Tools called: {[tc['tool'] for tc in result['tool_calls_made']]}")If you find this work useful, please cite:
@article{xiong2025thinking,
title={Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models},
author={Xiong, Zhen and Cai, Yujun and Li, Zhecheng and Yuan, Junsong and Wang, Yiwei},
journal={arXiv preprint arXiv:2509.21749},
year={2025}
}