WoW-Bench whale icon : Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations

1 Carnegie Mellon University University
2Seoul National University
3NVIDIA
*This work was done at Seoul National University

Abstract

Large audio language models (LALMs) extend language understanding into the auditory domain, yet their ability to perform low-level listening, such as pitch and duration detection, remains underexplored. However, low-level listening is critical for real-world, out-of-distribution tasks where models must reason about unfamiliar sounds based on fine-grained acoustic cues. To address this gap, we introduce the World-of-Whale benchmark (WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal vocalizations. WoW-bench is composed of a Perception benchmark for categorizing novel sounds and a Cognition benchmark, inspired by Bloom's taxonomy, to assess the abilities to remember, understand, apply, and analyze sound events. For the Cognition benchmark, we additionally introduce distractor questions to evaluate whether models are truly solving problems through listening rather than relying on other heuristics. Experiments with state-of-the-art LALMs show performance far below human levels, indicating a need for stronger auditory grounding in LALMs.

Overview

Descriptive alt text

World-of-Whale (WoW) benchmark aims to evaluate low-level listening capabilities of LALMs using marine mammal vocalizations, which are rarely represented in conventional datasets and span a broad acoustic range.

It is composed of Perception benchmark that assess the perceptual generalization of LALMs by evaluating their ability to categorize sounds into less familiar classes and Cognition benchmark that assesses whether models can cognitively process fine-grained acoustic characteristics and perceived events through low-level listening. It also contains Distractor questions to evaluate whether models are truly solving tasks through listening rather than relying on shallow heuristics or linguistic priors.

Example Questions

Audio Options


Question:Which type of vocalization is most likely identified in the sound recording?
Type: Perception/Vocalization
A. Continuous modulated tones
B. High-pitched whistles
C. Single moan
D. Series of rapid pulsed clicks


Question: Based on the acoustic characteristics of the sound, which of the following best describes the main feature of the recording?
Type: Cognition/Understand
A. Repetitive short broadband bursts.
B. A continuous low-frequency tone around 400 Hz.
C. Modulated mid-frequency tones primarily between 1–4 kHz.
D. A high-pitched modulating tone primarily above 8 kHz.


Question: Given the following sound sequence: The first sound occurs before the first silence, the second sound occurs after the first silence, and the third sound occurs after the second silence. Which sound is most dominant in higher frequencies?
Type: Cognition/Apply-Frequency
A. The first sound
B. The second sound
C. The third sound
D. All the sounds are identical in frequency range


Question: Given the following sound sequence: Sound 1 occurs before the first silence, Sound 2 occurs after the first silence, and Sound 3 occurs after the second silence. Which sound is most dominant in higher frequencies?
Type: Distractor: Cognition/Apply-Frequency
A. Sound 1
B. Sound 2
C. Sound 3
D. All the same

Example questions across all tasks, each paired with a spectrogram of the corresponding input audio.

Descriptive alt text

Experiment Results

Acknowledgements

The audio recordings and associated metadata used in this work were sourced from Watkins Marine Mammal Sound Database, Woods Hole Oceanographic Institution, and the New Bedford Whaling Museum . We gratefully acknowledge the New Bedford Whaling Museum for granting permission to use the database for research purpose

BibTeX

@article{kim2025wow,
  title={WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations},
  author={Kim, Jaeyeon and Yun, Heeseung and Woo, SangHoon and Yang, Chao-Han Huck and Kim, Gunhee},
  journal={arXiv preprint arXiv:2508.20976},
  year={2025}
}