Large audio language models (LALMs) extend language understanding into the auditory domain,
yet their ability to perform low-level listening, such as pitch and duration detection,
remains underexplored. However, low-level listening is critical for real-world,
out-of-distribution tasks where models must reason about unfamiliar sounds based on
fine-grained acoustic cues. To address this gap, we introduce the World-of-Whale benchmark
(WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal
vocalizations. WoW-bench is composed of a Perception benchmark for categorizing novel sounds
and a Cognition benchmark, inspired by Bloom's taxonomy, to assess the abilities to
remember, understand, apply, and analyze sound events. For the Cognition benchmark, we
additionally introduce distractor questions to evaluate whether models are truly solving
problems through listening rather than relying on other heuristics. Experiments with
state-of-the-art LALMs show performance far below human levels, indicating a need for
stronger auditory grounding in LALMs.