ViSAGe: Video-to-Spatial Audio Generation

Jaeyeon Kim, Heeseung Yun, Gunhee Kim

Vision and Learning Lab, Seoul National University
ICLR 2025

Abstract

Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second YouTube video clips paired with corresponding first-order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental results demonstrate that ViSAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that ViSAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes.

Overview

Given a silent video and the camera direction, the model generates corresponding first-order ambisonics. The camera direction gives cue about where the visual event occurs in the 3D sound field.

We propose YT-Ambigen, a new dataset comprising YouTube videos paired with first-order ambisonics, tailored for the audio generation.

We present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework designed to generate spatial audio using CLIP features, patchwise energy maps, and neural audio codecs.

ViSAGe outperforms two-stage approaches, which separately handle video-to-audio generation and audio spatialization, across all metrics.

Qualitative examples of generated first-order ambisonics.

Guidelines for Listening Demo Videos

Please use your headphones 🎧 for the best audio experience.
Camera Direction Convention: (elevation, azimuth)
- Elevation: Values range from 0 to 90, representing the angle (in degrees) from the vertical (z-axis). Each unit corresponds to 2 degrees.
- Azimuth: Values range from 0 to 180, representing the anti-clockwise angle from the y-axis (back side). Each unit corresponds to 2 degrees.

Examples from YT-Ambigen

Camera Direction: (37, 75), Front Left

Camera Direction: (47, 64), Front Left

Camera Direction: (62, 87), Front Low

Camera Direction: (22, 80), Up Right

ViSAGe Generation Results

We also provide original panoramic videos. Feel free to rotate the video to assess the fidelity of the generated three-dimensional sound field.

Ground Truth

Ground Truth Panorama

ViSAGe

ViSAGe Panorama

Camera Direction: (51, 8), Back

Camera Direction: (45, 97), Front Left

Camera Direction: (63, 90), Front Low

Camera Direction: (52, 71), Front Right

Examples for Moving or Non-centering Objects

Ground Truth

ViSAGe

Ground Truth

ViSAGe

Camera Direction: (30, 64), Upper Front-Right

Camera Direction: (49, 78), Front

Camera Direction: (35, 133), Upper Left

Camera Direction: (44, 95), Front

Comparison with Baselines (Panoramic Videos)

ViSAGe

SpecVQGAN + Ambi Enc.

SpecVQGAN + Audio Spatial.

Ground Truth

Camera Direction: (44, 49), Front Right

Camera Direction: (45, 92), Front

ViSAGe

Diff-foley + Ambi Enc.

Diff-foley + Audio Spatial.

Ground Truth

Camera Direction: (43, 160), Back Left

Camera Direction: (45, 33), Back Right

BibTeX

@inproceedings{kim2025visage, 
  title={ViSAGe: Video-to-Spatial Audio Generation}, 
  author={Jaeyeon Kim and Heeseung Yun and Gunhee Kim}, 
  booktitle={ICLR}, 
  year={2025}
}