ViSAGe: Video-to-Spatial Audio Generation

Vision and Learning Lab, Seoul National University
ICLR 2025

Abstract

Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second YouTube video clips paired with corresponding first-order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental results demonstrate that ViSAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that ViSAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes.

Overview

Guidelines for Listening Demo Videos

  • Please use your headphones 🎧 for the best audio experience.
  • Camera Direction Convention: (elevation, azimuth)
    • Elevation: Values range from 0 to 90, representing the angle (in degrees) from the vertical (z-axis). Each unit corresponds to 2 degrees.
    • Azimuth: Values range from 0 to 180, representing the anti-clockwise angle from the y-axis (back side). Each unit corresponds to 2 degrees.

Examples from YT-Ambigen

Camera Direction: (37, 75), Front Left

Camera Direction: (47, 64), Front Left

Camera Direction: (62, 87), Front Low

Camera Direction: (22, 80), Up Right

ViSAGe Generation Results

  • We also provide original panoramic videos. Feel free to rotate the video to assess the fidelity of the generated three-dimensional sound field.

Ground Truth

Ground Truth Panorama

ViSAGe

ViSAGe Panorama

Camera Direction: (51, 8), Back

Camera Direction: (45, 97), Front Left

Camera Direction: (63, 90), Front Low

Camera Direction: (52, 71), Front Right

Examples for Moving or Non-centering Objects

Ground Truth

ViSAGe

Ground Truth

ViSAGe

Camera Direction: (30, 64), Upper Front-Right

Camera Direction: (49, 78), Front

Camera Direction: (35, 133), Upper Left

Camera Direction: (44, 95), Front

Comparison with Baselines (Panoramic Videos)

ViSAGe

SpecVQGAN + Ambi Enc.

SpecVQGAN + Audio Spatial.

Ground Truth

Camera Direction: (44, 49), Front Right

Camera Direction: (45, 92), Front

ViSAGe

Diff-foley + Ambi Enc.

Diff-foley + Audio Spatial.

Ground Truth

Camera Direction: (43, 160), Back Left

Camera Direction: (45, 33), Back Right

Examples of Saliency Metrics

Original Audio Energy Map

Modified for Spherical Evaluation

Ground Truth Audio Energy Map

Generated Audio Energy Map

CC: -0.087, AUC: 0.443

CC: 0.337, AUC: 0.640

CC: 0.632, AUC: 0.833

CC: 0.942, AUC: 0.998

BibTeX

@inproceedings{kim2025visage, 
  title={ViSAGe: Video-to-Spatial Audio Generation}, 
  author={Jaeyeon Kim and Heeseung Yun and Gunhee Kim}, 
  booktitle={ICLR}, 
  year={2025}
}