Self-Supervised Generation of Spatial Audio for 360° Videopapers.nips.cc/...of-spatial-audio-for-360-video.pdf · We introduce an approach to convert mono audio recorded by a 360

Self-Supervised Generation of Spatial Audiofor 360◦ Video

Pedro MorgadoUniversity of California, San Diego∗

Nuno VasconcelosUniversity of California, San Diego

Timothy LangloisAdobe Research, Seattle

Oliver WangAdobe Research, Seattle

Abstract

We introduce an approach to convert mono audio recorded by a 360◦ video camerainto spatial audio, a representation of the distribution of sound over the full viewingsphere. Spatial audio is an important component of immersive 360◦ video viewing,but spatial audio microphones are still rare in current 360◦ video production. Oursystem consists of end-to-end trainable neural networks that separate individualsound sources and localize them on the viewing sphere, conditioned on multi-modalanalysis of audio and 360◦ video frames. We introduce several datasets, includingone filmed ourselves, and one collected in-the-wild from YouTube, consisting of360◦ videos uploaded with spatial audio. During training, ground-truth spatialaudio serves as self-supervision and a mixed down mono track forms the input toour network. Using our approach, we show that it is possible to infer the spatiallocation of sound sources based only on 360◦ video and a mono audio track.

1 Introduction

360◦ video provides viewers an immersive viewing experience where they are free to look in anydirection, either by turning their heads with a Head-Mounted Display (HMD), or by mouse-controlwhile watching the video in a browser (e.g., YouTube). Capturing 360◦ video involves filming thescene with multiple cameras and stitching the result together. While early systems relied on expensiverigs with carefully mounted cameras, recent consumer-level devices combine multiple lenses in asmall fixed-body frame that enables automatic stitching, allowing 360◦ video to be recorded with asingle push of a button.

As humans rely on audio localization cues for full scene awareness, spatial audio is a crucialcomponent of 360◦ video. Spatial audio enables viewers to experience sound in all directions, whileadjusting the audio in real time to match the viewing position. This gives users a more immersiveexperience, as well as providing cues about which part of the scene might have interesting contentto look at. However, unlike 360◦ video, producing spatial audio content still requires a moderatedegree of expertise. Most consumer-level 360◦ cameras only record mono audio, and syncing anexternal spatial audio microphone can be expensive and technically challenging. As a consequence,while most video platforms (e.g., YouTube and Facebook) support spatial audio, it is often ignoredby content creators, and at the time of submission, a random polling of 1000 YouTube 360◦ videosyielded less than 5% with spatial audio.

In order to close this gap between the audio and visual experiences, we introduce three maincontributions: (1) we formalize the 360◦ spatialization problem; (2) design the first 360◦ spatializationprocedure; and (3) collect two datasets and propose an evaluation protocol to benchmark ours and

∗Contact author: [email protected]

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Copy

Copy

Copy

X

CNN

CNN

RGB

FLOW

STFT

𝑤𝑧𝑖 𝑡

𝑤𝑦𝑖 𝑡

𝑤𝑥𝑖 𝑡 𝜙𝑥 𝑡

𝜙𝑦 𝑡

𝜙𝑧 𝑡

𝑤𝑥𝑇𝑓

𝑤𝑦𝑇𝑓

𝑤𝑧𝑇𝑓

iSTFT

𝑎𝑖 𝑡, 𝜔 𝚽𝑖 𝑡, 𝜔

Tile

& C

onca

t

Audio 𝑖 𝑡

Video 𝑣 𝑡

𝚽 𝑡,𝜔

(a)

(b)

(c) (d)

Figure 1: Architecture overview. Our approach is composed of four main blocks. The input video and audiosignals are fed into the analysis block (a), which extracts high-level features. The separation block (b) thenlearns k time-frequency attenuation maps ai(t, w) to modulate the input STFT and produce modified waveformsf i(t). The localization block (c) computes a set of linear transform weights wi(t) that localize each source. Inthe ambisonics generation step (d), localization weights are then combined with the separated sound sources toproduce the final spatial audio output.

future algorithms. 360◦ spatialization aims to upconvert a single mono recording into spatialaudio guided by full 360 view video. More specifically, we seek to generate spatial audio in theform of a popular encoding format called first-order ambisonics (FOA), given the mono audio andcorresponding 360◦ video as inputs. In addition to formulating the 360◦ spatialization task, we designthe first data-driven system to upgrade mono audio using self-supervision from 360◦ videos recordedwith spatial audio. The proposed procedure is based on a novel neural network architecture thatdisentangles two fundamental challenges in audio spatialization: the separation of sound sourcesfrom a mixed audio input and respective localization of these sources. In order to train and validateour approach, we introduce two 360◦ video datasets with spatial audio, one recorded by ourselvesin a constrained domain, and a large-scale dataset collected in-the-wild from YouTube. Duringtraining, the captured spatial audio serves as ground truth, with a mixed down mono version providedas input to our system. Experiments conducted in both datasets show that the proposed neuralnetwork can generate plausible spatial audio for 360◦ video. We further validate each component ofthe proposed architecture and show its superiority over a state-of-the-art, but domain-independentbaseline architecture.

In the interest of reproducibility, code, data and trained models will be made available to thecommunity at https://pedro-morgado.github.io/spatialaudiogen.

2 Related Work

To the best of our knowledge, we propose the first system for audio spatialization. In addition tospatial audio, the fields most related to our work are self-supervised learning, audio generation, sourceseparation and audio-visual cross-modal learning, which we now briefly describe.

Spatial audio Artificial environments, such as those rendered by game engines, can play soundsfrom any location in the video. This capability requires recording sound sources separately andmixing them according to the desired scene configuration (i.e., the positions of each source relative tothe user). In a real world recording, however, sound sources cannot be recorded separately. In this casewhere sound sources are naturally mixed, spatial audio is often encoded using Ambisonics [13, 9, 30].

Ambisonics aim to approximate the sound pressure field at a single point in space using a sphericalharmonic decomposition. More specifically, an audio signal f(θθθ, t) arriving from direction θθθ = (ϕ, ϑ)(where ϕ is the zenith angle and ϑ the azimuth angle) at time t is represented by a truncated sphericalharmonic expansion of order N

f(θθθ, t) =∑Nn=0

∑nm=−n Y

mn (ϕ, ϑ)φmn (t) (1)

where Y mn (ϕ, ϑ) is the real spherical harmonic of order n and degreem, and φmn (t) are the coefficientsof the expansion. For ease of notation, Y mn and φmn can be stacked into vectors yyyN and φφφN , and(Eq. 1) written as f(θθθ, t) = yyyTN (θθθ)φφφN (t).

2

https://pedro-morgado.github.io/spatialaudiogen

In a controlled environment, sound sources with known locations can be synthetically encoded intoambisonics using their spherical harmonic projection. More specifically, given a set of k audio signalss1(t), . . . , sk(t) originating from directions θθθ1, . . . , θθθk,

φφφN (t) =∑ki=1 yyyN (θθθi)si(t). (2)

For ambisonics playback, φφφN is then decoded into a set of speakers or headphone signals in orderto provide a plane-wave reconstruction of the sound field. In sum, the coefficients φφφN , also knownas ambisonic channels, are sufficient to encode and reproduce spatial audio. Hence, our goal is togenerate φφφN from non-spatial audio and the corresponding video.

Self-supervised learning Neural networks have been successfully trained through self-supervisionfor tasks such as image super-resolution [10, 27] and image colorization [20, 46]. In the audiodomain, self-supervision has also enabled the detection of sound-video misalignment [37] and audiosuper-resolution [31]. Inspired by these approaches, we propose a self-supervised technique for audiospatialization. We show that the generation of ambisonic audio can be learned using a dataset of 360◦video with spatial audio collected in-the-wild without additional human intervention.

Generative models Recent advances in generative models such as Generative Adversarial Networks(GANs) [14] or Variational Auto-Encoders (VAE) [29] have enabled the generation of complexpatterns, such as images [14] or text [23]. In the audio domain, Wavenet [36] has demonstrated theability to produce high fidelity audio samples of both speech and music, by generating a waveformfrom scratch on a sample-by-sample basis. Furthermore, neural networks have also outperformedprior solutions to audio super-resolution [31] (e.g. converting from 4kHz to 16kHz audio) using aU-Net encoder-decoder architecture, and have enabled “automatic-Foley” type applications [41, 38],i.e. generating sounds that correspond to image features, and vice-versa. In this work, instead ofgenerating audio from scratch, our goal is to augment the input audio channels so as to introducespatial information. Thus, unlike Wavenet, efficient audio generation can be achieved withoutsacrificing audio fidelity, by transforming the input audio. We also demonstrate the advantages of ourapproach, inspired by the ambisonics encoding process in controlled environments, over a genericU-Net architecture for spatial audio generation.

Source separation Source separation is a classic problem with an extensive literature. While earlymethods present the problem as independent component analysis, and focused on maximizing thestatistical independence of the extracted signals [24, 7, 6, 2], recent approaches focus on data-drivensolutions. For example, [19] proposes a recurrent neural-network for monaural separation of twospeakers, [1, 12, 11] seek to isolate sound sources by leveraging synchronized visual information inaddition to the audio input, and [44] studies a wide range of frequency-based separation methods.Similarly to recent trends, we rely on neural networks guided by cross-modal video analysis. However,instead of only separating human speakers [44] or musical instruments [47], we aim to separatemultiple unidentified types of sound sources. Also, unlike previous algorithms, no explicit supervisionis available to learn the separation block.

Source localization Sound source localization is a mature area of signal processing and roboticsresearch [3, 35, 34, 42]. However, unlike the proposed 360◦ spatialization problem, these works relyon microphone arrays using beamforming techniques [43] or binaural audio and HRTF cues similarto those used by humans [18]. Furthermore, the need for carefully calibrated microphones limits theapplicability of these techniques to videos collected in-the-wild.

Cross visual-audio analysis Cross-modal analysis has been extensively studied in the vision andgraphics community, due to the inherently paired nature of video and audio. For example, [4] learnsaudio feature representations in an unsupervised setting by leveraging synchronized video. [22]segments and localizes dominant sound sources using clustering of video and sound features. Othermethods correlate repeated motions with sounds to identify sound sources such as the strumming of aguitar using for example canonical correlation analysis [25, 26], joint embedding spaces [41, 38] orother temporal features [5].

3

3 Method

In this section, we define the 360◦ spatialization task to upconvert common audio recordings tosupport spatial audio playback. We then introduce a deep learning architecture to address this task,and two datasets to train the proposed architecture.

3.1 Audio spatialization

The goal of 360◦ spatialization is to generate ambisonic channels φφφN (t) from non-spatial audio i(t)and corresponding video v(t). To handle the most common audio formats supported by commercial360◦ cameras and video viewing platforms (e.g., YouTube and Facebook), we upgrade monauralrecordings (mono) into first-order ambisonics (FOA). FOA consists of four channels that store thefirst-order coefficients, φ0

0, φ−11 , φ0

1 and φ11, of the spherical harmonic expansion in (Eq. 1). For ease

of notation, we refer to these tracks as φw, φy, φz and φx, respectively.

Self-supervised audio spatialization Converting mono to FOA ideally requires learning fromvideos with paired mono and ambisonics recordings, which are difficult to collect in-the-wild. Inorder to learn from self-supervision, we assume that monaural audio is recorded with an omni-directional microphone. Under this assumption, mono is equivalent to zeroth-order ambisonics (upto an amplitude scale) and, as a consequence, the upconversion only requires the synthesis of themissing higher-order channels. More specifically, we learn to predict the first-order componentsφx(t), φy(t), φz(t) from the (surrogate) mono audio i(t) = φw(t) and video input v(t). Note thatthe proposed framework is also applicable to other conversion scenarios, e.g. FOA to second-orderambisonics (SOA), simply by changing the number of input and output audio tracks (see Sec 5).

3.2 Architecture

Audio spatialization requires solving two fundamental problems: source separation and localization.In controlled environments, where the separated sound sources si(t) and respective localization θθθiare known in advance, ambisonics can be generated using (Eq. 2). However, since si(t) and θθθi arenot known in practice, we design dedicated modules to isolate sources from the mixed audio inputand localize them in the video. Also, because audio and video are complementary for identifyingeach source, both separation and localization modules are guided by a multi-modal audio-visualanalysis module. A schematic description of our architecture is shown in Fig. 1. We now describeeach component. Details of network architectures are provided in Appendix A.

Audio and visual analysis Audio features are extracted in the time-frequency domain, which hasproduced successful audio representations for tasks such as audio classification [17] and speakeridentification [33]. More specifically, we extract a sequence of short-term Fourier transforms (STFT)computed on 25ms segments of the input audio with 25% hop size and multiplied by Hann windowfunctions. Then, we apply a (two-dimensional) CNN encoder to the audio spectrogram, whichprogressively reduces the spectrogram dimensionality and extracts high-level features.

Video features are extracted using a two-stream network, based on Resnet-18 [16], to encode bothappearance (RGB frames) and motion (optical flow predicted by FlowNet2 [21]). Both streams areinitialized with weights pre-trained on ImageNet [8] for classification, and fine-tuned on our task.

A joint audio-visual representation is then obtained by merging the three feature maps (audio, RGBand flow) produced at each time t. Since audio features are extracted at a higher frame rate than videofeatures, we first synchronize the audio and video feature maps by nearest neighbor up-sampling ofvideo features. Each feature map is then projected into a feature vector (1024 for audio and 512 forRGB and flow), and the outputs concatenated and fed to the separation and localization modules.

Audio separation Although the number of sources may vary, this is often small in practice.Furthermore, psycoaccoustic studies have shown that humans can only distinguish a small number ofsimultaneous sources (three according to [39]). We thus assume an upper-bound of k simultaneoussources, and implement a separation network that extracts k audio tracks f i(t) from the input audioi(t). The separation module takes the form of a U-Net decoder that progressively restores the STFTdimensionality through a series of transposed convolutions and skip connections from the audio

4

analysis stage of equivalent resolution. Furthermore, to visually guide the separation module, weconcatenate the multi-modal features to the lowest resolution layer of the audio encoder. In the lastup-sampling layer, we produce k sigmoid activated maps ai(t, ω), which are used to modulate theSTFT of the mono input ΦΦΦ(t;ω). The STFT of the ith source ΦΦΦi(t;ω) is thus obtained throughthe soft-attention mechanism ΦΦΦi(t;ω) = ai(t, ω) · ΦΦΦ(t;ω), and the separated audio track f i(t)reconstructed as the inverse STFT of ΦΦΦi(t;ω) using an overlap-add method.

Localization To localize the sounds f i(t) extracted by the separation network, we implementa module that generates, at each time t, the localization weights wi(t) = (wix(t), wiy(t), wiz(t))associated with each of the k sources, through a series of fully-connected layers applied to themulti-modal feature vectors of the analysis stage. In a parallel to the encoding mechanism of (Eq. 2)used in controlled environments, wi(t) can be interpreted as the spherical harmonics yyyN (θθθi(t))evaluated at the predicted position of the ith source θθθi(t).

Ambisonic generation Given the localization weights wi(t) and separated wave-formsf i(t), the first-order ambisonic channels φφφ(t) = (φx(t), φy(t), φz(t)) are generated by φφφ(t) =∑ki=1 w

i(t)f i(t). In summary, we split the generation task into two components: generating theattenuation maps ai(t, ω) for source separation, and the localization weights wi(t). As audio is notgenerated from scratch, but through a transformation of the original input inspired by the encodingframework of (Eq. 2), we are able to achieve fast deployment speeds with high quality results.

3.3 Evaluation metrics

Let φφφ(t) and φφφ(t) be the ground-truth and predicted ambisonics, and ΦΦΦ(t;ω) and ΦΦΦ(t;ω) theirrespective STFTs. We now discuss several metrics used for evaluating the generated signals φφφ(t).

STFT distance Our network is trained end-to-end to minimize errors between STFTs, i.e.,

MSEstft =∑p∈{x,y,z}

∑t

∑ω ‖Φp(t, ω)− Φp(t, ω)‖2, (3)

where ‖ · ‖ is the euclidean complex norm. MSEstft has well-defined and smooth partial derivativesand, thus, it is a suitable loss function. Furthermore, unlike the euclidean distance between rawwaveforms, the STFT loss is able to separate the signal into its frequency components, which enablesthe network to learn the easier parts of the spectrum without distraction from other errors.

Log-spectral distance (LSD) Distances that only compare the smoothed spectral behavior of audiosignals are widely used throughout the audio literature. We use the log-spectral distance [15] betweenΦΦΦ(t;ω) and ΦΦΦ(t;ω), which measures the distance in dB between the two spectrograms using

LSD =∑p∈{x,y,z}

∑t

√1K

∑Kω=1

(10 log10

∣∣∣Φp(t,ω)

Φp(t,ω)

∣∣∣)2

. (4)

Envelope distance (ENV) Due to the high-frequency nature of audio and the human insensitivity tophase differences, frame-by-frame comparison of raw waveforms do not capture perceptual similarityof two audio signals. Instead, we measure the euclidean distance between envelopes of φφφ(t) and φφφ(t),where the envelope of an audio wave is computed using the Hilbert transform method [40].

Earth Mover’s Distance (EMD) Ambisonics model the sound field f(θθθ, t) over all directions θθθ.The energy of the sound field measured over a small window wt around time t along direction θθθ is

E(θθθ, t) =√

1T

∑τ∈wt

f(θθθ, τ)2 =√

1T

∑τ∈wt

(yyyTN (θθθ)φφφN (τ)

)2. (5)

Thus, E(θθθ, t) represents the directional energy map of φφφ(t). In order to measure the localizationaccuracy of the generated spatial audio, we propose to compute the EMD [32] between the energymaps E(θθθ, t) associated with φφφ(t) and φφφ(t). In practice, we uniformly sample the maps E(θθθ, t) overthe sphere, normalize the sampled map so that

∑iE(θθθi, t) = 1, and measure the distance between

samples over the sphere’s surface using cosine (angular) distances for EMD calculation.

5

REC-STREET YT-ALL

YT-MUSIC YT-CLEAN0 100 200 300

Counts

Autos &Vehicles

Sports

Film &Animation

Science &Technology

Entertainment

Travel &Events

Music

People &Blogs

YT-AllYT-MusicYT-Clean

Figure 2: Representative images. Example video frames from each dataset.

3.4 Datasets

To train our model, we collected two datasets of 360◦ videos with FOA audio. The first dataset,denoted REC-STREET, was recorded by us using a Theta V 360◦ camera with an attached TA-1spatial audio microphone. REC-STREET consists of 43 videos of outdoor street scenes, totaling 3.5hours and 123k training samples (0.1s each). Due to the consistency of capture hardware and scenecontent, the audio of REC-STREET videos is relatively easier to spatialize.

The second dataset, denoted YT-ALL, was collected in-the-wild by scraping 360◦ videos fromYouTube using queries related to spatial audio, e.g., spatial audio, ambisonics, and ambix. Toclean the search results, we automatically removed videos that did not contain valid ambisonics, asdescribed by YouTube’s format, keeping only videos containing all 4 channels or with only the Zchannel missing (a common spatial audio capture scenario). Finally, we performed a manual curationto remove videos containing 1) still images, 2) computer generated content, or 3) post-processedand non-visually indicated sounds such as background music or voice-overs. During this pruningprocess, 799 videos were removed, resulting in 1146 valid videos totaling 113.1 hours of content(3976k training samples). YT-ALL was further separated into live musical performances, YT-MUSIC(397 videos), and videos with a small number of super-imposed sources which could be localized inthe image, YT-CLEAN (496 videos). Upgrading YT-MUSIC videos into spatial audio is especiallychallenging due to the large number of mixed sources (voices and instruments). We also identified489 videos that were recorded with a “horizontal” spatial audio microphone (i.e. only containingφw(t),φx(t) and φy(t) channels). In this case, we simply ignore the Z channel φz(t) when computingeach metric including the STFT loss. Fig. 2 shows illustrative video frames and summarizes the mostcommon categories for each dataset.

4 Evaluation

For our experiments, we randomly sample three partitions, each containing 75% of all videos fortraining and 25% for testing. Networks are trained to generate audio at 48kHz from input mono audioprocessed at 48kHz and video at 10Hz. Each training sample consists of a chunk of 0.6s of monoaudio and a single frame of RGB and flow, which are used to predict 0.1s of spatial audio at thecenter of the 0.6s input window. To make the model more robust and remove any bias to content inthe center, we augment datasets during training by randomly rotating both video and spatial audioaround the vertical (z) axis. Spatial audio can be rotated by multiplying the ambisonic channels withthe appropriate rotation matrix as described in [30], and video frames (in equirectangular format) canbe rotated using horizontal translations with wrapping. Networks are trained by back-propagationusing the Adam optimizer [28] for 150k iterations (roughly two days) with parameters β1 = 0.9,β2 = 0.999 and ε = 1e − 8, batch size of 32, learning rate of 1e − 4 and weight decay of 0.0005.During evaluation, we predict a chunk of 0.1s for each second of the test video, and average the resultsacross all chunks. Also, to avoid bias towards longer videos, all evaluation metrics are computed foreach video separately, and averaged across videos.

6

REC-STREET YT-CLEAN YT-MUSIC YT-ALL

STFT ENV EMD STFT ENV EMD STFT ENV EMD STFT ENV EMD

SPATIAL PRIOR 0.187 0.958 0.492 1.394 2.063 1.478 4.652 4.355 3.479 2.691 3.394 2.246U-NET BASELINE 0.180 0.935 0.449 1.361 2.039 1.403 4.338 4.678 2.855 2.658 3.239 2.137

OURS-NOVIDEO 0.178 0.973 0.450 1.370 2.081 1.428 4.220 4.591 2.654 2.635 3.200 2.117OURS-NORGB 0.158 0.779 0.425 1.339 1.847 1.405 3.664 3.569 2.432 2.546 2.907 2.063

OURS-NOFLOW 0.172 0.784 0.440 1.349 1.778 1.402 3.615 3.467 2.403 2.455 2.665 2.023OURS-NOSEP 0.152 0.790 0.422 1.381 1.773 1.415 3.627 3.602 2.447 2.435 2.694 2.050

OURS-FULL 0.158 0.767 0.419 1.379 1.776 1.417 3.524 3.366 2.350 2.447 2.649 2.019

Table 1: Quantitative comparisons. We report three quality metrics (Sec 3.3): Envelope distance (ENV),Log-spectral distance (LSD), and earth-mover’s distance (EMD), on test videos from different datasets (Sec 3.4).Lower is better. All results within 0.01 of the top performer are shown in bold.

GT

Our

s

Figure 3: Qualitative Results. Comparison between predicted and recorded FOA. Spatial audio is visualizedas a color overlay over the frame, with darker red indicating locations with higher audio energy.

Real time performance The proposed procedure can generate 1s of spatial audio at 48000Hzsampling rate in 103ms, using a single 12GB Titan Xp GPU (3840 cores running at 1.6GHz).

Baselines Since spatial audio generation is a novel task, no established methods exist for comparisonpurposes. Instead, we ablate our architecture to determine the relevance of each component, andcompare it to the prior spatial distribution of audio content and a popular, domain-independentbaseline architecture. Quantitative results are shown in Table 1.

To determine the role of the visual input, we remove the RGB encoder (NORGB), the flow encoder(NOFLOW), or both (NOVIDEO). We also remove the separation block entirely (NOSEP), andmultiply the localization weights with the input mono directly. The results indicate that the networkis highly relying on visual features, with NOVIDEO being one of the worse performers overall.Interestingly, most methods performed well on REC-STREET and YT-CLEAN. However, the visualencoder and separation block are necessary for more complex videos as in YT-MUSIC and YT-ALL.

Since the main sound sources in 360◦ videos often appear in the center, we validate the needfor a complex model by directly using the prior distribution of audio content (SPATIAL-PRIOR).We compute the spatial prior E(θ) by averaging the energy maps E(θ, t) of (Eq. 5) over allvideos in the training set. Then, to induce the same distribution on test videos, we decomposeE(θ) into its spherical harmonics coefficients (cw, cx, cy, cz) and upconvert the input mono using(φw(t), φx(t), φy(t), φz(t)) = (1, cx/cw, cy/cw, cz/cw) i(t). As shown in Table 1, relying solely onthe prior distribution is not enough for accurate ambisonic conversion.

We finally compare to a popular encoder-decoder U-NET architecture, which has been sucessfullyapplied to audio tasks such as audio super-resolution [31]. This network consists of a number ofconvolutional downsampling layers that progressively reduce the dimension of the signal, distillinghigher level features, followed by a number of upsampling layers to restore the signal’s resolution. Ineach upsampling layer, a skip connection is added from the encoding layer of equivalent resolution.To generate spatial audio, we modify the U-NET architecture by setting the number of units in theoutput layer to the number of ambisonic channels, and concatenate video features to the U-Netbottleneck (i.e., the lowest resolution layer). Our approach significantly outperforms the U-NETarchitecture, which demonstrates the importance of an architecture tailored to the task of spatial audiogeneration.

7

Ground-truth U-NET NOAUDIO NOSEP OURS

Figure 4: Comparisons. Predicted FOA produced by different procedures.

Figure 5: Mono recordings. Predicted FOA onvideos recorded with a real mono microphone (un-known FOA).

HMD In-Browser0

20

40

60

80

100

% R

eal

84.4

72.5

62.255.0

31.140.4

22.2

35.8

84.4444444444444472.53521126760563

GT Ours U-Net Mono

Figure 6: User studies. Percentage of videos labeled as"Real" when viewed with audio generated by various meth-ods (GT, OURS, U-NET and MONO) under two viewingexperiences (using a HMD device, and in-browser view-ing). Error bars represent Wilson score intervals [45] for a95% confidence level.

Qualitative results Designing robust metrics for comparing spatial audio is an open problem,and we found that only so much can be determined by these metrics alone. For example, fully flatpredictions can have a similar EMD to a mis-placed prediction, but perceptually be much worse.Therefore, we also rely on qualitative evaluation and a user study. Fig. 3 shows illustrative examplesof the spatial audio output of our network, and Fig. 4 shows a comparison with other baselines. Todepict spatial audio, we overlay the directional energy mapE(θθθ, t) of the predicted ambisonics (Eq. 5)over the video frame at time t. As can be seen in most of these examples, our network generatesspatial audio that has a similar spatial distribution of energy as the ground truth. Furthermore, due tothe form of the audio generator, the sound fidelity of the original mono input is carried over to thesynthesized audio. These and other examples, together with the predicted spatial audio, are providedin Supp. material.

The results shown in Table 1 and Fig. 3 use videos recorded with ambisonic microphones andconverted to mono audio. To validate whether our method extends to real mono microphones, wescraped additional videos from YouTube that were not recorded with ambisonics, and show that wecan still generate convincing spatial audio (see Fig. 5 and Supp. material).

User study The real criteria for success is whether viewers believe that the generated audio iscorrectly spatialized. To evaluate this, we conducted a “real vs fake” user study, where participantswere shown a 360◦ video and asked to decide whether the perceived location of the audio matchesthe location of its sources in the video (real) or not (fake). Two studies were conducted in differentviewing environments: a popular in-browser 360◦ video viewing platform (YouTube), and with ahead-mounted display (HMD) in a controlled environment. We recruited 32 participants from AmazonMechanical Turk for the in-browser study. For the HMD study, we recruited 9 participants (agedbetween 20 and 32, 1 female) through an engineering school email list of a large university. In bothcases, participants were asked to have normal hearing, and to listen to the audio using headphones. Inthe HMD study, participants were asked to wear a KAMLE VR Headset. To familiarize participantswith the spatial audio experience, each participant was first asked to watch two versions of a pre-selected video with and without correct spatial audio. After the practice round, participants watched20 randomly selected videos whose audio was generated by one of four methods: GT, the originalground-truth recorded spatial audio; MONO, just the mono track (no spatialization); U-NET, thebaseline method; and OURS, the result of our full method. After each video, participants were askedto decide whether its audio was real or fake. In total, 280 clips per method were watched for thein-browser study, and 45 per method in the HMD study.

The results of both studies, shown in Fig 6, support several conclusions. First, our approachoutperforms the U-NET baseline and MONO by statistically significant margins in both studies.

8

Ground truth Ours

Figure 7: Limitations. Our algorithm predictsthe wrong people who are laughing in a roomfull of people (top), and the wrong violin who iscurrently playing in the live performance (right).

FOA SOA

MONO → FOA FOA → SOA

ENV 1.870 0.333LSD 3.228 0.513

EMD 1.400 0.232

Figure 8: Higher order ambisonics. (Top) Examplesfrom our synthetic FOA to SOA conversion experiment.(Bottom) Comparison between Mono to FOA and FOA toSOA conversion tasks.

Second, in comparison to in-browser video platforms, HMD devices offer a more realistic viewingexperience, which enables non-spatial audio to be identified more easily. Thus, participants wereconvinced by the ambisonics predicted by our approach at higher rates while wearing an HMD device(62% HMD vs. 55% in-browser). Finally, spatial audio may not always be experienced easily, e.g.,when the video does not contain clean sound sources. As a consequence, even videos with GTambisonics were misclassified in both studies at a significant rate.

5 Discussion

Limitations We observe several cases where sound sources are not correctly separated or localized.This occurs with challenging examples such as those with many overlapping sources, reverberantenvironments which are hard to separate, or where there is an ambiguous mapping from visualappearance to sound source (such as multiple, similar looking cars). Fig. 7 shows a few examples.While general purpose spatial audio generation is still an open problem, we provide a first approach.We hope that future advances in audio-visual analysis and audio generation will enable more robustsolutions. Also, while total amount of content (in hours) is on par with other video datasets, thenumber of videos is still low, due to the limited number of 360◦ video with spatial audio availablefrom online sources. As this number increases, our method should also improve significantly.

Future work Although hardware trends change and we begin to see commercial cameras thatinclude spatial audio microphone arrays capable of recording FOA, we believe that up-convertingto spatial audio will remain relevant for a number of reasons. Besides the spatialization of legacyrecordings with only mono or stereo audio, our method can be used to further increase the ambisonicsspatial resolution, for example by up-converting first into second-order ambisonics (SOA). Unfortu-nately, ground-truth SOA recordings are difficult to collect in-the-wild, since SOA microphones arerare and expensive. Instead, to demonstrate future potential, we applied our approach to the FOA toSOA conversion task, using a small synthetic dataset where pre-recorded sounds are placed at chosenlocations, which move over time in random trajectories. These are accompanied by an artificiallyconstructed video consisting of a random background image with identifying icons synchronizedwith the sound location (see Fig. 8). The results shown in Fig. 8 indicate that converting FOA intoSOA may be significantly easier than ZOA to FOA. This is because FOA signals already containsubstantial spatial information, and partially separated sounds. Given these findings, a promising areafor future work is to synthesize a realistic large scale SOA dataset for learning to convert FOA intohigh-order ambisonics in order to support more realistic viewing experience.

Conclusion We presented the first approach for up-converting conventional mono recordings intospatial audio given a 360◦ video, and introduced an end-to-end trainable network tailored to thistask. We also demonstrate the benefits of each component of our network and show that the proposedgenerator performs substantially better than a domain independent baseline.

Acknowledgments This work was partially funded by graduate fellowship SFRH/BD/109135/2015 from thePortuguese Ministry of Sciences and Education and NRI Grant IIS-1637941.

9

References[1] T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In

Interspeech, 2018. 3[2] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances

in Neural Information and Processing Systems (NIPS), 1996. 3[3] S. Argentieri, P. Danès, and P. Souères. A survey on sound source localization in robotics: From binaural

to array processing methods. Computer Speech & Language, 34(1):87–112, 2015. 3[4] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video.

In Advances in Neural Information and Processing Systems (NIPS), 2016. 3[5] Z. Barzelay and Y. Y. Schechner. Harmony in motion. In IEEE Conf. on Computer Vision and Pattern

Recognition (CVPR), 2007. 3[6] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind

deconvolution. Neural computation, 7(6):1129–1159, 1995. 3[7] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287 – 314, 1994. 3[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image

database. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2009. 4[9] G. Dickins and R. Kennedy. Towards optimal soundfield representation. In Audio Engineering Society

Convention, 1999. 2[10] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution.

In European Conference on Computer Vision (ECCV), 2014. 3[11] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. Freeman, and M. Rubinstein.

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.In ACM SIGGRAPH, 2018. 3

[12] A. Gabbay, A. Shamir, and S. Peleg. Visual speech enhancement using noise-invariant training. InInterspeech, 2018. 3

[13] M. A. Gerzon. Periphony: With-height sound reproduction. J. Audio Eng. Soc, 21(1):2–10, 1973. 2[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.

Generative adversarial nets. In Advances in Neural Information and Processing Systems (NIPS), 2014. 3[15] A. Gray and J. Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech,

and Signal Processing, 24(5):380–391, 1976. 5[16] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference

on Computer Vision (ECCV), 2016. 4[17] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A.

Saurous, B. Seybold, et al. CNN architectures for large-scale audio classification. In IEEE InternationalConf. on Acoustics, Speech and Signal Processing (ICASSP), 2017. 4

[18] J. Hornstein, M. Lopes, J. Santos-Victor, and F. Lacerda. Sound localization for humanoid robots-buildingaudio-motor maps based on the HRTF. In IEEE/RSJ International Conf. on Intelligent Robots and Systems,2006. 3

[19] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis. Deep learning for monaural speechseparation. In IEEE International Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2014. 3

[20] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: joint end-to-end learning of global andlocal image priors for automatic image colorization with simultaneous classification. ACM Transactionson Graphics (TOG), 35(4):110, 2016. 3

[21] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of opticalflow estimation with deep networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),2017. 4

[22] H. Izadinia, I. Saleemi, and M. Shah. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2):378–390, 2013. 3

[23] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410, 2016. 3

[24] C. Jutten and J. Herault. Blind separation of sources, part I: An adaptive algorithm based on neuromimeticarchitecture. Signal Processing, 24(1), 1991. 3

[25] E. Kidron, Y. Y. Schechner, and M. Elad. Pixels that sound. In IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2005. 3

[26] E. Kidron, Y. Y. Schechner, and M. Elad. Cross-modal localization via sparsity. IEEE Trans. on SignalProcessing (TIP), 55(4):1390–1404, 2007. 3

[27] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutionalnetworks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. 3

[28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 6[29] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning

Representations (ICLR), 2014. 3

10

[30] M. Kronlachner. Spatial transformations for the alteration of ambisonic recordings. Master’s thesis, GrazUniversity of Technology, 2014. 2, 6

[31] V. Kuleshov, S. Z. Enam, and S. Ermon. Audio super resolution using neural networks. In Workshops atInternational Conference on Learning Representations (ICLR), 2017. 3, 7

[32] E. Levina and P. Bickel. The earth mover’s distance is the mallows distance: Some insights from statistics.In IEEE International Conference on Computer Vision (ICCV), 2001. 5

[33] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: A large-scale speaker identification dataset. InInterspeech, 2017. 4

[34] K. Nakadai, H. G. Okuno, and H. Kitano. Real-time sound source localization and separation for robotaudition. In International Conference on Spoken Language Processing, 2002. 3

[35] K. Nakamura, K. Nakadai, F. Asano, and G. Ince. Intelligent sound source localization and its applicationto multimodal human tracking. In IEEE/RSJ International Conf. on Intelligent Robots and Systems (IROS),2011. 3

[36] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu. Wavenet: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop,2016. 3

[37] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. InEuropean Conference on Computer Vision (ECCV), 2018. 3

[38] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicatedsounds. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. 3

[39] O. Santala and V. Pulkki. Directional perception of distributed sound sources. The Journal of the AcousticalSociety of America, 129(3):1522–1530, 2011. 4

[40] J. O. Smith. Mathematics of the discrete Fourier transform (DFT): with audio applications, chapter TheAnalytic Signal and Hilbert Transform Filters. Julius Smith, 2007. 5

[41] M. Soler, J.-C. Bazin, O. Wang, A. Krause, and A. Sorkine-Hornung. Suggesting sounds for images fromvideo collections. In European Conference on Computer Vision (ECCV), 2016. 3

[42] N. Strobel, S. Spors, and R. Rabenstein. Joint audio-video object localization and tracking. IEEE SignalProcessing Magazine, 18(1):22–31, 2001. 3

[43] J.-M. Valin, F. Michaud, and J. Rouat. Robust localization and tracking of simultaneous moving soundsources using beamforming and particle filtering. Robotics and Autonomous Systems, 55(3):216–228, 2007.3

[44] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview. IEEE/ACMTransactions on Audio, Speech, and Language Processing, 2018. 3

[45] E. B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the AmericanStatistical Association, 22(158):209–212, 1927. 8

[46] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Conference on ComputerVision (ECCV), 2016. 3

[47] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. InEuropean Conference on Computer Vision (ECCV), 2018. 3

11

Documents

Self-Supervised Generation of Spatial Audio for 360° Videopapers.nips.cc/...of-spatial-audio-for-360-video.pdf · We introduce an approach to convert mono audio recorded by a 360