5
arXiv:2004.12031v3 [cs.LG] 6 May 2020 Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement Zakaria Aldeneh , Anushree Prasanna Kumar , Barry-John Theobald , Erik Marchi , Sachin Kajarekar , Devang Naik , Ahmed Hussen Abdelaziz University of Michigan, Ann Arbor, MI, USA Apple, Cupertino, CA, USA [email protected], {ak 26, barryjohn theobald, emarchi, skajarekar, naik.d, ahussenabdelaziz}@apple.com Abstract We present an introspection of an audiovisual speech enhance- ment model. In particular, we focus on interpreting how a neu- ral audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual features provide not only high-level information about speech activity, i.e. speech vs. no speech, but also fine-grained visual information about the place of articulation. An interest- ing byproduct of this finding is that the learned visual embed- dings can be used as features for other visual speech applica- tions. We demonstrate the effectiveness of the learned visual representations for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications. Index Terms: audiovisual speech enhancement, lip reading, viseme classification, self-supervised learning 1. Introduction The goal of monaural (single-channel) speech enhancement is to improve the quality and intelligibility of speech when the audio is recorded in a noisy environment from a single mi- crophone. Enhancement models attenuate additive noise from a speech signal and can be used as pre-processors for various downstream applications, including automatic speech recogni- tion (ASR) and speaker verification [1–5]. Previous research has shown that acoustic models used for speech enhancement benefit from the addition of visual cues [6–10]. Although neural audiovisual speech enhancement models have shown promising results, it is unclear how visual cues are utilized [8–11]. One hypothesis is that visual cues provide only high-level information about speech activity, i.e. speech vs. no speech, depending on whether the lips are mov- ing or not. An alternative hypothesis is that visual cues pro- vide fine-grained information about what is being articulated. The aim of our work is to interpret how visual cues are used by audiovisual speech enhancement models. Such analysis is not only necessary for understanding the mechanism by which an audiovisual enhancer uses visual cues, but also for understand- ing the performance gains obtained from the addition of visual cues. We study the performance of audio-only speech enhance- ment models as a function of what is being articulated, where we use visemes as the basic unit of analysis. A viseme con- sists of a cluster of phonemes that share the same place of ar- ticulation, and so visemes represent visually indistinguishable phonemes [12]. For example, the phonemes /uh/ and /w/ both map to a rounded vowel viseme, while phonemes /b/ and /m/ map to a viseme representing bilabial consonants. We hypoth- esize that enhancement performance will vary depending on what is being said since certain sounds are more visually promi- nent than others. Given the per-viseme audio-only enhancement performance, we then quantify the performance gains obtained from the addition of visual cues to the enhancement model. We also hypothesize that the visual representations implic- itly learned by the audiovisual model can be used for other vi- sual speech tasks. We show that these visual representations can be used to discriminate visemes during continuous speech, e.g. rounding lips, stretching lips, and visible teeth. Our re- sults show that audiovisual speech enhancement can be used as a self-supervision task for learning meaningful visual speech representations without relying on manual annotations. 2. Related Work Past work has shown that the introduction of visual cues can im- prove the robustness of various speech processing applications. These include automatic speech recognition (ASR), speaker recognition and diarization, speech enhancement, and emotion recognition [13–16]. This section covers recent relevant works on audiovisual speech enhancement, audiovisual source separa- tion, and speech driven multi-modal self-supervised learning. Gabbay et al. introduced an audiovisual speech enhance- ment model based on an encoder-decoder architecture [8]. The model they introduced takes in the log Mel-scale spectrogram representation of the speech segment and the corresponding gray-scale video frames containing the lips of the target speaker, and outputs an enhanced version of the input spectrogram. Al- though promising, the models were all speaker-dependent (i.e. the same target speakers were used during training and testing phases), which limits their generalizability. Ephrat et al. introduced a speaker-independent approach for audiovisual speech enhancement [10]. Their models take in, as input, a complex-valued spectrogram of a three-second seg- ment of speech and the corresponding video frames cropped to contain only the face, and outputs complex masks, which are used to reconstruct an enhanced speech signal. Concurrently, Afouras et al. introduced an audiovisual speech enhancement approach to model both the magnitude and the phase compo- nents of the short-time Fourier transform (STFT) representa- tion of the input signal [9]. In a follow-up study, they extended their original approach to make the enhancement model robust to partial occlusions by conditioning it on both visual cues and speaker representations [11].

Self-supervised Learning of Visual Speech Features with ...arXiv:2004.12031v3 [cs.LG] 6 May 2020 Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Self-supervised Learning of Visual Speech Features with ...arXiv:2004.12031v3 [cs.LG] 6 May 2020 Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

arX

iv:2

004.

1203

1v3

[cs

.LG

] 6

May

202

0

Self-supervised Learning of Visual Speech Features

with Audiovisual Speech Enhancement

Zakaria Aldeneh†, Anushree Prasanna Kumar‡, Barry-John Theobald‡, Erik Marchi‡,Sachin Kajarekar‡, Devang Naik‡, Ahmed Hussen Abdelaziz‡

† University of Michigan, Ann Arbor, MI, USA‡Apple, Cupertino, CA, USA

[email protected], {ak 26, barryjohn theobald, emarchi, skajarekar, naik.d,

ahussenabdelaziz}@apple.com

Abstract

We present an introspection of an audiovisual speech enhance-

ment model. In particular, we focus on interpreting how a neu-

ral audiovisual speech enhancement model uses visual cues to

improve the quality of the target speech signal. We show that

visual features provide not only high-level information about

speech activity, i.e. speech vs. no speech, but also fine-grained

visual information about the place of articulation. An interest-

ing byproduct of this finding is that the learned visual embed-

dings can be used as features for other visual speech applica-

tions. We demonstrate the effectiveness of the learned visual

representations for classifying visemes (the visual analogy to

phonemes). Our results provide insight into important aspects

of audiovisual speech enhancement and demonstrate how such

models can be used for self-supervision tasks for visual speech

applications.

Index Terms: audiovisual speech enhancement, lip reading,

viseme classification, self-supervised learning

1. Introduction

The goal of monaural (single-channel) speech enhancement is

to improve the quality and intelligibility of speech when the

audio is recorded in a noisy environment from a single mi-

crophone. Enhancement models attenuate additive noise from

a speech signal and can be used as pre-processors for various

downstream applications, including automatic speech recogni-

tion (ASR) and speaker verification [1–5].

Previous research has shown that acoustic models used

for speech enhancement benefit from the addition of visual

cues [6–10]. Although neural audiovisual speech enhancement

models have shown promising results, it is unclear how visual

cues are utilized [8–11]. One hypothesis is that visual cues

provide only high-level information about speech activity, i.e.

speech vs. no speech, depending on whether the lips are mov-

ing or not. An alternative hypothesis is that visual cues pro-

vide fine-grained information about what is being articulated.

The aim of our work is to interpret how visual cues are used by

audiovisual speech enhancement models. Such analysis is not

only necessary for understanding the mechanism by which an

audiovisual enhancer uses visual cues, but also for understand-

ing the performance gains obtained from the addition of visual

cues.

We study the performance of audio-only speech enhance-

ment models as a function of what is being articulated, where

we use visemes as the basic unit of analysis. A viseme con-

sists of a cluster of phonemes that share the same place of ar-

ticulation, and so visemes represent visually indistinguishable

phonemes [12]. For example, the phonemes /uh/ and /w/ both

map to a rounded vowel viseme, while phonemes /b/ and /m/

map to a viseme representing bilabial consonants. We hypoth-

esize that enhancement performance will vary depending on

what is being said since certain sounds are more visually promi-

nent than others. Given the per-viseme audio-only enhancement

performance, we then quantify the performance gains obtained

from the addition of visual cues to the enhancement model.

We also hypothesize that the visual representations implic-

itly learned by the audiovisual model can be used for other vi-

sual speech tasks. We show that these visual representations

can be used to discriminate visemes during continuous speech,

e.g. rounding lips, stretching lips, and visible teeth. Our re-

sults show that audiovisual speech enhancement can be used

as a self-supervision task for learning meaningful visual speech

representations without relying on manual annotations.

2. Related Work

Past work has shown that the introduction of visual cues can im-

prove the robustness of various speech processing applications.

These include automatic speech recognition (ASR), speaker

recognition and diarization, speech enhancement, and emotion

recognition [13–16]. This section covers recent relevant works

on audiovisual speech enhancement, audiovisual source separa-

tion, and speech driven multi-modal self-supervised learning.

Gabbay et al. introduced an audiovisual speech enhance-

ment model based on an encoder-decoder architecture [8]. The

model they introduced takes in the log Mel-scale spectrogram

representation of the speech segment and the corresponding

gray-scale video frames containing the lips of the target speaker,

and outputs an enhanced version of the input spectrogram. Al-

though promising, the models were all speaker-dependent (i.e.

the same target speakers were used during training and testing

phases), which limits their generalizability.

Ephrat et al. introduced a speaker-independent approach for

audiovisual speech enhancement [10]. Their models take in,

as input, a complex-valued spectrogram of a three-second seg-

ment of speech and the corresponding video frames cropped to

contain only the face, and outputs complex masks, which are

used to reconstruct an enhanced speech signal. Concurrently,

Afouras et al. introduced an audiovisual speech enhancement

approach to model both the magnitude and the phase compo-

nents of the short-time Fourier transform (STFT) representa-

tion of the input signal [9]. In a follow-up study, they extended

their original approach to make the enhancement model robust

to partial occlusions by conditioning it on both visual cues and

speaker representations [11].

Page 2: Self-supervised Learning of Visual Speech Features with ...arXiv:2004.12031v3 [cs.LG] 6 May 2020 Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

Figure 1: The audiovisual speech enhancement system takes as

input a mixture speech signal and the corresponding video con-

taining the face of the speaker, and outputs a clean speech sig-

nal. The neural model (Figure 2) takes in two inputs, the mix-

ture spectrogram and a video of the mouth region, and outputs

an ideal ratio mask (IRM). The IRM is multiplied with the noisy

spectrogram to give a clean spectrogram, which is used along

with the noisy phase to reconstruct the clean speech signal.

Video

NeuralModel

ISTFT

Mag.

STFT

Phase

Mixed Audio

Clean Audio

VideoPrep.

Element-wiseMultiplication

Previous research looked at using correlation between

acoustic and visual modalities as a self-supervisory training sig-

nal [17–21]. Cheng and Zisserman [17] trained a model to de-

termine whether the acoustic and visual modalities of a speaker

are temporally aligned or not. They created training examples

by randomly introducing shifts in one of the modalities. The au-

thors showed that the learned visual representations can be used

as features for lip reading applications. Owens and Efros [18]

trained a more general model to predict whether the visual and

acoustics streams of a scene are temporally aligned or not, and

then used the learned representations for three tasks: source lo-

calization, audiovisual action recognition, and audio source sep-

aration. To the best of our knowledge, our work in this paper

is the first to investigate audiovisual speech enhancement as a

self-supervisory signal for learning strong visual features.

In contrast to previous research, our work focuses on in-

terpreting how audiovisual enhancers use visual cues to en-

hance the target signal. In particular, the novelty of our work is

two-fold. Firstly, we show that the performance of audiovisual

speech enhancers varies depending on what is being said, and

we show that performance gains vary as a function of the place

of articulation. Secondly, we demonstrate the effectiveness of

the audiovisual speech enhancement task as a self-supervised

way of learning useful visual embeddings that encode infor-

mation about the place of articulation. In the self-supervised

paradigm, we learn meaningful representations by training a

model to solve a task with labels obtained from the data itself

as opposed to labels obtained from manual annotation [22].

3. Audiovisual Speech Enhancement Model

Our architecture is shown in Figure 1. The neural enhancer

receives two inputs: the squared magnitude of the STFT (i.e.

power spectrum) of the noisy speech segment, and a video seg-

ment containing the corresponding pose-normalized gray-scale

mouth images of dimension w×h× t. To produce an enhanced

version of the input speech, the model predicts an ideal ratio

mask (IRM), which we write as:

IRM(m,f) =|S(m, f)|2

|S(m, f)|2 + |N(m, f)|2

where |S(m, f)|2 and |N(m, f)|2 represent the power spec-

trums of the speech and noise signals at frame m and frequency

Figure 2: The audiovisual enhancement network consists of

three sub-network: (1) the audio encoder; (2) the video en-

coder; and (3) the mask predictor. The audio encoder induces

a fixed-size embedding from a spectrogram segment and the

video encoder induces a fixed-size embedding from a sequence

of gray-scale images of the mouth region. The mask predictor

outputs an ideal ratio mask (IRM) given the concatenated mul-

timodal embedding.

Predictor

AudioEncoder

Spectrogram

VideoFrames

IRM

VideoEncoder

Mask

bin f . Element-wise multiplying an IRM by the power spec-

trum of the noisy signal gives an optimal estimate, in the sense

of the minimum mean square error (MMSE), of the power spec-

trum of the clean signal [23].

3.1. Audiovisual Neural Model

The audiovisual neural model, shown in Figure 2, consists of

three sub-networks: the audio encoder, the video encoder and

the mask predictor.

The audio encoder induces an embedding given the acoustic

representation of the noisy input speech. We experiment with

both fully-connected- and LSTM-based audio encoders in this

work. The video encoder is based on the VGG-M architec-

ture [17], which consists of a series of conv-pool layers, fol-

lowed by a series of fully-connected layers. The video encoder

induces an embedding given the video representation. Finally,

the mask predictor outputs an IRM given the concatenated mul-

timodal embedding. The mask predictor consists of a series of

fully-connected layers followed by a fully-connected linear re-

gression layer.

4. Experimental Setup

4.1. Dataset

We use an in-house audiovisual corpus containing around 68

hours (39,097 utterances) of speech from 600 gender-balanced

speakers. The utterances are queries for a digital assistant spo-

ken in English with an American accent. The audio is sampled

at 16kHz using a 16-bit PCM encoding. The video has a frame

rate of 60Hz and a resolution of 720 × 1280. We randomly

split the dataset into gender-stratified partitions using a 80/10/10

rule. The resulting splits consist of 480 speakers (29,415 ut-

terances, 52 hours) for training, 60 speakers (4,650 utterances,

8 hours) for validation, and 60 speakers (5,032 utterances, 8

hours) for testing.

4.2. Details

Mixed utterances for training are created on-the-fly by mixing

the target utterance with a random utterance from a different

speaker in the training set. The mixtures used for the validation

and test sets are fixed and are created using speakers from their

Page 3: Self-supervised Learning of Visual Speech Features with ...arXiv:2004.12031v3 [cs.LG] 6 May 2020 Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

Table 1: Enhancement performance obtained for each setup

and loss function. SNR: signal-to-noise-ratio in decibel (dB),

PESQ: perceptual evaluation of speech quality, A: Audio-

only enhancer, AV: audiovisual enhancer, FC: fully connected,

LSTM: Long short-term memory

Audio

EncoderLoss

SNR PESQ

A AV A AV

FC

MSE 4.21 7.53 2.70 2.88

MAE 4.28 8.10 2.56 2.87

MAE+Cosine 4.62 8.07 2.67 2.90

LSTM

MSE 4.61 8.42 2.67 2.92

MAE 4.63 8.56 2.58 2.90

MAE+Cosine 5.17 8.87 2.73 2.95

respective partitions. This ensures that the model does not see

any of the validation or test speakers during training. All of the

samples were mixed at signal-to-noise-ratio (SNR) of 0 dB.

Features: We train our model using 200ms audiovisual

segments. Audio features represent the squared magnitude of

the STFT of the noisy input signal extracted using a 25ms Ham-

ming window with a hop size of 10ms. Visual features represent

a sequence of 128×96 cropped gray-scale images of the mouth

region for the target speaker extracted using dlib [24].

Training: The neural network was trained with the ADAM

optimizer using a learning rate of 1e-4 for a maximum of 100

epochs. We monitor the validation performance during training

and apply early stopping when the validation loss converges.

The loss functions that we use are described in Section 5.1.

Architecture:. The video encoder is based on the VGG-M

architecture [17] and consists of five convolutional blocks fol-

lowed by three linear blocks. Each convolutional block consists

of a 3×3 convolution layer, followed by batch normalization, a

ReLU non-linearity, and 2×2 max-pooling. We use 96 filters in

the first convolution layer and 128 filters each of the proceeding

convolution layers. Each linear block consists of a linear layer,

followed by a ReLU non-linearity. We use width sizes of 1024,

512, and 256 for the first, second, and third linear layer, respec-

tively. The audio encoder consists of three 512-dimensional

fully-connected (or LSTMs) layers. Finally, the mask predictor

is made up of three linear blocks, each with a width of 512. The

hyper-parameters for the audio encoder and the mask predictor

were chosen based on validation performance.

Metrics: Three metrics are used to evaluate the per-

formance of the audiovisual enhancer: mean absolute er-

ror (MAE), SNR, and perceptual evaluation of speech qual-

ity (PESQ). SNR is used for measuring the background noise

reduction, while PESQ is used for measuring the perceptual

speech quality [25]. We use SNR and PESQ for measuring the

performance of the enhanced reconstructed signals at the utter-

ance level and use MAE for measuring the performance of the

predicted IRMs at the 200ms segment level. We use the ground-

truth clean reconstructed signals as reference when computing

SNR and PESQ.

5. Results and Analysis

5.1. Baseline

The goal is to build a strong baseline to be used for further anal-

ysis. We fix the visual encoder and study how changes to the

audio encoder and the loss function affect the audiovisual en-

hancement performance. For the audio encoder, we compare

the performance of the fully-connected (FC)-based encoders to

that of LSTM-based encoders. We compare three regression-

based loss functions: mean squared error (MSE), MAE, and a

hybrid loss function that combines MAE with the cosine dis-

tance.

MSE is a common loss function used in regression prob-

lems. Minimizing the MSE is equivalent to maximizing the

log-likelihood of data with a unimodal Gaussian distribution.

Upon further inspection of the distribution of the training tar-

gets, i.e. the IRMs, we find that it does not resemble a unimodal

Gaussian. Instead, the distribution of our training targets is bi-

modal, with a very large peak at zero (sparse labels) and a sec-

ond smaller peak at one. The MSE solution in this case, which is

the conditional mean of the distribution, will be between the two

peaks, shifted toward the higher peak, at zero. This results in

predicting blurry masks, which is consistent with observations

about using the MSE loss in computer vision applications [22].

Using the MAE loss function can mitigate some of the limi-

tations incurred from using the MSE loss function by encourag-

ing the prediction of sharper IRMs [9, 11]. One remaining lim-

itation with using both MSE and MAE loss functions is the as-

sumption that the individual components of the IRM vector are

statistically independent. To address this limitation, we propose

using a joint loss function that combines MAE with the cosine

loss function. The cosine loss measures the distance between

two entire vectors instead of measuring the distance between

individual vector components. The cosine distance, however,

cannot be used as a standalone loss, as it minimizes the an-

gle between two vectors irrespective of their magnitudes. This

can result in IRM vectors with magnitudes beyond the masks’

boundaries, i.e. zero and one. Therefore, we use the following

hybrid loss of the MAE and cosine distance to optimize the an-

gle between the ground truth and inferred IRM vectors while

bounding thei r magnitude values to be between zero and one:

Lhybrid = LMAE + α Lcos

where α is a trade-off parameter that we set to 0.5 in our exper-

iments.

Table 1 gives a summary of the results obtained from our

baseline experiment. The results show that considerable gains

are achieved using an audio-only enhancer (columns labeled

A), which was not expected a priori. One reason for this is

that although the target mixture for the noisy signal was 0dB

SNR, mixtures of 1dB emerged due to short pauses in the tar-

get and background speech. This 1dB difference between target

and background acoustic speech gives the network a clue for

enhancing the target speaker, even without visual cues. That

said, the results show that the addition of visual cues still pro-

vides improvement in performance for all setups. The results

also show that using an LSTM-based audio encoder yields bet-

ter performance compared to FC-based encoders. This can be

attributed to the temporal modeling of LSTMs. Finally, the re-

sults show that using the proposed hybrid loss function gives

improvements over using MSE for the majority of the setups.

5.2. Viseme-specific Relative Improvements

In this section, we investigate whether the visual features im-

prove the speech enhancement model by simply providing it

with voice activity features, i.e. speech/silence, or by provide

the model with more fine-grained information about what is

being articulated. We compare the per-viseme improvements

of the audio-only and audiovisual speech enhancement mod-

els in terms of the MAE between the inferred and ground truth

Page 4: Self-supervised Learning of Visual Speech Features with ...arXiv:2004.12031v3 [cs.LG] 6 May 2020 Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

Table 2: The phoneme-viseme mapping used in our work and

the enhancement improvements gained per viseme due to the

addition of visual cues. %∆ notes the percentage decrease in

MAE (higher is better).

Viseme cluster Viseme Phoneme %∆

Lip rounded vowels/V1/

/aa/ /ah/ /ao/26.6

level 1 /aw/ /er/ /oy/

Silence /SIL/ /sil/ /sp/ 25.0

Bilabial /P/ /p/ /b/ /m/ 23.1

Lip stretched vowels/V3/

/ae/ /eh/ /ey/20.0

level 1 /ay /y/

Palato alveolar /SH//sh/ /zh/ /ch/

19.0/jh/

Alveolar semivowels /L/ /l/ /el/ /r/ 17.6

Lip rounded vowels/V2/

/uw/ /uh/ /ow/16.7

level 2 /w/

Velar /G//g/ /ng/ /k/

15.4/hh/

Alveolar fricative /Z/ /z/ /s/ 14.3

Alveolar /T//t/ /d/ /n/

13.3/en/

Dental /TH/ /th/ /dh/ 13.3

Labio-Dental /F/ /f/ /v/ 13.3

Lip stretched vowels/V4/ /ih/ /iy/ 12.5

level 2

IRMs. The per-viseme performance is obtained using three

steps: First, we apply an in-house ASR model to all test ut-

terances to estimate frame-phoneme alignments. Next, we clus-

ter the phonemes into visemes following the phoneme-viseme

mapping in Table 2, which is a modified version of the map-

ping used in [12]. Finally, we compute the MAE between the

predicted mask and the IRM in the test set and report the per-

viseme performance.

The results in Table 2 show that the addition of visual cues

results in improvements for all visemes. The results also show

that the improvement is not the same for all visemes. As shown

in Table 2, the performance gains vary based on what is being

articulated. For instance, we see improvements for the viseme

/SH/, which sounds like noise acoustically but is easy to classify

visually. We also see different gains for the phoneme /m/, which

is mapped to the viseme cluster /P/, and phoneme /n/, which

is mapped to the viseme cluster /T/. Both of these phonemes

sound similar acoustically but look different visually.

5.3. Viseme Classification with Learned Visual Embed-

dings

In this section, we investigate whether or not we can use audio-

visual speech enhancement as a self-supervised task for learning

meaningful visual representations that can be used in other vi-

sual speech applications. Given the trained audiovisual speech

enhancement model from our previous experiment, we discon-

nect the video encoder and use it as a general feature extrac-

tor. We use these extracted features to train a logistic regression

model for viseme classification. For training the logistic re-

Table 3: Viseme classification performance obtained using vi-

sual embeddings extracted from the video encoder in the en-

hancement model.

Viseme Recall (%)

/SIL/ 84.3

/SH/ 68.7

/P/ 65.1

/F/ 50.0

/Z/ 43.1

/V1/ 39.6

/V3/ 31.1

/V2/ 28.5

/V4/ 24.0

/L/ 20.6

/TH/ 19.0

/G/ 9.5

/T/ 4.2

Average 33.5

Chance 7.7

gression model, we further split the test set used for evaluating

the audiovisual speech enhancement model into training, vali-

dation, and test sets following a speaker-independent 80/10/10

split rule. This approach ensures two things: (1) speakers used

for training the enhancement models are different from those

used in analysis; and (2) the logistic regression model is trained,

validated, and tested on speaker independent partitions. The C

hyper-parameter of the logistic regression model was tuned us-

ing the validation set. The performance is evaluated in terms of

recall per viseme.

Table 3 shows the viseme classification performance ob-

tained when using the visual embeddings as features for a

simple logistic regression viseme classifier. We find that the

visual embeddings are discriminative toward visemes, giving

an overall unweighted accuracy of 33.5%, where 7.7% is the

chance performance. We find that our classifier predicts ap-

parent visemes, such as /Z/, /F/, /P/, and /SH/, relatively accu-

rately compared to predicting visemes articulated more towards

the back of the mouth, such as /T/ and /G/. The trends that

we observe for viseme prediction performance using visual em-

beddings are similar to those observed in viseme classification

tasks. As a benchmark, we were able to obtain an unweighted

accuracy of 49.2% using a separate VGG-M neural network

trained from scratch specifically to detect visemes, which sug-

gests that our self-supervised visual features were able to close a

large proportion of the performance gap. This demonstrates the

efficacy of audiovisual speech enhancement as a self-supervised

task for learning strong visual features.

6. Conclusion

In this paper, we shed light on how an audiovisual speech en-

hancement model utilizes visual cues to improve the quality

and intelligibility of a target speech signal. We showed that

the performance of enhancement models varies depending on

what is being articulated; and we showed that the addition of

visual cues provides non-consistent gains in performance de-

pending on what is being articulated. Further, we demonstrated

the effectiveness of audiovisual speech enhancement as a self-

supervision task for learning meaningful visual representations

for visual speech applications.

Page 5: Self-supervised Learning of Visual Speech Features with ...arXiv:2004.12031v3 [cs.LG] 6 May 2020 Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

7. References

[1] S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. Morales-Cordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust asr us-ing neural network based speech enhancement and feature simu-lation,” in 2015 IEEE Workshop on Automatic Speech Recognitionand Understanding (ASRU). IEEE, 2015, pp. 482–489.

[2] A. H. Moore, P. P. Parada, and P. A. Naylor, “Speech enhance-ment for robust automatic speech recognition: Evaluation using abaseline system and instrumental measures,” Computer Speech &

Language, vol. 46, pp. 574–584, 2017.

[3] S. Shon, H. Tang, and J. Glass, “Voiceid loss: Speech enhance-ment for speaker verification,” arXiv preprint arXiv:1904.03601,2019.

[4] K. Tan and D. Wang, “A convolutional recurrent neural networkfor real-time speech enhancement.” in Interspeech, 2018, pp.3229–3233.

[5] N. Mamun, S. Khorram, and J. H. Hansen, “Convolutional neuralnetwork-based speech enhancement for cochlear implant recipi-ents,” arXiv preprint arXiv:1907.02526, 2019.

[6] A. H. Abdelaziz, S. Zeiler, and D. Kolossa, “Twin-hmm-basedaudio-visual speech enhancement,” in 2013 IEEE International

Conference on Acoustics, Speech and Signal Processing, May2013, pp. 3726–3730.

[7] S. Zeiler, H. Meutzner, A. H. Abdelaziz, and D. Kolossa, “In-troducing the turbo-twin-hmm for audio-visual speech enhance-ment,” in INTERSPEECH, 2016.

[8] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhance-ment,” arXiv preprint arXiv:1711.08789, 2017.

[9] T. Afouras, J. S. Chung, and A. Zisserman, “The conversa-tion: Deep audio-visual speech enhancement,” arXiv preprint

arXiv:1804.04121, 2018.

[10] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim,W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock-tail party: A speaker-independent audio-visual model for speechseparation,” arXiv preprint arXiv:1804.03619, 2018.

[11] T. Afouras, J. S. Chung, and A. Zisserman, “My lips are con-cealed: Audio-visual speech enhancement through obstructions,”arXiv preprint arXiv:1907.04975, 2019.

[12] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Ver-gyri, J. Sison, and A. Mashari, “Audio visual speech recognition,”IDIAP, Tech. Rep., 2000.

[13] A. H. Abdelaziz, “Comparing fusion models for dnn-based audio-visual continuous speech recognition,” IEEE/ACM Transactions

on Audio, Speech, and Language Processing, vol. 26, no. 3, pp.475–484, 2017.

[14] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser-man, “Deep audio-visual speech recognition,” IEEE transactions

on pattern analysis and machine intelligence, 2018.

[15] A. H. Abdelaziz, “Improving acoustic modeling using audio-visual speech,” in 2017 IEEE International Conference on Mul-

timedia and Expo (ICME). IEEE, 2017, pp. 1081–1086.

[16] J. S. Chung, B.-J. Lee, and I. Han, “Who said that?: Audio-visual speaker diarisation of real-world meetings,” arXiv preprint

arXiv:1906.10042, 2019.

[17] J. S. Chung and A. Zisserman, “Out of time: automated lip syncin the wild,” in Asian conference on computer vision. Springer,2016, pp. 251–263.

[18] A. Owens and A. A. Efros, “Audio-visual scene analysis withself-supervised multisensory features,” in Proceedings of the Eu-

ropean Conference on Computer Vision (ECCV), 2018, pp. 631–648.

[19] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott,and A. Torralba, “The sound of pixels,” in Proceedings of the Eu-

ropean Conference on Computer Vision (ECCV), 2018, pp. 570–586.

[20] H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of mo-tions,” in Proceedings of the IEEE International Conference onComputer Vision, 2019, pp. 1735–1744.

[21] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Tor-ralba, “Music gesture for visual sound separation,” arXiv preprintarXiv:2004.09476, 2020.

[22] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”in European conference on computer vision. Springer, 2016, pp.649–666.

[23] P. C. Loizou, Speech enhancement: theory and practice. CRCpress, 2013.

[24] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Ma-

chine Learning Research, vol. 10, no. Jul, pp. 1755–1758, 2009.

[25] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,“Perceptual evaluation of speech quality (pesq)-a new method forspeech quality assessment of telephone networks and codecs,” in2001 IEEE International Conference on Acoustics, Speech, andSignal Processing. Proceedings (Cat. No. 01CH37221), vol. 2.IEEE, 2001, pp. 749–752.