6
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion Andy T. Liu 1 , Po-chun Hsu 1 , Hung-yi Lee 1 1 College of Electrical Engineering and Computer Science, National Taiwan University {r07942089, r07942095, hungyilee}@ntu.edu.tw Abstract We present an unsupervised end-to-end training scheme where we discover discrete subword units from speech without us- ing any labels. The discrete subword units are learned un- der an ASR-TTS autoencoder reconstruction setting, where an ASR-Encoder is trained to discover a set of common linguistic units given a variety of speakers, and a TTS-Decoder trained to project the discovered units back to the designated speech. We propose a discrete encoding method, Multilabel-Binary Vectors (MBV), to make the ASR-TTS autoencoder differentiable. We found that the proposed encoding method offers automatic ex- traction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language. Therefore, the TTS-Decoder can synthesize speech with the same content as the input of ASR-Encoder but with different speaker char- acteristics, which achieves voice conversion (VC). We further improve the quality of VC using adversarial training, where we train a TTS-Patcher that augments the output of TTS-Decoder. Objective and subjective evaluations show that the proposed ap- proach offers strong VC results as it eliminates speaker iden- tity while preserving content within speech. In the ZeroSpeech 2019 Challenge, we achieved outstanding performance in terms of low bitrate. Index Terms: acoustic unit discovery, voice conversion, speech disentangled representation, adversarial training 1. Introduction Despite that human speech inherently carries linguistic features that represent textual information, modern text-to-speech train- ing pipelines still require parallel speech and text transcription pairs [1][2][3]. Parallel speech and transcripts may not always be available, and is costly to acquire, however human speech alone can be easily gathered. In this work, we focus on utilizing the advantage of unlabeled speech to discover discrete linguis- tic units, where machine learns to uncover the linguistic fea- tures hidden in human utterance without any supervision. We use these discovered linguistic units for voice conversion (VC) and achieved outstanding results. Embedding audio signals into latent representations has been a well studied practice [4][5][6][7][8][9][10]. Former studies have also attempted to encode speech content into var- ious representations for voice conversion, including the use of disentangled autoencoders [11], VAEs [12] or GANs [13]. Con- tinues vectors are the most common approach, however when it comes to encoding speech content, they may not be the best choice. As they are unlike the discrete phonemes that we often used to represent human language. Previous works [14][15][16] also attempt to learn discrete representations from audio, how- ever they did not apply the learned units for VC. In VC, the state-of-the-art approach [11] requires additional loss in the framework of GAN to guarantee the disentanglement Figure 1: The ASR-TTS autoencoder framework where MBV are learned as discrete linguistic representations. of learned encodings, whereas the proposed approach naturally possesses the direct ability to separate speaker style from speech content. In [17], VC is achieved through multiple losses to- gether with a K-way quantized embedding space and autore- gressive Wavenet, on the other hand, the proposed encoding space does not require additional training constraints. In this work, we use an ASR-TTS autoencoder to discov- ers discrete linguistic units from speech, without any alignment, text label, or parallel data provided. The ASR-Encoder learns to encode speech from different speakers to a common set of small linguistic symbols. These finite set of linguistic symbols are represented by Multilabel-Binary Vectors (MBV), vectors consist of arbitrary number of zeros and ones. The proposed MBV method is differentiable, hence allowing backpropagation of gradients and end-to-end training of an autoencoder recon- struction setting. While the ASR-Encoder learns a many-to-one mapping from speech to discrete subword units, a TTS-Decoder learns a one-to-many mapping from discrete subword units back to speech. The discrete nature of MBV allows it to innately sep- arate linguistic content from speech, removing speaker charac- teristics. Given an utterance of a source speaker, we were able to encode its speech content using the ASR-Encoder, and perform voice conversion to generate speech with the same linguistic content but style of a target speaker using the TTS-Decoder. In training, the TTS-Decoder is always given an encod- ing from an arbitrary speaker and is trained to decode it back to the original speech of that particular speaker. During infer- ence time, the TTS-Decoder has to take encodings of a source speaker and decode to a target speaker, an encoding-speaker pair that it never observed during training. Although speech conversion is already feasible, we propose to use additional ad- versarial training to compensate the training-testing inconsis- tency. Under the WGAN [18] setting, we train a TTS-Patcher arXiv:1905.11563v1 [cs.CL] 28 May 2019

Andy T. Liu Abstract · Andy T. Liu 1, Po-chun Hsu , Hung-yi Lee 1College of Electrical Engineering and Computer Science, National Taiwan University fr07942089, r07942095, [email protected]

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Andy T. Liu Abstract · Andy T. Liu 1, Po-chun Hsu , Hung-yi Lee 1College of Electrical Engineering and Computer Science, National Taiwan University fr07942089, r07942095, hungyileeg@ntu.edu.tw

Unsupervised End-to-End Learning of Discrete Linguistic Units for VoiceConversion

Andy T. Liu1, Po-chun Hsu1, Hung-yi Lee1

1College of Electrical Engineering and Computer Science, National Taiwan University{r07942089, r07942095, hungyilee}@ntu.edu.tw

AbstractWe present an unsupervised end-to-end training scheme wherewe discover discrete subword units from speech without us-ing any labels. The discrete subword units are learned un-der an ASR-TTS autoencoder reconstruction setting, where anASR-Encoder is trained to discover a set of common linguisticunits given a variety of speakers, and a TTS-Decoder trained toproject the discovered units back to the designated speech. Wepropose a discrete encoding method, Multilabel-Binary Vectors(MBV), to make the ASR-TTS autoencoder differentiable. Wefound that the proposed encoding method offers automatic ex-traction of speech content from speaker style, and is sufficientto cover full linguistic content in a given language. Therefore,the TTS-Decoder can synthesize speech with the same contentas the input of ASR-Encoder but with different speaker char-acteristics, which achieves voice conversion (VC). We furtherimprove the quality of VC using adversarial training, where wetrain a TTS-Patcher that augments the output of TTS-Decoder.Objective and subjective evaluations show that the proposed ap-proach offers strong VC results as it eliminates speaker iden-tity while preserving content within speech. In the ZeroSpeech2019 Challenge, we achieved outstanding performance in termsof low bitrate.Index Terms: acoustic unit discovery, voice conversion, speechdisentangled representation, adversarial training

1. IntroductionDespite that human speech inherently carries linguistic featuresthat represent textual information, modern text-to-speech train-ing pipelines still require parallel speech and text transcriptionpairs [1][2][3]. Parallel speech and transcripts may not alwaysbe available, and is costly to acquire, however human speechalone can be easily gathered. In this work, we focus on utilizingthe advantage of unlabeled speech to discover discrete linguis-tic units, where machine learns to uncover the linguistic fea-tures hidden in human utterance without any supervision. Weuse these discovered linguistic units for voice conversion (VC)and achieved outstanding results.

Embedding audio signals into latent representations hasbeen a well studied practice [4][5][6][7][8][9][10]. Formerstudies have also attempted to encode speech content into var-ious representations for voice conversion, including the use ofdisentangled autoencoders [11], VAEs [12] or GANs [13]. Con-tinues vectors are the most common approach, however when itcomes to encoding speech content, they may not be the bestchoice. As they are unlike the discrete phonemes that we oftenused to represent human language. Previous works [14][15][16]also attempt to learn discrete representations from audio, how-ever they did not apply the learned units for VC.

In VC, the state-of-the-art approach [11] requires additionalloss in the framework of GAN to guarantee the disentanglement

Figure 1: The ASR-TTS autoencoder framework where MBV arelearned as discrete linguistic representations.

of learned encodings, whereas the proposed approach naturallypossesses the direct ability to separate speaker style from speechcontent. In [17], VC is achieved through multiple losses to-gether with a K-way quantized embedding space and autore-gressive Wavenet, on the other hand, the proposed encodingspace does not require additional training constraints.

In this work, we use an ASR-TTS autoencoder to discov-ers discrete linguistic units from speech, without any alignment,text label, or parallel data provided. The ASR-Encoder learnsto encode speech from different speakers to a common set ofsmall linguistic symbols. These finite set of linguistic symbolsare represented by Multilabel-Binary Vectors (MBV), vectorsconsist of arbitrary number of zeros and ones. The proposedMBV method is differentiable, hence allowing backpropagationof gradients and end-to-end training of an autoencoder recon-struction setting. While the ASR-Encoder learns a many-to-onemapping from speech to discrete subword units, a TTS-Decoderlearns a one-to-many mapping from discrete subword units backto speech. The discrete nature of MBV allows it to innately sep-arate linguistic content from speech, removing speaker charac-teristics. Given an utterance of a source speaker, we were able toencode its speech content using the ASR-Encoder, and performvoice conversion to generate speech with the same linguisticcontent but style of a target speaker using the TTS-Decoder.

In training, the TTS-Decoder is always given an encod-ing from an arbitrary speaker and is trained to decode it backto the original speech of that particular speaker. During infer-ence time, the TTS-Decoder has to take encodings of a sourcespeaker and decode to a target speaker, an encoding-speakerpair that it never observed during training. Although speechconversion is already feasible, we propose to use additional ad-versarial training to compensate the training-testing inconsis-tency. Under the WGAN [18] setting, we train a TTS-Patcher

arX

iv:1

905.

1156

3v1

[cs

.CL

] 2

8 M

ay 2

019

Page 2: Andy T. Liu Abstract · Andy T. Liu 1, Po-chun Hsu , Hung-yi Lee 1College of Electrical Engineering and Computer Science, National Taiwan University fr07942089, r07942095, hungyileeg@ntu.edu.tw

Figure 2: Target guided adversarial learning to improve voiceconversion performance. We annotate the adversarial step inorange and the target guided step in blue.

in place of a generator. The TTS-Patcher learns to generate amask that augments the output of TTS-Decoder. Furthermore,we use a target driven reconstruction loss to guide the genera-tor’s update. As a result, the quality of voice conversion per-formance is improved. In the ZeroSpeech 2019 Challenge [19],we achieve 2nd place in terms of low bitrate under a strong di-mension constraint on the Surprise Language [20][21] leader-board. We further show that the proposed method is capable ofgenerating high quality intelligible speech when the dimensionconstraint is removed.

2. Proposed method2.1. Discrete linguistic units discovery

We present an unsupervised end-to-end learning framework fordistinct linguistic units discovery. In this stage we learn a com-mon set of discrete encodings for all the speakers. Let x ∈ Xbe an acoustic feature sequence where X is a set of all suchsequences from all source and target speakers, and x is a fixedlength-r segment randomly sampled from X . Let y ∈ Y bea speaker where Y is the group of all source and target speak-ers who produce the sequence collection X . In a training pair(x, y) ∈ (X,Y ), the sequence x is produced by speaker y.

2.1.1. Learning Multilabel-Binary Vectors

In Figure 1, an ASR-TTS autoencoder framework is used tolearn discrete linguistic units. The ASR-Encoder is trained tomap input acoustic feature sequence x ∈ X to a latent discreteencoding representation:

z = ASR(x), (1)

where we propose to use Multilabel-Binary Vectors (MBV) torepresent z generated in (1), the discrete encoding z is designedto represents the linguistic content of input speech x. We defineMBV as:

z = [e1, e2, · · · , en] ∈ Rn, ei ∈ {0, 1} (2)

where z is a n dimension binary vector consists of arbitrarynumber of zeros and ones. The proposed MBV encodingmethod is differentiable, and allows end-to-end training with di-rect gradient backpropagation in an autoencoder framework. Toobtain the binarized differentiable vector z, we linearly projecta continues output vector into a Rn×2 space, where n is the

dimension of the wanted MBV. The dual-channel projection al-lows each dimension in a MBV to symbolize an arbitrary at-tribute’s presence. We then perform categorical reparameter-ization trick with Gumbel-Softmax [22] on the 2nd-dimensiondual-channel, which is equivalent to asking the model to pre-dict whether an attribute is observed in a given input. Sincethe two channels are linked together by Gumbel-Softmax, sim-ply picking one of them is sufficient, hence we select the firstchannel as our MBV encoding z. We illustrate the process ofMBV encoding in Figure 1 (b). Former approaches also useGumbel-Softmax on the output layer to obtain one-hot vec-tors [22]. However, the resulting one-hot vectors is insufficientto represent content in speech, as the one-hot encoding space istoo sparse for machines to learn linguistic meanings, which weverified in our experiments.

2.1.2. Voice reconstruction and conversion

Given ASR-Encoder’s output, the TTS-Decoder is trained togenerate an output acoustic feature sequence defined as:

X(x, y) = TTS(ASR(x), y), (3)

where X(x, y) is a reconstruction of x from z = ASR(x)given the speaker identity y ∈ Y . Mean Absolute Error (MAE)is employed to evaluate the ASR-TTS autoencoder reconstruc-tion loss, as MAE is reported to be able to generate sharperoutputs than Mean Square Error [23]. The reconstruction lossis given by:

Lrec(θasr, θtts) = E(x,y)∼(X,Y )

∥∥x−X(x, y)∥∥1, (4)

where θasr and θtts are the parameters of the ASR-Encoderand TTS-Decoder, respectively. We uniformly sample (x, y)for training in (4). Because the speaker identity is provided tothe TTS-Decoder, the proposed MBV encodings z is able tolearn an abstract space that is invariant to speaker identity andonly encodes the content of speech, without using any form oflinguistic supervision.

At inference time, given a source speech xs, and targetspeaker yt, the TTS-Decoder can generate the voice of the tar-get speaker yt using the linguistic content ASR(xs) from xs:

X(xs, yt) = TTS(ASR(xs), yt), (5)

X(xs, yt) is the output of TTS-Decoder, which has the linguis-tic content of xs but style of speaker yt.

2.2. Target guided adversarial learning

With the learned speaker invariant ASR-Encoder mapping in(1), we successfully represent speech content with discrete bi-nary vectors MBVs. In this section, we describe how adversar-ial training is used to boost VC quality based on the discretelinguistic units learned in Section 2.1. We train a TTS-Patcherin an unsupervised manner, in which the TTS-Patcher generatesa spectrum mask that residually augments the output of equa-tion (5), resulting in a more precise voice conversion result.

We define two sets of speakers, source speaker set and tar-get speaker set, where we aim to convert the speech of a sourcespeaker into a target speaker’s style while preserving its con-tent. Let xs ∈ XS and xt ∈ XT be sequences where XS andXT are the set of sequences from source speakers and targetspeakers, respectively. Let ys ∈ YS be a speaker from the set ofall source speakers YS who produce XS , and let yt ∈ YT be aspeaker from the set of all target speakers YT who produceXT .

Page 3: Andy T. Liu Abstract · Andy T. Liu 1, Po-chun Hsu , Hung-yi Lee 1College of Electrical Engineering and Computer Science, National Taiwan University fr07942089, r07942095, hungyileeg@ntu.edu.tw

2.2.1. Adversarial learning step

In Figure 2, a TTS-Patcher is trained as a generator under anadversarial learning setting. The TTS-Patcher takes ASR(xs)and yt as input, and generates a spectrum mask P (xs, yt) thatranges from zero to one, and modifies X(xs, yt) through aresidual augmentation:

X ′(xs, yt) = (P (xs, yt)⊗X(xs, yt))⊕X(xs, yt). (6)

The ⊕ and ⊗ symbols indicate element-wise addition and mul-tiplication, respectively. In (6), xs is a randomly sampled inputspeech segment from source speaker ys, where yt is a randomlysampled target speaker. X(xs, yt) is the voice conversion utter-ance obtained in (5), P (xs, yt) is the output of TTS-Patcher,and finally X ′(xs, yt) the augmented spectrum.

A discriminatorD is trained to distinguish whether an inputacoustic feature sequence is real or reconstructed by machine.Since naive GAN [24] is notoriously hard to train, we minimizethe Wasserstein-1 distance between real and fake distributionsinstead, as proposed in the WGAN [18] formulation:

LWGAN = Ext∼XT [D(xt)]

− Exs∼XS ,yt∼YT [D(X ′(xs, yt))].(7)

The discriminator computes the Wasserstein-1 distance of twodistributions: real data xt sampled from the target speaker setXT , and augmented voice conversion outputs from (6). Weuse the alternative WGAN-GP [25] to enforce the 1-Lipschitzconstraint required by D, where weight clipping is replacedwith gradient penalty. On the last layer of the discriminator,we stretch an additional layer that learns a classifier to predictspeaker from a given speech. This allows the discriminator toconsider input spectrum’s fidelity and speaker identity at thesame time [26].

2.2.2. Target guided training step

The decoupled learning [27] of ASR-TTS autoencoder andTTS-Patcher stabilizes the GAN [24] training process. How-ever we found that under the adversarial learning scheme, theTTS-Patcher can easily learn to deceive the discriminator byover-adding style, this greatly compromises the original speechcontent. This is caused by the discriminator’s inability to dis-criminate utterances with incorrect or ambiguous content, thediscriminator only learns to focus on speaker style. As a so-lution, we propose to add an additional target guided trainingstep, we apply additional reconstruction loss after every adver-sarial step, as shown in Figure 2. Instead of converting xs, theASR-Encoder now takes a segment of target speech xt as input,equation (6) then becomes:

X ′(xt, yt) = (P (xt, yt)⊗X(xt, yt))⊕X(xt, yt), (8)

and we minimize MAE between xt and X ′(xt, yt):

Lrec(θp) = E(xt,yt)∼(XT ,YT )

∥∥∥xt −X ′(xt, yt)∥∥∥1, (9)

where θp is the parameter of the TTS-Patcher. This loss effec-tively guides the TTS-Patcher’s update under adversarial set-tings, as the added style is regularized to preserve intelligibility.

3. ImplementationThe ASR-Encoder is inspired by the CBHG module [1], wherethe linear output of ASR-Encoder is fed to the MBV encoding

module. We add noise in training by adding dropout layers inthe ASR-Encoder as suggested in [28]. The TTS-Decoder andTTS-Patcher have identical model architectures, where we usepixel shuffle layers to generate high resolution spectrum [29].We add speaker embedding on the feature map of all layers,where a distinct embedding is learned for all different layersas different information may be needed for each layer. The dis-criminator is consist of 2D-convolution blocks for temporal tex-ture capturing, and convolutions projection layers followed byfully-connected output layers. We trained the network usingAdam [30] optimizer and a batch size of 16. In the discretelinguistic units discovery stage, we train the ASR-TTS autoen-coder for 200k mini-batches. In the target guided adversariallearning stage we train the model for 50k mini-batches, in onebatch we train a step of adversarial learning including 5 itera-tions of discriminator update and 1 iteration of generator update,followed by a target guided reconstruction step.

We train and evaluate our model using the ZeroSpeech2019 English dataset [19]. In particular, we use the “TrainUnit Dataset” as our source speaker set XS , the “Train VoiceDataset” as our target speaker set XT , and we evaluate modelswith the “Test Dataset”. We used log-magnitude spectrogramsas acoustic features, the detailed settings are in Table 1. Dur-ing training, our model is trained to process 128 consecutiveoverlapping frames of spectrogram, where we uniformly sam-ple from the dataset. At inference time, for a given input withmore than 128 frames, the model process them as segments andconcatenate the outputs on the time-axis. Source code are pub-licly available1.

Table 1: Acoustic feature preprocess settings

Pre-emphasis 0.97 Sample rate 16kFrame length 50 ms Mel-spec 80Frame shift 12.5 ms Linear-spec 1024

Window type Hann Vocoder Griffin-Lim

4. ExperimentsWe compare the proposed MBV encodings with one-hot encod-ings, continues encodings, and continues encodings with addi-tional loss [11], all of which under the same autoencoder train-ing setting as described in Section 3.

4.1. Degree of disentanglement

To evaluate the degree of disentanglement for the proposedMBV with respect to speaker characteristics, we trained aspeaker verification classifier that takes spectrum as input andpredicts speaker identity. We encode speech from sourcespeaker and convert it to a target speaker. With the pre-trainedclassifier we measure target speaker classification accuracy onthe converted results. A disentangled representation should pro-duce voice similar to the target speakers and leads to higherclassification accuracy. The results are shown in Table 2, wherethe Dim column indicates the dimension n of encoding vec-tors z ∈ Rn. One-hot encodings are insufficient to encodespeech, resulting in poor conversion results. Continues encod-ings are incapable of disentangling content from style, resultingin lower classification accuracy. Where as the proposed MBVencodings has the ability to preserve speech content while re-moving speaker style. Although both one-hot vectors and MBV

1https://github.com/andi611/ZeroSpeech-TTS-without-T

Page 4: Andy T. Liu Abstract · Andy T. Liu 1, Po-chun Hsu , Hung-yi Lee 1College of Electrical Engineering and Computer Science, National Taiwan University fr07942089, r07942095, hungyileeg@ntu.edu.tw

Table 2: Comparison of different latent representations.

Types of encodings Dim Acc

One-hot 1024 43.3%Continues 1024 84.1%Continues 128 79.9%

Continues (with add’l loss) 1024 78%Continues (with add’l loss) 128 81.3%

Ours (MBV) 1024 92.3%Ours (MBV) 128 93.9%

are discrete, each dimension of one-hot vector corresponds to alinguistic unit, while each dimension of MBV may correspondsto a pronunciation attribute. This makes MBV more data effi-cient than one-hot vectors.

4.2. Subjective and objective evaluation

We perform subjective human evaluation on the convertedvoices. We use 20 subjects to grade each method on a 1 to 5scale under two measures: the naturalness of speech and thesimilarity in speaker characteristics to the target speaker. In Ta-ble 3 we show the result of our evaluation, the proposed methodresults in significant increase of target similarity with a slightdegrade of naturalness. We easily achieve comparable speechintelligibility as ordinary continues methods, while achievingbetter voice conversion quality with more disentanglement (Ta-ble 2). Subjective and objective evaluations suggest that the pro-posed MBV method eliminates speaker identity while reservingcontent within speech, and is suitable for voice conversion.

Table 3: Results of subjective human evaluation. All methodsused an encoding dimension of 1024 if not specified otherwise.

Types of encodings naturalness similarity

Continues 3.80 2.14Continues (with add’l loss) 3.21 2.58Ours (MBV with dim 6) 1.61 1.51Ours (MBV) 3.36 3.06Ours (with adv. training) 2.57 3.15

4.3. Encoding dimension analysis

We use several objective measures to determine the quality ofan encoding, these measurements are shown in Table 4. TheCER column is the output Character Error Rate (CER) froma pre-trained ASR, where we use the ASR results of real inputvoice as ground truth, which measures intelligibility of the con-verted speech. The BR column measures the bitrate (amountof information) that encodings carry in average with respect tothe testing set, as suggested in [19]. The ABX column mea-sures the machine ABX score, which indicates the goodnessof encoding quality [31][19]. The distinct column indicatesthe number of unique symbols used to encode speech in thetest set. Lower values suggest a better performance for all themeasures described above. In Table 4, we compare the pro-posed method with other approaches along side with the base-line model [32][33] demonstrated in [19]. When compared toother approaches, the proposed method achieves lower BR anddistinct values with comparable ABX scores. Due to the dis-crete and differentiable nature of MBV, the proposed method

Table 4: Performance of different encoding dimensions.

Method Dim CER BR ABX distinct

Baseline 200 1.000 71.98 35.90 65

Cont. 1024 0.036 138.45 31.83 16849128 0.040 138.45 33.96 16849

Ours

1024 0.196 138.45 32.02 16849512 0.313 138.45 32.82 16849256 0.430 138.45 32.52 16849128 0.629 138.45 31.58 1684964 0.717 138.35 32.57 1677232 0.797 134.80 31.82 1459116 0.887 105.96 35.62 37238 0.998 61.79 38.10 1467 0.998 55.97 37.71 946 1.000 48.78 39.60 515 1.000 41.32 41.79 28

Figure 3: Partial results of the ZeroSpeech Challenge.

can be used in other unsupervised end-to-end clustering or clas-sification tasks, where other approaches may fail to generalize.

4.4. The zero resource speech challenge competition

We compete with other teams in the ZeroSpeech 2019 Chal-lenge [19] at a global scale. We use the proposed method with adimension of 6 to achieve extremely low bitrate, and were ableto encode a whole language with less than 64 distinct units, hu-man evaluation also suggests that the produced speech are stillacceptable (Table 3). On the Surprise dataset [20][21] leader-board, the proposed method is 2nd place in terms of low bitrate,while achieving higher Mean Opinion Score (MOS) and lowerCER than the 1st place team, as shown in Figure 3. Althoughthe proposed approach does not achieve high MOS because itis an inevitable trade-off with extremely low bitrate, in Table 3we have shown that with a larger encoding dimension we wereable to generate outstanding voice converted speech.

5. ConclusionsWe proposed to use multilabel-binary vectors to represent thecontent of human speech, as its discrete nature offers a strongextraction of speaker-independent representation. We show thatthese discrete units naturally possess the ability of disentanglingspeech content and style, which makes them extremely suitablefor voice conversion tasks. Also, we show that these discreteunits indeed produce better style disentanglement than ordinarysettings, and finally we were able to improve voice conversionresults through th addition oefresidual augmented signals.

Page 5: Andy T. Liu Abstract · Andy T. Liu 1, Po-chun Hsu , Hung-yi Lee 1College of Electrical Engineering and Computer Science, National Taiwan University fr07942089, r07942095, hungyileeg@ntu.edu.tw

6. References[1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,

N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al.,“Tacotron: Towards end-to-end speech synthesis,” arXiv preprintarXiv:1703.10135, 2017.

[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Naturaltts synthesis by conditioning wavenet on mel spectrogram pre-dictions,” in 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.

[3] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan,S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprintarXiv:1710.07654, 2017.

[4] S.-F. Huang, Y.-C. Chen, H. yi Lee, and L.-S. Lee, “Improved au-dio embeddings by adjacency-based clustering with applicationsin spoken term detection,” CoRR, vol. abs/1811.02775, 2018.

[5] W. He, W. Wang, and K. Livescu, “Multi-view recurrent neu-ral acoustic word embeddings,” arXiv preprint arXiv:1611.04496,2016.

[6] S. Settle and K. Livescu, “Discriminative acoustic word em-beddings: Tecurrent neural network-based approaches,” 2016IEEE Spoken Language Technology Workshop (SLT), pp. 503–510, 2016.

[7] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H. yi Lee, and L.-S. Lee,“Audio word2vec: Unsupervised learning of audio segment rep-resentations using sequence-to-sequence autoencoder,” in INTER-SPEECH, 2016.

[8] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments inlow-resource settings,” in 2013 IEEE Workshop on AutomaticSpeech Recognition and Understanding. IEEE, 2013, pp. 410–415.

[9] W.-N. Hsu, S. Zhang, and J. R. Glass, “Unsupervised learningof disentangled and interpretable representations from sequentialdata,” in NIPS, 2017.

[10] A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu,R. C. Moore, and R. A. Saurous, “Unsupervised learning of se-mantic audio representations,” 2018 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), pp.126–130, 2018.

[11] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voiceconversion without parallel data by adversarially learning disen-tangled audio representations,” arXiv preprint arXiv:1804.02812,2018.

[12] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang,“Voice conversion from non-parallel corpora using variationalauto-encoder,” in 2016 Asia-Pacific Signal and Information Pro-cessing Association Annual Summit and Conference (APSIPA).IEEE, 2016, pp. 1–6.

[13] Y. Gao, R. Singh, and B. Raj, “Voice impersonation using gen-erative adversarial networks,” in 2018 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2018, pp. 2506–2510.

[14] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “An auto-encoder based approach to unsupervised learning of subwordunits,” in 2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 7634–7638.

[15] C.-T. Chung, C.-Y. Tsai, H.-H. Lu, C.-H. Liu, H. yi Lee, and L.-S. Lee, “An iterative deep learning framework for unsuperviseddiscovery of speech features and linguistic units with applicationson spoken term detection,” 2015 IEEE Workshop on AutomaticSpeech Recognition and Understanding (ASRU), pp. 245–251,2015.

[16] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Un-supervised speech representation learning using wavenet autoen-coders,” arXiv preprint arXiv:1901.08810, 2019.

[17] A. van den Oord, O. Vinyals et al., “Neural discrete representationlearning,” in Advances in Neural Information Processing Systems,2017, pp. 6306–6315.

[18] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875, 2017.

[19] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea,X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W.Black, L. Besacier, S. Sakti, and E. Dupoux, “The ZeroResource Speech Challenge 2019: TTS without T,” in 20thAnnual Conference of the International Speech CommunicationAssociation (INTERSPEECH 2019): Crossroads of Speechand Language, 2019, submitted. [Online]. Available: https://zerospeech.com/2019/

[20] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “Devel-opment of hmm-based indonesian speech synthesis,” 2008 Proc.Oriental COCOSDA, pp. 215–220, November 2008.

[21] S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Naka-mura, “Development of indonesian large vocabulary continu-ous speech recognition system within a-star project,” 2008 Proc.Technologies and Corpora for Asia-Pacific Speech Translation(TCAST), pp. 19–24, January 2008.

[22] E. Jang, S. Gu, and B. Poole, “Categorical reparameterizationwith gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.

[23] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” in The IEEEConference on Computer Vision and Pattern Recognition (CVPR),July 2017.

[24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,” in Advances in Neural Information Processing Sys-tems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp.2672–2680.

[25] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville, “Improved training of wasserstein gans,” in Advancesin Neural Information Processing Systems, 2017, pp. 5767–5777.

[26] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesiswith auxiliary classifier gans,” in Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70. JMLR. org,2017, pp. 2642–2651.

[27] Z. Zhang, Y. Song, and H. Qi, “Decoupled learning for condi-tional adversarial networks,” in 2018 IEEE Winter Conference onApplications of Computer Vision (WACV). IEEE, 2018, pp. 700–708.

[28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” in Proceedingsof the IEEE conference on computer vision and pattern recogni-tion, 2017, pp. 1125–1134.

[29] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop,D. Rueckert, and Z. Wang, “Real-time single image and videosuper-resolution using an efficient sub-pixel convolutional neuralnetwork,” in Proceedings of the IEEE conference on computer vi-sion and pattern recognition, 2016, pp. 1874–1883.

[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980, 2014.

[31] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, andE. Dupoux, “Evaluating speech features with the Minimal-PairABX task: Analysis of the classical MFC/PLP pipeline,” in IN-TERSPEECH 2013 : 14th Annual Conference of the InternationalSpeech Communication Association, Lyon, France, Aug. 2013,pp. 1–5.

[32] L. Ondel, L. Burget, and J. Cernocky, “Variational inference foracoustic unit discovery,” Procedia Computer Science, vol. 81, pp.80–86, 2016.

Page 6: Andy T. Liu Abstract · Andy T. Liu 1, Po-chun Hsu , Hung-yi Lee 1College of Electrical Engineering and Computer Science, National Taiwan University fr07942089, r07942095, hungyileeg@ntu.edu.tw

[33] Z. Wu, O. Watts, and S. King, “Merlin: An open source neuralnetwork speech synthesis system,” in 9th ISCA Speech SynthesisWorkshop (2016), Sep. 2016, pp. 218–223.