Deep neural network-guided unit selection synthesis

DEEP NEURAL NETWORK-GUIDED UNIT SELECTION SYNTHESIS

Thomas Merritt1, Robert A.J. Clark1∗, Zhizheng Wu1, Junichi Yamagishi1,2, Simon King1

1The Centre for Speech Technology Research, University of Edinburgh, United Kingdom2National Institute of Informatics, Japan

[email protected]

ABSTRACT

Vocoding of speech is a standard part of statistical para-metric speech synthesis systems. It imposes an upper boundof the naturalness that can possibly be achieved. Hybrid sys-tems using parametric models to guide the selection of nat-ural speech units can combine the benefits of robust statis-tical models with the high level of naturalness of waveformconcatenation. Existing hybrid systems use Hidden MarkovModels (HMMs) as the statistical model. This paper demon-strates that the superiority of Deep Neural Network (DNN)acoustic models over HMMs in conventional statistical para-metric speech synthesis also carries over to hybrid synthesis.We compare various DNN and HMM hybrid configurations,guiding the selection of waveform units in either the vocoderparameter domain, or in the domain of embeddings (bottle-neck features).

Index Terms— speech synthesis, hybrid synthesis, deepneural networks, embedding, unit selection

1. INTRODUCTION

Statistical parametric speech synthesis (SPSS) systems wereoriginally proposed in order to offer more flexibility (e.g.,adaptability to target speech) than is possible with unit se-lection synthesis. However during the process of extractingand modelling speech parameters, followed by resynthesis,the naturalness of the speech is substantially reduced. As aconsequence, these systems are consistently rated as less nat-ural than unit selection, as we see in the results of many Bliz-zard Challenges [1, 2, 3, 4].

During previous investigations, some of the hypothesisedexplanations for the reduced quality of SPSS systems havebeen formally tested [5, 6, 7, 8]. The finding that across-linguistic-context averaging was harmful motivated us tobuild an HMM system which performed no such averaging[9]. In [6, 7, 8], we also identified the parametrisation step(i.e., vocoding) as introducing large degradations in quality,even before any modelling has taken place.

In order to increase the quality of speech above the ceil-ing imposed by vocoding, we conducted the current investiga-

∗Now at Google

tion. Our starting point is a prototypical unit selection system(Festival’s Multisyn engine), in which very little processing ofthe speech waveforms is performed. We present an investiga-tion into the effectiveness of hybrid synthesis (unit selectionguided by SPSS) within the Multisyn framework.

2. PRIOR WORK

2.1. Unit selection

Unit selection synthesis is usually described as an optimisa-tion problem: to find the sequence of units (diphones, in Mul-tisyn) that minimises the sum of target costs and join costs[10], which involves trading off how well a candidate unitmeets a required specification against how well it concate-nates with neighbouring units. By defining the join cost to bezero for units that are contiguous in the database, unit selec-tion effectively uses relatively large units of variable size.

Standard unit selection systems typically use mismatchesbetween the linguistic specifications of the target and candi-date units to compute a target cost. Distances between acous-tic features are used to compute join cost [11, 12, 13].

Whilst speech within contiguous regions found in thedatabase is effectively ‘perfectly natural’, unit selectionspeech generally suffers from concatenation artefacts. Avariety of hybrid synthesis systems have been proposed tosolve this problem by employing statistical models to predictthe acoustic properties of speech, and then selecting unitsfrom the database that match [14, 15, 16].

2.2. Hybrid synthesis

Hybrid synthesis systems thus use statistical models (usuallyby generating speech parameter trajectories) as the basis ofthe target cost function [15, 16, 14]. An extension of this ap-proach is ‘multiform’ synthesis in which some types of unitsare generated via vocoding, whilst others are retrieved fromthe speech database [17, 18, 19, 20, 21] although this is out-side of the scope of our current investigation. Hybrid systemshave performed very well in Blizzard Challenges [1, 2, 3, 4].

HMMs are the preferred statistical models in hybrid sys-tems’ target cost function, despite recent but compelling evi-dence that DNNs are superior to the regression tree employed

in HMM systems [22, 23, 24]. One exception is the investi-gation in [21], where a bidirectional recurrent neural network(RNN) provides a prosodic target. However the authors in[21] did not use this system to synthesise exclusively usingconcatenation and instead used it in a multiform setup com-bined with modelled prosody. Previous investigations weremade into the use of RNNs for extracting prosodic informa-tion from Mandarin which is then fed into a second RNNwhich outputs prosodic information [25]. The details of howthis is used to inform selection of units is not made clear andwas not formally tested.

In [9], context embeddings (which can also be called ‘bot-tleneck features’ when they are derived using a hidden layerof a feed-forward neural network [22]) were used to selectrich-context HMM models (models which are trained only onsamples where the linguistic contexts exactly match) in orderto select better models for synthesis in the inevitable eventthat contexts seen at synthesis-time were not observed in thetraining data [26]. This outperformed conventional HMMsynthesis, which provides a motivation to use these contextembeddings to select units in a hybrid system. We use dis-tance in embedding space (or distance in speech parameterspace in some cases) to measure the mismatch between thetarget and a candidate unit.

3. MULTISYN

Multisyn is a general purpose unit selection framework en-abling simple implementation of unit selection synthesiswithin the Festival toolkit [27, 28, 12]. Festival’s Multisynis used as one of the baselines for the Blizzard Challenge,and forms the basis of our hybrid unit selection systems. Theunit size used in all systems reported here is the diphone. Al-though gains have been demonstrated using other sized units[14], this is outside the scope of the current investigation.

The Multisyn target cost function is a simple weightedsum of mismatches in selected linguistic features. The defaultweights were left unchanged for the baseline system used inthis investigation, but the relative weight of the target costcompared to the join cost was manually tuned, for consistencywith the hybrid systems to which it was compared.

The join cost for Multisyn is a sum of distances between12 MFCCs, f0 and energy from the frames either side of thejoin [27, 28]. This default join cost was used in all systems.

Before performing the search to minimise the number ofjoin and target costs required to be computed, it is necessaryto pre-select a shortlist of candidates for each target position.The default pre-selection method in Multisyn returns candi-dates with matching diphone identity. In the event that this listis empty, a back-off scheme is invoked which uses manually-written phone substitution rules. Again, this default schemewas left in place, although it may be possible in future touse distance in embedding or speech parameter spaces in thebackoff procedure.

4. PROPOSED HYBRID TARGET COST

The context embeddings derived from a neural network, or al-ternatively the actual speech parameters predicted at the out-put of the network, can be thought of as a non-linear projec-tion of the input linguistics features. The projection is learnedin a supervised manner, according to whatever optimisationcriterion is used to train the network. It is this supervisionfrom acoustic information that makes these DNN-derived fea-tures more powerful than the purely linguistic feature-basedfunction used as standard in Multisyn. For example, linguis-tic features that are not predictive of acoustic properties willbe discarded.

The motivation for using a DNN – that, crucially, has beentrained to perform parametric speech synthesis – to providethe embeddings (rather than some other method), comes fromthe universally positive reports of DNN synthesis in recentliterature.

Multisyn operates on diphone units, but the synthesisDNN we used operates on phone units. To map betweenthese, we divided each phone into 4 sections. The featuresbeing used for the target cost (either context embeddings froma DNN [22], or output speech parameters from the neural net-work or an HMM) are gathered together across all frameswithin each of these 4 regions, from which we compute themean and variance per section. The variance is floored at 1%of the global variance per feature (the floor value was chosenvia informal listening). This is done in the same way for bothcandidate and target.

The Kullback Leibler divergence (KLD) [29] is computedfor each of the 4 sub-phone regions individually. The use ofKLD in embedding space follows on from our previous workon ‘rich-context’ modelling [9].

The KLD between distribution f of the features computedfor the frames corresponding to a given section in the test sen-tence, and distribution g, is:

DKL(f ||g) =12

[log| Σg || Σf |

+ Tr[Σ−1g Σf ]− d

+(µf − µg)T Σ−1g (µf − µg)], (1)

where µ and Σ are mean and covariance and d is the dimen-sionality of the feature vector. The KLD for each of the 4sections comprising a diphone is summed together to givethe final divergence score. The average of DKL(f ||g) andDKL(g||f) was used in order to make the measure symmet-rical.

The SPSS-derived target cost function used in this inves-tigation is designed to be independent of phoneme duration,relying on the target cost to select candidates with suitabledurations. However, work on explicit control of duration maybe fruitful in the future.

Table 1. Conditions included in listening testID DescriptionN Natural speechM MultisynLE Multisyn with target cost derived from context embedding

from 2nd layer of 6 layer DNN (as in [9])HE Multisyn with target cost derived from context embedding

from 5th layer of 6 layer DNNNP Multisyn with target cost derived from output from

Stacked bottleneck DNN system [22]HP Multisyn with target cost derived from output from

HTS demo with GV [30]

5. EXPERIMENTS

5.1. Implementation

The systems shown in Table 1 were constructed in order totest the effectiveness of speech parameter trajectories (fromHMMs or DNNs) and context embedding trajectories (fromDNNs) for computing the target cost. As previously stated,the only component that differs between systems is the tar-get cost, and the relative weight between target and join costs(tuned by informal listening).

Systems LE andHE use 32-dimensional context embed-ding features generated by a DNN similar to that described in[22, 31]. These come from the 2nd (lower layer of the DNN,closer to the linguistic input) or 5th (higher layer of the DNN,closer to the speech parameters output) layer of a 6 layer feedforward DNN, respectively. These systems are successors tothe rich context system described in [9] but instead of gener-ating the speech using a vocoder, they perform unit selectionand concatenation.

System NP uses the speech parameters output fromthe final layer of the stacked bottleneck DNN system pre-sented in [22, 31]. The speech parameters form an 86-dimensional vector (60th order mel-generalised cepstrum,25 band-aperiodicities, f0).

System HP was included to represent a conventionalHMM-guided hybrid system. This system uses the parame-ters generated by HMMs trained using the HTS demo recipe[30], including GV, to compute the target cost in much thesame way as system NP makes use of the generated speechparameters from the stacked bottleneck DNN system. Thecomparison between these systems HP and NP will tell uswhether the gains offered by DNNs in SPSS carry over to thehybrid scenario.

Informal listening to the speech generated via vocodingfrom the speech parameters of systemsNP andHP was con-ducted, in order to confirm that these systems generate speechof the quality expected. This vocoder-generated speech wasnot evaluated formally in the listening test reported below.The relative target cost vs join cost weight for all systems(M , LE, HE, NP and HP ) was tuned by informal listeningusing a few listeners.

2400 sentences from a male speaker of British English[32] were used as the training set for the HMMs and DNNs,and as the unit database in all systems. The text of 20 un-

seen Herald newspaper news sentences were used as a de-velopment set for tuning the target cost weight of each sys-tem. An additional 90 unseen Herald news sentences werethen used for the listening test. For DNN synthesis (requiredto produce the embeddings or speech parameters of systemsLE, HE and NP ), durations predicted by the HMM systemwere used; note that the durations of the final hybrid syntheticspeech are determined by the unmodified natural durations ofthe candidate diphone units selected from the database. Be-fore conducting the listening test, all utterances were volumenormalised according to [33].

5.2. Experimental setup

The listening test followed the MUSHRA paradigm [34],comprising the systems shown in Table 1, with the same setup as in [7]. In a MUSHRA test, versions of single sentencegenerated under all conditions are presented side-by-side tothe listener, allowing direct comparisons in naturalness to bemade. The listener is required to rate the systems between0 (completely unnatural) and 100 (completely natural). Thisparadigm was originally designed to evaluate audio codecsand we find that it is effective at prising apart relative dif-ferences between multiple systems because listeners haveknowledge of the full range of those systems before makingthey judgements. SystemN acts as a hidden (i.e., not labelledin the test) upper anchor. Listeners are instructed that thereis one system that must be given a rating of 100. MUSHRAusually also includes a lower hidden anchor; it is unclear whatshould be used for this in the case of synthetic speech, so nolower anchor was included in this test (as was also the case in[7] and [9]).

This test was conducted with 30 listeners, with each lis-tener rating 30 screens. Each screen presented 6 stimuli atonce: a single sentence under all 6 conditions. The 30 lis-teners were split into 3 groups of 10 listeners, and each waspresented with a disjoint set of 30 sentences; thus 90 differ-ent sentences were used. The stimuli played to listeners alongwith listener responses can be found at [35].

6. RESULTS

Figures 1 and 2 show the listeners responses from theMUSHRA test, in terms of the absolute values of their scores,and in terms of the rank order of systems derived from thesescores, respectively. The dashed green lines added to the boxplots show mean values. All tests for significant differencesused Holm-Bonferroni correction due to the large number ofcondition pairs to compare.

All conditions are significantly different from each otherin terms of absolute value, except between: M and HP , LEand HE, LE and NP , HE and NP . Significant differencesare in agreement using a t-test and Wilcoxon signed-rank testat a p value of 0.01.

0

10

20

30

40

50

60

70

80

90

100

N M LE HE HP NP

Fig. 1. Boxplot of absolute values from MUSHRA test

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6N M LE HE HP NP

Fig. 2. Boxplot of the rank order from MUSHRA test

All conditions are significantly different from each otherin terms of rank order, except between; M and HP , LE andHE. These significant differences are in agreement using aMann-Whitney U test and a Wilcoxon signed-rank test at ap value of 0.01. There is a disagreement in statistical sig-nificance between conditions LE and NP with the Mann-Whitney U test finding this difference in ranking to be statis-tically significant whereas the Wilcoxon signed-rank test doesnot.

6.1. Comparison to baseline system M

We can see that the ‘trajectory tiling’ approach to unit se-lection described in [14], and implemented in our systemsLE,HE,HP,NP is generally effective, with all systemsperforming at least as well as the baseline, and significantlybetter in all cases where a DNN was used as the parametricmodel. We were not able to obtain significant improvementsover baseline with HMM-generated speech parameter trajec-tories (system HP ).

6.2. DNNs vs HMMs

The use of deep neural networks in systems LE, HE andNP provides significant improvements over both the baseline(M ) and the HMM-driven hybrid system (HP ). This demon-

strates that the gains found in SPSS systems when movingfrom HMMs + regression trees to DNNs transfers over to thehybrid unit selections paradigm.

7. CONCLUSIONS & FUTURE WORK

We have proposed to use deep neural networks to guide unitselection systems, and have presented an experimental com-parison of several different configurations of hybrid unit se-lection, all implemented within Festival’s Multisyn frame-work. We found that the use of a DNN to generate features foruse in the target cost was more effective than using an HMM,be that using the speech parameters generated at the outputof the DNN or using context embeddings from a bottlenecklayer.

In this investigation, only the target cost function wasmodified. However, further increases might be obtained infuture work by improving the join cost function [36].

Although we found no significant differences betweenthe use of speech parameters from a DNN compared to con-text embeddings, there is perhaps more consistency in listenerjudgements for the embedding-based systems (LE,HE) thanthe DNN speech parameter-based system (NP ).

The context embedding features discussed here could beused elsewhere in the unit selection system, by using thesefeatures as a back-off function, replacing manual phone sub-stitution rules, or to perform the initial pre-selection of units,instead of the current pre-selection of units by matching di-phone identity.

Investigating different types of neural network for use inthis hybrid framework is left as future work, but we expectthat any improvement in parametric synthesis would carryover to the hybrid method for waveform generation. For ex-ample, mixture density networks (MDNs) might be used todirectly produce a likelihood-based target cost instead of theKLD-based approach used here. Recurrent neural network(RNNs), which are more powerful sequence model, mightalso be used to generate the target trajectories.

Finally the systems investigated here made use of uniformsubsections of diphones (the waveform concatenation unit)in conjunction with phoneme-level acoustic models. How-ever investigations into using diphone-level acoustic modelsis of interest as this would allow state-sized representationsof speech to be used instead and may further improve perfor-mance.

8. ACKNOWLEDGEMENTS

Thanks to Oliver Watts for his suggestions for tuning the tar-get cost weight and to Gustav Eje Henter for help with theMUSHRA test implementation and analysis. This researchwas supported by EPSRC Programme Grant EP/I031022/1,Natural Speech Technology (NST). The NST research datacollection may be accessed athttp://datashare.is.ed.ac.uk/handle/10283/786.

9. REFERENCES

[1] Simon King and Vasilis Karaiskos, “The Blizzard Challenge 2011,” inProc. Blizzard Challenge, 2011.



[4] Simon King, “Measuring a decade of progress in text-to-speech,” Lo-quens, vol. 1, no. 1, 2014.

[5] Thomas Merritt and Simon King, “Investigating the shortcomings ofHMM synthesis,” in Proc. 8th ISCA Speech Synthesis Workshop, 2013,pp. 165–170.

[6] Thomas Merritt, Tuomo Raitio, and Simon King, “Investigating sourceand filter contributions, and their interaction, to statistical parametricspeech synthesis,” in Proc. Interspeech, 2014, pp. 1509–1513.

[7] Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo,and Simon King, “Measuring the perceptual effects of modelling as-sumptions in speech synthesis using stimuli constructed from repeatednatural speech,” in Proc. Interspeech, 2014, pp. 1504–1508.

[8] Thomas Merritt, Javier Latorre, and Simon King, “Attributing mod-elling errors in HMM synthesis by stepping gradually from natural tomodelled speech,” in Proc. ICASSP, 2015.

[9] Thomas Merritt, Junichi Yamagishi, Zhizheng Wu, Oliver Watts, andSimon King, “Deep neural network context embeddings for model se-lection in rich-context HMM synthesis,” in Proc. Interspeech, 2015.

[10] Andrew J Hunt and Alan W Black, “Unit selection in a concatena-tive speech synthesis system using a large speech database,” in Proc.ICASSP, 1996, pp. 373–376.

[11] Alan W Black and Paul A Taylor, “Automatically clustering similarunits for unit selection in speech synthesis.,” in Proc. Eurospeech,1997.

[12] Paul Taylor, Alan W Black, and Richard Caley, “The architecture ofthe festival speech synthesis system,” in The Third ESCA/COCOSDAWorkshop on Speech Synthesis, 1998.

[13] Paul Taylor, “The target cost formulation in unit selection speech syn-thesis.,” in Proc. Interspeech, 2006, pp. 2038–2041.

[14] Yao Qian, Frank K Soong, and Zhi-Jie Yan, “A unified trajectory tilingapproach to high quality speech rendering,” Audio, Speech, and Lan-guage Processing, IEEE Transactions on, vol. 21, no. 2, pp. 280–290,2013.

[15] Zhen-Hua Ling, Heng Lu, Guo-Ping Hu, Li-Rong Dai, and Ren-HuaWang, “The USTC system for Blizzard Challenge 2008,” in Proc.Blizzard Challenge, 2008.

[16] Zhi-Jie Yan, Yao Qian, and Frank K Soong, “Rich-context unit selec-tion ( RUS ) approach to high quality TTS,” in Proc. ICASSP, 2010,pp. 4798–4801.

[17] Vincent Pollet and Andrew Breen, “Synthesis by generation and con-catenation of multiform segments.,” in Proc. Interspeech, 2008, pp.1825–1828.

[18] Alexander Sorin, Slava Shechtman, and Vincent Pollet, “UniformSpeech Parameterization for Multi-form Segment Synthesis,” in Proc.Interspeech, 2011, pp. 337–340.

[19] Alexander Sorin, Slava Shechtman, and Vincent Pollet, “Psychoacous-tic Segment Scoring for Multi-Form Speech Synthesis.,” in Proc. In-terspeech, 2012, pp. 2214–2217.

[20] Alexander Sorin, Slava Shechtman, and Vincent Pollet, “Refined Inter-segment Joining in Multi-Form Speech Synthesis,” in Proc. Inter-speech, 2014, pp. 790–794.

[21] Raul Fernandez, Asaf Rendel, Bhuvana Ramabhadran, and Ron Hoory,“Using Deep Bidirectional Recurrent Neural Networks for Prosodic-Target Prediction in a Unit-Selection Text-to-Speech System,” in Proc.Interspeech, 2015.

[22] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and SimonKing, “Deep neural networks employing multi-task learning andstacked bottleneck features for speech synthesis,” in Proc. ICASSP,2015.

[23] Zhen-Hua Ling, Shi-Yin Kang, Heiga Zen, Andrew Senior, MikeSchuster, Xiao-Jun Qian, Helen M Meng, and Li Deng, “Deep learningfor acoustic modeling in parametric speech generation: A systematicreview of existing techniques and future trends,” IEEE Signal Process-ing Magazine, vol. 32, no. 3, pp. 35–52, 2015.

[24] Heiga Zen, “Acoustic Modeling in Statistical Parametric Speech Syn-thesis - From HMM to LSTM-RNN,” in Proc. MLSLP, 2015.

[25] Sin-Horng Chen, Shaw-Hwa Hwang, and Yih-Ru Wang, “An RNN-based prosodic information synthesizer for Mandarin text-to-speech,”IEEE Transactions on Speech and Audio Processing, vol. 6, no. 3, pp.226–239, 1998.

[26] Zhi-Jie Yan, Yao Qian, and Frank K Soong, “Rich context modelingfor high quality HMM-based TTS,” in Proc. Interspeech, 2009, pp.1755–1758.

[27] Robert AJ Clark, Korin Richmond, and Simon King, “Festival 2–buildyour own general purpose unit selection speech synthesiser,” in Proc.SSW5, 2004.

[28] Robert AJ Clark, Korin Richmond, and Simon King, “Multisyn: Open-domain unit selection for the festival speech synthesis system,” SpeechCommunication, vol. 49, no. 4, pp. 317–330, 2007.

[29] John R. Hershey and Peder A. Olsen, “Approximating the Kullback-Leibler divergence between Gaussian mixture models,” in Proc.ICASSP, 2007.

[30] Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Ma-suko, Alan W. Black, and Keiichi Tokuda, “The HMM-based speechsynthesis system (HTS) version 2.0,” in Proc. SSW6, 2007, pp. 294–299.

[31] Zhizheng Wu and Simon King, “Minimum trajectory error training fordeep neural networks, combined with stacked bottleneck features,” inProc. Interspeech, 2015.

[32] Martin Cooke, Catherine Mayo, and Cassia Valentini-Botinhao,“Hurricane natural speech corpus, [sound],” LISTA Consortium,doi:10.7488/ds/140, 2013.

[33] International Telecommunication Union, Telecommunication Stan-dardization Sector, Geneva, Switzerland, Objective measurement ofactive speech level, March 2011.

[34] International Telecommunication Union Radiocommunication Assem-bly, Geneva, Switzerland, Method for the subjective assessment of in-termediate quality level of coding systems, March 2003.

[35] Thomas Merritt, Robert A. J. Clark, Zhizheng Wu, Junichi Yamag-ishi, and Simon King, “Listening test materials for “Deep neuralnetwork-guided unit selection synthesis”, 2016 [dataset],” Universityof Edinburgh, The Centre for Speech Technology Research (CSTR),doi:10.7488/ds/1313.

[36] Alistair D. Conkie and Stephen Isard, “Optimal coupling of diphones,”in Progress in speech synthesis, pp. 293–304. Springer, 1997.

Documents

Deep neural network-guided unit selection synthesis