17
Advances in Computer Science and Engineering © 2014 Pushpa Publishing House, Allahabad, India Published Online: July 2014 Available online at http://pphmj.com/journals/acse.htm Volume 12, Number 2, 2014, Pages 101-117 Received: April 16, 2014; Accepted: June 17, 2014 Keywords and phrases: voice conversion, restricted Boltzmann machine, sparse representation, parallel dictionary learning. PARALLEL DICTIONARY LEARNING USING A JOINT DENSITY RESTRICTED BOLTZMANN MACHINE FOR SPARSE-REPRESENTATION-BASED VOICE CONVERSION Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki Graduate School of System Informatics Kobe University 1-1 Rokkodai, Kobe, Japan e-mail: [email protected] [email protected] [email protected] Abstract In voice conversion, sparse-representation-based methods have recently been garnering attention because they are, relatively speaking, not affected by over-fitting or over-smoothing problems. In these approaches, voice conversion is achieved by estimating a sparse vector that determines which dictionaries of the target speaker should be used, calculated from the matching of the input vector and dictionaries of the source speaker. The sparse-representation-based voice conversion methods can be broadly divided into two approaches: (1) an approach that uses raw acoustic features in the training data as parallel dictionaries, and (2) an approach that trains parallel dictionaries from the training data. Our approach belongs to the latter; we systematically estimate the parallel dictionaries using a restricted Boltzmann machine, a fundamental technology commonly used in

PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Advances in Computer Science and Engineering © 2014 Pushpa Publishing House, Allahabad, India Published Online: July 2014 Available online at http://pphmj.com/journals/acse.htmVolume 12, Number 2, 2014, Pages 101-117

Received: April 16, 2014; Accepted: June 17, 2014 Keywords and phrases: voice conversion, restricted Boltzmann machine, sparse representation, parallel dictionary learning.

PARALLEL DICTIONARY LEARNING USING A JOINT DENSITY RESTRICTED BOLTZMANN MACHINE FOR

SPARSE-REPRESENTATION-BASED VOICE CONVERSION

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki

Graduate School of System Informatics Kobe University 1-1 Rokkodai, Kobe, Japan e-mail: [email protected]

[email protected] [email protected]

Abstract

In voice conversion, sparse-representation-based methods have recently been garnering attention because they are, relatively speaking, not affected by over-fitting or over-smoothing problems. In these approaches, voice conversion is achieved by estimating a sparse vector that determines which dictionaries of the target speaker should be used, calculated from the matching of the input vector and dictionaries of the source speaker. The sparse-representation-based voice conversion methods can be broadly divided into two approaches: (1) an approach that uses raw acoustic features in the training data as parallel dictionaries, and (2) an approach that trains parallel dictionaries from the training data. Our approach belongs to the latter; we systematically estimate the parallel dictionaries using a restricted Boltzmann machine, a fundamental technology commonly used in

Page 2: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 102

deep learning. Through voice conversion experiments, we confirmed the high-performance of our method, comparing it with the conventional Gaussian mixture model (GMM)-based approach, and a non-negative matrix factorization (NMF)-based approach, which is based on sparse representation.

1. Introduction

In recent years, voice conversion (VC), a technique used to change specific information in the speech of a source speaker to that of a target speaker while retaining linguistic information, has been garnering much attention in speech signal processing. VC techniques have been applied to various tasks, such as speech enhancement [1], emotion conversion [2], speaking assistance [3], and other applications [4, 5]. Most of the related work in VC focuses not on f0 conversion but on the conversion of spectrum features, and we conform to that in this report as well.

Various statistical approaches to VC have been studied so far, including those discussed in [6, 7]. Among these approaches, the joint density Gaussian mixture model (JD-GMM)-based mapping method [8] is widely used, and a number of improvements have been proposed. Toda et al. [9] introduced dynamic features and the global variance (GV) of the converted spectra over a time sequence. Helander et al. [10] proposed transforms based on partial least squares (PLS). There have also been approaches that do not require parallel data since they use a GMM adaptation technique [11], eigen-voice GMM [12, 13] or probabilistic integration model [14].

However, it is reported that the GMM-based approaches have some problems related to over-fitting and over-smoothing [10, 15-17]. The problems come from the linear conversion that aggregates adjacent data points to each Gaussian mode. Sparse-representation-based approaches using non-negative matrix factorization (NMF) [18] or unit selection (US) [16] have, therefore, been proposed to alleviate such problems.

In sparse-representation-based voice conversion, we obtain the target speaker’s speech by estimating a sparse vector that determines which

Page 3: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 103

dictionaries should be used. The sparse vector is calculated by matching the source speaker’s input vector and the dictionaries. Therefore, the pairs of dictionaries (parallel dictionaries) should be trained (or prepared) beforehand. In this approach, the converted speech quality mainly depends on the accuracy when creating the parallel dictionary and when selecting the appropriate dictionaries (estimating a sparse vector). Exemplar-based voice conversion using NMF [19] uses bases that are exactly the same as the spectrum in the parallel training data. Therefore, although it is not affected by the errors associated with creating dictionaries, it degrades conversion accuracy by the errors that occur when estimating sparse vectors (a mismatch of activity matrices of the source speaker and the target speaker). Another NMF-based voice conversion method, in which parallel dictionaries are trained from the training data so as to match their activity matrices instead of using the exemplars without change (referred to “Spectral mapping; SM”), has been also proposed [20]. This approach, in contrast, improves the errors that occur by mismatching the activity matrices; however, it produces errors in the training of parallel dictionaries.

The above-mentioned voice conversion methods are based on linear functions. Since our vocal tracts have non-linear shapes, a non-linear function is a better representation for capturing non-linear relationships between a source speaker’s speech and target speaker’s speech. Several voice conversion methods based on non-linear functions can be found in [21] by Desai et al. (they employ multi-layer neural networks (NNs)), in [22] by Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional restricted Boltzmann machine (CRBM [24])). Our earlier works also employed speaker-dependent RBMs or CRBMs to capture speaker-specific features [25, 26]. It has been reported that these graphical models are better at representing the distribution of high-dimensional observations with cross-dimension correlations than GMM in speech synthesis [15] and in speech recognition [27]. Since Hinton et al. introduced an effective training algorithm for the DBN in 2006 [28], the use of deep learning rapidly spread in the field of signal processing, as well as in speech signal processing. An RBM (or DBN) has been used, for example, for

Page 4: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 104

hand-written character recognition [28], 3-D object recognition [29], machine transliteration [30], and so on.

In this paper, we describe a non-linear voice conversion method that utilizes a joint density RBM of a source speaker and a target speaker in order to effectively train parallel dictionaries in the framework of a sparse-representation approach. An RBM is a bi-directional probabilistic model that consists of a visible layer and a hidden layer, characterized in that there is no connection among the units in the same layer, but there exist connections among the units in different layers. These connection weights are trained in an unsupervised manner. In our approach, an RBM inputs a concatenated vector (parallel data) of the source speaker’s and the target speaker’s acoustic features, such as MFCC, that are aligned beforehand. By feeding such vectors, the RBM trains co-occurrences of the source speaker’s features and the target speaker’s features through hidden units (the estimated weights indicate filters of parallel dictionaries). Our approach is similar to Ling’s work [22]. While Ling’s approach alternatively used an RBM instead of using GMM to capture the joint distribution, our approach tries to train parallel dictionaries in sparse-representation-based voice conversion. Our earlier works [25, 26] also used RBMs (or DBNs) for each speaker to extract speaker-specific features, while our approach feeds a concatenated vector of the speakers in the visible layer.

This paper is organized as follows: In Section 2, voice conversion based on sparse representation is described. In Section 3, we review the RBM, which is a fundamental probabilistic model in our method. In Section 4, the proposed method that uses an RBM for training parallel dictionaries is described. In Section 5, the performance of the proposed method is evaluated, and we conclude and discuss future work in Section 6.

2. Sparse-representation-based Voice Conversion

In voice conversion, the system generally converts an acoustic vector of

the source speaker DR∈x into the target speaker’s vector .DR∈y In

Page 5: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 105

sparse-representation-based voice conversion, K pairs of the source speaker’s

dictionaries KDx

×∈ RD and the target speaker’s dictionaries KDy

×∈ RD

(parallel dictionaries) are trained beforehand. Given an input vector of the source speaker, the converted target speaker’s vector y is obtained using the trained parallel dictionaries as shown in Figure 1. First, we calculate a sparse

vector KK0, αα R∈ that satisfies

( ),αxf D≈x (1)

and then we obtain the target speaker’s vector as follows:

( ),αyf D≈y (2)

where ( )⋅f indicates an arbitrary function.

Figure 1. Voice conversion based on sparse representation.

Most sparse-representation-based approaches use training exemplars without changes for the parallel dictionaries yx DD , [19, 31, 32]. For the

calculation of the sparse vector ,α there are various approaches that can be used, such as L1 normalization [31], K-nearest neighbors algorithm [32], and sparse non-negative matrix factorization (NMF) [19]. In [20], another approach using sparse NMF has been proposed that uses trained parallel dictionaries so that the sparse vectors for the source and the target are the same. Furthermore, the above-mentioned sparse-representation-based voice conversion methods are based on linear function ( )( ⋅f are linear functions in

equations ( ) ( )).21

Page 6: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 106

Figure 2. Graphical representation of an RBM.

3. Restricted Boltzmann Machine

A restricted Boltzmann machine (RBM) is an undirected graphical model that defines the distribution of visible variables with binary hidden (latent) variables as shown in Figure 2. An RBM was originally introduced as a method of representing binary-valued data [33], and it later came to be used to deal with real-valued data (such as acoustic features) known as a Gaussian-Bernoulli RBM (GBRBM) [34]. However, it has been reported that the original GBRBM had some difficulties because the training of the parameters was unstable [28, 35, 36]. Later, an improved learning method for GBRBM was proposed by Cho et al. [37] to overcome the difficulties.

In literature dealing with the improved GBRBM [37], the joint

probability ( )hv,p of real-valued visible units [ ] R∈= iT

I vvv ,...,,1v and

binary-valued hidden units [ ] ,...,,1T

Jhh=h { }1,0∈jh is defined as

follows:

( ) ( ),1, , hvhv EeZp −= (3)

( ) ,2, 2

2Whvhcbvhv

TTE ⎟

⎠⎞

⎜⎝⎛−−−=σσ (4)

( )∑ −=hv

hv

,

, ,EeZ (5)

Page 7: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 107

where 2⋅ denotes L2 norm. ,,, 11 ××× ∈∈∈ IIJI RRR bW σ and 1×∈ JRc

are model parameters of the GBRBM, indicating the weight matrix between visible units and hidden units, the standard deviations associated with Gaussian visible units, a bias vector of the visible units, and a bias vector of hidden units, respectively. The fraction bar in equation (4) denotes the element-wise division.

Because there are no connections between visible units or between hidden units, the conditional probabilities ( )vh |p and ( )hv |p form simple

equations as follows:

( ) ,1 :2 ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛+=|= j

T

jj chp Wvvσ

S (6)

( ) ( ),, 2: iiii bvvvp σ+|=|= hWh N (7)

where j:W and :iW denote the jth column vector and the ith row vector,

respectively. ( )⋅S and ( )2, σμ|⋅N indicate an element-wise sigmoid function

and Gaussian probability density function with the mean μ and variance .2σ

For parameter estimation, the log-likelihood of a collection of visible units ( )∏= n np vlogL is used as an evaluation function. Differentiating

partially with respect to each parameter, we obtain:

,22modeli

ji

datai

ji

ij

hvhv

σ−

σ=

∂∂WL (8)

,22modeli

i

datai

ii

vvb σ

−σ

=∂∂L (9)

,modeljdatajj

hhc −=∂∂L (10)

where data⋅ and model⋅ indicate expectations of input data and the inner

Page 8: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 108

model, respectively. However, it is generally difficult to compute the second term, so, typically, the expected reconstructed data recon⋅ computed by

equations (6) and (7) is used instead [28].

In the improved GBRBM, the variance parameter 2iσ is replaced as

izi e=σ2 so as to constrain the variance to a non-zero value and provide

stability in training the parameters. Under this modification, the gradient with respect to iz becomes

( )

dataii

iiz

ivbvez

i hW :2

2 −−

=∂∂ −L

( ) .2 :2

modelii

iiz vbve i hW−−

− − (11)

Using equations (8), (9), (10), and (11), each parameter can be updated by stochastic gradient descent with a fixed learning rate starting from initial values. That is,

( ) ( ) ,oldnewΘ∂∂γ−Θ=Θ L (12)

where γ indicates a learning rate.

Figure 3. A joint density RBM for voice conversion. Each hidden unit connects with the dictionaries of the source and the target speaker.

Page 9: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 109

4. A Joint Density RBM for Parallel Dictionary Learning and Iterative Estimation Algorithm

Our voice conversion system uses a joint density RBM to train parallel dictionaries yx DD , and estimate sparse vectors α at the same time as

shown in Figure 3. As discussed in the previous section, an RBM is a two-layer network that consists of a visible layer and a hidden layer, characterized in that bi-directional connections exist only between visible and hidden units. As shown in Figure 3, the RBM that feeds a concatenated vector of source speaker’s features x and target speaker’s features y can be regarded as a network where a dictionary-selection weight iα connects to both x and y

with weights of ith dictionaries ixD and ,i

yD respectively.

Given parallel training data ( ),, yx we define a joint probability of

α,, yx as follows:

( ) ( ),1,;,, ,;,, yxEyx eZp DDDD αα yxyx −

= (13)

( )22

22,;,,y

y

xx

yxE σσαbybxyx

−+

−=DD

,22 ασ

ασ

α y

T

yx

T

x

T DD⎟⎟

⎜⎜

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛−− yxc (14)

where ( )∑ −= αα

,,,,

yxyxEeZ indicates a normalization term, and c is a

bias parameter vector of a dictionary-selection weights. xb and xσ indicate

bias and deviation parameters of the source speaker’s acoustic features, respectively, and yb and yσ indicate bias and deviation parameters of the

target speaker’s features, respectively. The dictionaries yx DD , (and the

other parameters) can be estimated by maximizing a likelihood ( )yx,p=L

( )∑= α α,, yxp from equation (13). By partially differentiating the

likelihood with respect to each dictionary, we obtain:

Page 10: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 110

,22modelx

kd

datax

kdx

dddk

xxσ

α−

σ

α=

∂∂DL (15)

.22modely

kd

datay

kdy

dddk

yyσ

α−

σ

α=

∂∂DL (16)

As discussed in Section 3, we make practical use of an approximation method (contrastive divergence) to calculate the second terms of equations (15), (16). For the other parameters, we can also derive the gradients the same as in equations (9), (10), (11).

When it comes to conversion, we estimate the target speaker’s vector y

by repeating forward inference and backward inference of an RBM as shown in Figure 4. Given an initial vector ,0y we first calculate the expectation

values of dictionary-selection weights α using the probability that each dictionary is selected:

( ) ., 22 ⎟⎟

⎜⎜

⎛+

⎟⎟

⎜⎜

⎛+⎟

⎟⎠

⎞⎜⎜⎝

⎛=|= cyxyx

y

Ty

x

Txp

σσ1α DDS (17)

Input: Dictionaries ,, yx DD a source speaker’s vector x and an initial

target vector 0y

Output: Estimated target speaker’s vector y

Initialize: Set the initial values as .ˆ 0yy =

Repeat the following updates R times:

1. { } ( ) ⎟⎟

⎜⎜

⎛+

⎟⎟

⎜⎜

⎛+⎟

⎟⎠

⎞⎜⎜⎝

⎛=| cyx

yx 22ˆ,ˆˆy

Ty

x

TxpE

σσαα α DDS

2. { } ( ) yypE by y +=| αα α ˆˆ ˆ D

Figure 4. Iterative estimation algorithm of the target vector using a joint density RBM.

Page 11: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 111

Secondly, the expectation values of y are calculated using .α The conditional probability of y is given from backward inference of an RBM as follows:

( ) ( )., 2yyyp σαα byy +|=| DN (18)

Repeating the above-mentioned procedures (estimation of α and y) R times, we iteratively obtain the converted vector .y Although several

approaches for determining the initial vector 0y can be considered (e.g., an

approach that uses the source feature vector x, or estimated values obtained from another methods such as GMM), we use a zero vector 0 in this paper.

Similar to SMNMF (spectral mapping NMF [20]), our voice conversion method optimizes the likelihood of the training data as well as the likelihood of the dictionary-selection vector as shown in equation (13). The most obvious difference is that our approach uses a non-linear function for estimating a dictionary-selection vector as in equation (17), while SMNMF still uses a linear function. Furthermore, while SMNMF is restricted to input non-negative values, our approach can feed real values without constraints. In particular, MFCC, which tends to distribute monomodally, will go together with our approach that assumes Gaussian-distributed inputs.

5. Experiments

5.1. Conditions

In our experiments, we conducted voice conversion using the ATR Japanese speech database [38], comparing our method (joint density restricted Boltzmann machines, or “JDRBM”) with the conventional sparse-representation-based voice conversion that uses exemplar-based NMF [19] (“NN”), and spectral-mapping-based NMF [20] (“SMNMF”) and, for a reference, the well-known GMM-based approach (64 mixtures). From this database, we selected and used a male speaker (identified with “MMY”) and a female speaker (“FTK”) for the source and target speakers, respectively. As

Page 12: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 112

an acoustic feature vector for our approach and GMM, we calculated 24-dimensional MFCC features from STRAIGHT spectra [39] using filter-theory [40] to decode the MFCC back to STRAIGHT spectra in the synthesis stage. For NMF-based approaches, we used 513-dimensional vectors of STRAIGHT spectra. The parallel data of the source/target speakers processed by Dynamic Programming were created from 216 word utterances (58,426 frames) in the dataset, and were used for the training of each method. For the objective test, 25 sentences (about 100 sec. long) that were not included in the training data were arbitrarily selected from the database. The RBM was trained using gradient descent with a learning rate of 0.01 and momentum of 0.9, with the number of epochs being 400. We changed the number of hidden units as 96, 192, and 384, and evaluated their performance (referred to as “JDRBM-96”, “JDRBM-192”, and “JDRBM-384”, respectively). We obtained the converted vector with 5=R iterations (already converged). For spectral-mapping-based NMF, we changed the number of bases as 1,000 (“SMNMF-1000”) and 2,500 (“SMNMF-2500”). For exemplar-based NMF, we compared the case where all training frames were used (“EXNMF-58426”) and the case where 1,000 frames were arbitrarily used from the training data (“EXNMF-1000”).

For the objective evaluation, we used SDIR (spectral distortion improvement ratio) to measure how the converted vector is improved to resemble the original source vector. The SDIR is defined as follows:

[ ]( ) ( )

( ) ( ),

ˆlog10 2

2

10 ∑∑

−=

dtt

dst

dd

dddBSDIR

XX

XX (19)

where ( ) ( )dd st XX , and ( )dtX denote the dth original target spectra, the

source spectra and the converted spectra (spectra obtained from the converted MFCC), respectively. The larger the value of SDIR is, the greater the improvement in the converted spectra. We calculated the SDIR for each frame in the training data, and averaged the SDIR values for the final evaluation.

Page 13: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 113

Table 1. Performance of each method

Joint density RBM (Proposed) Spectral mapping (NMF) Exemplar-based (NMF) GMM

Method JDRBM-96 JDRBM-192 JDRBM-384 SMNMF-1000

SMNMF-2500

EXNMF-1000

EXNMF-58426

GMM-64

SDIR (dB)

5.21 5.32 4.45 5.14 4.68 4.91 5.23 4.11

(a) (b)

(c) (d)

Figure 5. Examples of voice conversion using an utterance “ureshii hazuga yukkuri netemo irarenai”. (a) Spectrogram of the male source speaker, (b) spectrogram of the female target speaker, (c) converted spectrogram obtained by JRBM-192, and (d) expectation values of the estimated α (vertical and horizontal axes indicate the index and the time, respectively).

5.2. Results and discussion

We summarize the experimental results of each method in Table 1. As shown in Table 1, our approach (“JDRBM-192”) outperformed the other methods. The differences between our approach and SMNMF are the types of input features and the gate functions in equations (1), (2). The reason for the improvement is attributed to the fact that our approach, which inputs real-valued data and uses non-linear gate functions, is able to represent input

Page 14: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 114

features better than SMNMF. When we compare within our methods, we can see that the performance degrades when the number of hidden units is too large (“JDRBM-384”) or too small (“JDRBM-96”). This is because the JDRBM where the number of hidden units is too small could not represent the input data sufficiently, and the JDRBM where the number of hidden units is too large was over-fitted.

Figure 5 shows an example of the spectrogram converted using our method “JDRBM-192”, along with the spectrograms of the source speech and the target speech, and expected values of the dictionary-selection vectors. In particular, as shown in Figure 5(d), even though our approach does not explicitly use any constraints related to sparsity, the estimated α seems very sparse (almost all of the values in α are zero). This is due to the fact that an RBM characteristically naturally makes the hidden units sparse in the process where the model parameters are estimated so that the hidden units do not capture redundant information between each other.

6. Conclusion

This paper presented a voice conversion method using a joint density RBM as an alternative tool of a sparse-representation-based approach where only a few dictionaries are used for the converted-voice generation. Our proposed method demonstrated better performance compared with the conventional sparse-representation-based approach (spectral-mapping NMF and exemplar-based NMF) and the well-known GMM-based approach. In the future, we will study a method that automatically decides the appropriate number of hidden units since the number seems important. Furthermore, we will extend our method to have a deeper architecture using a deep Boltzmann machine or such, so that it captures more complex information in the data.

References

[1] A. Kain and M. W. Macon, Spectral voice conversion for text-to-speech synthesis, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1998, pp. 285-288.

Page 15: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 115

[2] C. Veaux and X. Robet, Intonation conversion from neutral to expressive speech, Proc. Interspeech, 2011, pp. 2765-2768.

[3] K. Nakamura, T. Toda, H. Saruwatari and K. Shikano, Speech Communication 54 (2012), 134-146.

[4] L. Deng, A. Acero, L. Jiang, J. Droppo and X. Huang, High-performance robust speech recognition using stereo training data, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2001, pp. 301-304.

[5] A. Kunikoshi, Y. Qiao, N. Minematsu and K. Hirose, Speech generation from hand gestures based on space mapping, Proc. Interspeech, 2009, pp. 308-311.

[6] R. Gray, IEEE ASSP Magazine 1 (1984), 4-29.

[7] H. Valbret, E. Moulines and J.-P. Tubach, Speech Communication 11 (1992), 175-187.

[8] Y. Stylianou, O. Cappé and E. Moulines, IEEE Transactions on Speech and Audio Processing 6 (1998), 131-142.

[9] T. Toda, A. W. Black and K. Tokuda, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007), 2222-2235.

[10] E. Helander, T. Virtanen, J. Nurminen and M. Gabbouj, IEEE Transactions on Audio, Speech, and Language Processing 18 (2010), 912-921.

[11] C.-H. Lee and C.-H. Wu, Map-based adaptation for speech conversion using adaptation data selection and non-parallel training, Proc. Interspeech, 2006, pp. 2254-2257.

[12] T. Toda, Y. Ohtani and K. Shikano, 2006.

[13] D. Saito, K. Yamamoto, N. Minematsu and K. Hirose, One-to-many voice conversion based on tensor representation of speaker space, Proc. Interspeech, 2011, pp. 653-656.

[14] D. Saito, S. Watanabe, A. Nakamura and N. Minematsu, Probabilistic integration of joint density model and speaker model for voice conversion, Proc. Interspeech, 2010, pp. 1728-1731.

[15] Z.-H. Ling, L. Deng and D. Yu, IEEE Transactions on Audio, Speech, and Language Processing (2013), 2129-2139.

[16] Z. Wu, T. Virtanen, T. Kinnunen, E. S. Chng and H. Li, Exemplar-based unit selection for voice conversion utilizing temporal information, Proc. Interspeech, 2013, pp. 3057-3061.

Page 16: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki 116

[17] H.-T. Hwang, Y. Tsao, H.-M. Wang, Y.-R. Wang and S.-H. Chen, Alleviating the Over-Smoothing Problem in GMM-Based Voice Conversion with Discriminative Training, Proc. Interspeech, 2013, pp. 3062-3066.

[18] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, 2000, pp. 556-562.

[19] R. Takashima, T. Takiguchi and Y. Ariki, Exemplar-based voice conversion in noisy environment, Spoken Language Technology Workshop (SLT) (2012), 313-317.

[20] R. Takashima, R. Aihara, T. Takiguchi and Y. Ariki, Noise-Robust Voice Conversion Based on Spectral Mapping on Sparse Space, SSW8 (2013), 71-75.

[21] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black and K. Prahallad, Voice conversion using artificial neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 3893-3896.

[22] L.-H. Chen, Z.-H. Ling, Y. Song and L.-R. Dai, Joint Spectral Distribution Modeling Using Restricted Boltzmann Machines for Voice Conversion, Proc. Interspeech, 2013, pp. 3052-3056.

[23] Z. Wu, E. S. Chng and H. Li, Conditional restricted Boltzmann machine for voice conversion, IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), 2013.

[24] G. W. Taylor, G. E. Hinton and S. T. Roweis, Modeling human motion using binary latent variables, Advances in Neural Information Processing Systems, 2006, pp. 1345-1352.

[25] T. Nakashika, R. Takashima, T. Takiguchi and Y. Ariki, Voice Conversion in High-order Eigen Space Using Deep Belief Nets, Proc. Interspeech, 2013, pp. 369-372.

[26] T. Nakashika, T. Takiguchi and Y. Ariki, 113 (2013), 83-88.

[27] A.-R. Mohamed, G. E. Dahl and G. Hinton, IEEE Transactions on Audio, Speech, and Language Processing 20 (2012), 14-22.

[28] G. E. Hinton, S. Osindero and Y.-W. Teh, Neural Computation 18 (2006), 1527-1554.

[29] V. Nair and G. E. Hinton, 3D Object Recognition with Deep Belief Nets, NIPS, (2009), 1339-1347.

[30] T. Deselaers, S. Hasan, O. Bender and H. Ney, A deep learning approach to machine transliteration, Fourth Workshop on Statistical Machine Translation, (2009), 233-241.

Page 17: PARALLEL DICTIONARY LEARNING USING A JOINT ...takigu/pdf/2014/14-ACSE...Chen et al. (they use a restricted Boltzmann machine (RBM)), and in [23] by Wu et al. (they use a conditional

Parallel Dictionary Learning Using a Joint Density Restricted … 117

[31] D. L. Donoho, IEEE Transactions on Information Theory 52 (2006), 1289-1306.

[32] H. Chang, D.-Y. Yeung and Y. Xiong, Super-resolution through neighbor embedding, Computer Vision and Pattern Recognition (2004), 275-282.

[33] Y. Freund and D. Haussler, Unsupervised learning of distributions of binary vectors using two layer networks, Computer Research Laboratory, 1994.

[34] G. E. Hinton and R. R. Salakhutdinov, Science 313 (2006), 504-507.

[35] A. Krizhevsky and G. Hinton, Computer Science Department, University of Toronto, Tech. Rep., 2009.

[36] G. Hinton, A Practical Guide to Training Restricted Boltzmann Machines, Tech. Rep. Department of Computer Science, University of Toronto, 2010.

[37] K. Cho, A. Ilin and T. Raiko, Artificial neural networks and machine learning-ICANN 2011, Springer, 2011, pp. 10-17.

[38] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara and K. Shikano, Speech Communication 9 (1990), 357-363.

[39] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino and H. Banno, TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation, ICASSP (2008), 3933-3936.

[40] B. Milner and X. Shao, Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model, Proc. Interspeech, 2002, pp. 2421-2424.