SVSGAN: SINGING VOICE SEPARATION VIA … · SVSGAN: SINGING VOICE SEPARATION VIA GENERATIVE ADVERSARIAL NETWORK Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing R. Jang Department of Computer

SVSGAN: SINGING VOICE SEPARATION VIA GENERATIVE ADVERSARIAL NETWORK

Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing R. Jang

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

ABSTRACT

Separating two sources from an audio mixture is an im-portant task with many applications. It is a challenging prob-lem since only one signal channel is available for analysis. Inthis paper, we propose a novel framework for singing voiceseparation using the generative adversarial network (GAN)with a time-frequency masking function. The mixture spec-tra is considered to be a distribution and is mapped to theclean spectra which is also considered a distribtution. Theapproximation of distributions between mixture spectra andclean spectra is performed during the adversarial training pro-cess. In contrast with current deep learning approaches forsource separation, the parameters of the proposed frameworkare first initialized in a supervised setting and then optimizedby the training procedure of GAN in an unsupervised setting.Experimental results on three datasets (MIR-1K, iKala andDSD100) show that performance can be improved by the pro-posed framework consisting of conventional networks.

Index Terms— Singing voice separation, music sourceseparation, deep learning, generative adversarial network

1. INTRODUCTION

Monaural source separation is important to various music ap-plications and is sometimes used as a pre-processing step ofmusic signal analysis. For instance, leading instrument detec-tion [1, 2] separates a leading instrument from its accompa-niments. Singing pitch estimation [3–5] can be improved byfirst separating vocals from background music. Cover songidentification [6] is also based on leading instrument or vocalpitch features, estimated using a separated singing voice.

Several approaches have been proposed for singing voiceseparation. Rafii and Pardo proposed the REPET system [7]to separate voice and music by extracting the repeating musi-cal structure. Assumption of low rank and sparsity of musichas been used for matrix decompostition [8–11]. The widelyused non-negative matrix factorization (NMF) is applied bylearning the non-negative reconstruction bases and weightsfor singing voice separation [12]. Moreover, a complex NMFmodel [13] has been proposed for jointly estimating the spec-trogram and the source phase.

With the development of deep learning, Mass et al. [14]used recurrent neural networks (RNN) to create a clean voice.

Fig. 1: Block diagram of the proposed framework.

Huang et al. [15] then proposed deep RNN with discrimi-native training to reconstruct vocals from background music.Training multi-context networks [16,17] with different inputscombined at layer level was proposed to improve audio sep-aration performance. Deep clustering [18] is also used formusic separation. Post-processing with a Wiener filter at theoutput of neural networks and data augmentation [19] havebeen proposed to separate vocals and instruments. All ofthese deep learning techniques use multiple non-linear lay-ers to learn the optimal hidden representations from data in asupervised setting.

Generative adversarial networks (GAN) are a new gener-ative model of deep learning [20], which has been success-fully used in the field of computer vision to generate realisticimages. In the field of source separation, Pascual et al. [21]proposed the use of GAN on speech enhancement, which op-erates in the waveform domain and aims to generate cleanvocal waveforms. This paper proposes a novel framework forsinging voice separation via GAN (SVSGAN) which oper-ates in the frequency domain and uses a conditional versionof GAN. To our knowledge, this is the first proposed frame-work to use adversarial learing to perform sinigng voice sep-aration. We regard each spectrum as a sample vector comingfrom the distribution of spectra. Non-linear mapping of dis-tributions between the mixture spectra and the clean spectra isperformed during the adversarial training process. Before ad-versarial training, the generator parameters are first initializedwith joint optimization in a supervised setting and then opti-mized by the SVSGAN training process in an unsupervised

arX

iv:1

710.

1142

8v2

[cs

.SD

] 1

3 N

ov 2

017

Fig. 2: The proposed SVSGAN framework which consists of two conventional DNNs: generator G and discriminator D. Eachspectrum is considered to be a sample vector coming from a distribution of spectra.

setting. Finally the time-frequency masking function consistsof the generator outputs. The block diagram of the proposedframework is shown in Fig. 1.

The remainder of this paper is organized as follows: Sec-tion 2 gives an overview of GAN. Section 3 presents thedeatails of proposed model including parameter initializationand the adversarial training process. Section 4 presents theexperimental settings and results using the MIR-1K, iKalaand DSD100 datasets. We conclude the paper in Section 5.

2. GENERATIVE ADVERSARIAL NETWORKS

Ian et al. [20] proposed adversarial learning models that learnto map samples z from one distribution to samples x fromanother distribution. GAN consists of generative model G anddiscriminative model D, which compete in a two-player min-max game. G aims to imitate the real data distribution whileD is a binary classifier which tries to accurately distinguishreal data from those generated. Within this min-max game,the generator and the discriminator can be trained jointly byoptimizing the following objective function:

minG

maxD

V (G,D) = Ex∼Pdata(x)[logD(x)]+

Ez∼PG(z)[log(1−D(G(z)))],

(1)

where x is real data sampled from distribution Pdata and G(z)stands for artificial data sampled from distribution PG. It isshown in [20] that sufficient training data and epochs allowsthe distribution PG to coverage to the distribution Pdata.

To get more mapping information, we use a conditionalextension of GAN (CGAN) [22] which is augmented withsome side information. Suppose there is a context vector yas side information, the generator G(z, y) tries to synthesizerealistic data under the control of y. Similarly, the CGANmodel allows the output of the discriminative model D(x, y)

to be controlled by the context vector y. The objective func-tion becomes the following:

minG

maxD

VCGAN (G,D) = Ex,y∼Pdata(x,y)[logD(x, y)]+

Ez∼PG(z),y∼Pdata(y)[log(1−D(G(z, y), y))].

(2)

In this work, we adjust the input of CGAN, which is discussedin Section 3.

3. PROPOSED WORK

3.1. Model of Singing Voice Separation GAN (SVSGAN)

The SVSGAN architecture consists of two conventional deepneural networks (DNNs): generator G and discriminator D,as shown in Fig. 2. We use magnitude spectra as features andtake each spectrum as a sample vector from the spectra distri-bution. Non-linear mapping is performed between the inputmixture spectrum and output clean spectrum, which consistsof the vocal part and background music part. Generator G in-puts a mixture spectra and generates realistic vocal and back-ground music spectra while discriminator D distinguishes theclean spectra from those generated spectra.

Given that magnitude spectra are transformed from thetime domain audio signals using short time Fourier transform(STFT), the output targets y1 and y2 of the network are themagnitude spectra of different sources. After training, the net-work’s output predictions, which are also magnitude spectra,are y1 and y2. The time-frequency masking function, calleda soft time-frequency mask, can smooth the source separa-tion results and is used here. The time-frequency mask can bedefined as:

m(f) =|y1(f)|

|y1(f)|+ |y2(f)|, (3)

where f = 1, 2, ...,F, stands for different frequencies. After atime-frequency mask is calculated, it is applied to the spectraz of the mixture signals to estimate the predicted separationspectra s1 and s2, corresponding to source 1 and source 2,defined as:

s1(f) = m(f)z(f),

s2(f) = (1−m(f))z(f),(4)

where f = 1, 2, ...,F, stands for different frequencies. How-ever, based on [15], the joint optimization is proposed toachieve better results. Similarly, instead of training the net-work for the time-frequency mask, we train it with the time-frequency masking function. As shown in the left part ofFig.2, the time-frequency masking function is regarded as anadditional layer at the network output, defined as:

y1 =|y1|

|y1|+ |y2|⊗ z,

y2 =|y2|

|y1|+ |y2|⊗ z,

(5)

where ⊗ stands for elementwise operation. y1 and y2 are es-timated spectra, which can be transformed into time-domainsignals using the inverse short time Fourier transform (ISTFT)with phase information. In this way, the network and time-frequency masking function are jointly optimized. In our pro-posed framework, the final output separated spectra is basedon Eq. 5.

3.2. Training Objective Functions

Before adversarial training, the parameters of the generator Gare initialized by performing Eq. 5 in a supervised setting.The training obejctive J is the mean squared error (MSE)function, which is defined as follows:

J = ‖y1 − y1‖2 + ‖y2 − y2‖2. (6)

After parameter initialization, the generator G provides ba-sic performance for singing voice separation to serve as thebaseline for our experiments.

To fit the input of generator G, the training objective func-tion of SVSGAN by adjusting Eq. 2 is defined as follows:

minG

maxD

VSV SGAN (G,D) =

Ez,sc∼Pdata(z,sc)[logD(sc, z)]+

Ez∼PG(z)[log(1−D(G(z), z)],

(7)

where sc is the concatenation of y1 and y2, and the outputof G(z) is the predicted spectra consisting of the concate-nation of y1 and y2, which is generated from input spectraz. The output of discirminator D is controlled by the aug-mented input spectra z. By this step, the SVSGAN not onlyapproximates the distribution between input spectra and out-put spectra but also learns the general structure of the spectra.

Fig. 3: SVSGAN training process, where “B.g.m.” standsfor background music. The parameters of generator G arefirst inintialized. Discriminator D returns “fake” when theinput contains predicted spectra and returns “real” when inputcontains clean spectra.

In addition, we use log D trick [20] as the objective functionfor generator G.

Note that better separation results may be obtained usingcomplicated training objective functions and more powerfulneural networks, such as RNN or CNN. However, we use abasic neural network architecture and the MSE as the trainingobjective to investigate the degree of performance improve-ment provided by GAN.

4. EXPERIMENTS

4.1. Dataset & Settings

The proposed framework is evaluated using the MIR-1Kdataset [23], iKala dataset [11] and Demixing Secret Database(DSD100) [24]. The MIR-1K dataset consists of 1,000 songclips lasting 4 to 13 seconds with a sample rate of 16,000 Hz.These clips are recorded from 110 Chinese popular karaokesongs performed by both male and female amateurs. TheiKala dataset consists of 352 30-second song clips with asample rate of 44,100 Hz. These clips are recorded from Chi-nese popular songs performed by professional singers. Only252 song clips are released as a public subset for evaluation.Each song clip in these two datasets is a stereo recording,with one channel for the singing voice and the other for back-ground music. Manual annotations of the pitch contours areprovided. In experimental settings, we randomly select one-fourth of the song clips for training data and the remainingsong clips are used for testing.

The DSD100 dataset is taken from a subtask called MUSfrom the Signal Separation Evaluation Campaign (SiSEC). It

MIR-1K DatasetModel SDR SAR SIRDNN (baseline) 6.57 10.14 9.84SVSGAN (V+B) 6.69 10.32 9.86SVSGAN (V+M) 6.73 10.28 9.96SVSGAN (V+B+M) 6.78 10.29 10.07IBM (upper bound) 13.92 14.80 21.96

iKala DatasetModel SDR SAR SIRDNN (baseline) 9.74 11.72 14.99SVSGAN (V+B) 10.15 12.48 14.72SVSGAN (V+M) 10.22 12.78 14.41SVSGAN (V+B+M) 10.32 12.87 14.54IBM (upper bound) 12.30 14.10 23.70

Table 1: Vocal results (in dB) of conventional DNN andSVSGANs on the MIR-1K and iKala datasets. “IBM” rep-resents ideal binary mask. Some examples of singing voiceseparation are provided at http://mirlab.org/demo/svsgan.

consists of Dev and Test parts each with 50 songs with a sam-ple rate of 44,100 Hz. Each song provides four sources:bass, drums, other and vocals and the mixture is semi-professionally engineered. The average duration of thesesongs is 4 minutes and 10 seconds and the dataset includes awide variety of music genres.

To reduce computational cost, all song clips from theiKala and DSD100 datasets are downsampled to 22,050 Hz.We used STFT to yield magnitude spectra with a 1024-pointwindow size and a 256-point hop size. Performance is mea-sured in terms of source to distortion ratio (SDR), source tointerference ratio (SIR), and source to artifact ratio (SAR),calculated by the Blind Source Separation (BSS) Eval tool-box v3.0 [25]. For the iKala and MIR-1K datasets, overallperformance is reported on weighted means of the SDR, SARand SIR. For the DSD100 dataset, overall performance isreported on median values of SDR based Test part.

4.2. Experimental Results

To compare the performance between the conventional DNNand SVSGANs, we construct a conventional DNN, whichconsists of 3 hidden layers, each with 1024 neurons, denotedas DNN (baseline). The architecture of generator G in theSVSGANs is identical to the baseline and is combined withdiscriminator D consisting of 3 hidden layers, each with 512neurons. The difference between the SVSGANs is the inputspectra of discriminator D, as shown in Table 1, where “V”stands for the vocal spectra, “B” is the background musicspectra, and “M” is the mixture spectra. Comparing DNN(baseline) to SVSGANs, the results on iKala and MIR-1Kdatasets show that SVSGANs enhance performance in termsof SDR and SAR. Comparing different SVSGAN architec-tures, SVSGAN (V+B) represents the results of the original

Fig. 4: Vocal results on the Test part of the DSD100 dataset,sorted by median values of each submission.

GAN architecture while SVSGAN (V+M) and SVSGAN(V+B+M) represent the results of the conditional GAN. SVS-GAN (V+M) is found to provide better results, indicatingthat when the input of discriminator D contains the mix-ture spectra, SVSGAN (V+M) not only learns the mappingfrom the distribution of mixture spectra to the distributionof clean spectra but also learns a general structure from themixture spectra at the same time. Comparing SVSGAN(V+M) to SVSGAN (V+B+M), which has more inputs fordiscriminator, suggests that increasing the number of inputsto discriminator D improves performance.

Fig. 4 compares the Test part of the DSD100 dataset. TheDNN (baseline) and SVSGAN (V+B+M) are the same asthose evaluated on the iKala and MIR-1K datasets. Since weonly trained the model with the Dev part of dataset withoutadditional augmented datasets, such as MedleyDB [26], SVS-GAN (V+B+M) does not outperform all other submissions.However, the result still shows that singing voice separationcan be improved by adversarial learning on this dataset fea-turing a wide variety of music genres.

5. CONCLUSIONS & FUTURE WORK

This paper proposes a singing voice separation model withtime-frequency masking function for monaural recordings us-ing a generative adversarial framework. The framework con-sists of two conventional neural networks with conditionalGAN, and is shown to potentially enhance source separationperformance. Possible future work involves three directions.First, we will incorporate additional augmented data in ouradversarial training process to achieve better performance.Next, we will seek to improve generator G and discrimina-tor D using more powerful neural networks, such as CNN orRNN. Finally, we will explore the use of Wasserstein GAN[27] to achieve better performance. Future work will alsoinclude further comparisons between SVSGANs and othercompetitive approaches.

http://mirlab.org/demo/svsgan

http://mirlab.org/demo/svsgan

6. REFERENCES

[1] J. L. Durrieu, B. David, and G. Richard, “A musically moti-vated mid-level representation for pitch estimation and musicalaudio source separation,” IEEE J. Sel. Topics Signal Process.,vol. 5, no. 6, pp. 1180–1191, Oct 2011.

[2] S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural networkbased instrument extraction from music,” in Proc. IEEE Int.Conf. Acoust., Speech and Signal Process. (ICASSP), 2015, pp.2135–2139.

[3] C. L. Hsu, D. Wang, J.-S. R. Jang, and K. Hu, “A tan-dem algorithm for singing pitch extraction and voice separa-tion from music accompaniment,” IEEE/ACM Trans. Audio,Speech, Language Processing, vol. 20, no. 5, pp. 1482–1491,July 2012.

[4] Z.-C. Fan, J.-S. R. Jang, and C.-L. Lu, “Singing voice sep-aration and pitch extraction from monaural polyphonic audiomusic via DNN and adaptive pitch tracking,” in Proc. IEEEInt. Conf. Multimedia Big Data. (BigMM), 2016, pp. 178–185.

[5] Y. Ikemiya, K. Itoyama, and K. Yoshii, “Singing voice sep-aration and vocal f0 estimation based on mutual combinationof robust principal component analysis and subharmonic sum-mation,” IEEE/ACM Trans. on Audio, Speech, and LanguageProcessing, vol. 24, no. 11, pp. 2084–2095, Nov 2016.

[6] J. Serra, E. Gomez, and P. Herrera, Audio Cover Song Identi-fication and Similarity: Background, Approaches, Evaluation,and Beyond, pp. 307–332, Springer Berlin Heidelberg, Berlin,Heidelberg, 2010.

[7] Z. Rafii and B. Pardo, “REpeating Pattern Extraction Tech-nique (REPET): A simple method for music/voice separation,”IEEE Trans. Audio, Speech, Language Process., vol. 21, no. 1,pp. 73–84, Jan 2013.

[8] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice separation from monaural recordingsusing robust principal component analysis,” in Proc. IEEE Int.Conf. Acoust., Speech and Signal Process. (ICASSP), 2012, pp.57–60.

[9] P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Real-timeonline singing voice separation from monaural recordings us-ing robust low-rank modeling,” in Proc. Int. Soc. Music Info.Retrieval Conf. (ISMIR), 2012, pp. 67–72.

[10] Y.-H. Yang, “Low-rank representation of both singing voiceand music accompaniment via learned dictionaries,” in Proc.Int. Soc. Music Info. Retrieval Conf. (ISMIR), 2013, pp. 427–432.

[11] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su, Y.-H.Yang, and R. Jang, “Vocal activity informed singing voiceseparation with the iKala dataset,” in Proc. IEEE Int. Conf.Acoust., Speech and Signal Process. (ICASSP), 2015, pp. 718–722.

[12] B. Zhu, W. Li, R. Li, and X. Xue, “Multi-stage non-negativematrix factorization for monaural singing voice separation,”IEEE/ACM Trans. Audio, Speech, Language Processing, vol.21, no. 10, pp. 2096–2107, Oct 2013.

[13] P. Magron, R. Badeau, and B. David, “Complex NMF un-der phase constraints based on signal modeling: Applicationto audio source separation,” in Proc. IEEE Int. Conf. Acoust.,Speech and Signal Process. (ICASSP), 2016, pp. 46–50.

[14] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen,and A. Y. Ng, “Recurrent neural networks for noise reductionin robust ASR,” in Proc. Interspeech, 2012, pp. 22–25.

[15] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, andP. Smaragdis, “Singing-voice separation from monauralrecordings using deep recurrent neural networks,” in Proc. Int.Soc. Music Info. Retrieval Conf. (ISMIR), 2014, pp. 477–482.

[16] X.-L. Zhang and D. Wang, “Multi-resolution stacking forspeech separation based on boosted dnn,” in Proc. Interspeech,2015, pp. 1745–1749.

[17] X.-L. Zhang and D. Wang, “A deep ensemble learning methodfor monaural speech separation,” IEEE/ACM Trans. on Audio,Speech, and Language Processing, vol. 24, no. 5, pp. 967–977,Mar 2016.

[18] Y. Luo, Z. Chen, J. R. Hershey, J. L. Roux, and N. Mesgarani,“Deep clustering and conventional networks for music sepa-ration: Stronger together,” in Proc. IEEE Int. Conf. Acoust.,Speech and Signal Process. (ICASSP), 2017, pp. 61–65.

[19] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Taka-hashi, and Y. Mitsufuji, “Imporving music source separationbased on deep neural networs through data augmentation andnetwork blenging,” in Proc. IEEE Int. Conf. Acoust., Speechand Signal Process. (ICASSP), 2017, pp. 261–265.

[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in Advances in Neural Information ProcessingSystems (NIPS), 2014, pp. 2672–2680.

[21] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech en-hancement generative adversarial network,” in Proc. Inter-speech, 2017.

[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inarXiv:1611.07004, 2016.

[23] C.-L. Hsu and J.-S. R. Jang, “On the improvement of singingvoice separation for monaural recordings using the mir-1kdataset,” IEEE Trans. Audio, Speech, Language Process., vol.18, no. 2, pp. 310–319, Feb 2010.

[24] “SiSEC MUS Homepage,” 2016, [Online]https://sisec.inria.fr/sisec-2016/2016-professionally-produced-music-recordings/.

[25] E. Vincent, R. Gribonval, and C. Fevotte, “Performance mea-surement in blind audio source separation,” IEEE Trans. Audio,Speech, Language Process., vol. 14, no. 4, pp. 1462–1469, July2006.

[26] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Can-nam, and J. P. Bello, “MedleyDB: A multitrack dataset forannotation-intensive mir research,” in Proc. Int. Soc. MusicInfo. Retrieval Conf. (ISMIR), 2014, pp. 155–160.

[27] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” inarXiv:1701.07875, 2017.

Documents

SVSGAN: SINGING VOICE SEPARATION VIA … · SVSGAN: SINGING VOICE SEPARATION VIA GENERATIVE ADVERSARIAL NETWORK Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing R. Jang Department of Computer