Singing Voice Separation - University of Rochester

Singing Voice Separation

Christos Benetatos, Ge Zhu, Yoon mo Yang

Motivations

● Real world applications○ Automatic speech recognition (source separation)○ Chord recognition and main melody extraction

● Commercial applications○ Query-by-humming (e.g. Soundhound)○ Automatic pitch correction (e.g. Autotune)○ Singing synthesis (e.g. Vocaloid)

Background and Challenges

● Background○ All pop music around the world use singing○ Singing voice is treated as the most expressive instrument○ Contains two aspects of information: sound and words

● Challenges○ For monaural recordings, only single channel info. is available○ For conventional approaches, polyphonic cases are not easy○ Datasets are important for DNN-based approaches

Low Rank Approximation MethodsFor Voice Separation

What does it mean that music background is Low-Rank signal ?A musicians intuitive explanation

● Low-Rank (Context Level)○ Most of the time, in music which is voice/lyrics-centered, where the music’s role is just to

support the singing, a trained musician can predict the instrumental part for the whole piece, after hearing the first x seconds. The smaller x is, the lowest the rank of the musical (not voice) signal is. This is not happening often in pure instrumental or classical music.

● Low-Rank (Note Spectrogram Level)○ In a piano background musical piece, all the repeated piano notes sound the same. If you

hear a C5, all the C5 repetitions will sound the same. No surprises. So we can say that all the piano backing music, is a combination of at most 88 spectrum elements (number of keys)

○ However, if the singer is virtuoso, there is more freedom and variety in his/her notes. Rarely we hear the same passage, or the same timbre in the notes. We need a lot more than 88 elements to describe the voice, that it stops being a low-rank signal

Main categories of Low-Rank Methods

Dictionary Methods

NMF [1]

● Works good when the signal contains only music

● Low rank assumption for both accompaniment and voice

● It is more difficult to summarize the voice in a small number of spectra templates

Lp-Norm NMF [2]

● Voice is a sparse signal● Voice = NMF reconstruction error● Implicitly control voice by controlling sparsity of

error

[1]. Smaragdis, Paris, and Judith C. Brown. "Non-negative matrix factorization for polyphonic music transcription." IEEE workshop on applications of signal processing to audio and acoustics. Vol. 3. No. 3. 2003.[2]. T. Nakamuray and H. Kameoka, “Lp-norm non-negative matrix factorization and its application to singing voice enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, Apr. 2015

Low Rank + Sparse Decomposition

Figure from http://www.ihes.fr/~comdev/liens/Chaire_Schlumberger/candes.pdf

Robust Principal Component Analysis

● Method to separate low-rank background from sparse foreground

● Applications to video background extraction (moving objects are sparse noise)

Assumptions:

● Low-rank background is not sparse● Sparse foreground is not low-rank

Norms and Convexity Background

∝ : propotional to

≈ : indicator of (or measure of)

● Sparsity

○ Sparsity(A) ≈ zero elements of A○ Lo norm = non-zero elements of A (non-convex Norm)○ So, maximizing sparsity = minimizing Lo norm○ L1 norm = sum of the absolute values of all A elements.○ We can use L1 norm as a convex approximation of Lo

● Rank

○ Rank(A) ≈ # of non-zero eigenvalues of A (non Convex Function)

○ L* norm = sum of singular values of A ∝ sum of eigenvalues of AA* ≈ Rank(AA) = Rank(A)○ So, we can use L* norm as a convex approximation of Rank function

RPCA as an optimization problem usingPrincipal Component Pursuit (PCP)

Non Convex PCP Problem

Convex PCP relaxation

λ is not a tunable parameter. There is a proof that if

Then

Why not just regular PCA ?

● Sensitive to outliers / Breaks down with heavily corrupted data● Like NMF, it also uses the low-rank assumption for the total voice-music signal

PCA RPCA

Po-Sen Huang et al 2012 [1]“Singing-voice separation from monaural recordings using robust principal component analysis”

[1]. P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, Mar. 2012.

Overall Architecture and parameters

● Overall Architecture ● Tunable λambda

Time-Frequency Masking

● Time-Frequency Masking:

Evaluation and Results

● Evaluation Metrics

Po-Sen Huang et al 2014 [1]“Singing-voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks”

● From monaural recordings in a supervised setting● DRNNs with different temporal connections● Jointly optimizing the networks for multiple source signals by including the

separation step● Different discriminative training objectives● Proposed framework:

[1]. P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using deep recurrent neural networks” Proc. of the International Society for Music Information Retrieval (ISMIR), 2014

● A RNN is a DNN, with layers which introduce the memory from the past

● Black: hidden states, White: input frames, Grey: output frames

● Weakness of RNNs: lack hierarchical processing of the input at the current time step○ Deep recurrent neural

networks to solve this.

Deep Recurrent Neural Networks Architectures

Proposed Model Architecture Joint Training via time-freq masking

● Magnitude spectra as features● Separating one of the sources from a

mixture, not learning one of the sources as the target

● A time-frequency masking technique

● Enforces the constraint that the sum of the prediction results is equal to the original mixture

● Can be viewed as a layer● Jointly train the network with the

time-freq masking function● An extra layer to the original output of

the network at the end

DiscriminativeTraining objectives ● The Mean Square Error

● Generalized KL divergence● Discriminative objective

functions to have high SIR○ Increase the similarity b/w the

prediction and its target○ Decrease the similarity b/w the

prediction and the targets of other sources

● 𝜸 is a constant chosen by the performance

Results and Conclusion

● Results with unsupervised and supervised settings

● 2.30 ~ 2.48 dB GNSDR gain, 4.32~5.42 dB GSIR gain compared to RNMF

● To further enhance the results○ Jointly optimizing a soft mask

function with the networks○ the discriminative training criteria

● Demo:https://sites.google.com/site/deeplearningsourceseparation/

https://sites.google.com/site/deeplearningsourceseparation/

https://sites.google.com/site/deeplearningsourceseparation/

Yi Luo et.al. 2017“Deep Clustering and Conventional Networks for Music Separation: Stronger Together”

Conventional regression-based networks:

● Supervised mask-inference based method● Increase separation between sources

Deep clustering:

● Unsupervised method to solve general audio separation problem with multiple sources of same type and arbitrary number of sources

● Reduce within-source variance

Demo Page: http://danetapi.com/chimera

Deep Clustering Intuition:Deep (learning techniques to derive embedding features for performing efficient) Clustering

Traditional spectral clustering (contrast to central clustering):

● Spectral decomposition of the original feature signal● Map feature matrix into a different dimensional space based on spectrum ● In the mapped dimensional space, perform simple central clustering

Deep clustering:

● Use a neural network to learn embedding features automatically. Then perform central clustering algorithm to cluster

John Hershey et.al.: Deep Clustering: Discriminative embeddings for segmentation and separation

Details

Partition based training:

● Reference label indicator: Y = {y(n,c)} (map element n to class c)● Then, A = YY’ is an ideal affinity matrix to represent partition.

Training objective:

● Embeddings enable accurate clustering based on labels

Objective function:

Details

Cost Function:

Test:

● After computing V on test signals, cluster rows of V using k-means.

Deep ClusteringAdvantages:

● Partition-based (No label) instead of class-based (Label required).● Help to solve permutation problem. (By using permutation-independent

embedding.)

Disadvantages:

● Embedding dimension has to be tuned● Requires post-processing

Sing Voice Separation Task

Conventional NN:

● Output soft mask

Deep Clustering:

● Output embeddings ● Post-processing embeddings to get soft mask

Sing Voice Separation

Chimera Networks:

● Deep clustering head

● Conventional NN head

● Globally:

Results

● Not only won 1st in MIREX 2016 but also outperformed best systems from the past.

(SDRi: Improvement of SDR

with respect to that in mixture)

(iKala dataset: not trained on)

● Demo: http://www.merl.com/demos/deep-clustering

http://www.merl.com/demos/deep-clustering

Documents

Singing Voice Separation - University of Rochester