Upmixing from mono - A source separation approach

Dublin Institute of TechnologyARROW@DIT

Conference papers Audio Research Group

2011-07-01

Upmixing from Mono : a Source SeparationApproachDerry FitzgeraldDublin Institute of Technology, [email protected]

This Conference Paper is brought to you for free and open access by theAudio Research Group at ARROW@DIT. It has been accepted forinclusion in Conference papers by an authorized administrator ofARROW@DIT. For more information, please [email protected], [email protected].

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License

Recommended CitationFitzgerald, D. : Upmixing from Mono ; a Source Separation Approach. 17th International Conference on Digital Signal Processing, 6-8July, 2011, Corfu, Greece

http://arrow.dit.ie

http://arrow.dit.ie/argcon

http://arrow.dit.ie/arg

mailto:[email protected], [email protected]

http://creativecommons.org/licenses/by-nc-sa/3.0/




UPMIXING FROM MONO - A SOURCE SEPARATION APPROACH

Derry FitzGerald

Audio Research GroupDublin Institute of Technology

Kevin St., Dublin 2, Ireland

ABSTRACTWe present a system for upmixing mono recordings to stereothrough the use of sound source separation techniques. Theuse of sound source separation has the advantage of allow-ing sources to be placed at distinct points in the stereo field,resulting in more natural sounding upmixes. The system sep-arates an input signal into a number of sources, which canthen be imported into a digital audio workstation for upmix-ing to stereo. Considerations to be taken into account whenupmixing are discussed, and a brief overview of the vari-ous sound source separation techniques used in the systemare given. The effectiveness of the proposed system is thendemonstrated on real-world mono recordings.

Index Terms— Upmixing, Sound source separation,vocal separation, percussion separation, pitched instrumentseparation, Non-negative Tensor Factorisation, Non-negativepartial cofactorisation.

1. INTRODUCTION

Since the first issuing of commercial stereo gramophonerecordings in the late 1950s, there have been numerous at-tempts to create stereo recordings from material originallyissued as monophonic or single channel recordings. Thistypically involved taking two copies of the signal, and delay-ing one copy by 30-40 ms, and then high-pass filtering onecopy and low-pass filtering the other copy. This was the basisof the Duophonic system used by Capitol Records to createpseudo-stereo recordings. Alternatively, approaches based oncomb-filtering have been proposed [1],[2].

However, a notable problem with these recordings is thatwhile they can give the appearance of spread or width to amonophonic recording, they do not allow for the placement ofindividual sources in the original recording at distinct pointsin the stereo field. The ability to do this would result in amuch more natural sounding conversion or upmixing of monorecordings to stereo, or indeed 5.1 surround sound.

To this end, it is proposed to use sound source separa-tion techniques to separate out sources or instruments from

This research was funded under the Science Foundation Ireland StokesLectureship scheme

a mono recording. The separated source signals can then beused to create a stereo upmix of the mono recording. Soundsource separation techniques have previously been used forthe creation of stereo to 5.1 upmixes [3], but little or no workhas been done on mono to stereo. The state of the art in singlechannel sound source separation has advanced considerablyin recent years, with large amounts of work focusing on ap-proaches using spectrogram factorisation in particular. Nev-ertheless, given the difficulty of the problem, there will typi-cally still be imperfections in the separation, both in terms ofresidual traces of other sources, and in artifacts due to the sep-aration process. This can be a limiting factor when attemptingto use these separations in the context of new musical pieces,but, as will be discussed later, this is not as much of a problemwhen using the separated sources for the purposes of upmix-ing from mono to stereo.

The focus in this paper is on the use of blind source sepa-ration techniques to separate the signals for upmixing, wherethere is little or no information provided to the system aboutthe nature of the sources to be separated. It is felt that thiswill result in a more general system which is capable of deal-ing with very different types of music, without recourse tooptimising the techniques for each style.

There are a number of considerations to be taken intoaccount when attempting to create a successful upmix frommono to stereo when using sound source separation tech-niques. Firstly, the upmix should be free from any audibleartifacts. This can be achieved by ensuring that no informa-tion from the original signal is lost at any point in separatingthe sources, so that the separated signals summed togetherfully reconstitute the original mono signal. In this case,any artifacts in the individual separations will be usually bemasked by the other sources, provided that the chosen panpositions for the separated sources are not too extreme. Thishas ramifications for the stereo width of the upmix, and isdiscussed in greater detail in section 3.

Secondly, another important consideration is the stabilityof the pan position of the separated sources. It is desirable tocreate upmixes in which the pan position of the sources re-mains fixed throughout their appearance in the piece. If thesources drift gradually out of position or occasionally jumpposition, then this can be distracting for the listener, particu-

https://www.researchgate.net/publication/228548783_Localization_Quality_Assessment_in_Source_Separation-Based_Upmixing_Algorithms?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

larly for those using headphones.The principal reason for a source to move position is due

to incorrect separation of the sources, where, for example,part of a vocal track has not been correctly separated fromthe percussion instruments and so where the incorrect sep-aration occurs, the vocal moves towards the position of thepercussion instruments. This would be particularly notice-able if the sources were on opposite sides of the stereo field,such as the vocals panned hard right and the percussion in-struments panned hard left. In cases such as this, the easiestway to ameliorate the problem is either to put both sources inthe same position, or to put the sources in positions close toeach other, such as putting the drums hard left and the vocalsmid-left. This again can impose limitations on the positioningof sources in the stereo upmix.

Taking these considerations into account, it can be seenthat the separation of the various sources does not have to beperfect in order to achieve a realistic upmix from mono tostereo. Due to the effects of masking, most of the issues re-lating to the residual presence of other sources and artifactsdue to separation can be overcome if due care is taken whenreconstructing the sources. Therefore, the principal criteriafor achieving a realistic upmix is that the sources are suffi-ciently separated to allow directionality to be imposed on thesources. This is a considerably less onerous requirement thanachieving separations with minimal artifacts and bleed fromother sources.

Finally, there is the issue as to whether to aim to recreatethe original mono mix, but with enhanced width, or whetherto attempt to remix the material by increasing or decreasingthe gains of the individual sources. The approach taken hereis to solely enhance width, while leaving the overall balanceof the sources the same. It is felt that this results in upmixeswhich are more faithful to the original source material, thougharguments could be made for remixing to reflect the tastes ofmodern listeners.

2. UPMIXING SYSTEM OVERVIEW

In recent years much research has been carried out on thetopic of single channel sound source separation, which con-cerns itself with the extraction of individual sources or in-struments from a single channel mixture of instruments. Awide range of different techniques have been used in attempt-ing to solve this problem, some involving the use of trainingdata to generate models of the sources to be separated, someusing other prior knowledge such as pitch information aboutthe sources, and others which attempt to perform the separa-tions in a blind manner without any prior knowledge aboutthe sources. However, for the purposes of this paper, this re-search on single channel separation can be broken into threebroad categories, distinguished by the types of sources to beseparated. These are outlined below.

The first category is research related to the separation of

pitched instruments from other pitched instruments. In thecontext of upmixing, it is clear that this is a necessary partof any upmixing system, as it will allow pitched sources ina mixture to be placed at different points in the stereo field,such as in a recording containing piano, guitar and bass guitar.Techiques used for this purpose include techniques based onnon-negative matrix and tensor factorisations [4, 5, 6], andothers based on common trajectories of harmonics [7, 8].

The second category attempts to extract percussion in-struments occuring in mixtures of other instruments includ-ing both pitched instruments and voice. This is necessaryfor upmixing in order to allow the placement of percussionsources at points in the stereo field. Again, a number of tech-niques used here are based on non-negative matrix and tensorfactorisations [9, 10, 11], while others use template match-ing and adaptation [12, 13]. Other techniques make use ofsimple heuristics based on characteristics of both pitched andharmonic instruments [14, 15].

The third category deals with the problem of extractingvocals from mixtures of instruments including both pitchedinstruments and percussion instruments. Again, it is evidentthat a requirement of any upmixing system based on soundsource separation requires the ability to place the vocal at agiven place in the stereo field. Techniques used here includeBayesian approaches and matrix factorisation techniques [16,17, 18, 19].

Fig. 1. Upmixing system flowchart

In order to create a separation-based upmixing system,techniques and algorithms from all of these categories willneed to be used. The upmixing system proposed takes meth-ods from all of these categories and applies them successivelyto generate the separated source signals. Figure 1 outlines the

https://www.researchgate.net/publication/228545344_Drum_source_separation_using_percussive_feature_detection_and_spectral_modulation?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/34907534_Sound_source_separation_in_monaural_music_signals_?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/244458174_Discovering_Auditory_Objects_Through_Non-Negativity_Constraints?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/220848077_Underdetermined_Source_Separation_with_Structured_Source_Priors?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/243788138_Separation_of_Sound_Sources_by_Convolutive_Sparse_Coding?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/228655808_Separation_of_a_monaural_audio_signal_into_harmonicpercussive_components_by_complementary_diffusion_on_spectrogram?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/3458051_Unsupervised_Single-Channel_Music_Source_Separation_by_Average_Harmonic_Structure_Modeling?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

order in which the algorithms are applied to decompose theoriginal signal. Firstly, a vocal separation algorithm is em-ployed to separate the vocals from the other instrumentationin the track. Where vocal harmonies are present the vocalscan be further processed using the pitch separation algorithmto separate to some extent the harmonies and add stereo width.

The signal containing the remaining instrumentation, in-cluding both pitched and percussion instruments will thenbe split into separate percussion and pitched instrument sig-nals, through use of techniques from the second category.An optional stage, depending on the upmixing requirementsincludes the further decomposition of the percussion signalinto sources containing separated percussion instruments, toallow further flexibility in the positioning of the percussionsources. Finally, the pitched instruments are further decom-posed to separate the pitched instruments from each other.The proposed system is quite computationally intensive andit can take between a half an hour to an hour to process a 3minute pop song. Upon completion of the separation process,the separated source signals can then be imported into a dig-ital multitrack editor to create a stereo upmix of the originalmaterial by panning the separated signals to various points inthe stereo field.

The following sections deal with details of the upmixingseparation system. Section 3 discusses how the sources arereconstructed as this is common to all the source separationtechniques used. Then the following three sections describethe algorithms deployed for vocal separation, percussion sep-aration and pitched instrument separation. However, the tech-niques will not be presented in the order shown in Figure 1, asthe technique used for vocal separation makes use of the al-gorithms related to separating pitched and percussion sources,as well as that for separating pitched sources. Therefore, it isnecessary to explain both of these algorithms in brief beforeproceeding to describe the vocal separation algorithm.

3. SIGNAL RECONSTRUCTION

The upmixing system makes use of a number of source sep-aration algorithms, based on widely different techniques toestimate separated source spectrograms. Regardless of theseparation method, the same method is used to reconstructthe separated source signals. Rather than apply the origi-nal phase information to the separated mixture spectrograms,these spectrograms are instead used to create masks which areapplied to the original complex valued spectrogram Y gener-ated from the input signal:

Yk = Y ⊗ Q2k

Q21 + · · ·+Q2

z

(1)

where Yk is the complex-valued spectrogram of the kth of zsources, Qk is the estimated magnitude spectrogram of thekth source, ⊗ denotes elementwise multiplication, and allother operations are also performed elementwise. In effect,

these masks perform a version of Wiener filtering on the in-put signal.

There are a number of reasons for the use of the above ap-proach. Firstly, the above approach has been observed to givemore natural sounding separations than applying the originalphase to the estimated spectrograms. Secondly, the aboveapproach ensures that the separated signals sum together toyield the original input signal. Given the multipass nature ofthe upmixing process, where separated signals are typicallypassed through another separation stage, the Wiener filteringapproach insures that no information from the original signalis lost at any point. The final set of separated signals obtainedfrom the various stages will sum together to yield the origi-nal input signal. This has important effects on the perceptualquality of the final upmix created from the separated signals.

Regardless of the techniques used, there will be artifactsin the separated signal. These will often be quite noticeablewhen the separated signals are played in isolation. However,if care is taken in the upmix, when played in conjunction withthe other separated signals, the human auditory system willreintegrate the artifacts to their correct sources, yielding astereo signal where no artifacts are audible.

However, it should be noted that in certain situations, ifthe chosen source position is too extreme, such as panninghard left or hard right, artifacts will become noticeable in theseparation. This is because the positioning is too extreme forthe human auditory system to successfully reintegrate the ar-tifacts present in the other sources with the correct source.This varies from source to source, depending on the separa-tion quality, and can sometimes result in limitations on thestereo width of the upmix for certain sources. The audiblepresence of artifacts can usually be avoided by panning thedesired source as far as possible towards the desired locationwhile listening to ensure no artifacts can be heard. For sourcespanned to center positions this is not an issue, unless signifi-cant gain changes are involved.

Another by-product of the reconstruction technique is thefact that it ensures that there will be no phase issues betweenthe recovered signals. This means that the effect of sourcepanning is solely created using differences in intensity. Itshould be noted that this also applies to using the originalmixture phase with the separated source spectrograms. Thelack of phase issues also ensures that the upmixes are fullymono compatible, i.e a mono-ed version of the upmix isequivalent to the original mono signal.

4. DRUM SEPARATION USING MEDIANFILTERING

Drum separation is performed using a median filtering basedapproach [20]. The underlying idea behind this method isthat percussion instruments can be regarded as forming ver-tical ridges in spectrograms, while the harmonics of pitchedinstruments form horizontal ridges in the spectrogram. In the

https://www.researchgate.net/publication/254583990_HarmonicPercussive_Separation_using_Median_Filtering?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

case of vertical ridges associated with percussion instruments,peaks associated with pitched instruments can be regarded asoutliers. Therefore, removing these outliers will reduce thepresence of pitched instruments from the spectrogram. Sim-ilarly, the transients associated with the onset of percussioninstruments will be outliers in the horizontal ridges associ-ated with the pitched instruments and removing these outlierswill reduce the effects of percussion instruments in the spec-trogram.

A simple technique for the removal of outliers is the useof median filtering, where a given sample is replaced by themedian of the values taken from a window around the sample.Where x(n) is the input vector, then the output after medianfiltering y(n) is defined as:

y(n) = median {x(n− k : n+ k), k = (l − 1)/2} (2)

where l is the length of the median filter, and l and is odd.For even lengths the mean of the two values in the sorted listis used. For an input magnitude spectrogram X, let Xi de-note the ith time frame and Xh denote the hth frequency slicecontaining the values of the hth frequency bin across time.Then, percussion enhanced frames and harmonic enhancedslices can be obtained from:

Pi = M{Xi, lperc} (3)

Hh = M{Xh, lharm} (4)

where M denotes median filtering,and where lperc and lharmare the length of the percussion-enhancing and harmonic-enhancing median filters respectively. The percussion en-hanced frames Pi are then combined to yield a percussionenhanced spectrogram P. Similarly the frequency slicesHh are combined to yield a harmonic-enhanced spectrogramH. These spectrograms can then be used to generate maskswhich are applied to the original complex valued spectrogrambefore inversion to the time domain to recover the separatedsources in the manner described in Section 3

5. PITCHED INSTRUMENT SEPARATION

The algorithm for pitched instrument separation is based onthe tensor factorisation approaches described in [21] and [22].This is a general separation algorithm, capable of separatingpitched and percussive instruments from n-channel mixturesthough it is not effective at separating vocals from pitchedinstruments. Here we deal with a slightly simplified singlechannel case of these algorithms, where shifts in time havebeen eliminated from the model.

In the following, tensors are denoted by upper-case calli-graphic letters such as A. Contracted tensor multiplication isdenoted by 〈AB〉{a,b} where A and B are the tensors to bemultiplied. Here, contracted tensor multiplication takes placeon the dimensions a and b of A and B respectively. Indexing

of elements within a tensor is notated byA(i, j) as opposed tousing subscripts, following the conventions used in the TensorToolbox for Matlab, which was used in the implementationof the algorithm [23]. Given an input spectrogram X of sizen×m, then X is modelled as:

X ≈ X̂ =

K∑k=1

〈〈〈FH〉{2,1}W〉{3,1}S〉{2,1}

+

L∑l=1

〈BC〉{2,1} (5)

The first right-hand side term models pitched instruments, andthe second unpitched or percussion instruments. K denotesthe number of pitched instruments and L denotes the num-ber of unpitched instruments. As all tensors are instrument-specific, for ease of notation, the subscripts k and l are im-plicit in all tensors within the respective summations.

The pitched instrument model is a source-filter modelwhere each source is modelled as a weighted sum of sinu-soids. These are then filtered by a filter which attempts tomimic the the formant structure of the instrument. Here theformant filter is modelled by F , an n× n diagonal matrix. Hcontains a dictionary of sinusoids, where H (:, i, j) containsthe frequency spectrum of a sinusoid with frequency equalto the jth harmonic of the ith note of the instrument. H is atensor of size n × zk × hk where zk and hk are respectivelythe number of allowable notes and the number of harmon-ics used to model the kth instrument. W contains a set ofweights which describe how the sinusoids which make up anote played by an instrument are weighted to approximatethe instrument timbre and is of size hk. S is a tensor of sizezk ×m which contains the activations of the zk notes playedby the instrument.

The unpitched instruments are modelled as the productof a source frequency basis function B of size n × 1 with atime activation basis function C of size 1×m. For separationof pitched instruments only, this part can obviously be elimi-nated from consideration. However, as this part of the modelis used as part of the vocal separation algorithm, it is includedhere for completeness.

Multiplicative update equations can be derived using thegeneralised Kullback-Liebler divergence as a cost function ina manner similar to that shown in [21]. The individual sourcespectrograms can then be recovered by combining the ten-sors associated with the kth source to yield a source spectro-gram Xk. These source spectrograms are then used to cre-ate Wiener filters which are applied to the original complex-valued spectrogram in the manner described in section 3.

6. VOCAL SEPARATION

The vocal separation algorithm used is the algorithm de-scribed in [24]. This algorithm makes use of the drum

https://www.researchgate.net/publication/228830376_Musical_source_separation_using_generalised_non-negative_tensor_factorisation_models?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/254583302_Single_Channel_Vocal_Separation_using_Median_Filtering_and_Factorisation_Techniques?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/5304094_Extended_Nonnegative_Tensor_Factorisation_Models_for_Musical_Sound_Source_Separation?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/5304094_Extended_Nonnegative_Tensor_Factorisation_Models_for_Musical_Sound_Source_Separation?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

separation technique described in section 4 and the tensorfactorisation technique described in section 5, as well as afurther cofactorisation stage. The vocal separation algorithmtakes advantage of the fact that for magnitude spectrogramsobtained from large window sizes with high frequency resolu-tion, vocals will appear as locally broadband noise, while forlow frequency resolution the vocals will appear as harmonicin nature. This is in contrast to percussion instruments whichwill appear as broadband noise regardless of what frequencyresolution is used, and to pitched instruments which willappear as harmonic regardless of the frequency resolutionused.

It follows then that using the median filter based separa-tion algorithm at high frequency resolution, such as with anFFT size of 16384 samples, on a signal containing drums, vo-cals and pitched instruments, will result in two signals, onecontaining mainly drums and vocals, and another containingmainly pitched instruments. The drums and vocals signal canthen again be processed by the median filtering separation al-gorithm, this time at low frequency resolution, yielding a sep-arated drum signal and a separated vocal signal. However, forbest results, it was observed that using a Constant Q trans-form [25] typically gave better separations than using a lowfrequency resolution STFT.

After this vocal separation has been obtained, there willstill be artifacts in the separated vocal, In the case of a typicalpop song this would include elements of the drums and oftensome trace of the bass guitar. To reduce the effects of theseartifacts, the separated vocal is then passed through the tensorfactorisation based algorithm described in section 5 resultingin an improved vocal separation.

Further improvements can be obtained by using this sepa-rated vocal to perform a partial cofactorisation on the originalinput signal. Partial non-negative matrix cofactorisation wasoriginally proposed as a means of separating drums from mix-tures of pitched instruments [26] and was adapted in [24] forthe purposes of vocal separation. Here, the original mixturespectrogram and a spectrogram of the separated vocal weredecomposed simultaneously, while sharing some frequencybasis functions between the two spectrograms, thereby forc-ing some basis functions to be associated with vocals only,while allowing the remaining basis functions to adapt to othersources in the mixture. This model can be expressed as:

X̂ = ATST +AV SV (6)

andD̂ = AV SV 1 (7)

where X is the mixture spectrogram and D is the separatedvocal spectrogram. AT and ST contain the frequency andtime basis functions for the instrumental track containing bothpitched instruments and percussion. AV then contains thecommon frequency basis functions between the two input ma-trices associated with the vocals. SV and SV 1 contain the

time basis functions for the vocal frequency basis functionsfor matrices X and D respectively. Then, again using thegeneralised Kullback-Liebler divergence as a cost function,update equations can be derived for the model parameters.The resulting vocal separation and separated backing trackobtained using this technique typically contains less artifactsthan the original vocal separation used as input.

At each stage in the vocal separation process, the Wienerfiltering technique is again used to generate the source sig-nals. Once the final separated vocal has been obtained, theseparated instrumental track is then separated using firstlythe drum separation algorithm and then the individual instru-ments are separated using the tensor factorisation algorithm,yielding the separated sources for use for upmixing.

7. UPMIXING EXAMPLES

Having described the various sound source separation tech-niques used in the upmixing separation system, we nowpresent real-world examples of upmixes created using theseparations obtained from the system. The separated signalswere imported into a digital audio workstation, and werepanned to various positions to create the stereo upmix.

Figure 2 shows the spectrogram of an excerpt taken fromthe original mono recording of “Good Vibrations” by theBeach Boys. Here the instrumentation consists of piano,mixed percussion, theremin and bass guitar. Figure 3 showsthe spectrogram of the left channel of the stereo upmix, whilefigure 4 shows the spectrogram of the right channel of the up-mix. In this case, the piano was panned to a mid-left position,while the percussion was panned hard right, the theremin tomid right and the bass guitar to the centre of the stereo field.

Fig. 2. Spectrogram of an excerpt from “Good Vibrations”- Original Mono Recording. Instrumentation includes piano,mixed percussion, theremin and bass guitar.

It can be seen that the presence of percussion is very much

https://www.researchgate.net/publication/254583302_Single_Channel_Vocal_Separation_using_Median_Filtering_and_Factorisation_Techniques?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/224149949_Nonnegative_matrix_partial_co-factorization_for_drum_source_separation?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

https://www.researchgate.net/publication/228523955_Constant-Q_transform_toolbox_for_music_processing?el=1_x_8&enrichId=rgreq-1e1e0e5d-8d18-4398-be38-596432cc71a5&enrichSource=Y292ZXJQYWdlOzIyNDI1NjIwNTtBUzo5OTA3MzIwMTgwMzI3M0AxNDAwNjMyMzAwMzMw

reduced in the left channel spectrogram in comparison withthe original mono signal, while the presence of the percussioncan be clearly seen in the right channel. The harmonics of thepiano can clearly be seen in both left and right spectrograms,though it is clearly louder in the left channel than the right.The theremin is visible as a modulating sinusoid near the bot-tom right corner of both the original spectrogram and the rightchannel of the upmix, and indeed can be more clearly seen inthe right channel of the upmix than in the original mono mix.The low frequency energy of the bass guitar can be seen to beequally loud in both left and right spectrograms of the upmix.

Fig. 3. Spectrogram of an excerpt from “Good Vibrations”- Stereo Upmix Left Channel. The piano and bass are thepredominant instruments in this channel

On listening to the stereo upmix, a wide stereo imagecan be heard. This is particularly noticeable when swappingbetween the original mono mix and the stereo upmix. Noartifacts can be heard in the created upmix, and the sourcescan clearly be identified as having come from a given posi-tion in the stereo field. The stability of the positioning of thesources is quite good throughout the upmix, with little or nosource drift evident upon listening. This demonstrates thepotential of the upmixing system in creating realistic stereoupmixes from mono real-world recordings. A longer versionof the excerpt used in these figures is available for listeningat http://eleceng.dit.ie/derryfitzgerald/index.php?uid=489&menu_id=54. Other upmixingexamples are also available for listening from this page.

8. CONCLUSIONS

Having outlined the reasons for using sound source separa-tion for the purposes of upmixing from mono to stereo, wethen dealt with considerations that need to be taken into ac-count in order to achieve successful upmixing using soundsource separation. Following from this, an overview of the

Fig. 4. Spectrogram of an excerpt from “Good Vibrations” -Stereo Upmix Right Channel. The percussion, theremin andbass are the predominant instruments in this channel.

upmixing separation system was presented, and a descriptionof the signal reconstruction method presented. The varioussound separation technologies used in the system were thenbriefly discussed. Finally the effectiveness of the system forupmixing was demonstrated on real-world mono recordings.

Future work will concentrate on improving the quality ofthe separations obtained from the various algorithms, partic-ularly with a view to the considerations highlighted in thispaper. Work will also be carried out in performing listeningtests on upmixes obtained from a mono version of a multi-track recording, in comparison with actual stereo mixes madefrom the multitrack. It is hoped that this will allow quantifica-tion of the performance of the upmixing system against actualstereo mixes.

9. REFERENCES

[1] M. Schroder, “An artificial stereophonic effect obtainedfrom a single audio signal,” Journal of the Audio Engi-neering Society, vol. Vol.6, no.2, pp. 74–80, 1958.

[2] R. Orban, “A rational technique for synthesizingpseudo-stereo from monophonic sources,” Journal ofthe Audio Engineering Society, vol. Vol.18, pp. 157–165, 1970.

[3] D. Barry and G. Kearney, “Localization quality assess-ment in source separation-based upmixing algorithms,”in AES 35th International Conference, London, 2009.

[4] E. Vincent and X. Rodet, “Underdetermined source sep-aration with structured source priors,” in 5th Interna-tional Conference on Independent Component Analysisand Blind Signal Separation, 2004.

[5] T. Virtanen, “Separation of sound sources by convo-lutive sparse coding,” in Proc. of ISCA Tutorial andResearch Workshop on Statistical and Perceptual AudioProcessing, 2004.

[6] P. Smaragdis, “Discovering auditory objects throughnon-negativity constraints,” in Proc. of ISCA Tutorialand Research Workshop on Statistical and PerceptualAudio Processing, 2004.

[7] Zhiyao Duan, Yungang Zhang, Changshui Zhang, andZhenwei Shi, “Unsupervised single-channel musicsource separation by average harmonic structure mod-eling,” Audio, Speech, and Language Processing, IEEETransactions on, vol. 16, no. 4, pp. 766 –778, May 2008.

[8] T. Virtanen, Sound Source Separation in Monaural Mu-sic Signals, Ph.D. thesis, Tampere University of Tech-nology, Tampere, Finland, 2006.

[9] O. Gillet and G. Richard, “Transcription and separationof drum signals from polyphonic music,” IEEE Transac-tions on Audio, Speech and Language Processing, vol.16, no. 3, pp. 529–540, 2008.

[10] M. Helen and T. Virtanen, “Separation of drums frompolyphonic music using non-negative matrix factorisa-tion and support vector machine,” in Proc. EuropeanSignal Processing Conference, Anatalya, Turkey, 2005.

[11] D. FitzGerald, E. Coyle, and M. Cranitch, “Using ten-sor factorisation models to separate drums from poly-phonic music,” in Proc.of 12th International Conferenceon Digital Audio Effects, (DAFX09), Como, Italy, Sept.2009.

[12] K. Yoshii, M. Goto, and H. Okuno, “Drum sound recog-nition for polyphonic audio signals by adaptation andmatching of spectrogram templates with harmonic struc-ture suppression,” IEEE Transactions on Audio, Speechand Language Processing, vol. 15, no. 1, pp. 333–345,2007.

[13] K. Yoshii, M. Goto, and H. Okuno, “Inter:d a drumsound equaliser for controlling volume and timbre ofdrums,” in Proc. European Workshop on the Integrationof Knowledge, Semantic and Digital Media Technolo-gies, 2005.

[14] D. Barry, D. FitzGerald, and E. Coyle, “Drum sourceseparation using percussive feature detection and spec-tral modulation,” in Proc. IEE Irish Signals and SystemsConference, 2005.

[15] N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, andS. Sagayama, “Separation of a monaural audio signalinto harmonic/percussive components by complemen-tary diffusion on spectrogram,” in Proceedings of the16th European Signal Processing Conference, 2008.

[16] A. Ozerov, P. Phillipe, F. Bimbot, , and R. Gribonval,“Adaption of bayesian models for single channel sourceseparation and its application to voice/music separationin popular songs,” IEEE Transactions on Audio Speechand Language Processing, 2007.

[17] S. Vembu and S. Baumann, “Separation of vocals frompolyphonic audio recordings,” in Proc. Int. Symp. MusicInf. Retrieval (ISMIR05), 2005.

[18] B. Raj, P. Smaragdis, M. Shashanka, and R. Singh,“Separating a foreground singer from background mu-sic,” in Proc. Int Symp. Frontiers Research Speech Mu-sic (FRSM).

[19] C. Hsu and J. Jang, “On the improvement of singingvoice separation for monaural recordings using the mir-1k dataset,” IEEE Transactions on Audio Speech andLanguage Processing, 2010.

[20] D. FitzGerald, “Harmonic/percussive separation usingmedian filtering,” in Proc.of 13th International Confer-ence on Digital Audio Effects, (DAFX10), Graz,Austria,Sept. 2010.

[21] D. FitzGerald, M. Cranich, and E. Coyle, “Extendednon-negative tensor factorisation models for musicalsound source separation,” Computational Intelligenceand Neuroscience, vol. 2008, Article ID 872425, 2008.

[22] D. FitzGerald, M. Cranich, and E. Coyle, “Musicalsource separation using generalised non-negative tensorfactorisation models,” in Workshop on Music and Ma-chine Learning, International Conference on MachineLearning, Helsinki.

[23] B. Bader and T. Kolda, “Matlab tensor toolbox version2.2,” 2007.

[24] D. FitzGerald and M. Gainza, “Single channel vocalseparation using median filtering and factorisation tech-niques,” ISAST Transactions on Electronic and SignalProcessing, vol. No. 1 Vol. 4, pp. 62–73, 2010.

[25] C. Schoerkhuber and A. Klapuri, “Constant-q transformtoolbox for music processing,” in 7th Sound and MusicComputing Conference, Barcelona, Spain, 2010.

[26] J. Yoo, M. Kim, K. Kang, and S. Choi, “Nonnegativematrix partial co-factorization for drum source separa-tion,” in Proc. of the IEEE Conference on Acoustics,Speech, and Signal Processing, 2010.

Documents

Upmixing from mono - A source separation approach