Automatic Music Stretching Resistance Classification Using Audio Features and Genres

IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 12, DECEMBER 2013 1249

Automatic Music Stretching Resistance ClassificationUsing Audio Features and Genres

Jun Chen and Chaokun Wang, Member, IEEE

Abstract—Music stretching resistance (MSR) is a fresh but im-portant concept in audio signal processing, which characterizes theability of a music piece to be stretched in time (compressed or elon-gated) without objectionable perceptual artifacts. It has the poten-tial to be highly demanded in various multimedia applications likemusic resizing, audio editing andmultimedia integration, but thereis almost no prior knowledge about this property of music in litera-ture. In this letter, the task of MSR is formulated for the first time,and anMSR classification method that employs metric learning onaudio features and genres is also proposed. It attempts to automatewhat human acceptable time-stretching rate range of music shouldbe. The proposed method outperforms the reference classificationmethods in accuracy in the comparative experiments.

Index Terms—Audio features, metric learning, music stretchingresistance (MSR), musical genre.

I. INTRODUCTION

M USIC resizing technique, which is a branch of con-tent-aware music adaption [1], enables general users

to compress or elongate a given music piece to a preferredtime length, say 0.80 times as long as the original one, tomeet requirements like audio-video synchronization, animationproducts, etc. However, current music resizing approaches[1]–[3] and the related methods [5]–[7] suffer from perceptualartifacts, and cannot guarantee the human acceptance of theresized music pieces when over-compressed or over-elongated.Among these existing methods, LyDAR [3] carries out com-pression based on the lyrics-density but fails to support low-ratecompression which degrades the quality of the output musicpiece. [1], [2] simply employs a naïve approximation of thehuman acceptable time stretching range [4], and therefore setsthe minimum compressing rate and the maximum elon-gating rate to 0.80 and 1.20, respectively, for each songin the experiments. This unified estimation fails to consider andutilize the characteristics of each individual music piece, i.e.

Manuscript received August 26, 2013; accepted October 08, 2013. Date ofpublication October 17, 2013; date of current version October 24, 2013. Thiswork was supported by the National Natural Science Foundation of China underGrants 61373023, 61170064, and 60803016, the National High Technology Re-search and Development Program of China under Grant 2013AA013204, andby the Opening Foundation of Key Laboratory of Intelligent Information Pro-cessing, ICT, CAS under Grant IIP 2012-5. The associate editor coordinating thereview of this manuscript and approving it for publication was Prof. Ali TaylanCemgil.The authors are with the School of Software, Tsinghua University, Beijing

100084, China (e-mail: [email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/LSP.2013.2286200

pitch, timbre, and musical genre. The same issue also exists indynamic re-scaling of music [5] and time-scale modification ofmusic [6], [7], where the boundaries of stretching or re-scalingof music pieces are unknown.All of these problems call for the automatic estimation of

MSR, a.k.a. stretch-resistance in [2], which is composed ofthe minimum compressing rate and the maximum elon-gating rate , denoted as . MSRcharacterizes the ability of a music piece to be stretched underthe human auditory and psychological acceptance. To the bestof our knowledge, there is no prior research addressing thisissue. The estimation of MSR contributes to the selection of theglobal stretching threshold for a given music piece, based onwhich the resizing algorithm is scheduled [3]. Moreover, theidea of MSR can be improved from course granularity (musicpieces) to fine granularity (music segments or sentences), basedon which the non-homogeneous music resizing can take themost advantage of the stretchability [1]–[3]. Besides, MSRshapes the boundaries for the dynamic music re-scaling [5] intime domain and the time-scale modification of music [6], [7].It may also be applied in the music thumbnail generation whenthe time-stretching technique is used.In this letter, we report our research on MSR by proposing a

classification method using audio features and genres based onmetric learning technique to estimate the MSR value for a givenmusic piece. The involvement of audio features and genreshas been considered to capture the characteristics of musicpieces and cultural backgrounds [8]–[11]. The human factoris taken into account in estimation by utilizing the beforehanduser-tagged MSR dataset. The comparative experimental re-sults showed that our method could provide much more reliableestimation of MSR than that of those reference approaches.All the music pieces in the experiments are stretched usingthe synchronized overlap-add (SOLA) method [6] which hasbeen implemented in SoundTouch (www.surina.net/sound-touch/). However, the idea of the proposed algorithm is stillindependent to the time-stretching techniques, and it could alsobe applied in other scenarios using different time-stretchingmethods like phase-vocoding, pitch synchronized overlap-add,and waveform similarity based overlap-add [7].The rest of this letter is structured as follows: A brief in-

troduction to our experimental settings is given in Section II.The algorithm to generate MSR classes for classification is dis-cussed in Section III. The feature selection is demonstrated inSection IV, followed by the detailed description of the proposedclassification algorithm in Section V. The experimental resultsand conclusion are provided in Section VI and Section VII,respectively.

1070-9908 © 2013 IEEE

1250 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 12, DECEMBER 2013

TABLE IMUSICAL GENRES DISTRIBUTION AMONG THE SONG COLLECTION

II. EXPERIMENTAL SETTINGS

We conducted our experiments on a collection of 894 songsdownloaded from mp3.baidu.com and identically covering 11musical genres shown in Table I. Meanwhile, we invited 17volunteers (5 female and 12 male) aging from 18 to 25 from Ts-inghua University to take listening experiments. All of them arenon-musicians while 2 of them had received piano instructionfor more than 3 years. This had been considered more likelyto be the general situation in real life. During every trial, eachparticipant listened to a given song and its stretched versionsone after another. Each song was stretched in discrete scalewith a rate step 0.02, i.e. compressed versionsand elongated versions . After listening, theparticipant judged the MSR ( , ) for the given musicpiece based on her own music perception and acceptance, andinput them into our experimental webpages. To reduce theerror caused by participants’ individual difference, we pro-vided them with some judging criteria, for example, extremelyhigh/low singing speed, overly dense/sparse lyrics-density,uncomfortable tremolo and guttural. Participants were chosenwith requirements. They should enjoy listening to music indaily life and were asked to conduct the rewarding experimentat least half an hour a day. MSR rates on the same song acrossdifferent participants are majority voted before being added toour user-tagged MSR dataset.

III. CLASSES GENERATION

Rather than giving the exact numerical estimation of MSR( ), we considered it more reasonable to separately pro-vide an estimated range for each of and in view oferrors. Each estimated range corresponds to a predefined classgenerated by performing discretization on and usinghistogram analysis [15]. We illustrate the MSR histogram ofour song collection in Fig. 1 based on the user-tagged MSRdataset introduced in Section II. A Gaussian-like distributionis observed for both and , based on which the MSRclasses are generated. According to this observation, we useda greedy equal-frequency histogram method to separately dis-cretize the rate axis into non-overlapping rate ranges for both

and . The method is implemented as follows:1) Given the desired number of classes , the minimumnumber of music pieces for a class is defined as ,where is the total number of music pieces in the collec-tion. As in our case, , and we set becauseit discriminates the MSR of music pieces in our songcollection well enough and a larger or smaller leads tomore unbalanced or ineffective separation of MSR in our

Fig. 1. The distribution histogram of and in the song collection.

TABLE IITHE GENERATED CLASSES FOR AND . LOWER BOUND AND UPPERBOUND REPRESENT THE MINIMUM AND MAXIMUM RATE VALUES OF THIS

LEFT-CLOSED RIGHT-OPEN INTERVAL, RESPECTIVELY

experiments. Thus, for oursong collection.

2) Find the peak among the un-added bins in the histogramwithin the rate range [0.00, 1.00) for (or [1.00, 2.00)for ) and assign it to a new bucket. Add up the piecenumber by joining the higher one of the two adjacentun-added bins of this bucket until the size of the bucketexceeds or no adjacent bins are available. Output a newclass which spans over this bucket.

3) Repeat Step (2) on the rest of the histogram to generatemore classes for both and until the whole his-togram is covered.

There are possibilities that more or less than classes aregenerated due to the greedy strategy. However, it still providesa fair way to discretize MSR considering its distribution in thecollection. In our experiment, the rate step between two adjacentbins is 0.02 which is a high enough resolution for human per-ception of music. Table II shows the MSR discretization resultswhere five classes have been generated for each of and

. Thus, each song in the collection corresponds to both anclass and an class.

IV. FEATURE SELECTION

There have been a variety of audio features, among whichtimbre, pitch, and rhythm are most commonly used [8]–[10],[12]. We conducted comparative experiments on differentaudio features and their combinations with various classifiers.In all, five types of audio features—mel-frequency cepstralcoefficients (abbr. , 13 dims), spectral information andzero-crossing rate (abbr. , 4 dims, spectral information in-cludes centroid, rolloff and flux), chroma (abbr. , 14 dims),tempo (abbr. , 1 dim) and pitch contour, have been extracted

CHEN AND WANG: AUTOMATIC MUSIC STRETCHING RESISTANCE CLASSIFICATION 1251

Fig. 2. The accuracy of (a) classification, and (b) classificationusing different audio features and their combinations with various classifiers.

for each song using Marsyas (http://marsyas.info). For a givenmusic piece, we use the means of the frame-level features like, and as its global features. Pitch histogram [12] (abbr., 4 dims) is constructed as the replacement of pitch contour in

the global feature representations. Let denote the vector ofthe feature , and denote the vector of the combinationof and by appending at the tail of . So are thesame of the other features and combinations. All the featurevectors are normalized into the range [0, 1] using the min-maxnormalization method.Fig. 2 illustrates the results of the classification experi-

ments conducted on the extracted audio features with thestate-of-the-art classifiers [8]–[10] implemented in Weka(http://www.cs.waikato.ac.nz/ml/weka/). From Fig. 2(a) and2(b), we conclude that is more effective inclassification, while ( is as good as , and wechoose the latter which is more expressive) outperforms otherson classification. Thus, in our proposed classificationmethod, we use and as the feature repre-sentations of and , respectively. Notice that thefeature selections of and are different. Particularly,the addition of degrades the performance in estimating

. Since approximates the response of human auditorywell and is widely used in speech recognition [13], we mayconclude that human hearing is more sensitive to voices inspeech acceleration than that in deceleration. Meanwhile, KNNand SVM dominate the other classifiers in accuracy under allconditions in Fig. 2. Thus, optimization based on KNN or SVMis considered more appropriate in building MSR classifiers.

V. CLASSIFICATION METHOD

In line with Feature Selection, we separately built two clas-sifiers for and based on their feature selections. Theclassifiers are optimized using metric learning technique whichassigns proper weights to different dimensions of features. Toachieve better discrimination on MSR, we also employed mu-sical genres in the distance measurement.

A. Learning Audio Feature Weights

The metric learning [14] is based on the weighted distancemeasurement as shown in Equation (1) where is a real-valuedpositive semidefinite matrix. and are vectors of featurerepresentations of the th and th music pieces in the songcollection.

(1)

Since matrix is obtained through learning, for ease of calcu-lation, is usually regarded as a diagonal matrix whose diag-onal elements are all nonnegative. These nonnegative elementsmeasure the importance of the corresponding dimensions on theclassification of MSR.We defined the objective function as:

-

-(2)

where Same-Class represents that two music pieces have thesame MSR class. Equation (2) suggests that the overall distanceamong MSR-similar pieces should be relatively small while theoverall distance amongMSR-dissimilar pieces should be larger.Since the number of pairs whose elements have thesame MSR class is much smaller than that of the pairs whoseelements have different MSR classes according to the MSRdiscretization results in Table II, the operator is applied.It weakens the impact of inter-class dissimilarity on the dis-tance measurement and therefore enhances the importance ofintra-class similarity. Thus, the matrix can be learned by thegradient descending method which uses Equation (3) on theuser-tagged MSR train set.

(3)

To avoid over-learning, a threshold of the maximum numberof iterations or the minimum difference of values betweentwo adjacent iterations should be given according to the datasetand actual demands. Matrix is obtained near optimal, and itunderlines the importance of different audio features and evenfeature dimensions in shaping MSR. The improvement in clas-sification accuracy with a metric-learned compared with thatof an identity matrix will be discussed in Section VI.

B. Tuning Musical Genre Weights

Considering the usability of musical genre taxonomy[8]–[11], genre information is also included in the distancemeasurement shown in Equation (4), where

1252 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 12, DECEMBER 2013

Fig. 3. The accuracy of (a) classification and (b) classificationusing our method compared to KNN. Here, ML stands for Metric Learning.

is the genre indicative function (Equation (5)). The weightcannot be learned through the gradient descending due to

the discontinuity of . Therefore, it should betuned separately in extra experiments. Since the 11 musicalgenres used in our experiments are on the top level of the genretree (see the additional documents for review), the fuzziness[10] between any two musical genres is minimized. Then, thegenre indicative function is expressive enough regarding thecorrectness and simplicity.

(4)have the same genre

otherwise(5)

Thus, our proposed classifiers work fine by performing KNNclassification using this new distance measurement.

VI. EXPERIMENTAL RESULT

Since there is no prior research addressing the MSR classifi-cation to the best of our knowledge, we compared our methodwith the simple KNN which dominates the other classifiers inSection IV. The experiments were conducted based on our user-tagged MSR dataset introduced in Section II, and 5-fold crossvalidation was employed in the accuracy evaluation. The re-sults are illustrated in Fig. 3. As can be seen, the addition ofmusical genre contributed a lot in discriminating MSR for both

and when compared to KNN without genre infor-mation. Moreover, the optimization based on KNN and musical

genres using metric learning techniques, also slightly improvedthe accuracy.As for our song collection, the classification reaches

the highest accuracy 46.9% when using the feature combina-tion of with our method by setting as 60, as1.00. In addition, the classification meets the best accu-racy 43.6% when utilizing the feature combination ofwith our method by setting as 20, as 0.80.Although we do not observe an extremely high accuracy in

the classification experiments, our method is still able to givean estimation of MSR for the given music piece, and this isessential in many multimedia applications like music resizingand audio editing.

VII. CONCLUSION

In this letter, we describe our research into music stretchingresistance, in which an applicable classification method usingaudio features and genre information based on the metriclearning technique has been proposed. Two classifiers havebeen built separately for and under their featureselections. The experimental results demonstrate the advantageof our method towards the other reference approaches using thestate-of-the-art classifiers in MSR classification.

REFERENCES

[1] Z. Liu, C. Wang, J. Wang, H. Wang, and Y. Bai, “Adaptive music re-sizing with stretching, cropping and insertion,” Multimedia Syst., vol.19, no. 4, pp. 359–380, 2013.

[2] Z. Liu, C. Wang, Y. Bai, H. Wang, and J. Wang, “MUSIZ: A genericframework for music resizing with stretching and cropping,” ACMMultimedia, pp. 523–532, 2011.

[3] Z. Liu, C. Wang, G. Luo, Y. Bai, and J. Wang, “Lydar: A lyrics den-sity based approach to non-homogeneous music resizing,” ICME, pp.310–315, 2010.

[4] E. Lee, T. Nakra, and J. Borchers, “You’re the conductor: A realisticinteractive conducting system for children,” NIME, pp. 68–73, 2004.

[5] J. Charles and B. Wenner, “Scalable music: Automatic music retar-geting and synthesis,” Eurographics, vol. 32, no. 2, pp. 345–354, 2013.

[6] S. Roucos and A. Wilgus, “High quality time-scale modification forspeech,” ICASSP, vol. 10, pp. 493–496, 1985.

[7] W. Verhelst and M. Roelands, “An overlap-add technique based onwaveform similarity (WSOLA) for high quality time-scale modifica-tion of speech,” ICASSP, pp. 554–557, 1993.

[8] G. Tzanetakis, G. Essl, and P. Cook, “Automatic musical genre classi-fication of audio signals,” IEEE Trans. Speech Audio Process., vol. 10,no. 5, pp. 293–302, 2002.

[9] T. Li and M. Ogihara, “A comparative study on content-based musicgenre classification,” SIGIR, pp. 282–289, 2003.

[10] N. Scaringella, G. Zoia, and D. Mlynek, “Automatic genre classifi-cation of music content: A survey,” IEEE Signal Process. Mag, pp.133–141, 2006.

[11] U. Bagci and E. Erzin, “Automatic classification of musical genresusing inter-genre similarity,” IEEE Signal Process. Lett, pp. 521–524,2007.

[12] G. Tzanetakis, A. Ermolinskyi, and P. Cook, “Pitch histograms in audioand symbolic music information retrieval,” ISMIR, pp. 31–38, 2002.

[13] T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evalu-ation of various MFCC implementations on the speaker verificationtask,” SPECOM, vol. 1, pp. 191–194, 2005.

[14] E. Xing, A. Ng, M. Jordan, and S. Russell, “Distance metric learning,with application to clustering with side-information,” NIPS, vol. 15,pp. 505–512, 2002.

[15] J. Han and K. Micheline, Data Mining: Concepts and Techniques, 2ndEd. ed. San Mateo, CA, USA: Morgan Kaufmann, 2006.

Documents

Automatic Music Stretching Resistance Classification Using Audio Features and Genres