[IEEE 2008 International Symposium on Computer Science and its Applications (CSA) - Hobart, Australia (2008.10.13-2008.10.15)] International Symposium on Computer Science and its Applications

Multitimbral Musical Instrument Classification

Peter Somerville and Alexandra L. UitdenbogerdSchool of Computer Science and Information Technology

RMIT UniversityMelbourne, Vic., Australia, 3000

pcsomerv,[email protected]

Abstract

The automatic identification of musical instrument tim-bres occurring in a recording of music has many appli-cations, including music search by timbre, music recom-mender systems and transcribers. A major difficulty is thatmost music is multitimbral, making it difficult to identify theindividual timbres present. One approach is to classify mu-sic based on specific groups of musical instruments. In thispaper we report on our experiments that classify musicalinstrument timbres based on specific groups that are oftenfound in commercial recordings.

Classification using the K-Nearest Neighbour classifier,with audio features such as Mel Frequency Cepstral Coef-ficients, on a set of 160 samples from commercial record-ings, gave an accuracy of 80%. Some of the difficultiesarose from distinguishing similar instrument groups, suchas those only differing by the inclusion or exclusion of avoice. However, when these were examined in isolation,greater accuracy was achieved, suggesting that a hierar-chical approach may be helpful.

1. Introduction

The typical method for acquiring recorded music haschanged radically over the last 10 years. It is more com-mon to download the latest single from an on-line mu-sic sales site such as iTunes than it is to buy a CD sin-gle from a record store. The existence of on-line musicsites presents opportunities for richer ways of finding music.While iTunes relies on the traditional techniques of findingmusic by metadata such as the song title or artist, plus rec-ommended music based on customers with similar purchasepatterns, other sites such as eMusic that have a greater em-phasis on new music and non-mainstream artists, rely moreon the provision of a variety of methods for finding music.Our team is researching the techniques needed for differentways of finding music, such as by melody, perceived mood,

and timbre.The term timbre refers to the sound of a musical instru-

ment that makes it identifiably different to another instru-ment. For example, consider the different sound of a pianoto that of a violin. People often express a preference forsome instrument timbres over others, such as the sound ofa clarinet is thought by many to be more appealing than,say, a piano accordion. Tastes differ greatly in this regard,with some people enjoying electronic music timbres, or themore organic noise timbres associated with industrial mu-sic, while others prefer the clean sounds of acoustic instru-ments such as the acoustic guitar.

As the timbre of instruments within a piece of music isan important factor in determining whether a given personwill enjoy the music, it can be a useful feature for search. Inaddition, regardless of enjoyment, a user may wish to locatepieces that contain a particular instrument, such as trumpet.In our ongoing work on this topic, we have explored the fea-siblity of classifying music according to the instrument tim-bres contained within it. Our early work considered timbresin isolation, both in the simple case of recordings of a sin-gle note [7], and multiple notes [6]. In this current piece ofwork we explore the more likely situation to occur in popu-lar music recordings, being the occurrence of simultaneousmusical instruments, including the voice.

Due to the extreme variety available in synthetic instru-ment sounds, such as those found in electronic or indus-trial music, we have focused on music containing a selec-tion of instruments that are more consistently identifiable aswell as frequently occuring in popular and classical music,namely, acoustic and electric guitar, piano, organ, bass gui-tar, drums, orchestral instruments, voice and combinationsof these instruments. Within this small range there is stilla potential for considerable variety of timbre, for example,with different effects being applied to an electric guitar, ordifferent settings being used for an organ.

For our experiments we classified music into eight pos-sible groups according to the musical instruments present.Audio feature extractors used in the experiments were:

International Symposium on Computer Science and its Applications

978-0-7695-3428-2/08 $25.00 © 2008 IEEE

DOI 10.1109/CSA.2008.67

269

International Symposium on Computer Science and its Applications

978-0-7695-3428-2/08 $25.00 © 2008 IEEE

DOI 10.1109/CSA.2008.67

269

Spectral Centroid, Rolloff, Flux, Zerocrossings, RMS(RootMean Square - amplitude envelope) and Mel-FrequencyCepstral Coefficients (MFCC). The classifiers applied tothe instrument samples were Decision Trees (J48), K-Nearest Neighbor (KNN), and two Bayesian based classi-fiers, NaiveBayes and BayesNet. The best performing clas-sifier was KNN with a classification accuracy of 80% andthe most significant feature selector was the MFCCs.

This paper is arranged in the following manner. Relatedwork is followed by the approach used, data sources, exper-iments and results ending with the conclusion and sugges-tion of future work.

2. Related Work

Research into classification of multitimbral music ac-cording to timbre has relatively recent origins. Essid et.al. [2] addressed the issue of instrument recognition in poly-phonic music (multiple notes occurring at the same time),by representing combinations of instruments that are likelyto be played together with respect to a certain musicalgenre. The experimental platform consisted of sound ex-cerpts from commercial recordings of the jazz genre. En-sembles using a combination of the following ten instru-ments were used in the experiments: double bass, drums,piano, percussion, trumpet, tenor sax, electro-acoustic gui-tar, Spanish guitar and male and female singing voices.

The results showed that by using a hierarchical classi-fication algorithm, the recognition of classes consisting ofcombinations of instruments was possible. The schemeproduced a hierarchy of nested clusterings. The approachstarted with the same number of clusters as classes and thenmeasured the distances between pairs of clusters. The clos-est pairs were then grouped into new clusters. This processwas continued until all classes lay in a single cluster. Thework showed an average accuracy of 53% for segmentedmusic with respect to the instruments played [2].

Sandvold et. al. [5] used feature-based modelling forclassifying percussive sounds mixed in polyphonic music.Localised sound models were built for each recording usingfeatures and combined with prior knowledge (general mod-els) to improve percussion classification. Categories werekick, snare, cymbal, kick+cymbal, snare+cymbal and not-percussion. The results returned an accuracy of values 20%higher than that of general models.

Eggink and Brown [1] proposed an approach that en-abled instrument recognition in situations where multipletones may overlap in time. The approach aimed to ignorefeatures that included noise and interfering tones whilst us-ing the parts of the signal which were dominated by thetarget sound. The experiments needed an approach that ex-cluded frequency regions dominated by interfering sound.Audio was divided into 40ms frames with a 20ms overlap.

Hanning windows were then used along with a fast Fouriertransform(FFT). The classifier was trained for flute, oboe,clarinet, violin and cello classes.

Fujihara et al. [3] proposed a method for estimating theF0s (fundamental frequency) of sung vocals from poly-phonic audio signals. The approach consisted of three parts:pitch likelihood calculation, vocal probability calculationand F0 tracking based on Viterbi search. Pitch likelihoodis the likelihood that an F0 is the most predominant F0 ina examined spectrum and the Viterbi search refers to an al-gorithm that computes the most likely sequence of hiddenstates. This is generally in the context of hidden Markovmodels where the challenge is to find the hidden parametersfrom the observable parameters. Feature extractors wereLPC-derived mel cepstral coefficients (LPMCCs) and F0s.Linear predictive coding (LPC) helps to represent the spec-tral envelope of a digital signal of speech in a compressedform. The LPMCCs were chosen as they have proven to ex-press vocal characteristics better than MFCCs. Two gaus-sian mixture models (GMMs) were used and trained on fea-ture vectors extracted from two areas: vocal sections andfrom interlude sections. Training data consisted of 21 songsspanning 14 different singers taken from the popular sec-tion of the RWC (Real World Computing) Music Database.Their method was evaluated on 10 musical pieces also takenfrom the popular section of the RWC Music Database. Re-sults indicated that their method improved the estimationaccuracy from 78.1% to 84.3%. The introduction of F0saw an improvement of accuracy from 82.6% to 84.3% andtherefore was confirmed as a good feature for vocal/non-vocal discrimination. The accuracy of LPMCCs was 0.3%higher than MFCCs. More recently this team has developedtechniques for handling overlapping notes [9], and using in-strument existence probabilities [8].

3. Approach Used

Popular music regularly contains song mixes containingat least drums and bass tracks. Other common instrumentsfound in such music are acoustic guitar, electric guitar, pi-ano, organ, orchestral sounds and in most cases, the humanvoice. This research uses eight different categories of in-strument groupings which have been composed from instru-ments that are found in a high percentage of modern music.For the purposes of keeping the research within workableboundaries, synthesized sounds and effects have not beenconsidered. The instrument groupings chosen are: piano,acoustic guitar, organ, electric guitar, piano & bass guitar &drum, orchestral, electric guitar & bass guitar & drum andvoice. Piano & bass guitar & drum for instance, representssegments of a song where only piano, bass guitar and drumsare playing simultaneously. The orchestral group may con-tain any number of brass, woodwind, strings and typical in-

270270

struments normally found in orchestral music. The voicecategory identifies a segment of music where voice plusother instrument combinations are playing simultaneously.Thus four of the classes in our experiment are multitimbraland four are monotimbral. The instrument categories in-cluding the voice category can be seen in Table 1.

The experiments carried out in this research extract fea-tures from 160 different sound files. Twenty-five songswere used for combined instrument purposes and sevensongs were used for voice. There were five one second seg-ments extracted from each song, giving 160 samples in totalas shown in (1).

25 × 5 + 7 × 5 = 160 (1)

The result of the feature extraction was then passed to theclassifier for classification into the relevant instrument cat-egories.

3.1. Feature Extraction

The feature extractors used in our approach were Spec-tral Centroid, Flux, Rolloff, Zero Crossing, RMS (Root-Mean Square) and 13 MFCCs. ACE’s jAudio, an opensource package was used for the feature extraction tasks [4].Each audio file had features extracted every 50ms. WithinjAudio, a window size(samples) of 2048 was used.

Just under 20 complete occurrences of the 18 featureswere extracted from the digital audio giving a total of358 total attributes spanning approximately 1000 millisec-onds(ms) of digital audio.

3.2. Classification

Feature extraction was followed by the use of the clas-sifiers. In particular, these experiments applied DecisionTrees (J48), KNN, NaiveBayes and BayesNet as imple-mented by the data mining software Weka, using defaultvalues. Evaluation was based on ten-fold stratified cross-validation. The stratified approach is where Weka attemptsto properly represent each instrument class in both trainingand test sets.

4. Data Sources

Songs used for this experiment came from commercialtracks covering a variety of popular styles and genres. Inparticular genres such as pop, jazz, classical, gospel andfolk have been included. Samples used in these experimentswere 16 bit, mono samples recorded at 44.1kHz. A final31 different commercial audio tracks were chosen from alarger collection as this number allowed segments of musicwhere the different combinations of instruments were play-ing simultaneously. Twenty-five of the 31 tracks had audio

segments extracted where only the combination of instru-ments are playing. No voice is present in these segments.In this case, five seconds of audio are examined at either thebeginning of the track or at the five second position of thetrack. Seven of the 31 tracks had audio segments extractedwhere voice was definitely present in the mix. The 35 or 40second position of the audio track was used as these time lo-cations had voice present along side other instruments. Onesong was used for both the accompanied voice class and aninstrumental class, thus 31 songs were used for 32 5-secondsamples, each of which was divided into 5 one-second in-stances for the experiment.

The 32 tracks were broken into the eight instrument cat-egories and the final allocations of instruments contained ina category can be seen in the final column labelled ‘No. oftracks’ in Table 1.

5. Experiments

These experiments aim to determine whether it was pos-sible to classify music according to multitimbral mixtures,and whether this would be more difficult than classifying amonotimbral recording.

5.1. Multitimbral Classification Experi-ment

This experiment took the eight instrument categories,four of which were multitimbral, and performed feature ex-traction followed by classification. Of the four classifiersused, the KNN classifier returned the best results with acorrectly classified percentage of 80%. The predominantfeature selector used as indicated by the J48 decision treeclassifer was a MFCC. The results are highlighted in theconfusion matrices given in Fig.1. Figures have been nor-malised and are listed as percentages(%s). For the KNNclassifier, the electric guitar category showed very little con-fusion with other categories and it was only slightly con-fused with voice when using the Decision Tree classifier.Some confusion can be seen between piano and orches-tral and between acoustic guitar and piano, bass guitar anddrums. Overall, covering the four classifiers, the greatestconfusion was between the voice category and the electricguitar, bass guitar and drums category followed closely bysome confusion between organ and the electric guitar, bassguitar and drums category. It is no surprise that electric gui-tar, bass and drums continually became confused with otherinstrument categories given that the combination of thesethree instruments generally provides a very full sound cov-ering many of the available sound frequencies. The voicecategory can easily get lost in a mix saturated with heavyelectric guitar, bass and drums. The organ category repre-sents a very filling sound so it is easy to imagine some of

271271

Table 1. Instrument categories itemising the groupings used, abbreviations and track allocation num-bers.

Full description of instrument combinations Chart abbreviations Abbreviations used in experiments No. of tracks

acoustic guitar acous acousticguitar 3electric guitar, bass guitar & drums ebd elecguitbassdrum 4

electric guitar elec electricguitar 3combination of orchestral based instruments orch orchestral 5

pipe organ or hammond style organ organ organ 2various selection of pianos piano piano 5

piano & bass guitar & drums pbd pianobassdrum 3selection of accompanied vocals voice voice 7

its timbre and frequencies overlapping and being hidden ina full electric guitar, bass guitar and drums song mix.

Fig.2 shows the best classification results achieved foreach of the eight instrument groups. In most cases, the KNNclassifier gained the highest result, the only exceptions be-ing Bayesnet for piano and Naive Bayes for accompaniedvoice. The chart clearly indicates that the electric guitarand orchestral instrument groups were best at avoiding con-fusion with other instrument categories. It is also clear fromthe results that monotimbral samples were not necessarilythe easiest to classify. On average, the four monotimbralclasses had better results, but one of the best classes (or-chestral) was multitimbral, and, when considering the bestresult over all classifiers, the worst class was monotimbral(acoustic guitar). There may be other factors that affectedthese results, such as the genre of music, and the small over-lap of songs across categories. In the next section we look atthese issues, as well as a comparison of accompanied voiceand its most frequently confused class in some detail.

5.2. Further Analysis

In the previous experiment, the accompanied voice classwas the most problematic for classification, being fre-quently confused with the electric guitar, bass and drumsclass, as well as the organ class. It is possible that thetypes of accompaniment occurring with the voice is con-fused with similar instrumental groups without voice. Here,we look at the first two of these classes in isolation in sev-eral ways. Firstly, we look at how consistently songs inthese two classes are classified together, by treating eachsong as a class. Thus, there are five instances for each class.There were seven songs containing voice plus four songs inthe electric guitar, bass guitar and drums category. Note thatthe song that was used for both the accompanied voice andinstrumental categories in the previous experiment was notused here as the instrument class was organ.

The results revealed two aspects in particular. Firstly,

classification was carried out using the four classifiers as inthe previous experiment. The best classifier was once againKNN (see Figure 3) with a correctly classified percentage of61%. The predominant feature selector used as determinedby the J48 decision tree classifer was a MFCC. On numer-ous occasions, voice segments were confused with one an-other and to a lesser extent, the electric guitar, bass gui-tar and drums instrument category was also confused withitself. Secondly, on a number of occasions, electric gui-tar, bass guitar and drums included in some gospel rocksongs were confused with Darren Hayes’s ‘To the moonand back’ where vocals were present and some other gospelrock songs were confused with Coldplay’s ‘Fix You’, ‘TheHardest Part’ and Darren Hayes’s ‘So Beautiful’ where vo-cals were also present. Little or no confusion was evidentwith the other vocal tracks, namely a French track, a Green-day song and the Hallelujah Chorus from the Messiah. Theonly surprise here is the Greenday song which could havebeen listed as another song which was confused. It would befair to assume that Darren Hayes and Coldplay based musicis more likely to have similar instrument timbres to that ofan instrument category containing electric guitar, bass gui-tar and drums mainly due to their music consisting mainlyof those types of instruments.

Clearly the data set used here is too small to draw anystatistically significant conclusions that can be generalisedupon, but it does show that for our collection that it is notentirely likely that samples extracted from the same songwould be classified together.

Additionally we examined the same data grouped intothe two classes of accompanied voice and electric guitar,bass and drums (ebd). To reduce bias from different sizedclasses, we restricted the number of songs to three for eachclass, thus resulting in 15 instances for each class. Boththe Naive Bayes and KNN classifiers had 87% correctlyclassified instances, with KNN more reliably classifyingvoice (all excerpts classified as voice were actually voicesamples), and Naive Bayes being more successful with ebd

272272

Bayes Network (BayesNet)classified as

a b c d e f g h69 0 11 0 3 0 3 14 | a4 80 0 0 12 0 4 0 | b0 7 67 7 0 13 7 0 | c

20 40 10 0 0 0 0 30 | d0 7 0 0 93 0 0 0 | e0 7 27 0 7 53 7 0 | f0 4 0 0 4 0 92 0 | g

20 0 0 0 0 0 15 65 | h

Decision Tree (J48)classified as

a b c d e f g h49 3 3 6 11 0 9 20 | a8 60 0 0 4 20 8 0 | b0 13 67 0 0 0 13 7 | c

40 20 0 40 0 0 0 0 | d27 0 0 0 73 0 0 0 | e0 20 13 0 0 67 0 0 | f8 20 8 0 8 4 52 0 | g

20 0 5 25 5 0 15 30 | h

KNNclassified as

a b c d e f g h60 0 3 14 3 6 0 14 | a0 80 0 0 0 0 20 0 | b0 0 73 0 0 27 0 0 | c0 0 10 90 0 0 0 0 | d0 0 0 0 100 0 0 0 | e7 0 13 0 0 80 0 0 | f0 0 0 0 0 0 100 0 | g0 0 0 15 5 0 5 75 | h

Naive Bayesclassified as

a b c d e f g h74 0 0 6 0 0 3 17 | a4 60 0 8 8 12 8 0 | b

20 7 27 13 0 13 13 7 | c40 0 0 50 0 0 0 10 | d0 7 0 0 93 0 0 0 | e

27 0 7 0 0 47 7 13 | f0 0 0 0 8 0 92 0 | g

25 0 5 0 0 5 0 65 | h

Legend a = voice e = electricguitarb = piano f = pianobassdrumc = acousticguitar g = orchestrald = organ h = elecguitbassdrum

Figure 1. Confusion matrices for the fourclassifiers. Electric guitar generally showedvery little confusion with other instrumentgroups.

acous ebd elec orch organ piano pbd voice

Instrument Categories

0

20

40

60

80

100

Cor

rect

ly C

lass

ifie

d In

stan

ces

(%)

Figure 2. Classification results achieved us-ing the best result from the best classifier.

(12/13 samples classified as ebd were ebd samples). It maybe worth trying a combination of the two classifiers with alarger data set to see if greater accuracy can be achieved.

5.3. Discussion

Based on the feature extractors used and the given clas-sifiers, the results gathered from this study dealing withthe classification of eight different instrument categories re-vealed that the combination of electric guitar, bass guitarand drums were sometimes confused with other songs con-taining a mixture of instruments and vocals. Songs con-taining organs were also sometimes confused with electricguitar, bass guitar and drums. The instrument categories ofelectric guitar and orchestral instruments were the best atavoiding confusion with other instrument categories.

A closer look at the main confusion between the firstmentioned example revealed some confusion within the cat-egories of voice and electric guitar, bass guitar and drums.Also, these results showed some confusion between somecontemporary rock gospels songs not containing voice andsome popular commercial artists which included voice.

KNN classifier was the most successful classifier for allclassification tests whilst the most useful feature selectorswere MFCCs. Classification into eight instrument cate-gories resulted in 80% of instances being correctly classi-fied. The further analysis showed that it may be useful tocombine classifiers that work on two classes such as ac-companied voice and ebd to achieve much higher accuracyoverall.

6. Conclusion and Future work

Our experiments explored the idea of classifying a pieceof music based on the instruments that occur within it. On

273273

KNNclassified as

a b c d e f g h i j k2 2 0 0 1 0 0 0 0 0 0 | a0 4 0 0 0 0 0 1 0 0 0 | b0 0 0 0 1 0 0 4 0 0 0 | c0 1 0 3 0 0 0 1 0 0 0 | d0 0 1 0 4 0 0 0 0 0 0 | e0 0 0 0 0 5 0 0 0 0 0 | f0 0 0 0 0 0 5 0 0 0 0 | g0 0 0 0 0 0 0 5 0 0 0 | h0 0 0 0 0 0 0 1 4 0 0 | i0 0 0 0 1 0 0 4 0 0 0 | j0 0 0 0 0 0 1 2 0 0 2 | k

Legend a = coldplayfixyouV f = greendayareweVb = coldplayhardestV g = messiahhallelujahVc = dhayessobeautV h = petrahelloagaind = dhayestothemoonV i = roguetrdrsbelievere = frenchaquoicaV j = thirddaykeepon

k = thirddaytunnel

Figure 3. Unnormalised confusion matrix ofsong samples for the best performing classi-fier, KNN.

average the success of classification is slightly less for mul-titimbral categories than monotimbral ones, but the differ-ence is not large. Typical audio features coupled with aKNN classifier gave reasonable results. Unlike some workin the field we used music from commercial recordings andclasses that match styles of music likely to be found on-lineby consumers.

Examining our results more closely, we conclude that itmay be useful to combine classifiers and build a hierarchicalclassifier using several binary classifiers for classes that arefrequently confused with each other.

There are various difficulties with scaling up the experi-ments of this nature, one of the major ones being the manualprocess of locating suitable recordings, however, we hope towork with larger collections in the future.

Many questions remain unanswered in the quest fortimbre-related music search, one of which is whether moresophisticated techniques can improve the accuracy of clas-sification, and ultimately make a “query-by-timbre” systemfeasible for large collections of music. One possible ap-proach is the use of source separation of individual instru-ment tracks from a song mixture prior to classification orretrieval. This is the focus of our current research on theproblem.

References

[1] J. Eggink and G. Brown. Application of missing feature the-ory to the recognition of musical instruments in polyphonicaudio. In Proc. International Conference on Music Informa-tion Retrieval, ISMIR’03, pages 125–131, Washington DC,USA, 2003.

[2] S. Essid, G. Richard, and B. David. Instrument recognition inpolyphonic music based on automatic taxonomies. In IEEETransactions on Audio, Speech, and Language Processing,volume 14, pages 68–80, January 2006.

[3] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata,and H. Okuno. F0 estimation method for singing voicein polyphonic audio signal based on statistical vocal modeland viterbi search. In 2006 International Conference onAcoustics, Speech and Signal Processing (ICASSP’2006),volume V, pages 253–256, May 2006.

[4] D. McEnnis, C. McKay, I. Fujinaga, and P. Depalle. jAudio:A feature extraction library. In Proc. of the 6th InternationalConference on Music Information Retrieval, pages 600–603,London, UK, Sept 2005.

[5] V. Sandvold, F.Gouyon, and P. Herrera. Drum sound classifi-cation in polyphonic audio recordings using localized soundmodels. In Proceedings of Fifth International Conference onMusic Information Retrieval, pages 537–540, Barcelona, Jan-uary 2004.

[6] P. Somerville and A. Uitdenbogerd. Note-based segmentationand hierarchy in the classification of digital musical instru-ments. In Proc. of the International Computer Music Confer-ence, pages 240–247, Denmark, Copenhagen, Aug 2007.

[7] P. Somerville and A. L. Uitdenbogerd. Classification of mu-sic based on musical instrument timbre. In S. J. Simoff, G. J.Williams, J. Galloway, and I. Kolyshkina, editors, Proceed-ings of the Australasian Data Mining Conference, volume 4,pages 173–188, Sydney, Dec. 2005. University of TechnologySydney.

[8] T.Kitahara, M. Goto, K.Komatani, T.Ogata, and H.Okuno. In-strogram: Probabilistic representation of instrument existencefor polyphonic music. IPSJ Digital Courier, 3:1–13, 2007.

[9] T.Kitahara, M. Goto, K.Komatani, T.Ogata, and H.Okuno. In-strument identification in polyphonic music: Feature weight-ing to minimize influence of sound overlaps. EURASIP Jour-nal on Advances in Signal Processing, Special issue on MusicInformation Retrieval Based on Signal Processing, 2007:1–15, 2007.

274274

Documents

[IEEE 2008 International Symposium on Computer Science and its Applications (CSA) - Hobart, Australia (2008.10.13-2008.10.15)] International Symposium on Computer Science and its Applications