Polyphonic music information retrieval based on multi-label cascade classification system presented by Zbigniew W. Ras University of North Carolina, Charlotte,

Polyphonic music information retrieval based on multi-label cascade classification system

presented byZbigniew W. RasUniversity of North Carolina, Charlotte, NC College of Computing and Informatics www.kdd.uncc.edu

http//:www.mir.uncc.edu

Student: Wenxin JiangAdvisor: Dr. Zbigniew W. Ras Polyphonic music information retrieval based on multi-label cascade classification system

43 MIR systemsMost are pitch estimation-based melody and rhythm match

This presentation will focus on timbre estimation Survey of MIR- http://mirsystems.info/

Outcome:Musical Database[music pieces indexed by instruments and emotions].Resulting Database will be represented as FS-treeguarantying efficient storage and retrieval .MIRAI - Musical Database (mostly MUMS)[music pieces played by 59 different music instruments]Goal: Design and Implement a System for Automatic Indexing of Music by Instruments (objective task) and Emotions (subjective task)

What is needed?Database of monophonic and polyphonic music signals and their descriptions in terms of new features (including temporal) in addition to the standard MPEG7 features. These signals are labeled by instruments and emotions forming additional features called decision features.Automatic Indexing of Music

Why is needed?To build classifiers for automatic indexing of musical sound by instruments and emotions.

MIRAI - Cooperative Music Information Retrieval System based on Automatic IndexingUserInstrumentsQueryIndexed Audio Database QueryAdapterDurationsEmptyAnswer?Music Objects

Binary File PCM :Sampling Rate 44.1K Hz16 bits2,646,000 values/min.

Raw data--signal representationPCM (Pulse Code Modulation) - the most straightforward mechanism to store audio. Analog audio is sampled & individual samples are stored sequentially in binary format.

The nature and types of raw dataChallenges to applying KDD in MIR

Data sourceorganizationvolumeTypeQualityTraditional dataStructuredModest Discrete,CategoricalCleanAudio dataUnstructuredVery largeContinuous,NumericNoise

Feature Databasetraditional pattern recognition FeatureExtractionlower level raw data form

Higher level representations classificationclusteringregressionAmplitude values at each sample pointmanageableFeature extractions

MPEG7 features Instantaneous Harmonic Spectral Centroid Instantaneous Harmonic Spectral Deviation Signal Hamming Window STFT Signal envelope FundamentalFrequency Harmonic Peaks Detection Instantaneous Harmonic Spectral Spread Temporal Centroid Power SpectrumSpectral Centroid Log Attack Time Instantaneous Harmonic Spectral Variation Hamming WindowSTFTNFFT FFT points

Derived DatabaseExtended MPEG7 features Other features & new features

FeatureDurationsSub-TotalTotalTristimulus Parameters41040Spectrum Centriod/Spread II 428Flux 414Roll Off414Zero Crossing 414MFCC44x13208Spectrum Centroid/Spread I326Harmonic Parameters3412Flatness 34x24288Durations 313Total577

FeatureHarmonic Upper Limit1Harmoni Ratio1Basis Functions190Log Attack Time1Temporal Centroid1Spectral Centroid1Spectrum Centroid/Spread I2Harmonic Parameters4Flatness 24x4Total297

Hierarchical ClassificationSchema I

Schema II - Hornbostel Sachs AerophoneChordophoneMembranophone IdiophoneFreeSingle ReedSideLip VibrationWhipAlto FluteFluteC TrumpetFrench HornTubaOboeBassoon

Schema III - Play Methods MutedPizzicatoBowedPickedPiccoloFluteBassoonAlto FluteShakenBlow

Xin Cynthia Zhang*Xin Cynthia Zhang*Database Table

ObjClassification AttributesDecision AttributesCA1 CAnHornbostel SachsPlay Method10.22 0.28[Aerophone, Side, Alto Flute][Blown, Alto Flute]20.31 0.77[Idiophone, Concussion, Bell][Concussive, Bell]30.05 0.21[Chordophone, Composite, Cello][Bowed, Cello]40.12 0.11[Chordophone, Composite, Violin][Martele, Violin]

Xin Cynthia Zhang

Example1212C[1]C[2]C[2,1]C[2,2]1212d[1]d[2]d[3,1]d[3,2]3d[3]Level ILevel IIClassification AttributesDecision Attributes

Xabcdx1a[1]b[2]c[1]d[3]x2a[1]b[1]c[1]d[3,1]x3a[1]b[2]c[2,2]d[1]x4a[2]b[2]c[2]d[1]

Classification90% training, 10% testing.10 folds.Hierarchical (Schema I) vs none hierarchical.Compare with different classifiers.J48 treeNave Baysian

Results of the none-hierarchical Classification

J48-TreeNaveBaysianAll70.4923%68.5647%MPEG65.7256%56.9824%

Results of the hierarchical Classification (Schema I) with MPEG7 features

J48-TreeNaveBaysianFamily86.434%64.7041%No-pitch73.7299%66.2949%Percussion85.2484%84.9379%String72.4272%61.8447%Wind67.8133%67.8133%

Results of the hierarchical Classification (Schema I) with all features

J48-TreeNaveBaysianFamily91.726%72.6868%No-pitch77.943%75.2169%Percussion86.0465%88.3721%String76.669%66.6021%Woodwind75.761%78.0158%

Classification Results

J48-TreeWith new FeaturesWithout new FeaturesAccuracyRecallAccuracyRecallCon-clarinet100.060.083.3100.0Electricbass 100.073.393.393.3Flute 100.050.060.075.0Steel Drums 100.066.750.066.7Tuba 100.0100.0100.085.7Vibraphone 87.593.378.673.3Cello 87.095.286.761.9Violin 84.077.866.759.3Piccolo 83.350.060.060.0Marimba 82.487.583.393.8Ctrumpet 81.376.587.582.4Alto Flute 80.080.080.080.0English Horn 80.057.142.942.9

.Polyphonic Sound

segmentation

Feature extraction

Classifier

Get Instrument

Polyphonic sounds how to handle?Single-label classification Based on Sound SeparationMulti-labeled classifiersGet frameProblems?Information loss during the signal subtractionSound Separation Flowchart

This presentation will focus on timbre estimation in polyphonic soundsand designing multi-labeled classifierstimbre relevant descriptorsSpectrum Centroid, Spread Spectrum Flatness Band Coefficients

Harmonic Peaks Mel frequency cepstral coefficients (MFCC) Tristimulus

Sub-pattern of single instrument in mixture Feature extraction

FeatureExtractionFeaturesClassifier40msTimbre estimation based on multi-label classifier segmentationAcoustic descriptors

instrumentconfidenceCandidate 170%Candidate 250%......Candidate N10%



Polyphonic Sound

Get frame

Feature extraction

Perform multiple classifying

Finish all the Frames estimation

Voting processbased on contextGet Final winners

Multiple labelsFlowchart of multi-label classification system

Timbre Estimation Results based on different methods[Instruments - 45, Training Data (TD) - 2917 single instr. sounds from MUMS, Testing on 308 mixed sounds randomly chosen from TD, window size 1 sec, frame size 120ms, hop size 40ms, MFCC extracted from each frame (following MPEG-7)]Threshold 0.4 controls the total number of estimations for each index window.

experiment #pitch basedSound SeparationN(Labels) maxRecallPrecisionF-score1YesYes154.55%39.2%45.60%2YesYes261.20%38.1%46.96%3YesNO264.28%44.8%52.81%4YesNO467.69%37.9%48.60%5YesNO868.3%36.9%47.91%

Chart9

0.5455

0.612

Recall

Single Label Vs Multiple Label

Sheet1

pitch basedSound SeparationN(Labels)recall

YesSP154.55%

YesSP261.20%

YesNS264.28%

YesNS467.69%

NoNS470.13%

experiment #

pitch based

Sound Separation

N(Labels)

recall

1

Yes

Yes

1

54.55%

2

Yes

Yes

2

61.20%

3

Yes

NO

2

64.28%

4

Yes

NO

4

67.69%

5

No

NO

4

70.13%

Sheet1

Recall


Sheet2

Recall

Seperation Vs Non-Sperataion

Sheet3

experiment #descriptionRecognition Rate

1Feature-based and separation + Decision Tree (n=1)36.49%


3Spectrum Match + KNN (k=1;n=2)79.41%


5Spectrum Match + KNN (k=5;n=2) without percussion instrument87.10%

Sheet3

Recognition Rate

Sound Separation

y54.55%61.20%

n64.28%67.69%70.13%

Chart10

0.612

0.6428

0.6769

Recall


Sheet1


YesSP154.55%

YesSP261.20%

YesNS264.28%

YesNS467.69%

NoNS470.13%

experiment #

pitch based

Sound Separation

N(Labels)

recall

1

Yes

Yes

1

54.55%

2

Yes

Yes

2

61.20%

3

Yes

NO

2

64.28%

4

Yes

NO

4

67.69%

5

No

NO

4

70.13%

Sheet1

0

0

Recall


Sheet2

0

0

0

Recall


Sheet3







Sheet3

0

0

0

0

0

Recognition Rate

Sound Separation

y54.55%61.20%

n64.28%67.69%70.13%

Polyphonic Sound(window)

Get frame

Feature extraction

Classifiers

Multiple labelsCompressed representations of the signal: Harmonic Peaks, Mel Frequency Ceptral Coefficients (MFCC), Spectral Flatness, .

Irrelevant information (inharmonic frequencies or partials) is removed.

Violin and viola have similar MFCC patterns. The same is with double-bass and guitar. It is difficult to distinguish them in polyphonic sounds.

More information from the raw signal is needed.Polyphonic Sounds

Short Term Power Spectrum low level representation of signal (calculated by STFT)Power Spectrum patterns of flute & trombone can be seen in the mixtureSpectrum slice 0.12 seconds long

Experiment:

Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency -261.6 Hz

Training set: Power Spectrum from 3323 frames - extracted by STFT from 26 single instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,Alto flute, piano, Bach trumpet, tuba, and bass clarinet.

Testing Set: Fifty two audio files are mixed (using Sound Forge ) by two of these 26 singleinstrument sounds.

Classifier (1) KNN with Euclidean distance (spectrum match based classification); (2) Decision Tree (multi label classification based on previously extracted features)

Timbre Pattern Match Based on Power Spectrum n number of labels assigned to each frame; k parameter for KNN

experiment #descriptionRecallPrecisionF-score1Feature-based + Decision Tree (n=2)64.28%44.8%52.81%2Spectrum Match + KNN (k=1;n=2)79.41%50.8%61.96%3Spectrum Match + KNN (k=5;n=2)82.43%45.8%58.88%4Spectrum Match + KNN (k=5;n=2) without percussion instrument87.1%

Chart1

0.6428

0.7941

0.8243

0.871

Recognition Rate

Spectrum-based VS Feature-based

Sheet1


Feature-based64.28%

k=1Spectrum Match79.41%

k=5Spectrum Match82.43%

k=5Spectrum Match (without percussion )87.10%

Sheet1

Recognition Rate

Spectrum-based VS Feature-based

Sheet2

Sheet3

MBD06C9E170.xls

Chart10

0.612

0.6428

0.6769

Recall


Sheet1


YesSP154.55%

YesSP261.20%

YesNS264.28%

YesNS467.69%

NoNS470.13%

experiment #

pitch based

Sound Separation

N(Labels)

recall

1

Yes

Yes

1

54.55%

2

Yes

Yes

2

61.20%

3

Yes

NO

2

64.28%

4

Yes

NO

4

67.69%

5

No

NO

4

70.13%

Sheet1

0

0

Recall


Sheet2

0

0

0

Recall


Sheet3







Sheet3

0

0

0

0

0

Recognition Rate

Sound Separation

y54.55%61.20%

n64.28%67.69%70.13%

Hierarchical structureFluteEnglish HornViolinViola

Instrument granularity classifiers which are trained at each level of the hierarchical treeHornbostel/Sachs

Modules of cascade classifier for single instrument estimation --- Hornboch /SachsPitch 3B91.80%96.02%98.94%= 95.00%*>

New Experiment:

Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency - 261.6 Hz

Training set: 2762 frames extracted from the following instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,Alto flute, piano, Bach trumpet, tuba, and bass clarinet.

Classifiers WEKA: (1) KNN with Euclidean distance (spectrum match based classification); Decision Tree (classification based on previously extracted features)

Confidence ratio of the correct classified instances over the total number of instances

Classification on different Feature Groups

GroupFeature descriptionClassifierConfidenceA33 Spectrum Flatness Band CoefficientsKNN Decision Tree99.23%94.69%B13 MFCC coefficientsKNN Decision Tree98.19%93.57%C28 Harmonic PeaksKNN Decision Tree86.60%91.29%D38 Spectrum projection coefficientsKNN Decision Tree47.45%31.81%ELog spectral centroid, spread, flux, rolloff, zerocrossingKNN Decision Tree99.34%99.77%

Feature and classifier selection at each level of cascade systemKNN + Band Coefficients

Nodefeature ClassifierchordophoneBand CoefficientsKNNaerophoneMFCC coefficientsKNNidiophoneBand CoefficientsKNN

Nodefeature Classifierchrd_compositeBand CoefficientsKNNaero_double-reedMFCC coefficientsKNNaero_lip-vibratedMFCC coefficientsKNNaero_sideMFCC coefficientsKNNaero_single-reedBand CoefficientsDecision Treeidio_struckBand CoefficientsKNN

Classification on the combination of different feature groupsClassification based on KNNClassification based on Decision Tree

From those two experiments, we see that:

KNN classifier works better with feature vectors such as spectral flatness coefficients, projection coefficients and MFCC. Decision tree works better with harmonic peaks and statistical features.

Simply adding more features together does not improve the classifiers and sometime even worsens classification results (such as adding harmonic to other feature groups).

Feature and classifier selection at each level of Cascade System - Hornbostel/Sachs hierarchical tree

Feature and classifier selection at top level

Feature and classifier selection at second level

Feature and classifier selection at third level

Feature and Classifier Selection Table for Level 1Feature and Classifier Selection Table for Level 2Feature and Classifier Selection

NodeFeatureClassifierchordophoneFlatness coefficientsKNNaerophoneMFCC coefficientsKNNidiophoneFlatness coefficientsKNN

NodeFeatureClassifierchrd_compositeFlatness coefficientsKNNaero_double-reedMFCC coefficientsKNNaero_lip-vibratedMFCC coefficientsKNNaero_sideMFCC coefficientsKNNAero single-reedFlatness coefficientsDecision TreeIdio StruckFlatness coefficientsKNN

HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS

Common method to calculate the distance or similarity between clusters: single linkage (nearest neighbor), complete linkage (furthest neighbor), unweighted pair-group method using arithmetic averages (UPGMA), weighted pair-group method using arithmetic averages (WPGMA), unweighted pair-group method using the centroid average (UPGMC), weighted pair-group method using the centroid average (WPGMC), Ward's method.

Most common distance functions: Euclidean, Manhattan, Canberra (examines the sum of series of a fraction differences between coordinates of a pair of objects), Pearson correlation coefficient (PCC) measures the degree of association between objects, Spearman's rank correlation coefficient.

Clustering algorithm HCLUST (Agglomerative hierarchical clustering) R Package

Testing Datasets (MFCC, flatness coefficients, harmonic peaks) :

The middle C pitch group which contains 46 different musical sound objects. Each sound object is segmented into multiple 0.12s frames and each frame is stored as an instance in the testing dataset. There are totally 2884 frames

We also extract three different features (MFCC, flatness coefficients, and harmonic peaks) from those sound objects. Each feature produces one dataset of 2884 frames for clustering.

Clustering:When the algorithm finishes the clustering process, a particular cluster ID is assigned to each single frame.

Contingency Table derived from clustering result

Cluster 1Cluster jCluster n

Instrument 1X11X1 jX1n

Instrument iXi1XijXin

Instrument nX n1X njX nn

Evaluation result of Hclust algorithm (14 results which yield the highest score among 126 experimentsw number of clusters, - average clustering accuracy of all the instruments, score= *w

FeaturemethodmetricwscoreFlatness Coefficientswardpearson87.3%3732.30Flatness Coefficientswardeuclidean85.8%3731.74Flatness Coefficientswardmanhattan85.6%3630.83mfccwardkendall81.0%3629.18mfccwardpearson83.0%3529.05Flatness Coefficientswardkendall82.9%3529.03mfccwardeuclidean80.5%3528.17mfccwardmanhattan80.1%3528.04mfccwardspearman81.3%3427.63Flatness Coefficientswardspearman83.7%3327.62Flatness Coefficientswardmaximum86.1%3227.56mfccwardmaximum79.8%3427.12Flatness Coefficientsmcquittyeuclidean88.9%3026.67mfccaveragemanhattan87.3%3026.20

Clustering result from Hclust algorithm with Ward linkage method and Pearson distance measure; Flatness coefficients are used as the selected featurectrumpet and batchtrumpet are clustered in the same group. ctrumpet_harmonStemOut is clustered in one single group instead of merging with ctrumpet. Bassoon is considered as the sibling of the regular French horn. French horn muted is clustered in another different group together with English Horn and Oboe .

Comparison between non-cascade classification and cascade classification with different hierarchical schemas

ExperimentClassification methodDescriptionRecallPrecisionF-Score1non-cascadeFeature-based64.3%44.8%52.81%2non-cascadeSpectrum-Match79.4%50.8%61.96%3CascadeHornbostel/Sachs75.0%43.5%55.06%4Cascadeplay method77.8%53.6%63.47%5Cascademachine learned87.5%62.3%72.78%

We evaluate the classification system by the mixture sounds which contain two single instrument sounds.

We also create 49 polyphonic sounds by randomly selecting three different single instrument sounds and mixing them together.

We then test those three-instrument mixtures with five different classification methods (experiment 2 to 6) which are described in the previous two-instrument mixture experiments. Single-label classification based on the sound separation method is also tested on the mixtures (experiment 1). KNN (k=3) is used as the classifier for each experiment.

Classification results of 3-instrument mixtures with different algorithms

Exp# Classifier Method Recall PrecisionF-Score

1Non-CascadeSingle-label based on sound separation31.48%43.06%36.37%2Non_CascadeFeature-based multi-label classification Spectrum-Match69.44%58.64%63.59%3Non_Cascademulti-label classification85.51%55.04%66.97%4Cascade(hornbostel)multi-label classification64.49%63.10%63.79%5Cascade(playmethod)multi-label classification66.67%55.25%60.43%6Cascade(machine Learned)multi-label classification63.77%69.67%66.59%

User entering queryUser is not satisfied and he is entering a new query- Action Rules System

Action RuleAction rule is defined as a term

Information Systemconjunction of fixed condition features shared by both groups proposed changes in values of flexible features desired effect of the action[() ( )] ()

A B D

a1 b2 d1

a2 b2

a2 b2 d2

"Action Rules Discovery without pre-existing classification rules", Z.W. Ras, A. Dardzinska, Proceedings of RSCTC 2008 Conference, in Akron, Ohio, LNAI 5306, Springer, 2008, 181-190 http://www.cs.uncc.edu/~ras/Papers/Ras-Aga-AKRON.pdf

Auto indexing system for musical instruments

intelligent query answering system for music instruments

WWW.MIR.UNCC.EDU

********

Spectrum Centroid describes the gravity center of the spectrum Spectrum Spread describes the deviation of the power spectrum with respect to the gravity center in a frame. Like Spectrum Centroid, it is an economic way to describe the shape of the power spectrum.Spectrum Band Coefficients describes the flatness property of the power spectrum within a frequency bin projection coefficients project the spectrum from high dimensional space of spectrum to low dimensional space with compact salient statistical information. Harmonic Peaks is a sequence of local peaks of harmonics of each frame .Mel frequency cepstral coefficients describe the spectrum according to the human perception system in the mel scale. They are computed by grouping the STFT points of each frame into a set of coefficients.Tristimulus The concept of tristimulus originates in the world of colour, describing the way three primary colours can be mixed together to create a given colour. By analogy, the musical tristimulus measures the mixture of harmonics in a given sound, grouped into three sections. parameters describe the ratio of the energy of 3 groups of harmonic partials to the total energy of harmonic partials The following groups are used: fundamental, medium partials (2, 3, and 4) and higher partials The first tristimulus measures the relative weight of the first harmonic; the second tristimulus measures the relative weight of the 2nd, 3rd, and 4th harmonics taken together; and the third tristimulus measures the relative weight of all the remaining harmonics.

***As Figure shows, the power spectrum patterns of single flute and single trombone could still been identified in mixture spectrum without blurring with each other (as marked in the figure). Therefore, we do get the clear picture of distinct pattern of each single instrument when we observe each spectrum slice of the polyphonic sound wave. This explains the reason that human hearing system could still accurately recognize the two different instruments from the mixture instead of misclassifying them as some other instruments. However those distinct timbre relevant characteristics for each instrument preserved in the signal wont be able to be observed in the previous feature space From the results shown in Table , we get the following conclusions:1. Using the multiple label classifier for each frame yields better results than using single label classifier2. Spectrum-based KNN classification improves the recognition rate of polyphonic sounds significantly.3. Some percussion instrument (such as vibraphone, marimba) are not suitable for spectrum-based classification, but most instruments generating harmonic sounds work well with this new method.Energy describe total energy of harmonic partials.

According to the previous discussion and conclusion, in order to get the highest accuracy for the ultimate estimation at bottom level of hierarchical tree, cascade system must be able to pick the pair of feature and classifier from the available features pool and classifiers pool in the way that system achieve the best estimation at each level of cascade classification. To get such information, We need to deduced the knowledge from current training database by combining each feature from feature pool (A,B,C,D) with each classifier from the classifier pool( NaiveBayes, KNN, Decision Tree), and running the classification experiments in weka on the subset which corresponds to the each node in the hierarchical tree used by cascade classification system. *

Documents

Polyphonic music information retrieval based on multi-label cascade classification system presented by Zbigniew W. Ras University of North Carolina, Charlotte,