13
An Evaluation of Musical Score Characteristics for Automatic Classification of Composers Ofer Dor Yoram Reich Computer Music Journal, Volume 35, Number 3, Fall 2011, pp. 86-97 (Article) Published by The MIT Press For additional information about this article Access Provided by Penn State Univ Libraries at 02/27/13 11:23PM GMT http://muse.jhu.edu/journals/cmj/summary/v035/35.3.dor.html

An Evaluation of Musical Score Characteristics for Automatic

Embed Size (px)

Citation preview

Page 1: An Evaluation of Musical Score Characteristics for Automatic

An Evaluation of Musical Score Characteristics for AutomaticClassification of Composers

Ofer DorYoram Reich

Computer Music Journal, Volume 35, Number 3, Fall 2011, pp. 86-97(Article)

Published by The MIT Press

For additional information about this article

Access Provided by Penn State Univ Libraries at 02/27/13 11:23PM GMT

http://muse.jhu.edu/journals/cmj/summary/v035/35.3.dor.html

Page 2: An Evaluation of Musical Score Characteristics for Automatic

An Evaluation of MusicalScore Characteristics forAutomatic Classificationof Composers

Ofer Dor and Yoram ReichSchool of Mechanical EngineeringFaculty of EngineeringTel-Aviv UniversityTel-Aviv 69978, [email protected]@eng.tau.ac.il

Although humans can distinguish between differenttypes of music, automated music classification is agreat challenge. Within the last decade, numerousstudies have been conducted on the subject, usingboth audio and score analysis (Kranenburg andBaker 2004; Manaris et al. 2005; Laurier and Herrera2007; Weihs et al. 2007; Raphael 2008; Laurieret al. 2009). The classifications in these studieswere done mostly by inference methods and/ormachine-learning methods. The results have beenquite modest (Kranenburg and Baker 2004; Laurieret al. 2009).

As music can be classified in many ways, studieshave focused on diverse classification targets.Kranenburg and Baker (2004) have shown thatit is possible to automatically recognize musicalstyle from compositions of five well known 18th-century composers. Numerous algorithms havebeen proposed to detect important musical features(melodic, rhythmic, and harmonic) with data miningand machine-learning techniques in large corpora ofscores (Hartmann et al. 2007). Geertzen and Zaanen(2008) presented an approach to automatic composerrecognition based on learning recurring patterns inmusic by grammatical inference.

Music can be represented in audio or as notation.Existing classification studies encode features withdifferent representations. Clearly, the nature of therepresentation is a major determinant of the successof the classification. Therefore, the difficulty thatpresent approaches have in classifying composers bytheir compositions stems from using features thatdo not sufficently capture the differences betweenthe composers.

Manaris et al. (2005), using a new set of features(metrics), achieved a classification accuracy of 94percent for five composers. Their experiments, how-ever, seem questionable: They performed only oneholdout test instead of multiple cross-validation

Computer Music Journal, 35:3, pp. 86–97, Fall 2011c© 2011 Massachusetts Institute of Technology.

tests with statistical significant results, as is cus-tomary in data mining or machine-learning studies(Reich and Barai 1999; Demsar 2006). Therefore,it is difficult to assess the benefit of their pro-posed features. Furthermore, the distribution ofthe composers in the training and testing instancesis not clear. (A later paper [Manaris et al. 2008]did use cross-validation, but with respect to genreclassification rather than composer classification.)

We propose an approach that classifies com-posers of classical music based on certain low-levelcharacteristics of their compositions. The proposedcomposition characteristics are descriptive featuresderived from the time-ordered sequence of pitchesin a composition. These are simple features ap-propriate for a data-mining application; they arenot high-level music-theoretical features, such asthe results of traditional harmonic, contrapuntal,motivic, or metrical analysis. However, our resultsindicate that by using these features, classificationby composer in a two-composer data set can bedone with usually greater than 90 percent accuracy.Our results also show that classifying composerswith works in the same genre and/or the sameinstrumentation achieves higher accuracy than doesclassifying works without regard to genre or instru-mentation. (Although the term “genre” has multiplemeanings in music scholarship, we use it here torefer to a major period of classical music or, in thecase of ragtime, to a particular historical style.) Weshow, for example, that distinguishing between key-board music of Mozart and Haydn—which, as thesecomposers have very similar styles, is considereda challenging task for most humans—can be donewith 75 percent accuracy. Further, we demonstratethe contribution of individual features to the clas-sification accuracy in data sets containing multiplecomposers, as well as in data sets containing onlytwo composers.

In what follows, we describe the data struc-ture used in this study, including the new

86 Computer Music Journal

Page 3: An Evaluation of Musical Score Characteristics for Automatic

features discovered by a machine-learningprogram called CHECKUP, and we dis-cuss the experiments and their results. Thedata sets and scores are freely available atwww.eng.tau.ac.il/∼yoram/tools/db composers.zip.

Data Structure

The notation of a musical score needs to be con-verted to different syntaxes that are suitable formachine-learning classifiers.

For nine composers, all available musical scorefiles in ∗∗kern format (Huron 1997) were down-loaded from the Humdrum project’s library atkern.humdrum.org. The scores are for keyboard(e.g., piano or organ) or for string instruments.The ∗∗kern format is a special humdrum formatthat represents the underlying syntactic informa-tion conveyed by a musical score. Humdrum datain ∗∗kern files are encoded in a two-dimensionaltable with time increasing down the page andhigher parts/voices toward the right. Next, the∗∗kern files were converted to the Melisma Notesfile format, an event list of notes (Temperley andSleator 1999). A musical score file in the MelismaNotes file format contains a list of notes fromvarious voices (e.g., soprano, alto, tenor, and bass),where each note is made of three parameters:on-time, off-time, and pitch. When notes havethe same on-time, the order of the notes is givenfrom lower to higher (e.g., bass to soprano) voices.All voices are mixed into one list with no in-dication of the voice to which a note belongs.The on-time and off-time parameters describe anote’s beginning and ending, expressed in msecfrom the start of the composition (even thoughthe data in our case were derived from a musicalscore, not from a recording of a performance). Thepitch parameter is equivalent to the MIDI notenumbers.

Musical Score Characteristics

Learning from musical scores in their raw struc-ture, which is made of vectors of notes (pitches)

that play at different times, is impossible withcurrent machine-learning classifiers. The num-ber of notes in each score is different, and thereare relations between sequential notes in thescore (e.g., there are recurring chords). There-fore, instead of the raw data, we used a collec-tion of features to describe the score. Most ofthe features we used are constructed manuallyand are derived from the pitch values. In ad-dition to these manually constructed features,and in order to improve classification perfor-mance, we used a feature-discovery method toautomatically construct new features. The fulllist of features is the characterization of thescore.

Manually Constructed Features

The manually constructed features are intuitive fea-tures underlining the musical score characteristics.These features are presented subsequently.

Pitch Class Features

There are twelve pitch class features, each specifyingthe number of occurrences of one pitch class in thescore, divided by the total number of notes in thescore. The pitch class is calculated from the pitchparameter in the note as: pitch-class = pitch % 12(i.e., MIDI note number modulo 12). A value of 0is assigned to pitch-class C, so C# is 1, and so on,through B = 11.

Octave Features

There are ten octave features, each specifying thenumber of occurrences of one octave range inthe score divided by the total number of notesin the score. The octave is calculated from thepitch parameter in the note as follows: octave =floor(pitch / 12). A value of 5 is given to the middleoctave, so Middle C is 12 × 5 + 0 = 60.

Note Count Feature

The note count feature indicates the total numberof notes in the score.

Dor and Reich 87

Page 4: An Evaluation of Musical Score Characteristics for Automatic

Table 1. Example of a Pitch Trigram [52, 52, 45], for Chopin, Mazurka in AMinor, Op. 68, No. 2

Strikethrough notes are those that are removed in order to create the singular-time score.Grayed out notes are the singular time notes that are not in the pitch trigram.

Pitch Range Features

The two pitch range features hold the minimum andmaximum pitch values for all notes in the score.

Pitch Trigram Features

As used here, a pitch trigram specifies the mostcommon sequence of three pitches in a “singular-time score” (to be defined subsequently). Thereare four pitch trigram features: one for each of thethree pitches, plus a fourth feature storing the ratiobetween the number of occurrences of this trigramin the singular-time score and the total number ofnotes in the singular-time score. A singular-timescore is the score that results from removing allnotes that overlap in time with the previous note.The criterion for removal is: (previous off-time> on-time). This enables finding a pattern, or arecurring sequence of notes, in a musical part(e.g., the lowest voice) while ignoring other parts.An example of a pitch trigram, which has thesequential pitch values [52, 52, 45] and exists in 16.2percent of all the notes in Chopin’s Mazurka in AMinor, Op. 68, No. 2, is presented in Table 1 andFigure 1.

Composer Feature

This is the target feature denoting the musicalscore’s composer.

Feature Discovery

CHECKUP, a unique classifier tool that can learnfrom both vector and regular features (explainedsubsequently) and construct mathematical vectorrelations, was used in this study (Dor and Reich2007). CHECKUP is able to discover many types offeatures that can be used to produce higher-accuracyclassification and an associated rule set. A schematicdiagram of CHECKUP is shown in Figure 2. Theinput data set is a list of instances; an instance ismade of a list of features, where each feature couldbe a regular feature (numeric or nominal) or a vectorfeature (an ordered list of regular features). The inputfeatures are transformed into a list of regular featuresaccording to an instruction list, constructed by thefeature generation module in the previous iteration.The transformed features are classified with aclassifier (e.g., the C4.5 classifier). The classificationinformation (e.g., accuracy, rule-set), along withthe input and previously generated features, is used

88 Computer Music Journal

Page 5: An Evaluation of Musical Score Characteristics for Automatic

Figure 1. Example of apitch trigram [52=E, 52=E,45=A], for Chopin,Mazurka in A Minor, Op.68, No. 2.

Figure 1.

Figure 2. Schematicdiagram of the CHECKUPclassifier tool.

Figure 2.

by the generation module to construct a new listof features. This new list contains thousands ofdifferent features, based on mathematical or logicaldefinitions that could, if desired, be extended further.An example of an instruction list is: a=mean(A),b=sum(A+B), where A and B are vectors and a andb are numeric values.

The tested data set had 100 instances, where eachinstance is a sub-section of a score with four features.Three of the features are vector features: on-time,off-time, and pitch vectors, with a vector size of 200(i.e., 200 notes). The fourth feature is the composertarget feature. As the 1,183 scores used in this studyhave between 300 and 15,000 notes, we did not usethe full score data: Learning from these large vectorsrequired large computation resources and time.CHECKUP discovered that the most significantfeatures are the note duration and pitch gradientfeature groups (described subsequently). Thesefeatures were used, in addition to the manuallyconstructed features described earlier, to classify thecomposers from the data set of musical scores.

Note Duration Features

The note duration features are the mean and the vari-ance of all note (non-rest) durations. For each note,

the note duration is defined as the difference betweenits off-time and on-time parameters. For example, inTable 1, the first note with a pitch value equal to 52has an on-time equal to 1,164 and an off-time equalto 1,681. The note duration value for this note istherefore 517. The same calculations are performedon all notes in the score; the mean and variance ofall note duration values are then calculated.

Pitch Gradient Features

The pitch gradient features are the mean and thestandard deviation of all pitch gradients in the score.For each note, a pitch gradient is defined as follows:pitch-gradient = sign(pitch – previous pitch). If thecurrent pitch is higher than the previous pitch, thenthe gradient is equal to 1; if the current pitch islower than the previous pitch, then the gradient isequal to –1; and if the current pitch is equal to theprevious pitch, the gradient is equal to 0.

Experiments

As shown in Table 2, a collection of 1,183 scoresby nine composers, from different genres and fordifferent instruments, was used. Two groups ofdata sets were created. The first group is made ofeight different multiple-composer combinations,as shown in Table 3. The second group is made ofall the 36 possible pairings of composers, given theset of nine composers. Each data set was generatedten times: Each time a different selection and orderof scores were used. The number of scores forevery composer in each data set is equal to thelowest number of scores among all composers. For

Dor and Reich 89

Page 6: An Evaluation of Musical Score Characteristics for Automatic

Table 2. The Nine Composers Listed with Genre Classification and the Number of Musical Scores Used inthe Experiments

Number of Number of TotalScores for Scores for Number

Composer Key Composer Name Genre Strings Keyboard of Scores

bach Bach, Johann Sebastian Baroque 16 230 246beethoven Beethoven, Ludwig van Classical 73 108 181chopin Chopin, Frederic Romantic 0 87 87corelli Corelli, Arcangelo Baroque 47 198 245haydn Haydn, Franz Joseph Classical 181 31 212joplin Joplin, Scott Ragtime 0 45 45mozart Mozart, Wolfgang Amadeus Classical 16 66 82scarlatti Scarlatti, Domenico Baroque 0 58 58vivaldi Vivaldi, Antonio Baroque 27 0 27

Table 3. The Composer List and Data Set Size for the Multiple-Composer Data Sets

Dataset Composer Keys Number of Scores Instrumentation

classical beethoven, haydn, mozart 246 strings, keyboardbaroque bach, corelli, scarlatti, vivaldi 108 strings, keyboardstrings beethoven, corelli, haydn, vivaldi 108 stringskeyboard bach, beethoven, chopin, corelli, joplin,

mozart, scarlatti 315 keyboardclassical-keyboard beethoven, haydn, mozart 93 keyboardbaroque-keyboard bach, corelli, scarlatti 174 keyboardmajor-composers bach, beethoven, mozart 246 strings, keyboardcomposers bach, beethoven, chopin, corelli, haydn,

joplin, mozart, scarlatti 360 strings, keyboard

All datasets have string and keyboard scores, except the strings dataset which has only string scores and the keyboard, classical-keyboard, and baroque-keyboard datasets, which have only keyboard scores.

example, the classical data set has three composersand each composer has a different number of scores:Beethoven, Haydn, and Mozart have 181, 212, and82, respectively. The lowest number of scores is82; therefore, the number of scores in the classicaldata set is set to 82 × 3 = 246. The classicaldata set is generated ten times, where each timea random selection of 82 scores among all of theavailable scores by each composer is performed,then the order of scores is randomized in the newlycreated data set. In this way, we created 80 differentmultiple-composer data sets (eight combinations ofmultiple composers times ten random selections)and 790 different two-composer data sets (36 pairs

of composers for all instruments plus 28 pairsfor keyboard instruments plus 15 pairs for stringinstruments, times 10 random selections).

We used various classifiers from the WaikatoEnvironment for Knowledge Analysis (WEKA),a suite of machine-learning classifiers (Garner1995; Witten and Frank 1999); IBk, a K-nearestneighbor classifier; C4.5, a decision tree classifier; anaive Bayes classifier; SMO, a sequential minimaloptimization algorithm for training a supportvector classifier; RandomForest, a random forestsclassifier; SimpleLogistic, a classifier for buildinglinear logistic regression models; and JRip/Ripper(Repeated Incremental Pruning to Produce Error

90 Computer Music Journal

Page 7: An Evaluation of Musical Score Characteristics for Automatic

Table 4. The Classification Accuracies (Mean ± Standard Deviation) Derived from a Sample of Ten,10-Fold Cross-Validation Tests for Multiple-Composer Data Sets and All Classifiers

Naı̈ve Random SimpleC4.5 IBk JRip Bayes Forest SMO Logistic

classical 62.64 ± 2.77 57.52 ± 3.81 58.21 ± 2.87 62.36 ± 1.82 65.89 ± 3.61 68.09 ± 2.83 69.23 ± 2.20baroque 73.52 ± 6.79 75.37 ± 4.17 69.44 ± 5.05 82.41 ± 4.02 79.91 ± 4.28 82.50 ± 3.74 84.26 ± 3.32strings 76.48 ± 6.13 68.80 ± 5.66 67.78 ± 5.14 79.91 ± 3.55 82.04 ± 3.03 83.15 ± 3.51 84.91 ± 2.76keyboard 76.03 ± 3.59 74.63 ± 1.42 70.10 ± 1.90 82.06 ± 1.62 82.73 ± 1.44 82.38 ± 2.04 83.52 ± 1.83classical keyboard 70.97 ± 4.18 70.22 ± 6.17 65.91 ± 3.79 77.20 ± 4.61 76.88 ± 3.37 78.82 ± 4.84 81.40 ± 4.15baroque keyboard 94.14 ± 1.29 89.25 ± 2.45 92.36 ± 1.95 94.43 ± 2.15 95.63 ± 1.16 92.70 ± 1.58 95.00 ± 0.67major composers 88.74 ± 1.66 85.16 ± 2.43 84.72 ± 2.52 88.66 ± 1.06 91.30 ± 1.38 89.43 ± 1.13 90.53 ± 2.43composers 67.11 ± 3.14 65.58 ± 3.28 61.94 ± 2.62 75.00 ± 1.64 74.44 ± 1.71 74.36 ± 2.71 79.06 ± 1.17

Reduction), a propositional rule classifier. We alsodid several tests with a neural network, but wedid not find that it performed better than theaforementioned classifiers.

All the experiments were performed with ten-fold stratified cross-validations, and for each dataset a mean and standard deviation was calcu-lated from the ten randomly selected score datasets. The multiple-composer data sets were testedwith all used classifiers. In order to find the bestclassifier, a comparison was done according tothe Friedman test with α = 0.05 (Demsar 2006).The two-composer data sets were tested with theSMO classifier, which performed better than theother tested classifiers in selected binary data sets.Feature contribution tests were performed on alldatasets, where in each data set (among the 870data sets), features were added incrementally inorder to learn the influence and contribution ofeach group of features to the composer classificationperformance. We started by using one group offeatures, then added another group of features, andcontinued until we used all features from all sevengroups, totalling 6,090 experiments. The order ofadded feature groups was as follows: pitch class,octave, note count, note duration, pitch gradient,pitch range, and pitch trigram. Different orders ofadding features could affect the feature contribu-tion because of interactions between the features.However, a complete analysis is computationallyinfeasible.

Results and Discussion

Two categories of results are presented: first, theclassification performance of multiple-composerand two-composer data sets, and second, the con-tribution of features to the performance of differentmachine-learning classifiers.

Classification Performance UsingMultiple-Composer Data Sets

The accuracies of the classification prediction,tested with various classifiers in multiple-composerdatasets, are presented in Table 4. These results showthat the SimpleLogistic classifier achieved the bestperformance in most of the multiple-composer datasets, except for the baroque-keyboard and major-composers data sets, in which the RandomForestclassifier achieved the best performance. For themultiple-composer datasets, the SimpleLogisticclassifier achieved the highest rank, having the bestoverall performance compared to other classifiers astested with the Friedman test with α = 0.05 (Demsar2006).

These results show that classifying the classicaldata set achieves a poor accuracy—69.23 percent.Learning from the classical data set with keyboardscores only, however, achieved an accuracy of81.4 percent. Learning from the baroque dataset with strings and keyboard scores and with

Dor and Reich 91

Page 8: An Evaluation of Musical Score Characteristics for Automatic

Table 5. The Classification Accuracies (Mean ± Standard Deviation) for Two-Composer Data Sets, asTested with SMO Classifier

bach beethoven chopin corelli haydn joplin mozart scarlatti vivaldi

bach - 96.35 ± 0.79 96.95 ± 0.82 97.29 ± 0.29 98.54 ± 0.27 95.89 ± 1.66 97.07 ± 0.85 96.12 ± 1.96 94.07 ± 3.79beethoven 96.35 ± 0.79 - 84.43 ± 1.28 97.10 ± 0.15 92.79 ± 0.75 97.33 ± 0.78 89.27 ± 2.53 94.74 ± 2.09 97.41 ± 2.34chopin 96.95 ± 0.82 84.43 ± 1.28 - 99.25 ± 0.39 95.63 ± 0.95 89.33 ± 1.97 94.39 ± 0.69 93.88 ± 1.54 95.37 ± 1.80corelli 97.29 ± 0.29 97.10 ± 0.15 99.25 ± 0.39 - 96.67 ± 0.23 99.11 ± 1.15 96.71 ± 1.73 93.28 ± 1.40 85.19 ± 4.28haydn 98.54 ± 0.27 92.79 ± 0.75 95.63 ± 0.95 96.67 ± 0.23 - 99.22 ± 0.91 62.80 ± 2.59 97.67 ± 0.82 91.85 ± 1.99joplin 95.89 ± 1.66 97.33 ± 0.78 89.33 ± 1.97 99.11 ± 1.15 99.22 ± 0.91 - 99.78 ± 0.47 98.89 ± 0.91 97.59 ± 0.89mozart 97.07 ± 0.85 89.27 ± 2.53 94.39 ± 0.69 96.71 ± 1.73 62.80 ± 2.59 99.78 ± 0.47 - 95.60 ± 0.95 95.74 ± 1.96scarlatti 96.12 ± 1.96 94.74 ± 2.09 93.88 ± 1.54 93.28 ± 1.40 97.67 ± 0.82 98.89 ± 0.91 95.60 ± 0.95 - 94.63 ± 1.37vivaldi 94.07 ± 3.79 97.41 ± 2.34 95.37 ± 1.80 85.19 ± 4.28 91.85 ± 1.99 97.59 ± 0.89 95.74 ± 1.96 94.63 ± 1.37 -

Table 6. The Classification Accuracies (Mean ± Standard Deviation) for Two-Composer Data Sets withKeyboard Instrument Scores, as Tested with SMO Classifier

bach beethoven chopin corelli haydn joplin mozart scarlatti

bach - 96.85 ± 0.57 97.18 ± 0.74 98.84 ± 0.32 96.94 ± 1.60 98.44 ± 1.19 98.18 ± 1.14 96.90 ± 1.23beethoven 96.85 ± 0.57 - 80.92 ± 1.32 97.50 ± 0.73 95.97 ± 3.25 97.56 ± 1.37 94.17 ± 1.43 93.45 ± 1.16chopin 97.18 ± 0.74 80.92 ± 1.32 - 98.74 ± 0.80 94.68 ± 2.41 91.56 ± 3.40 92.65 ± 1.39 92.76 ± 1.48corelli 98.84 ± 0.32 97.50 ± 0.73 98.74 ± 0.80 - 91.94 ± 2.63 99.33 ± 0.94 96.14 ± 1.15 92.24 ± 1.77haydn 96.94 ± 1.60 95.97 ± 3.25 94.68 ± 2.41 91.94 ± 2.63 - 98.87 ± 1.53 75.16 ± 5.05 93.87 ± 2.72joplin 98.44 ± 1.19 97.56 ± 1.37 91.56 ± 3.40 99.33 ± 0.94 98.87 ± 1.53 - 99.33 ± 0.78 98.33 ± 0.79mozart 98.18 ± 1.14 94.17 ± 1.43 92.65 ± 1.39 96.14 ± 1.15 75.16 ± 5.05 99.33 ± 0.78 - 94.48 ± 1.30scarlatti 96.90 ± 1.23 93.45 ± 1.16 92.76 ± 1.48 92.24 ± 1.77 93.87 ± 2.72 98.33 ± 0.79 94.48 ± 1.30 -

only keyboard scores achieved accuracies of84.26 percent and 95.63 percent, respectively.Learning from the instrument-specific data sets,the strings and keyboard data sets, achieved resultsabove 83 percent. Classification accuracy of themajor composers (Bach, Beethoven, and Mozart)achieved a higher accuracy—91.3 percent.

The classification accuracy in the multiple-composer data sets is not dependent on differences ofgenre or instrumentation across composers, becausethe accuracy within a genre-specific or instrument-specific group is higher than in a nonspecific group,whereas one would expect the opposite to be trueif such differences were important. For example,the classification accuracy of the baroque-keyboarddata set is higher compared to the baroque dataset, and the classification accuracy of the baroquedata set is higher than that of the composersdata set. Furthermore, learning from all composers,including mixed genre and instrumentation, resultedin a moderate accuracy of 79 percent, compared to

subsets of the scores, genre, and/or instrument datasets, as presented in the classical and baroque datasets.

Classification Performance Using Two-ComposerData Sets

The accuracies of the classification prediction, testedwith the SMO classifier for two-composer data sets,are presented in Table 5. Similar comparisons, usingonly keyboard or strings scores, are presented inTables 6 and 7, respectively. Most of the accuraciesare above 90 percent in all two-composer data setswhether or not instrument-specific collections areused.

In most cases, the accuracies in the instrument-specific data sets are higher than in the correspond-ing non-instrument-specific data sets. For example,the Corelli–Vivaldi classification accuracy increasedfrom 85.19 percent with non-instrument-specific

92 Computer Music Journal

Page 9: An Evaluation of Musical Score Characteristics for Automatic

Table 7. The Classification Accuracies (Mean ± Standard Deviation) for Two-Composer Data Sets withString Scores, as Tested with the SMO Classifier

bach beethoven corelli haydn mozart vivaldi

bach - 98.13 ± 2.19 86.25 ± 3.36 93.75 ± 5.10 91.88 ± 3.67 85.31 ± 6.43beethoven 98.13 ± 2.19 - 100.00 ± 0.00 91.30 ± 1.68 91.88 ± 3.36 97.41 ± 1.56corelli 86.25 ± 3.36 100.00 ± 0.00 - 98.83 ± 0.78 99.38 ± 1.32 94.26 ± 2.38haydn 93.75 ± 5.10 91.30 ± 1.68 98.83 ± 0.78 - 73.75 ± 6.78 90.19 ± 6.99mozart 91.88 ± 3.67 91.88 ± 3.36 99.38 ± 1.32 73.75 ± 6.78 - 91.25 ± 4.11vivaldi 85.31 ± 6.43 97.41 ± 1.56 94.26 ± 2.38 90.19 ± 6.99 91.25 ± 4.11 -

scores to 94.26 percent with a strings score dataset. The Haydn–Mozart classification accuracy is62.8 percent for the non-instrument-specific dataset, and increased to 75.16 percent and 73.75 percentfor the keyboard and strings data sets, respectively.In most cases, however, the increase in accuracyfrom a non-instrument-specific data set to the cor-responding strings-specific data set is lower thanthe increase to the corresponding keyboard-specificdata set. There are some exceptions. For example,the Bach–Beethoven data set achieved accuraciesof 96.35 percent, 96.85 percent, and 98.13 percentfor non-instrument-specific, keyboard, and stringsdata sets, respectively. As another example, theBeethoven–Corelli data set achieved accuracies of97.1 percent, 97.5 percent, and 100 percent for thesame respective data sets.

The results are non-genre-specific, becausethere are no performance advantages or disad-vantages for two-composer data sets in the samegenre, and the results do not indicate any cor-relation between the genre of the composers’works and their classification performance. Ac-cording to the results, Beethoven, who is consid-ered to lie between the classical and romanticeras, has more in common with the roman-tic composer, Chopin, than with the classicalcomposers Mozart and Haydn. This is shownin that the non-instrument-specific Beethoven–Chopin data set achieved a classification ac-curacy of 84.43 percent, which is lower thanthe Beethoven–Mozart and Beethoven–Haydndata sets. The baroque data set contains thecomposers Bach, Corelli, Scarlatti, and Vivaldi,

and in most of the composer pairs the accura-cies are above 92 percent in instrument-specificand non-instrument-specific data sets. However,the classification accuracy in Corelli–Vivaldiincreased from 85.19 percent to 94.26 per-cent and decreased in Bach–Vivaldi from 94.07percent to 85.31 percent in the strings dataset.

Feature Contribution in Multiple-ComposerData Sets

Feature contribution to classification accuracywas tested in multiple-composer data sets, and ispresented in Figures 3 and 4. The results showthat some features have a greater contributionand some have a minor contribution. The mostsignificant groups are the pitch classes, octaves,note duration, and pitch gradient features. All thefeatures are useful because they all contribute toperformance in most of the data sets. The pitchtrigram features, however, made a slight reductionin accuracy of the baroque data set. Note duration,pitch gradient, and pitch range features contributepositively to accuracy in the baroque data sets, butthe same features contribute little or negativelyin the classical data sets. Note duration and pitchgradient features have more influence on accuracyin the keyboard data set compared to the stringsdata set. As shown in Figure 4, however, the pitchgradient contributes to classification accuracymore than the pitch trigram does, as shown inkeyboard, composers, and major-composers datasets.

Dor and Reich 93

Page 10: An Evaluation of Musical Score Characteristics for Automatic

Figure 3. The influence ofadding features (left toright) on classificationaccuracy in baroque andclassical data sets.

Figure 3.

Figure 4. The influence ofadding features (left toright) on classificationaccuracy in strings,keyboard, and composerdata sets.

Figure 4.

Feature Contribution in Two-Composer Data Sets

The contribution of features to classification ac-curacy was tested in all combinations of twocomposers with and without specific instrumen-tation, as presented in Figures 5, 6, and 7. Thefeature contribution (positive or negative) to accu-racy tends to be inconsistent, showing differencesbetween composer pairs and between instrumenttypes. Those features that contribute to accuracy in

Figure 5. The influence ofadding features onclassification accuracy inall two-composer datasets. Note that theright-hand plot (Reduction

in accuracy) has a differentscale; it is more “zoomedin” than the left-hand plot(Contribution of featuresto accuracy).

a specific data set presumably have unique charac-teristics for each composer, in contrast to featuresthat reduce the accuracy, which have characteristicsthat confuse the two composers.

The behavior of the feature contribution toaccuracy depends on the instrumentation. For ex-ample, all features in Haydn–Mozart with the non-instrument-specific data set contribute to accuracy,except the note duration features, which reduce theaccuracy. However, in the keyboard data sets the fea-tures that reduce accuracy in Haydn–Mozart are noteduration and pitch range, in contrast to the stringsdata set, where those features contribute to accuracyand where the features that reduce the accuracyare note count and pitch trigram. Haydn–Mozart,

94 Computer Music Journal

Page 11: An Evaluation of Musical Score Characteristics for Automatic

Figure 6. The influence ofadding features onclassification accuracy inall two-composer data setsfor the keyboard scoresonly.

without the pitch trigram features in the keyboarddata set, achieved an accuracy of 76.9 percent.

Features contributed to or reduced the accuracyin two different ways: independently, and dependingon other features. Together, the note count andnote duration features boost the accuracy more thanthe sum of the accuracies of each feature alone.Note count without note duration features in theBach–Chopin data set contributes 0.3 percentagepoints to accuracy, and note duration featureswithout note count contribute 6 percentage pointsto accuracy. However, note count and note durationfeatures in the Bach-Chopin data set contribute7.6 percent to accuracy, an improvement of 1.3percentage points compared to the contribution ofeach feature alone.

The results show that all of the features con-tribute to classification accuracy in Beethoven-Chopin, except the note duration features, in both

Figure 7. The influence ofadding features onclassification accuracy inall two-composer data setsfor the strings scores only.

instrument-specific and non-instrument-specificdata sets. The pitch trigram features contribute 2.9percentage points to accuracy in the non-instrumentspecific data set, and the pitch gradient featurecontributes 4.1 percentage points to accuracy in thekeyboard data set.

Across all two-composer data sets (includinginstrument-specific ones), pitch classes and octavesare the only feature groups that always increasethe accuracy. For all data sets, the most significantfeatures after pitch classes and octaves are the pitchgradient features.

The results also show that feature contributionshave no relation to genre or instrument data sets.There are no overall patterns of increasing or de-creasing accuracy in specific two-composer data setsin keyboard or strings, as shown in Figures 6 and 7.

Table 8 shows the overall contribution to accuracyof each feature, with three cases: (1) averaged acrossall the data sets, (2) averaged across only thekeyboard-specific data sets, and (3) averaged acrossonly the string-specific data sets. This calculationincluded only the two-composer data sets, not themultiple-composer data sets. It can be seen that, ingeneral, it is better not to use the pitch trigram in thestrings data sets, because the overall contributionis a reduction in accuracy. However, for a specificpair of composers, omitting the pitch trigram mightnot be the best approach. For example, in theBeethoven–Mozart data set, including the pitch

Dor and Reich 95

Page 12: An Evaluation of Musical Score Characteristics for Automatic

Table 8. The Average Contribution Accuracies of all Features in All, Keyboard, and Strings Data Sets

Pitch Classes Octaves Note Count Note Duration Pitch Gradient Pitch Range Pitch Trigram

All 64.29 21.43 1.25 2.41 3.50 1.20 0.31Keyboard 66.26 21.28 1.00 2.70 2.45 0.86 0.20Strings 60.70 27.67 0.00 0.24 2.98 0.90 −0.25

trigram increases the accuracy in the strings dataset. Thus, it can be useful to consult the detailedresults in Figures 3–7, not just the averages in thistable, when deciding what features to include in aspecific comparison.

Conclusions

In this article we presented a new technique forcomposer classification, using a number of featuresand characteristics for composer classification froma data set of compositions. We showed that the pitchclass and octave features are the most significantcompositional characteristics compared to otherclassification features. The next-most significantfeatures are the two best-performing features thatwere automatically generated by the CHECKUPexploratory tool classifier: note duration and pitchgradient. We showed, for the multiple-composerdata sets, that the SimpleLogistic classifier is thebest performer of the WEKA classifiers tested.

We succeeded in classifying various multiple-composer data sets and all binary combinations ofcomposers, including data sets with specific and non-specific genre and instrumentation. We achievedclassification accuracy of 97 percent between Bachand Chopin from their compositions, compared tothe nearly 80 percent accuracy achieved by Geertzenand Zaanen (2008). We noticed that classification ofmultiple composers was not dependent on genre orinstrumentation. Most of the accuracies are above90 percent in all composer pairs with or withoutinstrumentation distinctions. The classificationperformance for pairs of composers in instrument-specific data sets is higher than non-instrumentspecific data sets in most cases.

The Haydn–Mozart classification accuracy is75.16 percent in keyboard scores. We learned thatfeatures can increase or decrease the accuracy, either

independently of, or depending on, other features.Further investigation needs to be done in order toincrease the overall classification accuracy for alarge number of composers. In the Melisma Notesfile format, as used in the current work, the voicesare mixed together in the note list; future researchmight investigate how the results change if voiceinformation is preserved.

Note that these results concern machine classifi-cation and do not necessarily apply to human musicperception. In these experiments, pitch class wasfound to be the most significant feature for accuratediscrimination between composers. However, mosthumans, including musicians, do not possess perfectpitch. Therefore, when guessing the composer ofa composition to which they are listening, peoplerely more on intervals (relative pitch distances) thanon pitch classes per se. Transposing a compositionupward by a semitone, for example, probably haslittle impact on human identification of composers,even though it drastically changes the distributionof pitch classes. An expansion of the experimentsdescribed herein might encode notes as intervalsfrom a key center (tonic note), and compare classi-fier results of this representation to results obtainedfrom using only pitch classes.

References

Demsar, J. 2006. “Statistical Comparisons of Classifiersover Multiple Data Sets.” Journal Machine LearningResearch 7:1–30.

Dor, O., and Y. Reich. 2007. “CHECKUP: A Feature Gen-eration Rule-Base Learner.” In The Israeli Associationfor Artificial Intelligence Symposium (IAAI07).

Garner, S. R. 1995. “WEKA: The Waikato Environ-ment for Knowledge Analysis.” In Proceedings of theNew Zealand Computer Science Research StudentsConference, pp. 57–64.

96 Computer Music Journal

Page 13: An Evaluation of Musical Score Characteristics for Automatic

Geertzen, J., and M. Zaanen. 2008. “Composer Classifica-tion Using Grammatical Inference.” In Proceedings ofthe International Workshop on Machine Learning andMusic (MML-2008), pp. 17–18.

Hartmann, K., et al. 2007. “Interactive Data Mining andMachine Learning Techniques for Musicology.” InConference on Interdisciplinary Musicology, pp. 1–8.

Huron, D. 1997. “Humdrum and kern: Selective Fea-ture Encoding.” In E. Selfridge-Field, ed. BeyondMIDI: The Handbook of Musical Codes. Cambridge,Massachusetts: MIT Press, pp. 375–401.

Kranenburg, P. v., and E. Baker. 2004. “Musical StyleRecognition—A Quantitative Approach.” In Conferenceon Interdisciplinary Musicology, pp. 106–107.

Laurier, C., et al. 2009. “Exploring Relationships betweenAudio Features and Emotion in Music.” In ESCOM,Conference of European Society for the CognitiveSciences of Music, pp. 260–264.

Laurier, C., and P. Herrera. 2007. “Audio Music MoodClassification Using Support Vector Machine.” InProceedings of the International Conference on MusicInformation Retrieval (pages unnumbered).

Manaris, B., et al. 2005. “Zipf’s Law, Music Classification,and Aesthetics.” Computer Music Journal 29(1):5569.

Manaris, B., et al. 2008. “Armonique: Experiments inContent-Based Similarity Retrieval Using Power-LawMelodic and Timbre Metrics.” In Proceedings of theNinth International Conference on Music InformationRetrieval, pp. 343–348.

Raphael, C. 2008. “A Classifier-Based Approach to Score-Guided Source Separation of Musical Audio.” ComputerMusic Journal 32(1):51–59.

Reich, Y., and S. V. Barai. 1999. “Evaluating MachineLearning Models for Engineering Problems.” ArtificialIntelligence in Engineering 13(3):257–272.

Temperley, D., and D. Sleator. 1999. “Modeling Meterand Harmony: A Preference Rule Approach.” ComputerMusic Journal 15(1):10–27.

Weihs, C., et al. 2007. “Classification in Music Research.”Advances in Data Analysis and Classification 1(3):255–291.

Witten, I. H., and E. Frank. 1999. Practical Machine Learn-ing Tools and Techniques with Java Implementations.San Francisco, California: Morgan Kaufmann.

Dor and Reich 97