Playing Mozart by Analogy: Learning Multi-level Timing and Dynamics Strategies

This article was downloaded by: [University of Auckland Library]On: 05 December 2014, At: 11:34Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of New Music ResearchPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/nnmr20

Playing Mozart by Analogy: Learning Multi-levelTiming and Dynamics StrategiesGerhard Widmer & Asmir TobudicPublished online: 09 Aug 2010.

To cite this article: Gerhard Widmer & Asmir Tobudic (2003) Playing Mozart by Analogy: Learning Multi-level Timing andDynamics Strategies, Journal of New Music Research, 32:3, 259-268

To link to this article: http://dx.doi.org/10.1076/jnmr.32.3.259.16860

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose ofthe Content. Any opinions and views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be reliedupon and should be independently verified with primary sources of information. Taylor and Francis shallnot be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and otherliabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/nnmr20

http://dx.doi.org/10.1076/jnmr.32.3.259.16860

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Abstract

The article describes basic research in the area of machinelearning and musical expression. A first step towards auto-matic induction of multi-level models of expressive per-formance (currently only tempo and dynamics) from realperformances by skilled pianists is presented. The goal is tolearn to apply sensible tempo and dynamics “shapes” atvarious levels of the hierarchical musical phrase structure.We propose a general method for decomposing given expres-sion curves into elementary shapes at different levels, and forseparating phrase-level expression patterns from local, note-level ones. We then present a hybrid learning system thatlearns to predict, via two different learning algorithms, bothnote-level and phrase-level expressive patterns, and com-bines these predictions into complex composite expressioncurves for new pieces. Experimental results indicate that theapproach is generally viable; however, we also discuss anumber of severe limitations that still need to be overcomein order to arrive at truly musical machine-generated performances.

1. Introduction

The work described in this article is another step in a long-term research endeavour that aims at building quantitativemodels of expressive music performance via Artificial Intelligence and, in particular, inductive machine learningmethods (Widmer, 2001b). This is to be regarded as basicresearch. We do not intend to engineer computer programsthat generate music performances that sound as human-likeas possible. Rather, our goal is to investigate to what extenta machine can automatically build, via inductive learning

from “real-world” data (i.e., real performances by highlyskilled musicians), operational models of certain aspects of performance, for instance, predictive models of tempo,timing, or dynamics. By analysing the models induced by themachine, we hope to get new insights into fundamental principles underlying the complex phenomenon of ex-pressive music performance, and in this way contribute to thegrowing body of scientific knowledge about performance (see Gabrielsson, 1999, for an excellent overview of currentknowledge in this area).

Previous research has shown that computers can indeedfind and describe interesting and useful regularities at thelevel of individual notes. Using a new machine learning algo-rithm (Widmer, 2001a), we succeeded in discovering a smallset of simple, robust, and highly general rules that predict asubstantial part of the note-level choices of a performer (e.g.,whether she will shorten or lengthen a particular note) withhigh precision (Widmer, 2002). However, it became equallyclear (actually, it was clear from the outset) that this low levelof single notes is far from sufficient as a basis for a completemodel of expressive performance, and that these note-levelmodels must be complemented with models of expression at higher levels of musical organization, e.g., the level ofphrases.

The work presented here is a first preliminary step in thisdirection. We describe a system that learns to predict ele-mentary tempo and dynamics “shapes” at different levels ofthe hierarchical musical phrase structure, and combines thesepredictions with local timing and dynamics effects predictedby learned note-level models. To do this, the learning systemmust first be able to decompose given expression curves intoelementary patterns that can be associated with individual

Accepted: 21 March, 2003

Correspondence: Gerhard Widmer, Dept. of Medical Cybernetics and Artificial Intelligence, University of Vienna, Freyung 6/2, A-1010Vienna, Austria. E-mail: [email protected]

Playing Mozart by Analogy: Learning Multi-level Timing and

Dynamics Strategies

Gerhard Widmer1,2 and Asmir Tobudic2

1Department of Medical Cybernetics and Artificial Intelligence, University of Vienna, Vienna, Austria; 2Austrian ResearchInstitute for Artificial Intelligence, Vienna, Austria

Journal of New Music Research 0929-8215/03/3203-259$16.002003, Vol. 32, No. 3, pp. 259–268 © Swets & Zeitlinger

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014

260 Gerhard Widmer and Asmir Tobudic

phrases at different phrase levels, in order to obtain mean-ingful training examples for phrase-level learning, and toseparate phrase-level effects from local note-level effects(which will be learned by a separate learning algorithm).Likewise, we need a strategy for combining expressiveshapes predicted at different levels into one final compositeexpression curve.

In the following, we describe our current solution to the problems of expression curve decomposition and re-combination and present a first prototype system that com-bines two types of learning algorithms: a simple nearestneighbor algorithm that predicts phrase-level expressiveshapes in new pieces by “analogy” to shapes identified insimilar phrases in other pieces, and a rule learning algorithmthat learns prediction rules for note-level effects from the“residuals” that cannot be attributed to the phrase structureby the expression decomposition algorithm. Systematicquantitative experiments with performances of various sec-tions of Mozart piano sonatas show that the approach isviable in principle. We will also look at two specific musicalexamples in a bit more detail. The second one of these exam-ples, where the learning computer fails to produce a musi-cally sensible result, will give insight into some of theproblems associated with our approach. These and other limitations of our current system will be discussed in the final section of the article.

2. Multilevel decomposition of expression curves

Input to our learning system are the scores of musical piecesplus measurements of the tempo and dynamics variationsapplied by a pianist in a particular performance. These variations are given in the form of tempo and dynamicscurves and represent the local tempo and the relative loud-ness of each melody note of the piece, respectively. Both tempo and loudness are represented as multiplicativefactors, relative to the average tempo and dynamics of thepiece. For instance, a tempo indication of 1.5 for a notemeans that the note was played 1.5 times as fast as theaverage tempo of the piece, and a loudness of 1.5 means thatthe note was played 50% louder than the average loudness of all melody notes.

In addition, the system is given information about thehierarchical phrase structure of the pieces, currently at fourlevels of phrasing. Phrase structure analysis is currently doneby hand, as no reliable algorithms are available for this task.

Given an expression (dynamics or tempo) curve, thelearner is first faced with the problem of extracting the train-ing examples for phrase-level and note-level learning. Thatis, the complex curve must be decomposed into basic expres-sive “shapes” that represent the most likely contribution ofeach phrase to the overall expression curve.

As approximation functions to represent these shapes wedecided to use the class of second-degree polynomials (func-

tions of the form y = ax2 + bx + c), because there is ampleevidence from previous research that high-level tempo anddynamics are well characterized by quadratic or parabolicfunctions (Kronman & Sundberg, 1987; Repp, 1992; Todd,1992). Decomposing a given expression curve is an iterativeprocess, where each step deals with a specific level of thephrase structure: for each phrase at a given level, we computethe polynomial that best fits the part of the curve that corre-sponds to this phrase, and “subtract” the tempo or dynamicsdeviations “explained” by the approximations. The curve thatremains after this “subtraction” is then used in the next levelof the process. We start with the highest given level of phras-ing and move to the lowest. The rudimentary expressioncurve left after all levels of phrase approximations have beensubtracted is called the residual curve (Windsor & Clarke,1997).

As by our definitions, tempo and dynamics curves are listsof multiplicative factors, “subtracting” the effects predictedby a fitted curve from an existing curve simply means divid-ing the y values on the curve by the respective values of theapproximation curve.

More formally, let Np = {n1, . . . , nk} be the sequence ofmelody notes spanned by a phrase p, Op = {onsetp(ni) :ni ŒNp} the set (sequence) of relative note positions of these noteswithin phrase p (on a normalized scale from 0 to 1), and Ep = {expr(ni) :ni Œ Np} the part of the expression curve (i.e., tempo or dynamics values) associated with these notes.Fitting a second-order polynomial onto Ep then meansfinding a function fp(x) = ax2 + bx + c such that

is minimal.Given an expression curve (i.e., sequence of tempo or

dynamics values) Ep = {expr(n1), . . . , expr(nk)} over a phrasep, and an approximation polynomial fp(x), “subtracting” theshape predicted by fp(x) from Ep then means computing thenew curve

The final curve we obtain after the fitted polynomials at all phrase levels have been “subtracted” is called the residual of the expression curve.

To illustrate, Figure 1 shows the dynamics curve of thelast part (mm.31–38) of the Mozart Piano Sonata K.279 (Cmajor), first movement, first section. The four-level phrasestructure our music analyst assigned to the piece is indicatedby the four levels of brackets at the bottom of each plot. Thefigure shows the stepwise approximation of the performancecurve by polynomials at these four phrase levels. The greyline in part (e) of the figure shows how much of the originalcurve is accounted for by the four levels of approximations,and part (f ) shows the residual that is not explained by thehigher-level patterns and will be submitted to a rule learnerfor note-level learning.

E r n f onset n i kp i p p i¢ = ( ) ( )( ) ={ }exp : . . . .1

D f x N f onset n r np p p p i in Ni p

( )( ) = ( )( ) - ( )[ ]ŒÂ, exp

2

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014

Playing Mozart by analogy 261

3. Learning to predict tempo and dynamics

Given expression curves decomposed into levels of phrasalshapes (approximation polynomials) and a residual curve, weapply a two-level learning strategy to these training exam-ples. Phrase shapes for phrases in new pieces are predictedby a standard nearest-neighbor learning algorithm (seesection 3.1), and the residuals are fed into an inductive rulelearning algorithm that induces rules that predict low, note-level deviations (section 3.2). For prediction in new pieces,note-level and phrase-level predictions are then combined ina straightforward way (section 3.3).

3.1 Phrase-level learning via nearest neighbor prediction

Given a set of training performances with tempo and dynam-ics curves decomposed into phrasal shapes and residuals asdescribed above, a straightforward Nearest Neighbor learn-ing algorithm with one neighbor (Duda & Hart, 1967) is usedto predict phrase shapes (polynomials) for phrases in newpieces. Given a phrase in a new piece, the algorithm searchesits memory for the most similar phrase in the known pieces(at the same phrase level) and predicts the polynomial asso-ciated with this phrase as the appropriate shape for the newphrase.

The similarity between phrases is computed as the inverseof the standard Euclidean distance between the new targetphrase and a phrase retrieved from memory. For the moment,phrases are represented simply as fixed-length vectors ofattribute values, where the attributes describe very basicphrase properties like the length of a phrase, melodic inter-vals between the starting and ending notes, information aboutwhere the highest melodic point (the “apex”) of the phraseis, the harmonic progression between start, apex, and end,whether the phrase ends with a cadential chord sequence, etc.Given such a fixed-length representation, the definition of theEuclidean distance is trivial.

We have decided to use only the one nearest neighbor forprediction (instead of performing general k-NN, with k > 1),because what is predicted is not a scalar value, but a tripleof values (the three parameters a, b, c of an approximationpolynomial y = ax2 + bx + c), where it is not quite clear howseveral predictions would be combined. Also, an obviousdrawback of nearest neighbor algorithms is that they do notproduce explicit, interpretable models – they make predic-tions, but they do not describe the data and the target classes.As a next research step, we plan to investigate the utility ofother inductive learning algorithms for phrase-level learning,so that we will also get interpretable models that we can learnsomething from.

3.2 Rule-based learning of “residuals”

As Figure 1 shows quite clearly, the quadratic phrasal func-tions tend to reconstruct the larger trends in a performance

curve quite well, but they cannot describe all the detailedlocal nuances added by a pianist, e.g., the emphasis on par-ticular notes. Local nuances will be left over in what we callthe residuals – the tempo and dynamics fluctuations leftunexplained by the phrase-level shapes (see level (f) of Fig. 1). We would like to induce a model also of these localexpressive choices.

Actually, the residuals can be expected to represent amixture of noise and meaningful or intended local deviations(Windsor & Clarke, 1997). To learn reliable rules for pre-dicting note-level expressive actions, we need a learningalgorithm that is capable of effectively distinguishingbetween signal and noise. Nearest neighbor algorithms arenot particularly suitable here. Instead, we have chosen to usePLCG (Widmer, 2001a), a new inductive rule learning algo-rithm that has been shown to be highly effective in discov-ering reliable, robust rules from complex data where only apart of the data can actually be explained. PLCG also has theadvantage that it learns explicit sets of prediction rules, sothat we will get interpretable models at least at the note level.

PLCG learns sets of classification rules for discrete clas-sification problems. In order to apply it to the residual learn-ing problem, we need to define discrete target classes. Thesimple solution adopted here, which turns out to work suffi-ciently well, is to assign all expression values above 1.0 to aclass above1 and all others to class below1. The trainingexamples at the residual level are single notes, described viaa set of attributes that represent both intrinsic properties(such as scale degree, duration, metrical position) and someaspects of the local context (e.g., melodic properties like thesize and direction of the intervals between the note and itspredecessor and successor notes, and rhythmic propertieslike the durations of surrounding notes and some abstractionsthereof). An example of the kinds of rules that PLCG dis-covered under these definitions is shown in section 5 below.

To be able to predict numeric note-level expressionvalues, PLCG has been extended with a numeric learningmethod – again, a nearest-neighbor algorithm: all the train-ing examples (notes) covered by a learned rule are storedtogether with the rule. When predicting an expression valuefor a new note in a new test piece, PLCG first finds a match-ing rule to decide what category to apply, and then performsa k-NN search among the training examples stored with thatrule, to find the k (currently 3) notes most similar to thecurrent one. The numeric tempo or dynamics value predictedfor the new note is then a distance-weighted average of thevalues associated with the k most similar notes.

3.3 Combining phrase-level and note-level predictions

As noted above, the numeric values that make up our per-formance curves are to be interpreted as multiplicativefactors. Applying multi-level predictions made for a newpiece by the phrase-level and note-level learners is thusstraightforward – it is simply the inverse of the curve decom-position problem. Given a new piece to produce a perfor-

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014


mance for, the system starts with an initial “flat” expressioncurve (i.e., a list of 1.0 values) and then successively multi-plies the current value by the phrase-level predictions and thenote-level prediction.

Formally, for a given note ni that is contained in mhierarchically nested phrases pj, j = 1 . . . m, the expression(tempo or dynamics) value expr(ni) to be applied to it is com-puted as

where predPLCG(ni) is the note-level prediction of tempo ordynamics made by the “residual rules” learned by PLCG, andfpj

is the approximation polynomial predicted as being bestsuited for the jth-level phrase pj by the nearest-neighbor learn-ing algorithm.

4. Experiments and quantitative results

4.1 Data

In the following, we briefly present some experiments withour new approach. The data used for the experiments were

expr n pred n f onset ni PLCG i pj

m

p ij j( ) = ( ) ¥ ( )( )

=’

1

,

derived from performances of Mozart piano sonatas by aViennese concert pianist on a Bösendorfer SE 290 computer-controlled grand piano. The measurements obtained from thepiano permit the exact calculation of the tempo and dynamicscurves corresponding to these performances.

A manual phrase structure analysis (and harmonic analy-sis) of some sections of these sonatas was carried out by amusicologist.1 Phrase structure was marked at four hierar-chical levels. The resulting set of annotated pieces availablefor our experiment is summarized in Table 1. The pieces andperformances are quite complex and different in character;automatically learning expressive strategies from them is achallenging task.

4.2 Systematic quantitative evaluation

A systematic leave-one-piece-out cross-validation experi-ment was carried out to quantitatively assess the resultsachievable with our approach. Each of the 16 sections that

Fig. 1. Multilevel decomposition of dynamics curve of performance of Mozart Sonata K.279:1:1, mm.31–38. Level (a): original dynamicscurve (black) plus the second-order polynomial giving the best fit at the top phrase level (grey); levels (b–d) each show, for successively lowerphrase levels, the dynamics curve after “subtraction” of the previous approximation, and the best-fitting approximations at this phrase level;Level (e): “reconstruction” (grey) of the original curve by the four levels of polynomial approximations; level (f ): residual after all higher-level shapes have been subtracted.

1 The analysis was performed on the basis of the printed score only,without reference to a particular performance.

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014

Table 2. Results, by sonata sections, of cross-validation experiment. Measures subscripted with D refer to the “default” (mechanical, inexpressive) performance, those with L to the performance produced by the learner.

dynamics tempo

MSED MSEL MAED MAEL CorrL MSED MSEL MAED MAEL CorrL

kv279:1:1 0.0383 0.0411 0.1643 0.1544 0.6212 0.0348 0.0406 0.1220 0.1479 0.3550kv279:1:2 0.0318 0.0737 0.1479 0.1975 0.4204 0.0244 0.0335 0.1004 0.1327 0.2984kv280:1:1 0.0313 0.0266 0.1432 0.1226 0.7080 0.0254 0.0192 0.1053 0.1032 0.5821kv280:1:2 0.0281 0.0491 0.1365 0.1642 0.4711 0.0250 0.0304 0.1074 0.1232 0.4010kv280:2:1 0.1558 0.0831 0.3498 0.2002 0.7168 0.0343 0.0187 0.1189 0.1079 0.7518kv280:2:2 0.1424 0.0879 0.3178 0.2235 0.6980 0.0406 0.0431 0.1349 0.1400 0.5128kv280:3:1 0.0334 0.0134 0.1539 0.0916 0.7765 0.0343 0.0244 0.1218 0.1136 0.5813kv280:3:2 0.0226 0.0728 0.1231 0.2089 0.4590 0.0454 0.0418 0.1365 0.1327 0.3953kv282:1:1 0.1126 0.0465 0.2792 0.1721 0.7667 0.0295 0.0315 0.1212 0.1160 0.4222kv282:1:2 0.0920 0.0521 0.2537 0.1782 0.6976 0.0227 0.0421 0.1096 0.1477 0.3460kv282:1:3 0.1230 0.0613 0.2595 0.2105 0.7200 0.1011 0.0583 0.2354 0.1815 0.6676kv283:1:1 0.0283 0.0234 0.1423 0.1194 0.6007 0.0183 0.0274 0.0918 0.1193 0.2441kv283:1:2 0.0371 0.0520 0.1611 0.1629 0.4406 0.0178 0.0275 0.0932 0.1208 0.1948kv283:3:1 0.0404 0.0320 0.1633 0.1323 0.6030 0.0225 0.0214 0.1024 0.1085 0.4460kv283:3:2 0.0417 0.0402 0.1676 0.1466 0.5336 0.0238 0.0254 0.1069 0.1150 0.2948kv332:2 0.0919 0.0844 0.2554 0.2370 0.5475 0.0286 0.0416 0.1110 0.1520 0.2787

Mean: 0.0657 0.0525 0.2012 0.1701 0.6113 0.0330 0.0329 0.1199 0.1289 0.4232


make up the 8 movements was once set aside as a test piece,while the remaining 15 pieces were used for learning. Thelearned phrase-level and note-level predictions were thenapplied to the test piece, and the following measures werecomputed: the mean squared error of the system’s predic-tions on the piece relative to the actual expression curve pro-duced by the pianist (MSE = Sn

i=1(pred(ni) - expr(ni))2/n), the

mean absolute error (MAE = Sni=1|pred(ni) - expr(ni)|/n), and

the correlation between predicted and “true” curve. MSEparticularly punishes curves that produce a few extreme“errors” (deviations from what the pianist actually does).MSE and MAE were also computed for a default curve thatwould correspond to a purely mechanical, unexpressive per-formance, i.e., a performance curve consisting of all 1’s. That

allows us to judge if learning is really better than just doingnothing. The results of the experiment are summarized inTable 2, where each line gives the results obtained on therespective test piece when all others were used for training.

As can be seen, the results are mixed. We are interestedin cases where the relative errors (i.e., MSEL/MSED andMAEL/MAED) are less than 1.0, that is, where the curves pre-dicted by the learner are closer to the pianist’s actual perfor-mance than a purely mechanical rendition. In the dynamicsdimension, this is the case in 11 out of 16 cases for MSE,and in 12 out of 16 for MAE. Tempo seems not as well pre-dictable: only in 6 out of 16 cases (both w.r.t. MSE and MAE)does learning produce an improvement over a mechanicalperformance, at least in terms of these purely quantitative,unmusical measures. Also, the correlations vary between0.78 (kv280:3:1, dynamics) and only 0.19 (kv283:1:2,tempo).

Averaging over all 16 experiments, it seems that dynam-ics seems learnable under this scheme to some extent (therelative errors being RMSE = 0.799 and RMAE = 0.845),while tempo seems hard to predict in this way (RMSE ª 1,RMAE = 1.075). The correlations are quite high in mostcases.

The results can be improved if we split this set of ratherdifferent pieces into more homogeneous subsets, andperform learning within these subsets. For instance, separat-ing the pieces into fast and slow ones and learning in eachof these sets separately considerably increases the number ofcases where learning produces improvement over no learn-ing – again, especially in the domain of dynamics; temporemains a problem. Table 3 summarizes the results in termsof wins/losses between learning and no learning.

Table 1. Sonata movements used in experiments (notes refers to“melody” notes).

sonata section notes phrases at level

1 2 3 4

K.279:1 fast 4/4 1029 129 55 23 10K.280:1 fast 3/4 996 107 53 29 10K.280:2 slow 6/8 248 60 30 14 7K.280:3 fast 3/8 656 68 48 21 9K.282:1 slow 4/4 409 57 24 12 6K.283:1 fast 3/4 807 112 55 23 11K.283:3 fast 3/8 884 132 77 31 9K.332:2 slow 4/4 477 49 23 12 4

Total: 5506 714 365 165 66

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014


Although also the tempo can be predicted quite well insome pieces, the tempo results in general seem quite disap-pointing. But a closer analysis reveals that part of these ratherpoor results for tempo can be attributed to problems with the quadratic approximations. It turns out that quadratic or parabolic approximations might not be as suitable fordescribing expressive timing as has hitherto been believed.2

When we look at how well the four-level decompositions(without the residuals) reconstruct the respective trainingcurves,3 we find that the dynamics curves are generally betterapproximated by the four levels of polynomials than thetempo curves. The overall figures are given in Table 4. Thedifference between tempo and dynamics is quite dramatic.This phenomenon definitely deserves more detailed investigations.

Generally, we must keep in mind that our current repre-sentation for phrases is extremely limited: characterizingphrases via a small number of global attributes does not givethe learner access to the detailed contents of a phrase. Theresults might improve substantially if we had a better repre-sentation. Extensive studies in this direction are currentlyplanned.

Another question of interest is whether the learning ofnote-level rules from the residuals contributes anything tothe results. When we disable note-level learning and only usephrase-level learning for predicting expression curves, theresults are as shown in Table 5. Comparing this to Table 2,we note that the note-level rules do indeed improve thequality of the results, both in terms of error and correlation.The improvement may be slight in quantitative terms, but listening tests show that the predicted residuals contributeimportant audible effects that improve the musical quality ofthe resulting performances.

5. Musical results: two examples

It is instructive to look at the performance curves producedby the learning system, and to listen to the resulting “expres-sive” performances. The quality varies strongly, passages thatare musically sensible are sometimes followed by ratherextreme errors, at least in musical terms. One incorrect shapecan seriously compromise the quality of a composite perfor-mance curve that would otherwise be perfectly musical.

In the following, we look at two exemplary cases: a casewhere the learner produces quite sensible musical results,and a case of failure, which we will analyze in a bit moredetail.

5.1 A case of success

Figure 2 shows a case where prediction worked quite well,especially concerning the higher-level aspects. Some of thelocal patterns were also predicted quite well, while otherswere obviously missed. The piece from which this passagewas taken – the first section of the third movement of the Mozart piano sonata K.280, F major – is included as a sound example in the journal’s electronic appendix(http://www.swets.nl/jnmr/jnmr.html). For comparison, theelectronic appendix also contains a purely mechanical, inex-pressive version produced directly from the score.

Listening to these sound examples, the reader will findthat the system’s performance sounds quite lively and notuninteresting, with a number of quite musical developmentsboth in tempo and dynamics, and with a closing of the pieceby a nice final ritard – but also with some obvious mistakes,of course.

Table 3. Summary of wins vs. losses between learning and nolearning; + means curves predicted by the learner better fit thepianist than a flat curve (i.e., relative error <1), – means the opposite.

dynamics tempo

Learning from all pieces:MSE 11+/5- 6+/10-MAE 12+/4- 6+/10-

Learning from slow and fast pieces separately:MSE 14+/2- 8+/8-MAE 14+/2- 8+/8-

Table 4. Summary of fit of four-level polynomial decompositionon the training data. Measures subscripted with D refer to the“default” (mechanical, inexpressive) performances (repeated fromTable 2), those with P to the fit of the curves reconstructed by thepolynomial decompositions.

MSED MSEP MAED MAEP CorrP

dynamics 0.0657 0.0055 0.2012 0.0523 0.9456tempo 0.0330 0.0127 0.1199 0.0720 0.7421

2The problems of fitting parabolic curves to actual performanceshave also been noted in Windsor and Clarke (1997) and Clarke andWindsor (2000).3That is, we do not look at the performance of the learning system,but only at the effectiveness of approximating a given curve by fourlevels of quadratic functions.

Table 5. Results of learning at phrase-levels only (i.e., withoutresidual predictions).

MSED MSEL MAED MAEL CorrL

dynamics 0.0657 0.0533 0.2012 0.1718 0.6027tempo 0.0330 0.0339 0.1199 0.1308 0.3877

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014


It should be kept in mind that this is purely a result ofautomated learning. Only tempo and dynamics were shapedby the system. Articulation and pedalling are simply ignored,so the result cannot be expected to sound truly pianist-like.Also, grace notes and other ornaments are currently insertedvia an extremely simple and crude heuristic and should bemade to sound much more musical, depending on the context(see, e.g., Timmers et al., 2002). The only other effect thatwas added was a simple dynamic enhancement of themelody: melody notes were made to be 20% louder than the rest, in order to make the melody more clearly audible.This factor is roughly consistent with empirical results of arecent study on melody dynamics and melody lead (Goebl,2001).

We chose this sonata section because it is, by our musicaljudgment, the best performance by the learner of any of the16 test pieces, although it is not the best in terms of numericerror or correlation. Not surprisingly, numeric error andmusical quality do not always correlate.

With respect to note-level learning, an analysis of the ruleslearned by PLCG from the residuals shows that PLCG indeedseems to discover quite general and sensible principles oflocal timing and dynamics. An example of a rule discoveredby PLCG is

RULE TL4:below1 IFnext_dur_ratio £ 1/3 &dur_next > 1

“Lengthen a note if it is followed by a substantially longernote (i.e., the ratio between its duration and that of the nextnote is £1 :3) and if the next note is longer than 1 beat.”

This kind of principle – slightly delaying a long note thatfollows short ones – has been noted before and indeed hasbeen found to be quite a general principle, not only in Mozartperformances (Widmer, 2001a).

5.2 A case of failure

It is equally, if not more instructive to look at cases of failure.One such case is indicated in Figure 3, which shows thepianist’s and learner’s tempo curves in the first eight bars of thesecond movement of the F major sonate K.332. The respectivesound examples (mechanical and computer-generated ver-sions) are again included in the electronic appendix.

This interpretation produced by the learning algorithmsounds terrible, especially with respect to timing. Closerinspection reveals several reasons for this poor result (apart

Fig. 2. Learner’s predictions for the dynamics curve of Mozart Sonata K.280, 3rd movement, mm.25–50. Top: quadratic expression shapespredicted for phrases at four levels; bottom: composite predicted dynamics curve resulting from phrase-level shapes and note-level predic-tions (grey, without markers) vs. pianist’s actual dynamics (black, with markers).

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014


from the fact that this sonata movement is quite different incharacter from all the other pieces in the training set, whichmakes the learner’s task difficult to begin with).

The first point is actually that it does not take many mis-takes on the part of the learner to produce a result of verypoor musical quality. One incorrectly chosen shape (cf. theupward U shape predicted for the level-1 phrase that spansmeasure 3) can completely wreck the musical acceptabilityof a passage.

In addition, Figure 4 shows that K.332:2 is a good exampleof a piece where the tempo curve cannot be well approximatedby phrase-level shapes. For one thing, there are many localtiming deviations – mostly note lengthenings or delays – inthe pianist’s performance that are essential to the musicaleffect in a good performance. These cannot be captured by thephrase-level approximations (but might be accounted for bythe note-level rules learned from the residuals).

But here, in addition, it seems that the phrase structureanalysis might have been performed at too global a level,missing some relevant grouping boundaries that obviouslyguided the performer’s timing. A case in point is phrasenumber three at the lowest level, which covers all of bar three:here, our music analyst chose to treat the entire bar 3 as onephrase – thus maintaining the “phrasing rhythm” set by the

previous two bars – instead of dividing it into two parallelgroups. Given this grouping structure, it is impossible for thecomputer to mimic the pianist’s phrasing, which clearly rein-forces the two-subphrase reading of the passage and couldhave been nicely modelled by two quadratic shapes.

That points to a general problem, namely, the dependencyof our approach on a “correct” or “appropriate” phrase struc-ture analysis, where “appropriate” actually depends on theshape of the performance curves to be approximated – thus,we have a kind of circularity here. Whether one could in fact turn the problem around and use information about thestructure of performance curves to extract the underlyingphrase structure of the music is an extremely interesting open research question.

6. Discussion and future work

In this article, we have presented a two-level approach tolearning both phrase-level and note-level timing and dynam-ics strategies for expressive music performance. Both quali-tative and quantitative analyses show that the approach hassome promise, but also clearly indicate that there are stillsome severe problems that must be solved.

Fig. 3. Learner’s predictions for the tempo curve of Mozart Sonata K.332, 2nd mvt., mm.1–8.5. Top: the shapes predicted for the variousphrases (phrase structure indicated by brackets below); bottom: learner’s predicted tempo curve (grey, without markers) vs. pianist’s actualcurve (black, with markers).

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014


One obvious limitation is the propositional attribute-valuerepresentation we are currently using to characterize phrases,which does not permit the learner to refer to details of the internal structure and content of phrases. As a next step,we will look at possibilities of using more expressive repre-sentation languages and related learning algorithms, e.g.,first-order logic and relational learning methods from thearea of Inductive Logic Programming (Lavrac & Dzeroski,1994).

A general problem with nearest neigbor learning is that itdoes not produce interpretable models. As the explicit goalof our project is to contribute new insights to musical per-formance research, this is a serious drawback. Alternativelearning algorithms will have to be investigated.

A more difficult problem is the fact that we are currentlypredicting phrasal shapes individually and independently ofthe shapes associated with (or predicted for) other, relatedphrases, i.e., phrases that contain the current phrase, or arecontained by it. Obviously, this is too simplistic. Shapesapplied at different levels are highly dependent. Predictinginterdependent concepts at different levels of resolution is anew kind of scenario for machine learning, with potentialapplications in many domains, and we are planning to studythis problem in a general way.

Acknowledgements

This research is part of the START programme Y99-INF,financed by the Austrian Federal Ministry for Education,Science, and Culture. The Austrian Research Institute forArtificial Intelligence acknowledges basic financial supportfrom the Austrian Federal Ministry for Education, Science,and Culture. Thanks to Werner Goebl for performing the har-monic and phrase structure analysis of the Mozart sonatas.We are particularly grateful to the pianist Roland Batik forthe permission to use his performances for our investigations.

References

Clarke, E.F., & Windsor, W.L. (2000). Real and simulatedexpression: A listening study. Music Perception, 17, 1–27.

Duda, R., & Hart, P. (1967). Pattern classification and sceneanalysis. New York, NY: John Wiley & Sons.

Gabrielsson, A. (1999). The performance of music. In: D. Deutsch (Ed.), The Psychology of Music (2nd Ed.),501–602. San Diego, CA: Academic Press.

Goebl, W. (2001). Melody lead in piano performance: Expres-sive device or artifact? Journal of the Acoustical Society ofAmerica, 110, 563–572.

Fig. 4. Reconstruction of pianist’s tempo of Mozart K.332:2, mm.1–8, by hierarchy of four phrase shapes. Top: the shapes associated withthe various phrases (phrase structure indicated by brackets below); bottom: reconstructed tempo curve (grey, without markers) vs. pianist’sactual curve (black, with markers).

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014


Kronman, U., & Sundberg, J. (1987). Is the musical ritard anallusion to physical motion? In: A. Gabrielsson (Ed.), Actionand Perception in Rhythm and Music, 57–68. Stockholm,Sweden: Royal Swedish Academy of Music No.55.

Lavrac, N., & Dzeroski, S. (1994). Inductive logic programming.Chichester, NY: Ellis Horwood.

Repp, B. (1992). Diversity and commonality in music perfor-mance: An analysis of timing microstructure in Schumann’s“Träumerei.” Journal of the Acoustical Society of America,92, 2546–2568.

Timmers, R., Ashley, R., Desain, P., Honing, H., & Windsor,W.L. (2002). Timing of ornaments in the theme fromBeethoven’s Paisiello variations: Empirical data and a model.Music Perception, 20, 3–33.

Todd, N. McA. (1992). The Dynamics of dynamics: A model of musical expression. Journal of the Acoustical Society ofAmerica, 91, 3540–3550.

Widmer, G. (2001a). Discovering strong principles of expressive music performance with the PLCG rule learning strategy. In Proceedings of the 11th European Conference on Machine Learning (ECML’01), Freiburg.Berlin: Springer Verlag.

Widmer, G. (2001b). Using AI and machine learning to studyexpressive music performance: Project survey and firstreport. AI Communications, 14, 149–162.

Widmer, G. (2002). Machine discoveries: A few simple, robustlocal expression principles. Journal of New Music Research,31, 37–50.

Windsor, W.L., & Clarke, E.F. (1997). Expressive timing anddynamics in real and artificial musical performances: Usingan algorithm as an analytical tool. Music Perception, 15,127–152.

Dow

nloa

ded

by [

Uni

vers

ity o

f A

uckl

and

Lib

rary

] at

11:

34 0

5 D

ecem

ber

2014

Documents

Playing Mozart by Analogy: Learning Multi-level Timing and Dynamics Strategies