arXiv:1908.06969v2 [cs.SD] 16 Feb 2021

Musical Rhythm Transcription Based on Bayesian Piece-Specific Score ModelsCapturing Repetitions?

Eita Nakamuraa,b,∗, Kazuyoshi Yoshiia,c

aGraduate School of Informatics, Kyoto University, Kyoto 606-8501, JapanbThe Hakubi Center for Advanced Research, Kyoto University, Kyoto 606-8501, Japan

cAIP Center for Advanced Intelligence Project, RIKEN, Tokyo 103-0027, Japan

Abstract

Most work on musical score models (a.k.a. musical language models) for music transcription has focused on describing thelocal sequential dependence of notes in musical scores and failed to capture their global repetitive structure, which canbe a useful guide for transcribing music. Focusing on rhythm, we formulate several classes of Bayesian Markov models ofmusical scores that describe repetitions indirectly using the sparse transition probabilities of notes or note patterns. Thisenables us to construct piece-specific models for unseen scores with an unfixed repetitive structure and to derive tractableinference algorithms. Moreover, to describe approximate repetitions, we explicitly incorporate a process for modifyingthe repeated notes/note patterns. We apply these models as prior musical score models for rhythm transcription, wherepiece-specific score models are inferred from performed MIDI data by Bayesian learning, in contrast to the conventionalsupervised construction of score models. Evaluations using the vocal melodies of popular music showed that the Bayesianmodels improved the transcription accuracy for most of the tested model types, indicating the universal efficacy of theproposed approach. Moreover, we found an effective data representation for modelling rhythms that maximizes thetranscription accuracy and computational efficiency.

Keywords: Music transcription; musical language model; Bayesian modelling; symbolic music processing; musicalrhythm.

1. Introduction

Music transcription is an actively studied but yet un-solved problem in music information processing [1, 2]. Oneof the goals of music transcription is to convert a musicalperformance signal into a human-readable symbolic musi-cal score. While recent studies have achieved highly accu-rate pitch detection methods [3, 11, 33], it is also necessaryto transcribe rhythms in order to obtain symbolic musicrepresentations [6, 7, 10, 15, 17, 18, 19, 20, 23, 31, 32].Since there are many logically possible representations ofrhythms (including inappropriate ones for transcription)for a given performance [6], using a (musical) score model(a.k.a. musical language model) that describes the priorknowledge of musical scores is a key to solving this prob-lem. Here, we study the problem of the rhythm transcrip-tion of monophonic music, which is the task of recogniz-ing score-notated musical rhythms from human-performed

?This work is in part supported by JSPS KAKENHI Nos.15K16054, 16H01744, 16J05486, and 19K20340; JST ACCEL No.JPMJAC1602; the Kyoto University Foundation; and the KayamoriFoundation. The work of EN was supported by the JSPS researchfellowship (PD).

∗Corresponding authorEmail address: [email protected] (Eita

Nakamura)

MIDI data that contain onset time deviations. A rhythmtranscription method can also be used as a module ofaudio-to-score music transcription systems, as studied in[17, 20].

A common approach for music transcription is to inte-grate a score model and a performance/acoustic model toobtain a proper transcription that best fits an input perfor-mance signal, similar to the statistical speech recognitionmethod [21]. This is a top-down approach for describingthe generative process of musical performance signals usingstatistical models, in contrast to bottom-up approaches forextracting features from musical performance signals, suchas using ratios of inter-onset intervals [13, 25] and usingthe wavelet transform [30] to represent rhythmic features.A major advantage of the former approach is the potentialto utilize statistical machine learning techniques.

In constructing a score model for music transcription,capturing both local and global features of musical scoresis considered to be effective. Among the local features,the sequential dependences of musical notes can be usedto induce output scores that obey the musical grammar.Among the global features, repetitive structures are com-monly found in various styles of music [12, 26] and canbe a useful guide for transcription since they can com-plement the local information of musical scores. This isbecause by using a repetitive structure, one can in ef-

Preprint submitted to Information Sciences February 17, 2021

arX

iv:1

908.

0696

9v2

[cs

.SD

] 1

6 Fe

b 20

21

Piece 1

Piece 2

Generic model

Piece 1 Piece 2

Dirichlet process

Generic model

Model for piece 1 Model for piece 2

(a) (b)

& œ œ œ œ œ œ œ œ œ œ& œ œ œ œ œ œ œ œ œ œ œ œ œ œ

Pro

babi

lity

Note pattern

Pro

babi

lity

Note pattern

Pro

babi

lity

Note pattern

Pro

babi

lity

Note pattern

& œ œ œ œ œ œ œ œ œ œ & œ œ œ œ œ œ œ œ œ œ œ œ œ œ& œ œ œ œ œ .œ Jœ ˙

& .œ Jœ& œ œ œ & œ œ & œ œ œ œ & ˙ & .œ Jœ& œ œ œ & œ œ & œ œ œ œ & ˙

& .œ Jœ& œ œ œ & œ œ & œ œ œ œ & ˙ & .œ Jœ& œ œ œ & œ œ & œ œ œ œ & ˙Piece 3

Figure 1: Schematic illustrations of (a) a conventional score modeland (b) the studied score model. In (a), all target scores are gener-ated from a “generic” model. In (b), a piece-specific score model isgenerated for each individual score, where a repetitive structure isinduced by a sparse distribution.

fect cancel out timing deviations and other “noises” inperformances. Conventional score models for music tran-scription that are designed to capture the local sequen-tial dependence of musical notes include Markov models[10, 19, 20, 22, 23, 24, 31], hidden Markov models (HMMs)[27], recurrent neural networks (RNNs) [28], and long-short term memory (LSTM) networks [34]. However, itis challenging to construct a score model incorporating arepetitive structure for transcription, particularly becausethe computational costs for imposing extensive constraintson output scores typically become prohibitively large.

Recently, Nakamura et al. [18] proposed a statisticalscore model incorporating a repetitive structure and de-rived a tractable algorithm for music transcription basedon the model. A prominent feature of this method is thatpiece-specific score models for individual musical piecesare constructed and subsequently used for transcription,in contrast to the use of a generic score model for alltarget musical pieces in the conventional methods. Themodel is formulated as a Bayesian model whereby a piece-specific score model is first generated from a generic scoremodel and a musical score is then generated from thepiece-specific model (Fig. 1). This model is similar tothe topic model of natural language based on the latentDirichlet allocation (LDA) [5], where a word distributionof individual text is generated from a generic word distri-bution of a collection of text. Instead of a word distribu-tion, a distribution of note patterns (i.e. subsequences ofnotes corresponding to bars) is considered as a score modelfor generating musical scores. The process for generatingpiece-specific models is described as a Dirichlet process[14] with a small concentration parameter, which inducessparse distributions of note patterns. As a consequence ofthe sparseness of the piece-specific score model, repetitionsof note patterns are naturally induced in the generatedscore. It is important that the piece-specific score modelis stochastically generated and thus the total model candescribe unseen scores with an unfixed repetitive struc-ture. Moreover, by combining a process of modifying thegenerated note patterns, the model can also describe ap-proximate repetitions, which are commonly seen in music

practice.In [18], the efficacy of the score model was tested in

the task of the rhythm transcription of monophonic mu-sic. It was found that both the Bayesian construction ofscore models and the note modification process improvedthe transcription accuracy, showing the potential of theapproach. It was also found that the computational costswere too large for practical applications.

The purpose of this study is to find a score model cap-turing repetitions that can be practically used for rhythmtranscription. We construct wider classes of Bayesian scoremodels and conduct a systematic comparative evaluationof these models to find an optimal model in terms of tran-scription accuracy and computational costs. While repe-titions were considered in units of note patterns in [18],it is theoretically possible to consider repetitions in unitsof notes. We construct Bayesian score models based onthe note value Markov model (MM) [31] and the metricalMM [10, 23], which are advantageous for computationalefficiency compared to the note pattern MM considered in[18]. We apply the constructed score models to rhythmtranscription and conduct evaluations to examine the ef-fect of capturing repetitions and to reveal the best modelfor the task.

After introducing basic score models for rhythms (notevalue MM, metrical MM, and note pattern MM) and pre-senting statistical analyses on the repetitive structure inSec. 2, we formulate our models in Sec. 3. Bayesian ex-tensions of the three types of Markov models and the inte-gration of note modification processes (note divisions andonset shifts) are formulated there. A rhythm transcrip-tion method based on the score models is presented inSec. 4. Evaluations are carried out in Sec. 5, where modelsare compared in terms of the transcription error rate andcomputation time. We also examine the influence of theparameterization of the relevant hyperparameters. Thedata and source code used in this study are available athttps://bayesianscoremodel.github.io.

2. Repetitive Structure and Sparseness of Piece-Specific Score Models

We here define the form of music data we address, in-troduce basic score models, and present statistical analy-ses that indicate the necessity of considering piece-specificscore models in constructing musical score models captur-ing the repetitive structure of music.

2.1. Musical Score Data

The score data used in this study are extracted from acollection of vocal melodies of popular music. Specifically,we use 99 pieces from the RWC popular music dataset [9],190 pieces by The Beatles, and 142 contemporary Japanesepopular music (J-pop) pieces. To enable comparisons be-tween different models, only pieces in 4/4 time are chosen,only rhythms (note onset positions) are used, the 16th-note length is used as the minimal beat resolution, and

2

Table 1: Musical score data used in this study.

Dataset Original data Pieces Bars Notes

Train Beatles 180 16, 746 40, 913RWC 89 11, 449 31, 998

Contemporary J-pop 132 21, 152 66, 073Total 401 49, 347 138, 583

Test Beatles 10 924 2, 124RWC 10 1, 452 3, 950

Contemporary J-pop 10 1, 518 4, 672Total 30 3, 894 10, 716

Œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ

œ œ œ œ œ œ œ œ œ œ .œ œ œ œ œ

｜00001010｜10001000｜10101000｜10001010｜10001011｜00001010｜10101000｜10101000｜10001001｜10001010｜

｜4 6｜0 4｜0 2 4｜0 4 6｜0 4 6 7｜4 6｜0 2 4｜0 2 4｜0 4 7｜0 4 6｜

｜2 2｜4 4｜2 2 4｜4 2 2｜4 2 1 5｜2 2｜2 2 4｜2 2 4｜4 3 1｜4 2 2｜

( )

Score notation

Note pattern

Metricalpositions

Note values

A B’ B C C’

A B B C’’ C

Note combination Note division

Onset shift

(b)

(a)

Note pattern

(Onset) metrical positions

Note values

0 2 3 4

2 1 1 4

Score notation œ œ œ œ0 1 2 3 4 5 6 7Metrical

positions

1 0 1 1 1 0 0 0

Difference

Figure 2: Representations of musical rhythms. (a) A one-bar exam-ple. (b) A longer example. The score is taken from “Yasashii uta(tender song)” by a J-pop band Mr. Children.

each piece is segmented into a half-note length. In thisstep, we disregard rests, note onsets in finer beat resolu-tions, and segments without any note onsets. In addition,for simplicity, if a pair of consecutive note onsets in twosegments is more than a half-note length apart, a new on-set is added at the beginning of the latter segment so thatthe maximum note length will be a half-note length. Here-after the obtained segments are called bars, that is, piecesare represented in 2/4 time. The relative positions of noteonsets in each bar in units of 16th notes are called the met-rical positions, and the number of metrical positions Nb(= 8) is called the bar length. We should remark that thetime signature, bar length, and the manipulations madein preparing our data were chosen to balance the precisionof the representation and the computational efficiency ofnumerical experiments; the extension for including longerdurations and triplet notes is theoretically possible, as dis-cussed in Sec. 5.5. Finally, we randomly split the data intotwo sets, one set for training the models and the other setfor evaluating them, whose contents are shown in Table 1.

A note pattern (i.e. rhythmic pattern in a bar) can be

represented as an Nb-bit binary vector σ1σ2 · · ·σNb, where

σb = 1 indicates an onset at metrical position b and oth-erwise σb = 0. The onset time of a note measured fromthe beginning of a piece in units of 16th notes is called theonset score time. A score-notated note length (differencein the onset score time) is called a note value. As shownin Fig. 2, we can use one of three quantities (note value,metrical position, and note pattern) to represent musicalrhythms. These three representations are equivalent; how-ever, the metrical position of the first note onset is lost inthe note value representation.

We represent a musical score as a sequence of onset scoretimes (τn)Nn=0 where n represents the note onsets and Nis the number of notes (N + 1 note onsets are necessary todefine the lengths of N notes). The onset score times aredescribed in units of 16th-note lengths. In the example ofFig. 2(b), τ0 = 4, τ1 = 6, τ2 = 8, τ3 = 12, etc. Their abso-lute values are unimportant but we assume that τn (modNb) represents the metrical position bn ∈ {0, . . . , Nb − 1}of the n th note onset. The note value rn of the n th note(n = 1, . . . , N) is defined as rn = τn−τn−1. In the exampleof Fig. 2(b), b0 = 4, b1 = 6, b2 = 0, b3 = 4, etc. and r1 = 2,r2 = 2, r3 = 4, etc. Since the maximum note value andNb are 8, we can also write rn = bn − bn−1 if bn > bn−1

and rn = bn − bn−1 + Nb otherwise. Hereafter, we usethe notation τn1:n2

= (τn)n2n=n1

to indicate a sequence ofvariables.

2.2. Basic Score Models

A score model is a probabilistic generative model forsequences of onset score times that describe the probabilityP (τ0:N ), or, equivalently, P (b0:N ) or P (r1:N ). Here wereview three basic score models: the note value MM, themetrical MM, and the note pattern MM.

2.2.1. Note Value Markov Models (NoteMM)

The basic first-order note value MM (NoteMM1) [31] isdefined with the initial and transition probabilities of notevalues:

P (r1 = r) = πini r, P (rn = r | rn−1 = r′) = πr′r. (1)

The zeroth-order note value MM (NoteMM0) is definedby using the unigram probabilities P (rn) instead of thetransition probabilities P (rn|rn−1). The second-order notevalue MM (NoteMM2) and higher-order models can beconstructed by considering transition probabilities of theform P (rn| rn−k, . . . , rn−1). Since similar methods can beapplied for lower-order and higher-order models [16], wedescribe only the first-order models in what follows.

2.2.2. Metrical Markov Models (MetMM)

The basic first-order metrical MM (MetMM1) [10, 23]generates a sequence of metrical positions as

P (b0 = b) = χini b, P (bn = b | bn−1 = b′) = χb′b. (2)

3

The onset score times (τn)Nn=0 and the note values (rn)Nn=1

are then given as

τ0 = b0, rn = τn − τn−1 = [bn−1, bn], (3)

[bn−1, bn] :=

{bn − bn−1, bn−1 < bn;

bn − bn−1 +Nb, bn−1 ≥ bn.(4)

2.2.3. Note Pattern Markov Models (PatMM)

The note pattern MMs can be described as a general-ization of the metrical MMs. We denote a note pattern ofmetrical positions as

Bk = (Bk1, . . . , BkIk), (5)

where k is an index in the set of note patterns, Ik de-notes the number of note onsets in note pattern Bk, andeach Bki (i ∈ {1, . . . , Ik}) indicates a metrical position(Bki < Bk(i+1)). For example, the first two note patternsin Fig. 2(b) are represented as (4, 6) and (0, 4). Note thatBk1 needs not be 0: if Bk1 6= 0 it indicates that the firstnote of pattern Bk is a tied note. This is an extension ofthe note pattern model used in [18], which cannot describetied notes.

The basic first-order note pattern MM (PatMM1) is de-scribed as a two-level hierarchical MM [8] with a statespace indexed by a pair (k, i) of note-pattern index k andan internal index i ∈ {1, . . . , Ik} of notes in each patternk. The upper-level model generates a sequence of notepatterns as

Γini k = P (k1 = k), Γk′k = P (km = k | km−1 = k′). (6)

The lower-level model generates internal positions as

γkini i = P (i1 = i |k) = δi1, (7)

γki′i = P (i`+1 = i | i` = i′, k) = δ(i′+1)i, (8)

γki end = P (iIk = i |k) = δiIk , (9)

where δ denotes Kronecker’s delta. The initial and transi-tion probabilities of the total model are given as

ψini(ki) = P (k1 = k, i1 = i) = Γini kγkini i = Γini kδi1, (10)

ψ(k′i′)(ki) = P (kn = k, in = i|kn−1 = k′, in−1 = i′)

= δkk′γki′i + γk

′i′ endΓk′kγ

kini i = δkk′δi(i′+1) + δi′Ik′Γk′kδi1,

(11)

in which case we have bn = Bknin . The onset score timesand note values are given as in Eqs. (3) and (4).

2.3. Repetitions and Sparseness

To quantitatively see how a repetitive structure is re-flected in the sparseness of piece-specific score models,a distribution (generic model) of note patterns obtainedfrom all the pieces in our training dataset is shown withdistributions (piece-specific models) obtained from individ-ual pieces in Fig. 3. We use only the training dataset

10-5

10-4

10-3

10-2

10-1

100

0 50 100 150 200 250

Pro

babi

lity

Note Pattern

(a) The Beatles: ‘For No One’

0 50 100 150 200 250

Note Pattern

(b) RWC No. 41

All 401 pieces(Entropy = 5.3)

All 401 pieces(Entropy = 5.3)

One piece(Entropy = 3.8)

One piece(Entropy = 2.0)

Figure 3: A generic distribution of note patterns obtained from thetraining data (background, red) and piece-specific distributions ob-tained from individual pieces in the same data (foreground, green).The note patterns are ordered according to the probability.

Generic model (5.3)

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

0 50 100 150 200 250 300 350 400

Ent

ropy

(bi

ts/p

atte

rn)

Pieces

Real dataMean = 3.5

Dirichlet process (α = 11)Mean = 3.5

Finite samplesMean = 4.6

Figure 4: Distribution of the entropies of piece-specific distributionsof note patterns, that of sequences sampled from the generic notepattern distribution, and that of distributions generated from theDirichlet process. The pieces are ordered according to their entropy.

for the analysis in this subsection. The sparseness of thepiece-specific distributions, which can be measured by the(Shannon) entropy, is evident. From the entropies of thepiece-specific distributions of all the pieces in Fig. 4, wesee that most pieces have a much lower entropy comparedto the generic distribution with an entropy of 5.3.

The sparseness of the piece-specific distributions cannotbe explained merely by using the statistical fluctuation offinite data. To confirm this, we drew 401 sets of sam-ples (note patterns) from the generic distribution, each ofwhich corresponds to a piece in the data and has the samenumber of bars (the average number of bars was 123). Theentropies calculated from the distributions obtained fromthe sampled data are also plotted in Fig. 4. The result inFig. 4 shows that these distributions still have much higherentropies than the real data, indicating that the sparsenessof the real piece-specific distributions originates from therepetitive structure.

As mentioned above, rhythmic patterns can also be de-scribed as sequences of metrical positions or note values(Fig. 2) and repetitions at the note pattern level inducerepetitions at the note level and vice versa. This indicatesthat a repetitive structure is reflected in the sparseness ofpiece-specific note value MMs and metrical MMs, whosemain parameters are the transition probabilities of notevalues and those of metrical positions, respectively. InFigs. 5(a) and 5(b), we plot the distributions of the en-tropies for these models: one for the real data and one forthe sampled data from the generic model, respectively. Wecan again observe that the entropies of the actual piece-specific transition probabilities are much lower than the

4

0.5

1

1.5

2

2.5

0 50 100 150 200 250 300 350 400

Ent

ropy

(bi

ts/n

ote)

Pieces

0.5

1

1.5

2

2.5

0 50 100 150 200 250 300 350 400

Ent

ropy

(bi

ts/n

ote)

Pieces

Real dataMean = 1.5

Real dataMean = 1.4

Dirichlet process (α = 7)Mean = 1.5

Dirichlet process (α = 3.5)Mean = 1.4

Finite samplesMean = 1.8

(a)

(b)

Finite samplesMean = 1.9Generic model (2.0)

Generic model (1.9)

Figure 5: Distributions of the entropies of the piece-specific transi-tion probabilities of (a) note values and (b) metrical positions. Thecorresponding distributions obtained from sampled data from thegeneric transition probabilities and those obtained from the tran-sition probabilities generated from the Dirichlet process are alsoshown. The pieces are ordered according to their entropy.

entropy of the generic model and that this cannot be ex-plained merely by the finite sample effect.

These results support our assertion that a repetitivestructure in music can be useful for transcription from theviewpoint of generative modelling. Lower entropy means ahigher predictive ability in generative modelling. Thus, ifwe can infer the piece-specific score model appropriately,the transcription will be easier due to the model’s highpredictive ability.

2.4. Note Modifications for Approximate Repetitions

In music, repetitions may appear with modifications,leading to approximate repetitions. The example inFig. 2(b) illustrates the two most common types ofmodifications: note division/combination and onset shift[18, 29]. In a note division/combination (or simply divi-sion/combination), a note onset is inserted/deleted whilethe other note onsets are unchanged. These modificationsoften appear in sung melodies where the number of syl-lables varies in repeated phrases or sections. In an onsetshift (or simply shift), a note onset is shifted forward orbackward in time. This is another typical type of rhythmicvariation that includes syncopations.

Modelling these modifications for describing musicalscores with approximate repetitions can be useful for tran-scription because by identifying modifications we can in-duce more repetitions in data, which can lead to a higherpredictive ability. The concept of approximate repetitionsis also important for humans to understand or transcribemusic [12]. The example in Fig. 2(b) cannot be recognizedas repeated phrases if modified note patterns are consid-ered to be completely different ones.

Table 2: List of the constructed score models based on the notevalue Markov model (MM). In the acronyms, the number indicatesthe order of the Markov model, ‘S’ and ‘D’ indicate note shift andnote division, and ‘B’ indicates Bayesian extension.

Acronym MM order Bayesian Modification

NoteMM0 0th —NoteMM0B 0th X —NoteMM1 1st —NoteMM1B 1st X —NoteMM1SB 1st X Shift onlyNoteMM1DB 1st X Division onlyNoteMM1SDB 1st X Shift & divisionNoteMM2 2nd —NoteMM2B 2nd X —

œ œ œ œ œ

œ œ œ œ .œ œNote division Onset shift

n

sn = 1

r̃

qh1 qh2

Figure 6: The note division operation (left) and the onset shift op-eration (right).

3. Proposed Score Models

3.1. Overview of the Studied Models

Here, we explain the proposed score models. We con-struct score models based on three basic models: the notevalue MM, the metrical MM, and the note pattern MM.Each basic model is extended in a Bayesian manner todescribe the repetitive structure and combined with notemodification models to describe approximate repetitions.The incorporation of note modifications is described inSec. 3.2 and the Bayesian extensions are formulated inSec. 3.3. The list of constructed models based on the notevalue MM and their acronyms are summarized in Table 2.Similarly we construct 9 models (MetMM0, MetMM0B,. . ., MetMM2B) based on the metrical MM and 7 mod-els (PatMM0, PatMM0B, . . ., PatMM1SDB) based on thenote pattern MM. We do not consider the second-ordernote pattern MM and its extensions due to their imprac-tical computational cost.

3.2. Incorporation of Note Modifications

3.2.1. Note Modification Operations

We consider the most common types of note modifica-tions, onset shifts, note divisions, and note combinations(Sec. 2.4). As note divisions and combinations are the op-posite of each other, we only consider note divisions in ourmodels. In [18], only onset shifts for notes on bar onsets(i.e. syncopations) were considered. Here, we generalizethe idea and consider onset shifts for any notes.

5

We denote the shift of the onset score time of the n thnote as sn and the note value and score time of the n thnote before the application of onset shifts are notated asr̃n and τ̃n, respectively (Fig. 6 (right)). An onset shift isthen described as

τ̃n → τn = τ̃n + sn or r̃n → rn = r̃n + sn − sn−1. (12)

The shift variable sn describes the amount of the shift andtakes values in the range [−Nb+1, Nb−1] with probability

ξs = P (sn = s). (13)

We assume a constraint −r̃n < sn ≤ r̃n to make the onsetshift interpretable as a modification. Musically, we shouldalso have rn = r̃n + sn − sn−1 > 0 for all n.

In the most general case, a note division is describedas an operation r̃ → qh that maps a note value r̃ to asequence of note values qh = qh1 · · · qhGh

(Fig. 6 (left)).Here, h labels the patterns of note divisions and Gh is thenumber of notes after a division. The division probabilitycan be written as

ζr̃h = P (r̃ → qh). (14)

We must have ζr̃h = 0 unless r̃ = qh1 + · · ·+ qhGh. Using

an index g ∈ {1, . . . , Gh}, the pair of indices (h, g) canspecify the note value after applying the note division asrn = qhngn .

When we combine onset shifts and note divisions, wefirst apply note divisions and then apply an onset shift toeach of the divided notes. The resulting notes are indexedby (r̃, h, g, s), and their note values are given as

rn = qhngn + sn − sn−1. (15)

In this study, we only consider note divisions into twonotes, that is, Gh ≤ 2, to avoid impractical computationalcosts.

3.2.2. Score Models with Note Modifications

Note value MMs can be extended to incorporate notemodifications as follows. The model incorporating onsetshifts (NoteMM1S) is described by an HMM with a statespace indexed by z0 = s0 and zn = (r̃n, sn) (n = 1, . . . , N),where r̃n indicates the note value before modification andsn indicates the amount of the shift. The initial and tran-sition probabilities are given as

P (z0 = s) = ξs, P (z1 = (r̃, s)) = πini r̃ ξs, (16)

P (zn = (r̃, s) | zn−1 = (r̃′, s′)) = πr̃′r̃ ξs, (17)

and the final output is obtained from the output probabil-ity

P (rn|zn−1, zn) = I(rn = r̃n + sn − sn−1), (18)

where I(C) is 1 when expression C is true and is 0 otherwise.The note value MM incorporating note divisions

(NoteMM1D) is described by an HMM with a state space

indexed by z = (r̃, h, g), where r̃ indicates the note valuebefore note division and h and g indicate a note-divisionpattern and its internal note position, respectively (seeSec. 3.2.1). The initial and transition probabilities aregiven as

P (z1 = (r̃, h, g)) = πini r̃ζr̃hδg1, (19)

P (zn = (r̃, h, g) | zn−1 = (r̃′, h′, g′))

= δr̃r̃′δhh′δg(g′+1) + δg′Gh′πr̃′r̃ζr̃hδg1, (20)

and the final output is obtained from the output probabil-ity

P (rn|zn) = I(rn = qhngn). (21)

The note value MM incorporating both onset shifts andnote divisions (NoteMM1SD) is obtained by combining theabove two models. It is described by an HMM with astate space indexed by z0 = s0 and zn = (r̃n, hn, gn, sn)(n = 1, . . . , N) and the initial and transition probabilitiesare given as

P (z0 = s) = ξs, P (z1 = (r̃, h, g, s)) = πini r̃ζr̃hδg1ξs,

P (zn = (r̃, h, g, s) | zn−1 = (r̃′, h′, g′, s′))

=(δr̃r̃′δhh′δg(g′+1) + δg′Gh′πr̃′r̃ζr̃hδg1

)ξs. (22)

The final output is obtained from the output probability

P (rn| zn−1, zn) = I(rn = qhngn + sn − sn−1). (23)

The model parameters of the (non-Bayesian) first-ordernote value models are summarized as follows. Hereafterwe use r instead of r̃ to unify the notation. Let us writeπini = (πini r)r, πr = (πrr′)r′ , ζr = (ζrh)h, and ξ = (ξs)s.

• NoteMM1: πini, (πr)r.

• NoteMM1S: πini, (πr)r, ξ.

• NoteMM1D: πini, (πr)r, (ζr)r.

• NoteMM1SD: πini, (πr)r, (ζr)r, ξ.

We can similarly formulate four models (MetMM1,MetMM1S, MetMM1D, and MetMM1SD) based on themetrical MM with or without the note modification pro-cess and four models (PatMM1, PatMM1S, PatMM1D,and PatMM1SD) based on the note pattern MM. See theSupplemental Material for the explicit formulation.

3.3. Bayesian Extensions Based on Dirichlet Processes

We extend the score models in a Bayesian manner to for-mulate the generative process of piece-specific models froma generic model. We assume that piece-specific models andgeneric models have the same architecture but different pa-rameterizations. For NoteMM1SD, the generative processcan be formulated by assigning prior distributions to themodel parameters:

πini ∼ DP(αini, π̄ini), πr ∼ DP(απ, π̄r),

ζr ∼ DP(αζ , ζ̄r), ξ ∼ DP(αξ, ξ̄), (24)

6

where DP denotes a Dirichlet process [14]; π̄ini, π̄r, etc.denote base distributions; and αini, απ, etc. denote concen-tration parameters. Since the distributions are finite dis-crete distributions, the Dirichlet process can be describedas Dirichlet distributions. For example,

πr ∼ DP(απ, π̄r) = Dir(αππ̄r). (25)

The expectation value of the distributions generated bya Dirichlet process is equal to the base distribution [14].Based on this fact, we can interpret the distributions (πr,ζr, etc.) generated by the Dirichlet process as the piece-specific models and the base distributions (π̄r, ζ̄r, etc.)as the generic model representing the average of all thepiece-specific models. The distributions generated from aDirichlet process become sparser as we use a smaller con-centration parameter. The parameters of the note modifi-cation models ζr and ξ are also considered for individualpieces to describe the situation in which the note modifi-cations used in each piece have different statistical tenden-cies.

We can similarly construct Bayesian extensions of met-rical MMs and note pattern MMs. For metrical MMs, wehave

χini ∼ DP(αini, χ̄ini), χb ∼ DP(αχ, χ̄b), (26)

ζr ∼ DP(αζ , ζ̄r), ξ ∼ DP(αξ, ξ̄), (27)

and for note pattern MMs, we have

Γini ∼ DP(αini, Γ̄ini), Γk ∼ DP(αΓ, Γ̄k), (28)

ζr ∼ DP(αζ , ζ̄r), ξ ∼ DP(αξ, ξ̄). (29)

where χ̄ini, χ̄b, Γ̄ini, Γ̄k, etc. denote the base distributionscorresponding to the generic models and αini, αχ, αini, αΓ,etc. denote the corresponding concentration parameters.

In Fig. 4, the entropies of the distributions of the notepatterns (zeroth-order note pattern MMs) generated by aDirichlet process are shown, where the base distributionis the generic model and the concentration parameter isadjusted to match the average entropy with the real data.Although the obtained distribution of entropies does notvery well match the actual distribution, these distribu-tions overlap much more than the actual distribution doeswith the entropy distribution of the sampled data from thegeneric model. Similarly, in Figs. 5(a) and 5(b), we plotthe distributions of the entropies for the piece-specific notevalue MMs and metrical MMs obtained from the Dirichletprocesses. We see that the Dirichlet processes can generatetransition probabilities with a similar entropy to the en-tropies of the actual piece-specific transition probabilities.These results indicate that we can use small concentrationparameters to account for the sparse piece-specific transi-tion probabilities in the real data.

4. Rhythm Transcription Method

The aim of a rhythm transcription method is to esti-mate the score onset times τ0:N (or note values r1:N ) from

input MIDI data with note onset times t0:N (or durationsd1:N , where dn = tn−tn−1) that are represented in units ofseconds. We first construct a musical performance gener-ative model that describes the probability P (τ0:N , t0:N ) orP (r1:N , d1:N ) by combining a score model (one of the afore-mentioned models) and a performance model explainedin Sec. 4.1. We then derive a rhythm transcription algo-rithm that can estimate the score by maximizing the prob-ability P (τ0:N |t0:N ) or P (r1:N |d1:N ), which is explained inSec. 4.3. For the Bayesian score models, the model pa-rameters for the piece-specific model are learned in thetranscription step, as described in Sec. 4.2.

4.1. Performance Model

A performance (timing) model describes the generativeprocess of durations d1:N from note values r1:N , whichgives the probability P (d1:N |r1:N ). We consider a simpli-fied model with a constant (inverse) tempo v. Given a notevalue rn for the n th note, the corresponding duration dnis generated as

dn ∼ Gauss(vrn, σ2t ), (30)

where Gauss(µ,Σ) denotes a Gaussian distribution withmean µ and variance Σ, and σt represents the amountof onset time deviations. The onset times t0:N can bedetermined by the durations d1:N up to an initial time t0,which is insignificant.

In a real music transcription situation, the global tempois unknown and the tempo changes over time, especially insolo performances and classical musical performances. Theperformance model can be generalized to describe thesesituations by introducing an additional process to generatea sequence of local tempos vn [19]. To focus on the effectof musical score models, we only consider a constant andknown tempo in this study.

4.2. Parameter Learning

For non-Bayesian score models, the model parametersare pretrained or preset and are fixed during the transcrip-tion step. The initial and transition probabilities can bedirectly learned from musical score data. The probabili-ties ζ̄r and ξ̄ of the modification models cannot be directlylearned from musical score data, and thus they are manu-ally preset to reflect the belief that modifications are rare.Specifically, we assign probabilities ξ̄0 and ζ̄0, which areslightly less than unity, for cases without modifications,and other entries in ζ̄r and ξ̄ are assumed to have uniformprobabilities.

For Bayesian models, the hyperparameters of the priordistributions except for the concentration parameters arepretrained or preset to the parameterization of the cor-responding non-Bayesian models. The concentration pa-rameters can in principle be optimized according to thetranscription accuracy. In our evaluation in Sec. 5.2, they

7

are preset to a fixed value, and how the values of the con-centration parameters for the unigram or transition prob-abilities influence the transcription accuracy is studied inSec. 5.4.

The other parameters of the Bayesian models, namelythe parameters and internal variables of a piece-specificscore model, are treated as variables and estimated in thetranscription step according to the input MIDI data. LetX = d1:N (or t0:N ) denote the input MIDI data, Z denotethe set of internal variables of a score model including notevalues, Θ denote the set of parameters of the piece-specificscore model, and Ξ denote the set of hyperparameters.For NoteMM1S, for example, Z = (s0, r̃1, s1, . . . , r̃N , sN ),Θ = (πini,πr, ξ), and Ξ = (αini, π̄ini, απ, π̄r, αξ, ξ̄). Thegenerative process can be summarized as

P (X,Z,Θ |Ξ) = P (X|Z)P (Z|Θ)P (Θ|Ξ). (31)

Here, the first factor on the right-hand side represents theperformance model (Sec. 4.1), the second factor representsthe (non-Bayesian part of the) score model (Secs. 2.2 and3.2), and the last factor represents the prior distributions(Sec. 3.3). Parameters Θ of the piece-specific score modelreflecting both the piece-specific distribution of the rhyth-mic patterns inferred from the observation X and the priordistribution, which induces the sparseness of Θ, can be ob-tained by maximizing Eq. (31). Since an analytical solu-tion to this inference problem is impossible, as is typicallythe case for Bayesian models with latent variables, we usethe Gibbs sampling method to estimate Θ [4]. Using theGibbs sampling method, we can draw samples from theposterior probability P (Z,Θ |X,Ξ). After some iterations,we obtain the optimal parameters Θ that maximizes thelikelihood P (X|Θ) =

∑Z P (X|Z)P (Z|Θ) among the sam-

pled parameters. For a simple case (the first-order metricalMM), an explicit algorithm using Gibbs sampling is pre-sented in Appendix A. Although the algorithmic detailsof the Gibbs sampling method are important for the im-plementation of the models, they are derived by means ofstandard techniques and are not essential for understand-ing the main results of this study. Thus, for the other mod-els, we present those details with the explicit formulationsof the integrated models in the Supplemental Material.

4.3. Algorithms for Rhythm Transcription

Our generative models for rhythm transcription con-structed as combinations of a score model and a perfor-mance model are HMMs whose latent states are the vari-ables of the score model and whose observed variablesare onset times or durations of input (human-performed)MIDI data. Thus, the standard Viterbi algorithm [21] canbe used to estimate the variables of the score model, in-cluding the note values or metrical positions, from the in-put MIDI data. For Bayesian models, the model param-eters are first estimated from the input signal using themethod explained in Sec. 4.2, and the latent variables arethen estimated by the Viterbi algorithm. As an exam-ple, an explicit algorithm for rhythm transcription based

on the first-order Bayesian metrical MM (MetMM1B) ispresented in Appendix A.

For some of the present models, especially PatMM1DBand PatMM1SDB, the latent state space is huge, and theViterbi algorithm requires an impractical amount of com-putation time. To address this issue, we apply a beamsearch and only retain the top-W most likely states ateach Viterbi update (W is a preset number called thebeam width). Using NZ to represent the number of latentstates, the computational costs are reduced to O(NZW )from O(N2

Z) via this technique, while the chance of ob-taining the globally optimal solution decreases for smallerW s. Since the inferential precision increases as we increaseW , its value should be set to a practical upper limit fortractable computation. For Bayesian extensions of thesemodels, we also apply a similar method in the forwardalgorithm used for estimating the posterior probability.

5. Evaluation

To evaluate the performance of the studied models, weconducted two evaluation experiments. In the first exper-iment (Sec. 5.2), in order to compare the effects of differ-ent model architectures, the predictive ability of the non-Bayesian score models is evaluated. In the second experi-ment (Sec. 5.3), the transcription accuracies of the modelsare measured to examine the effects of the Bayesian ex-tensions and modification models. We also examine theinfluence of the hyperparameters of the studied models inSec. 5.4. Using the measurement of the computation timeof the models, in Sec. 5.5, we discuss which models arepractically important.

5.1. Setup

The initial and transition probabilities of all the mod-els were trained with the same training dataset (Sec. 2.1).The probabilities were estimated by the maximum likeli-hood method with an additive smoothing constant of 0.1.We confirmed that the results were not sensitive to thesmoothing constant, except for the transition probabili-ties of PatMM1 for which there were more parameters thanthe number of training samples. For this model we applieda linear interpolation with the unigram probabilities (0.8unigram probability + 0.2 transition probability) that wasroughly optimized w.r.t. the transcription accuracy.

We used for the evaluations the test dataset consist-ing of melodies taken from 30 pieces of popular music(Sec. 2.1). For the transcription experiments, we collectedperformed MIDI data played by amateur musicians andrecorded them with a digital piano. Four musicians par-ticipated and were exclusively assigned some pieces in thetest dataset. The first 40 bars of each piece were used, andthe musicians were asked to play the scores in a tempo of105 bpm (beats per minute). For systematic examinations,we also used synthetic MIDI data generated by the perfor-mance timing model in Sec. 4.1 with σt = 0.04 sec and a

8

Table 3: Cross entropies (CEs) (bits/symbol) of the non-Bayesianscore models.

Model CE Model CE Model CE

NoteMM0 2.29 NoteMM1 2.07 NoteMM2 1.93MetMM0 2.63 MetMM1 2.01 MetMM2 1.87PatMM0 1.97 PatMM1 1.89

tempo of 144 bpm (the value of σt was determined to sim-ulate human performance and the tempo was determinedso that the transcription difficulty roughly matches thatfor real MIDI data). All pieces in the test dataset wereused for the synthetic data.

5.2. Evaluation of the Score Models

We evaluated the non-Bayesian score models listed inTable 3 in terms of the cross entropy, which is a standardmeasure for quantifying the performance of language mod-els. Lower entropy means a higher predictive ability. Forall classes of models (note value MM, metrical MM, andnote pattern MM) the cross entropy decreased as we in-creased the order. Comparing the note value MMs andmetrical MMs, the former outperformed when the orderwas zero, but the latter outperformed for the higher-ordercases. The zeroth-order note pattern MM has lower crossentropy than the first-order note value MM and metricalMM. This demonstrates the strength of the note patternmodels, which can capture long-range sequential depen-dence at the note level. The reason the first-order notepattern MM was worse than the second-order metrical MMis likely because the data sparseness was more severe forthe former model.

5.3. Evaluation of the Transcription Accuracy

We used the synthetic data in addition to the real dataof human performances to clearly examine the effects ofthe score models. The following setups were used forthe rhythm transcription algorithms based on the studiedmodels. For the evaluations using the synthetic data, wefixed the parameter σt to its actual value (0.04 sec) usedfor the synthesis; and for the real data, we set σt = 0.035sec, which was roughly optimized in the preliminary exper-iments. All the concentration parameters were set to 10.To estimate the model parameters of the Bayesian models,we iterated the Gibbs sampling 100 times, except for themodels with too large of computational costs. For thesemodels (MetMM1SDB, PatMM1DB, and PatMM1SDB),we iterated 20 times. The influence of varying the val-ues for these parameters is investigated in Sec. 5.4. Theprobability parameters for note modifications were set asξ̄0 = ζ̄0 = 0.9, reflecting the assumption that note modi-fications are rare. For PatMM1DB and PatMM1SDB, weset the beam width as W = 200. This value was a prac-tical upper limit to run all experiments in a reasonableamount of time. We use the error rate (i.e. the ratio of the

4

6

8

10

12

14

Err

or r

ate

(%)

Synthetic performance data

1.5

2

2.5

3

3.5

Err

or r

ate

(%)

Real performance data

0 0B 1 1B 1SB

1DB

1SD

B 2 2B 0 0B 1 1B 1SB

1DB

1SD

B 2 2B 0 0B 1 1B 1SB

1DB

1SD

B

0 0B 1 1B 1SB

1DB

1SD

B 2 2B 0 0B 1 1B 1SB

1DB

1SD

B 2 2B 0 0B 1 1B 1SB

1DB

1SD

B

(a)

(b)

NoteMM MetMM PatMM

NoteMMMetMM PatMM

± SD

± SD

Figure 7: Transcription error rates for (a) the synthetic MIDI dataand for (b) the real performed MIDI data. See Table 2 for the modellabels.

number of notes with incorrectly estimated note valuesand the total number of notes) as an evaluation measure.Since note values can be obtained from metrical positions,this evaluation measure can be applied universally for thetranscribed results of all the models. Since the transcrip-tion results depend on random numbers used for Gibbssampling for the Bayesian models, we run the methods 10times with different random number seeds and obtainedthe average error rates.

The results are shown in Fig. 7. We first discuss theresults for the synthetic data. For non-Bayesian mod-els, higher-order models outperformed the correspondinglower-order models. The rank of the transcription ac-curacy approximately matches the rank of the cross en-tropy in Table 3. The Bayesian models without mod-ification models had significantly lower error rates thanthe corresponding non-Bayesian models. The effect of theBayesian extension was often larger than that of increasingthe order of the models. For example, MetMM0B outper-formed MetMM1 and MetMM1B outperformed MetMM2.Regarding the effects of the modification models, incor-porating note divisions decreased the error rate for allmodel types, incorporating onset shifts did so only for themetrical MM, and incorporating both operations was nomore effective than incorporating only note divisions for allmodel types. These results show that the incorporation ofmodification models can often improve the transcriptionresults but not significantly.

Comparing the different model types, the metrical MMsconsistently outperformed the corresponding note valueMMs in most cases. The accuracies of PatMM0 andPatMM1 were close to that of MetMM2, and the effect ofthe Bayesian extension was larger for the metrical MMs.The best model in terms of accuracy was MetMM2B.

For the real data, most Bayesian models again sig-nificantly outperformed the corresponding non-Bayesianmodels. Incorporating the modification models was ef-

9

&&&

42

42

42

.œ œ .œ œ

.œ œ .œ œ

.œ œ .œ œ

.œ œ .œ œ

.œ œ .œ œ

.œ œ .œ œ

.œ œ œ œ

.œ œ .œ œ

.œ œ .œ œ

œ œœ œœ œ

&&&

.œ œ œ œ

.œ œ œ œ

.œ œ .œ œ

œ .œ œ

œ .œ œ

œ .œ œ

œ œ .œ œ.œ œ .œ œ.œ œ .œ œ

œ œœ œœ œ

GT

MetMM2

MetMM2B

Input

&

&

~~

~~

~~

~~

~~

~~

~~

~~

GT

MetMM2

MetMM2B

Input

Figure 8: Examples of transcription results (RWC No. 89). Only barswith repeated rhythms, which are of our interest, are shown. GTrefers to the ground truth score. Transcription errors are indicatedby stars.

fective for some cases, but there were also cases wherethe error rates increased. We should notice that the er-ror rates seem to saturate around 2.0% and many modelsremain within the range of statistical fluctuation. Partic-ularly, the differences in the error rates between MetMM1and MetMM2 and between MetMM1B and MetMM2B aremuch smaller than the cases with the synthetic data. Thisfact also makes it difficult to clearly identify the best mod-els. MetMM2B again had the lowest error rate, but manyother models had similar error rates. The note value MMswere consistently worse than the corresponding metricalMMs.

Our results differ from the results of [18] in one respect.Paper [18] reported a significant decrease in the error rateby incorporating the modification models into the notepattern MM. This can be explained by the fact that thestate space of the note pattern MM in [18] did not span allpossible note patterns and the search space was extendedby incorporating note modifications. This bias towardsdecreasing the performance of the basic note pattern MMcan be clearly seen in the evaluation results ([18], Table1), and it is removed in our results by constructing notepattern MMs with a search space equivalent to other typesof MMs.

We manually analysed the estimation errors made byMetMM2B, which had the lowest error rates, for the realdata. There were approximately 70 errors (correspond-ing to an error rate of 2.08%) in the transcription resultsand 43 of them were clear performance errors. Note thata performance error of one note onset usually induces twoestimation errors of note values. The remaining estimationerrors were caused by less clear but significant timing devi-ations. There was a performance where complex rhythmswith frequent tied notes were not accurately played, andthe model made 18 estimation errors. From these erroranalyses, we found that the limiting transcription accuracyof approximately 2% is reasonable for this test dataset.

The transcription examples in Fig. 8 demonstrate the ef-

4

5

6

7

8

9

10–1

Err

or r

ate

(%)

Concentration parameter (log scale) 1 10 102

MetMM2B

MetMM1B

MetMM0B

Err

or r

ate

(%)

Number of iterations (log scale) 1 10 100

2

2.2

2.4

2.6

2.8

0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055

Err

or r

ate

(%)

Performance model parameter (sec)

MetMM2B

MetMM1B

MetMM0B

σt

4

5

6

7

8

9

10

11

MetMM2B

MetMM1B

MetMM0B

(a)

(b)

(c)

Entropy match

Figure 9: Transcription error rates for varying (a) the concentrationparameter of the unigram/transition probabilities, (b) the numberof iterations for the Gibbs sampling, and (c) the performance modelparameter σt. The solid curves indicate the mean values and theshades indicate the ranges of ±1 standard deviation.

fect of the Bayesian learning of piece-specific score models.This piece has repetitions of a tied dotted rhythm (see thefirst bars of the segments in Fig. 8), which are played withinaccurate timings. Since this rhythm is not common, theMetMM2 method using a generic score model made errorswhen the timing deviations were too large. In contrast,the MetMM2B method was able to recognize the correctrhythm in most cases by inducing the piece-specific scoremodel capturing these repetitions and in effect cancellingout the timing deviations.

5.4. Influence of the Hyperparameters

We studied the influence of the values of the hyperpa-rameters on the transcription accuracy. Here, we focus onthe Bayesian metrical MMs that do not incorporate themodification process to investigate the generic tendenciesbecause the non-Bayesian models have poorer performancethan the Bayesian models, and the Bayesian note patternMMs and other Bayesian models with modification modelshave practically prohibitive computational costs for precisemeasurements.

The effects of varying the values of the three most rel-evant parameters, including the concentration parameterfor the unigram/transition probabilities, the number of it-erations for Gibbs sampling, and the parameter σt of the

10

performance model, were investigated by measuring thetranscription error rates as in Sec. 5.3. When varying oneof these parameter values, the other parameters were fixedto the values used in Sec. 5.3. For the plots in Fig. 9, werun each algorithm with 10 different random number seedsand obtained the average error rate. For the concentrationparameter and the number of iterations, we used the syn-thetic MIDI data to obtain results with small statisticalfluctuations; and for performance model parameter σt, weused the real performed MIDI data.

Regarding the concentration parameter of the uni-gram/transition probabilities (Fig. 9(a)), the optimal val-ues for all the models were in the range [1, 100] but thevalues differ among the models. The theoretically opti-mal values at which the expectation value of the entropiesof the unigram/transition probabilities generated by theDirichlet processes match the corresponding mean entropyof the piece-specific unigram/transition probabilities of thetraining data are also shown there. We see that the theo-retically optimal values do not necessarily match the em-pirically optimal values that maximize the transcriptionaccuracy. We can also confirm that the differences in theerror rates for these values and the value 10 used in Sec. 5.3are within the range of one standard deviation of the sta-tistical fluctuation.

Regarding the number of iterations of the Gibbs sam-pling (Fig. 9(b)), the error rate monotonically decreasedas the number of iterations increased until a plateau wasreached for each model. For MetMM0B and MetMM1B,the plateau was already reached with approximately 10iterations; and for MetMM2B, the error rate seemed tocontinue decreasing still at the 500 th iteration. Since thecomputation time for the parameter estimation is approxi-mately proportional to the number of iterations, the tradeoff between the computation time and estimation precisionis relevant, especially for higher-order models.

Regarding the performance model parameter σt(Fig. 9(c)), we see that different models had different op-timal values in the range [0.03, 0.05] sec, but with a non-monotonic behaviour and a large statistical fluctuation.From these results, we understand that it is difficult to re-liably find the precise optimal value of σt for the real dataand that the tentative value 0.035 sec used in Sec. 5.3 isat least not a bad choice.

5.5. Discussion of the Computation Time and Extendabil-ity

We now discuss the computation time, which is anotherimportant aspect of transcription methods. The compu-tation times for the studied models are listed in Table 4.These values were measured on a computer with a 3.6 GHzIntel Core i9 CPU and 64 GB of RAM. Roughly speaking,the computation time increases linearly as the number ofnotes in the input data increases and it also increases lin-early as the number of iterations for Gibbs sampling forthe Bayesian models increases.

From the results we see that incorporating modificationmodels significantly increases the computation time andnote pattern MMs require much more time than note valueMMs and metrical MMs. Considering both the accuracyand computation time, the first-order and second-ordermetrical MMs with Bayesian extensions are the most use-ful models in practical applications. Incorporating modifi-cation models into these models would increase the accu-racy but also significantly increase the computation time.The order of a model, the use of the Bayesian extensionand the number of iterations for Gibbs sampling, and theuse of modification models should be determined accordingto the given resource of the computation time.

In practical applications, we need to address longer notevalues and finer beat resolutions than those considered inthis study. For example, if we consider the full bar lengthin 4/4 time and a beat resolution corresponding to one-third of a 16th note (to accommodate triplets), Nb = 48.The computation times required for the first-order notevalue MMs and metrical MMs are roughly proportionalto N2

b , and those for the second-order models are roughlyproportional to N3

b . However, the number of possible notepatterns increases as 2Nb , which makes the computationtimes of the note pattern MMs intractable, at least withoutrefined searching techniques. This means that the metricalMMs also have better extendability.

6. Conclusion

We have studied the statistical description of the repeti-tive structure of musical notes using Bayesian score modelsand its application to rhythm transcription. The main re-sults are summarized as follows.

• The repetitive structure of musical rhythms is re-flected in the sparseness of piece-specific score mod-els, and the Dirichlet process describing the generativeprocess of piece-specific score models from a genericscore model can explain the distribution of the en-tropies of piece-specific models, particularly for thenote value MM and metrical MM (Fig. 5).

• We have confirmed that the Bayesian score modelscapturing repetitions were indeed effective for signifi-cantly improving the transcription accuracy.

• Considering the transcription accuracy and the com-putational efficiency, the second-order Bayesian met-rical MM was found to be the best among the testedmodels.

• Incorporating the note modification process can helpcapture approximate repetitions and therefore im-prove the accuracy but the effect was small and thecomputational cost significantly increased.

In general, these results provide yet another piece of ev-idence for the importance of musical language modelling

11

Table 4: Computation times (sec) for transcribing the real data. For the Bayesian models, the number of iterations for Gibbs sampling was100; and for PatMM1DB and PatMM1SDB, W was 1000. The acronyms for the models are explained in Table 2.

Model Time Model Time Model Time

NoteMM0 < 0.1 NoteMM0B 0.13 NoteMM1 < 0.1NoteMM1B 0.65 NoteMM1SB 111 NoteMM1DB 23NoteMM1SDB 1.4× 103 NoteMM2 < 0.1 NoteMM2B 5.0MetMM0 < 0.1 MetMM0B 0.7 MetMM1 < 0.1MetMM1B 0.7 MetMM1SB 111 MetMM1DB 413MetMM1SDB 2.0× 104 MetMM2 < 0.1 MetMM2B 4.8PatMM0 1.3 PatMM0B 558 PatMM1 1.6PatMM1B 656 PatMM1SB 2.4× 104

PatMM1DB 3.3× 104 PatMM1SDB 9.6× 104

for music transcription and demonstrate the effectivenessof utilizing the repetitive structure. Since repetitions arefound in various musical elements including rhythms, pitchconfigurations, and chord progressions, the present ap-proach can also be useful for polyphonic music transcrip-tion, chord recognition, and other automatic music recog-nition tasks.

As future work, it is important to extend the model andincorporate pitches and multiple voices for applications tomore general types of music transcription. The present ap-proach of inferring piece-specific models can be even moreeffective in such cases that the amount of noise is largerthan the case of rhythm transcription studied here. Sucha situation is common when MIDI sequences are obtainedfrom audio signals [17, 20]. Another interesting possibil-ity is to combine the present Bayesian approach with deepgenerative models that can learn more complex structuresthan Markov models can learn.

Supporting information

Supplemental Material. Details of the inference al-gorithms for the Bayesian Markov models.

Appendix A. Rhythm Transcription Algorithmfor the First-Order Bayesian Met-rical Markov Model (MetMM1B)

In this appendix, an explicit algorithm for rhythmtranscription based on the first-order Bayesian metricalMarkov model (MetMM1B) is presented with some de-tails of computation procedure. We use the symbol R+ torepresent the set of nonnegative real numbers and writea ∈ RD+ to express that a is a D-dimensional vector withnonnegative elements (probability vectors are further as-sumed to be normalized).

Appendix A.1. Generative model

The generative process of the model is summarized asfollows. The metrical Markov model (Sec. 2.2.2) generates

the metrical positions b0:N = (b0, . . . , bN ) as

P (b0 = b) = χini b, P (bn = b | bn−1 = b′) = χb′b. (A.1)

Here, each metrical position bn takes a value in{0, 1, . . . , Nb − 1} (Nb is the bar length), and χini =(χini b)b ∈ RNb

+ and χb = (χbb′)b′ ∈ RNb+ are probability

vectors. The performance model (Sec. 4.1) generates notedurations d1:N (dn > 0) of a performance MIDI signal as

P (dn| bn−1 = b′, bn = b) = φb′bdn = Gauss(dn; [b′, b]v, σ2t ),

(A.2)

where

Gauss(x;µ, σ2) =1√

2πσ2exp

[− (x− µ)2

2σ2

](A.3)

denotes a Gaussian distribution, v > 0 denotes the (in-verse) tempo, σt > 0 denotes the amount of onset time de-viations, and [b′, b] represents the note value as in Eq. (4).In the experimental setup of this study, parameters v andσt are constants specified prior to an application of the al-gorithm. The complete-data probability of the whole datasequence (d1:N and b0:N ) is given as

P (d1:N , b0:N ) = χini b0

N∏

n=1

χbn−1bnφbn−1bndn . (A.4)

The piece-specific model parameters χini = (χini b)b andχb = (χbb′)b′ are generated from the generic model param-eters χ̄ini ∈ RNb

+ and χ̄b ∈ RNb+ by a Dirichlet process:

P (χini) = Dir(χini;αiniχ̄ini), (A.5)

P (χb) = Dir(χb;αχχ̄b), (A.6)

where

Dir(χ;α) = Γ(α1 + · · ·+ αD)D∏

i=1

χαi−1i

Γ(αi)(A.7)

denotes a Dirichlet distribution (Γ(x) denotes the gammafunction), and αini > 0 and αχ > 0 are concentration

12

parameters. These parameters are treated as constants tobe specified prior to an application of the algorithm.

In the application of this Bayesian model for rhythmtranscription, the generic model parameters χ̄ini and χ̄bare pretrained using musical score data (Sec. 4.2). Thepiece-specific model parameters χini and χb are treated asvariables and estimated during a run of the algorithm, asdescribed below.

Appendix A.2. Estimation of piece-specific model param-eters

For rhythm transcription, we estimate metrical positionsb0:N from a given performance MIDI signal with note du-rations d1:N . In the Bayesian setting, the piece-specificmodel parameters are estimated first, as described in thissubsection, and then the metrical positions are estimatedto obtain the final rhythm transcription, as described inthe next subsection. As formulated in Sec. 4.3, the formerstep is conducted by using the Gibbs sampling method [4]with the following steps of latent variable sampling andparameter sampling.

In the latent variable sampling step, model parametersχini and χb are fixed and b0:N are sampled by the forwardfiltering-backward sampling method. The forward vari-ables Fn(bn) = P (d1:n, bn) are computed by the forwardalgorithm:

F0(b0) := χini b0 (initialization), (A.8)

Fn(bn) =∑

bn−1

χbn−1bnφbn−1bndnFn−1(bn−1). (A.9)

The variables b0:N are then sampled by the backward sam-pling method as

P (bN |d1:N ) =FN (bN )∑b FN (b)

, (A.10)

P (bn| bn+1:N , d1:N ) =χbnbn+1φbnbn+1dn+1Fn(bn)∑b χbbn+1

φbbn+1dn+1Fn(b)

. (A.11)

In the parameter sampling step, metrical positions b0:N

are fixed and model parameters χini and χb are sampledby using the probabilities

P (χini) = Dir(χini;αiniχ̄ini + cini), (A.12)

P (χb′) = Dir(χb′ ;αχχ̄b′ + cχb′), (A.13)

where the variables cini = (cini b)b and cχb′ = (cχb′b)b counthow many times metrical positions appear in b0:N and aredefined as

cini b = δb0b, cχb′b =

N∑

n=1

δbn−1b′δbnb. (A.14)

The standard method for sampling a probability vector χfrom a Dirichlet distribution Dir(χ;α) can be used: af-ter sampling a number yi from the gamma distribution

Gam(αi, 1) independently for each i, the vector χ is ob-tained by normalization (χi = yi/

∑j yj).

For initialization, the model parameters χini and χb areset to χ̄ini and χ̄b, respectively. We then iterate the latentvariable sampling step and the parameter sampling stepiteratively for a predetermined number of times. At eachiteration the evidence probability P (d1:N ) =

∑b FN (b),

calculated by Eq. (A.9) with the temporary values of χini

and χb, is memorized. After running all iterations, theparameterization that yields the maximum evidence prob-ability is finally chosen and used for the subsequent stepfor estimating metrical positions.

Appendix A.3. Estimation of metrical positions

After piece-specific model parameters χini and χb areestimated, the metrical positions b̂0:N for the final rhythmtranscription result are estimated by maximizing the prob-ability P (b0:N |d1:N ) (Sec. 4.2). By the Bayes formula,P (b0:N |d1:N ) ∝ P (d1:N , b0:N ) and this is equivalent tomaximizing the probability in Eq. (A.4) w.r.t. b0:N . Tosolve this problem, we can apply the Viterbi algorithm [21]as follows. The Viterbi variables Vn(bn) and backtrackingindices Un(bn) are computed as

V0(b0) := χini b0 (initialization), (A.15)

Vn(bn) = maxbn−1

[χbn−1bnφbn−1bndnVn−1(bn−1)

], (A.16)

Un(bn) = argmaxbn−1

[χbn−1bnφbn−1bndnVn−1(bn−1)

]. (A.17)

The most probable sequence b̂0:N is then obtained by back-tracking:

b̂N = argmaxbN

VN (bN ), (A.18)

b̂n = Un+1(b̂n+1), (A.19)

which completes our rhythm transcription algorithm.In the actual implementation most computations involv-

ing probabilities of data sequences explained in this ap-pendix are performed in the logarithmic domain to avoidoverflow. See the source code in the Supplemental Ma-terial for implementation details. Rhythm transcriptionalgorithms for other models studied in this paper can bederived similarly. The details are also given in the Supple-mental Material.

References

[1] Benetos, E., Dixon, S., Duan, Z., and Ewert, S. (2019). Auto-matic music transcription: An overview. IEEE Signal ProcessingMagazine, 36(1):20–30.

[2] Benetos, E., Dixon, S., Giannoulis, D., Kirchhoff, H., and Kla-puri, A. (2013). Automatic music transcription: Challenges andfuture directions. J. Intelligent Information Systems, 41(3):407–434.

[3] Benetos, E. and Weyde, T. (2015). An efficient temporally-constrained probabilistic model for multiple-instrument musictranscription. In Proc. ISMIR, pages 701–707.

13

[4] Bishop, C. M. (2006). Pattern Recognition and Machine Learn-ing. Springer.

[5] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichletallocation. J. Machine Learning Res., 3:993–1022.

[6] Cemgil, A. T., Desain, P., and Kappen, B. (2000). Rhythmquantization for transcription. Comp. Mus. J., 24(2):60–76.

[7] Desain, P. and Honing, H. (1989). The quantization of musicaltime: A connectionist approach. Comp. Mus. J., 13(3):56–66.

[8] Fine, S., Singer, Y., and Tishby, N. (1998). The hierarchical hid-den Markov model: Analysis and applications. Machine Learning,32(1):41–62.

[9] Goto, M., Hashiguchi, H., Nishimura, T., and Oka, R. (2002).RWC music database: Popular, classical and jazz music databases.In Proc. ISMIR, pages 287–288.

[10] Hamanaka, M., Goto, M., Asoh, H., and Otsu, N. (2003).A learning-based quantization: Unsupervised estimation of themodel parameters. In Proc. ICMC, pages 369–372.

[11] Hawthorne, C., Elsen, E., Song, J., Roberts, A., Simon, I., Raf-fel, C., Engel, J., Oore, S., and Eck, D. (2018). Onsets and frames:Dual-objective piano transcription. In Proc. ISMIR, pages 50–57.

[12] Huron, D. (2006). Sweet Anticipation: Music and the Psychol-ogy of Expectation. The MIT Press.

[13] Jacoby, N. and McDermott, J. H. (2017). Integer ratio priors onmusical rhythm revealed cross-culturally by iterated reproduction.Current Biology, 27(3):359–370.

[14] Jordan, M. I. (2005). Dirichlet processes, Chinese restaurantprocesses and all that. In Tutorial Presentation at the NIPS Con-ference.

[15] Longuet-Higgins, H. (1987). Mental Processes: Studies in Cog-nitive Science. MIT Press.

[16] Mari, J.-F., Haton, J.-P., and Kriouile, A. (1997). Automaticword recognition based on second-order hidden Markov models.IEEE Transactions on speech and Audio Processing, 5(1):22–25.

[17] Nakamura, E., Benetos, E., Yoshii, K., and Dixon, S. (2018).Towards complete polyphonic music transcription: Integratingmulti-pitch detection and rhythm quantization. In Proc. ICASSP,pages 101–105.

[18] Nakamura, E., Itoyama, K., and Yoshii, K. (2016). Rhythmtranscription of MIDI performances based on hierarchicalBayesian modelling of repetition and modification of musical notepattern. In Proc. EUSIPCO, pages 1946–1950.

[19] Nakamura, E., Yoshii, K., and Sagayama, S. (2017). Rhythmtranscription of polyphonic piano music based on merged-outputHMM for multiple voices. IEEE/ACM TASLP, 25(4):794–806.

[20] Nishikimi, R., Nakamura, E., Itoyama, K., and Yoshii, K.(2016). Musical note estimation for F0 trajectories of singingvoices based on a Bayesian semi-beat-synchronous HMM. In Proc.ISMIR, pages 461–467.

[21] Rabiner, L. R. (1989). A tutorial on hidden Markov modelsand selected applications in speech recognition. Proceedings ofthe IEEE, 77(2):257–286.

[22] Raczynski, S., Vincent, E., and Sagayama, S. (2013). Dy-namic Bayesian networks for symbolic polyphonic pitch modeling.IEEE/ACM TASLP, 21(9):1830–1840.

[23] Raphael, C. (2002). A hybrid graphical model for rhythmicparsing. Artificial Intelligence, 137:217–238.

[24] Raphael, C. (2005). A graphical model for recognizing sungmelodies. In Proc. ISMIR, pages 658–663.

[25] Ravignani, A., Thompson, B., Grossi, T., Delgado, T., andKirby, S. (2018). Evolving building blocks of rhythm: how hu-man cognition creates music via cultural transmission. Annals ofthe New York Academy of Sciences, 1423:176–187.

[26] Savage, P. E., Brown, S., Sakai, E., and Currie, T. E. (2015).Statistical universals reveal the structures and functions of humanmusic. PNAS, 112(29):8987–8992.

[27] Schramm, R., McLeod, A., Steedman, M., and Benetos, E.(2017). Multi-pitch detection and voice assignment for a Cappellarecordings of multiple singers. In Proc. ISMIR, pages 552–559.

[28] Sigtia, S., Benetos, E., and Dixon, S. (2016). An end-to-end neu-ral network for polyphonic piano music transcription. IEEE/ACMTASLP, 24(5):927–939.

[29] Siorosa, G., Davies, M. E. P., and Guedes, C. (2018). A gener-ative model for the characterization of musical rhythms. J. NewMusic Res., 47(2):114–128.

[30] Smith, L. M. and Honing, H. (2008). Time-frequency repre-sentation of musical rhythm by continuous wavelets. Journal ofMathematics and Music, 2(2):81–97.

[31] Takeda, H., Otsuki, T., Saito, N., Nakai, M., Shimodaira, H.,and Sagayama, S. (2002). Hidden Markov model for automatictranscription of MIDI signals. In Proc. MMSP, pages 428–431.

[32] Temperley, D. and Sleator, D. (1999). Modeling meter andharmony: A preference-rule approach. Comp. Mus. J., 23(1):10–27.

[33] Wu, Y.-T., Chen, B., and Su, L. (2019). Polyphonic music tran-scription with semantic segmentation. In Proc. ICASSP, pages166–170.

[34] Ycart, A. and Benetos, E. (2017). A study on LSTM networksfor polyphonic music sequence modelling. In Proc. ISMIR, pages421–427.

14

Supplemental Material for

“Music Transcription Based on Bayesian Piece-Specific Score

Models Capturing Repetitions”

Eita Nakamura1,2∗ and Kazuyoshi Yoshii1,3†

1 Graduate School of Informatics, Kyoto University, Sakyo, Kyoto 606-8501, Japan2 The Hakubi Center for Advanced Research, Kyoto University, Sakyo, Kyoto 606-8501, Japan

3 AIP Center for Advanced Intelligence Project, RIKEN, Tokyo 103-0027, Japan

Contents

1 Introduction 1

2 Note Value Markov Model 22.1 Non-Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Integration with the Performance Model . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Bayesian Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Gibbs Sampling for Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Metrical Markov Model 53.1 Non-Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Integration with the Performance Model . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Bayesian Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4 Gibbs Sampling for Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Note Pattern Markov Model 84.1 Non-Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Integration with the Performance Model . . . . . . . . . . . . . . . . . . . . . . . . 94.3 Bayesian Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4 Gibbs Sampling for Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . 10

1 Introduction

In this Supplemental Material, we explicitly formulate three types of Markov models, their ex-tensions incorporating note modifications, Bayesian extensions, and their integrations with theperformance model. We also explicitly describe the Gibbs sampling methods for estimating modelparameters using performance data.

Based on the convention in the main text, the labels NoteMM1, NoteMM1S, NoteMM1D,and NoteMM1SD are used for the score models based on the note value Markov model (MM),MetMM1, MetMM1S, MetMM1D, and MetMM1SD are used for the score models based on the

∗Electronic address: [email protected]†Electronic address: [email protected]

1

arX

iv:1

908.

0696

9v2

[cs

.SD

] 1

6 Fe

b 20

21

metrical MM, and PatMM1, PatMM1S, PatMM1D, and PatMM1SD are used for the score modelsbased on the note pattern MM. When these score models are integrated with the performancemodel, the integrated models are labelled as NoteHMM1, NoteHMM1S, etc.

2 Note Value Markov Model

2.1 Non-Bayesian Models

2.1.1 Basic Model (NoteMM1)

The basic note value Markov model is defined with the initial and transition probabilities:

P (r1 = r) = πini r, P (rn = r | rn−1 = r′) = πr′r. (1)

The complete-data probability is

P (r1:N ) = πini r1

N∏

n=2

πrn−1rn . (2)

2.1.2 Model with Only Onset Shifts (NoteMM1S)

We introduce internal variables z0 = s0 and zn = (r̃n, sn) (n = 1, . . . , N), where r̃ indicates thenote value before onset shift and s indicates the amount of onset shift. The model is defined as

P (z0 = s) = ξs, P (z1 = (r̃, s)) = πini r̃ ξs, P (zn = (r̃, s) | zn−1 = (r̃′, s′)) = πr̃′r̃ ξs, (3)

P (rn| zn−1, zn) = I(rn = r̃n + sn − sn−1). (4)

2.1.3 Model with Only Note Divisions (NoteMM1D)

We introduce an internal variable z = (r̃, h, g), where r̃ indicates the note value before note divisionand h and g indicate division pattern and its internal note position. The model is defined as


P (zn = (r̃, h, g) | zn−1 = (r̃′, h′, g′)) = δr̃r̃′δhh′δg(g′+1) + δg′Gh′πr̃′r̃ζr̃hδg1, (6)

P (rn| zn = (r̃n, hn, gn)) = I(rn = qhngn), (7)

and the complete-data probability is

P (r1:N , z1:N )

= πini r̃1ζr̃1h1δg11

N∏

n=2

(δr̃nr̃n−1δhnhn−1δgn(gn−1+1) + δgn−1Ghn−1

πr̃n−1r̃nζr̃nhnδgn1

) N∏

n=1

I(rn = qhngn).

2.1.4 Model with Onset Shifts and Note Divisions (NoteMM1SD)

We introduce internal variables z0 = s0 and zn = (r̃n, hn, gn, sn) (n = 1, . . . , N). The model isdefined as

P (z0 = s) = ξs, P (z1 = (r̃, h, g, s)) = πini r̃ζr̃hδg1ξs, (8)

P (zn = (r̃, h, g, s) | zn−1 = (r̃′, h′, g′, s′)) =(δr̃r̃′δhh′δg(g′+1) + δg′Gh′πr̃′r̃ζr̃hδg1

)ξs, (9)


2

2.2 Integration with the Performance Model

2.2.1 NoteHMM1

Latent variables: zn = rn.

P (z1 = r) = πini r, P (zn = r | zn−1 = r′) = πr′r, (11)

P (dn| zn = r) = φrdn = Gauss(dn; rv, σ2t ). (12)

The complete-data probability is

P (d1:N , z1:N ) = πini r1

N∏

n=2

πrn−1rn

N∏

n=1

φrndn . (13)

2.2.2 NoteHMM1S

Latent variables: z0 = s0, zn = (r̃n, sn) (n = 1, . . . , N).

P (z0 = s) = ξs, P (z1 = (r̃, s)) = πini r̃ ξs, P (zn = (r̃, s) | zn−1 = (r̃′, s′)) = πr̃′r̃ ξs, (14)

P (dn|zn−1, zn) = φr̃nsn−1sndn = Gauss(dn; (r̃n + sn − sn−1)v, σ2t ) (15)


P (d1:N , z0:N ) = πini r̃1

N∏

n=2

πr̃n−1r̃n

N∏

n=0

ξsn

N∏

n=1

φr̃nsn−1sndn . (16)

2.2.3 NoteHMM1D

Latent variables: zn = (r̃n, hn, gn).


P (zn = (r̃, h, g) | zn−1 = (r̃′, h′, g′)) = δr̃r̃′δhh′δg(g′+1) + δg′Gh′πr̃′r̃ζr̃hδg1, (18)

P (dn| zn = (r̃, h, g)) = φhgdn = Gauss(dn; qhgv, σ2t ) (19)


P (d1:N , z1:N )

= πini r̃1ζr̃1h1δg11

N∏

n=2



) N∏

n=1

φhngndn .

2.2.4 NoteHMM1DS

Latent variables: z0 = s0, zn = (r̃n, hn, gn, sn) (n = 1, . . . , N).

P (z0 = s) = ξs, P (z1 = (r̃, h, g, s)) = πini r̃ζr̃hδg1ξs, (20)

P (zn = (r̃, h, g, s) | zn−1 = (r̃′, h′, g′, s′)) =(δr̃r̃′δhh′δg(g′+1) + δg′Gh′πr̃′r̃ζr̃hδg1

)ξs, (21)

P (dn| zn−1, zn) = φhngnsn−1sndn = Gauss(dn; (qhngn + sn − sn−1)v, σ2t ), (22)

3


P (d1:N , z0:N ) = πini r̃1ζr̃1h1δg11

N∏

n=2



)

·N∏

n=0

ξsn

N∏

n=1

φhngnsn−1sndn . (23)

2.3 Bayesian Extension

We put prior models on the parameters πini = (πini r̃)r̃, πr̃ = (πr̃r̃′)r̃′ , ζr̃ = (ζr̃h)h, and ξ = (ξs)s as

πini = DP(αini, π̄ini), πr̃ = DP(απ, π̄r̃), ζr̃ = DP(αζ , ζ̄r̃), ξ = DP(αξ, ξ̄). (24)

2.4 Gibbs Sampling for Bayesian Learning

Bayesian learning of the model parameters using performance data can be done with the Gibbssampling. Let θ be the parameters, z be the set of latent variables, and λ be the collection ofhyperparameters. We sample the distribution P (θ,z|d, λ) by alternately sampling P (θ|z,d, λ) =P (θ|z, λ) (parameter sampling) and P (z|θ,d, λ) = P (z|θ,d) (latent variables sampling).

2.4.1 Parameter Sampling for NoteHMM1 and NoteHMM1S

The sampling of the parameters θ from P (θ|z, λ) can be done using the following formulas.

πini ∼ Dir(αiniπ̄ini + cini), cini r̃ = δr̃1r̃, (25)

πr̃′ ∼ Dir(αππ̄r̃′ + cπr̃′), cπr̃′r̃ =

N∑

n=2

δr̃n−1r̃′δr̃nr̃, (26)

ξ ∼ Dir(αξξ̄ + cξ), cξs =

N∑

n=0

δsns. (27)

Here, cini r̃, cπr̃′r̃, and cξs count the numbers of times the relevant variables appear in the sequence

z. Similar notation will be used in the following without mentioning.

2.4.2 Parameter Sampling for NoteHMM1D and NoteHMM1SD

The sampling of the parameters θ from P (θ|z, λ) can be done using the following formulas.

πini ∼ Dir(αiniπ̄ini + cini), cini r̃ = δr̃1r̃, (28)

πr̃′ ∼ Dir(αππ̄r̃′ + cπr̃′), cπr̃′r̃ =

N∑

n=2

δgn1δr̃n−1r̃′δr̃nr̃, (29)

ζr̃ ∼ Dir(αζ ζ̄r̃ + cζr̃), cζr̃h =N∑

n=1

δgn1δr̃nr̃δhnh, (30)

ξ ∼ Dir(αξξ̄ + cξ), cξs =N∑

n=0

δsns. (31)

4

2.4.3 Latent Variables Sampling

The latent variables can be sampled using the forward filtering-backward sampling method. Theforward variable is defined as

αn(zn) = P (d1:n, zn).

For NoteHMM1 and NoteHMM1D, the forward variable can be computed as

α1(z1) = P (z1)P (d1|z1), (32)

αn(zn) =∑

zn−1

P (dn|zn)P (zn|zn−1)αn−1(zn−1), (33)

P (d = d1:N ) =∑

zN

αN (zN ). (34)

The backward sampling step goes as

P (zN |d) ∝ P (d, zN ) = αN (zN ), (35)

P (zn| zn+1:N ,d) = P (zn| zn+1, d1:n) ∝ P (zn, zn+1, d1:n) = P (zn+1|zn)αn(zn). (36)

For NoteHMM1S and NoteHMM1SD, the forward variables can be computed as

α0(z0) = P (z0), (37)

αn(zn) =∑

zn−1

P (dn−1| zn−1, zn)P (zn|zn−1)αn−1(zn−1), (38)

P (d = d1:N ) =∑

zN

αN (zN ). (39)

The backward sampling step goes as

P (zN |d) ∝ P (d, zN ) = αN (zN ), (40)

P (zn| zn+1:N ,d) = P (zn| zn+1, d1:n+1) ∝ P (zn, zn+1, d1:n+1) = P (zn+1|zn)P (dn+1| zn, zn+1)αn(zn).(41)

3 Metrical Markov Model


3.1.1 Basic Model (MetMM1)

The metrical positions b0:N are generated as

P (b0) = χini b0 , P (bn|bn−1) = χbn−1bn .

The note values and the onset score times are then given as

τ0 = b0, rn = τn − τn−1 = [bn−1, bn]Nb=

{bn − bn−1, bn−1 < bn;

bn − bn−1 +Nb, bn−1 ≥ bn.

3.1.2 Model with Only Onset Shifts (MetMM1S)

Introduce internal variables zn = (b̃n, sn) (n = 0, . . . , N).

P (z0 = (b̃, s)) = χini b̃ξs, P (zn = (b̃, s) | zn−1 = (b̃′, s′)) = χb̃′b̃ξs, (42)

P (rn| zn−1, zn) = I(rn = [b̃n−1, b̃n] + sn − sn−1). (43)

5

3.1.3 Model with Only Note Divisions (MetMM1D)

Introduce internal variables z0 = b̃0, zn = (b̃n, hn, gn) (n = 1, . . . , N).

P (z0 = b̃0) = χini b̃0, P (z1 = (b̃, h, g) | z0 = b̃′) = χb̃′b̃ζ[b̃′,b̃]hδg1, (44)

P (zn = (b̃, h, g) | zn−1 = (b̃′, h′, g′)) = δb̃b̃′δhh′δg(g′+1) + δg′Gh′χb̃′b̃ζ[b̃′,b̃]hδg1, (45)

P (rn|zn) = I(rn = qhngn). (46)

3.1.4 Model with Onset Shifts and Note Divisions (MetMM1SD)

Introduce internal variables z0 = (b̃0, s0), zn = (b̃n, hn, gn, sn) (n = 1, . . . , N).

P (z0 = (b̃, s)) = χini b̃ξs, P (z1 = (b̃, h, g, s) | z0 = (b̃′, s′)) = ξsχb̃′b̃ζ[b̃′,b̃]hδg1, (47)

P (zn = (b̃, h, g, s) | zn−1 = (b̃′, h′, g′, s′)) = ξs(δb̃b̃′δhh′δg(g′+1) + δg′Gh′χb̃′b̃ζ[b̃′,b̃]hδg1

), (48)



3.2.1 MetHMM1

Latent variables: zn = bn (n = 0, . . . , N).

P (z0 = b) = χini b, P (zn = b | zn−1 = b′) = χb′b, (50)

P (dn| zn−1 = b′, zn = b) = φb′bdn = Gauss(dn; [b′, b]v, σ2t ), (51)


P (d1:N , z0:N ) = χini b0

N∏

n=1

(χbn−1bnφbn−1bndn).

3.2.2 MetHMM1S

Latent variables: zn = (b̃n, sn) (n = 0, . . . , N).

P (z0 = (b̃, s)) = χini b̃ξs, P (zn = (b̃, s) | zn−1 = (b̃′, s′)) = χb̃′b̃ξs, (52)

P (dn| zn−1, zn) = φb̃n−1sn−1b̃nsndn= Gauss(dn; ([b̃n−1, b̃n] + sn − sn−1)v, σ2

t ), (53)


P (d1:N , z0:N ) = χini b0ξs0

N∏

n=1

(χb̃n−1b̃nξsnφb̃n−1sn−1b̃nsndn

). (54)

3.2.3 MetHMM1D

Latent variables: z0 = b̃0, zn = (b̃n, hn, gn) (n = 1, . . . , N).

P (z0 = b̃0) = χini b̃0, P (z1 = (b̃, h, g) | z0 = b̃′) = χb̃′b̃ζ[b̃′,b̃]hδg1, (55)

P (zn = (b̃, h, g) | zn−1 = (b̃′, h′, g′)) = δb̃b̃′δhh′δg(g′+1) + δg′Gh′χb̃′b̃ζ[b̃′,b̃]hδg1, (56)

P (dn|zn) = φhngndn = Gauss(dn; qhngnv, σ2t ), (57)

6


P (d1:N , z0:N ) = χini b0χb̃0b̃1ζ[b̃0,b̃1]h1δg11

N∏

n=1

φhngndn

·N∏

n=2

(δb̃nb̃n−1δhnhn−1δgn(gn−1+1) + δgn−1Ghn−1

χb̃n−1b̃nζ[b̃n−1,b̃n]hn

δgn1). (58)

3.2.4 MetHMM1SD

Latent variables: z0 = (b̃0, s0), zn = (b̃n, hn, gn, sn) (n = 1, . . . , N).

P (z0 = (b̃, s)) = χini b̃ξs, P (z1 = (b̃, h, g, s) | z0 = (b̃′, s′)) = ξsχb̃′b̃ζ[b̃′,b̃]hδg1, (59)

P (zn = (b̃, h, g, s) | zn−1 = (b̃′, h′, g′, s′)) = ξs(δb̃b̃′δhh′δg(g′+1) + δg′Gh′χb̃′b̃ζ[b̃′,b̃]hδg1

), (60)

P (dn| zn−1, zn) = φhngnsn−1sndn = Gauss(dn; (qhngn + sn − sn−1)v, σ2t ), (61)


P (d1:N , z0:N ) = χini b0χb̃0b̃1ζ[b̃0,b̃1]h1δg11

N∏

n=0

ξsn

N∏

n=1

φhngnsn−1sndn

·N∏

n=2

(δb̃nb̃n−1δhnhn−1δgn(gn−1+1) + δgn−1Ghn−1

χb̃n−1b̃nζ[b̃n−1,b̃n]hn

δgn1). (62)


We put prior models on the parameters as

χini = DP(αini, χ̄ini), χb̃ = DP(αχ, χ̄b̃), ζr = DP(αζ , ζ̄r), ξ = DP(αξ, ξ̄). (63)


The general procedure for Bayesian learning is same as in Sec. 2.4.

3.4.1 Parameter Sampling for MetHMM1 and MetHMM1S

χini ∼ Dir(αiniχ̄ini + cini), cini b = δb̃0b, (64)

χb̃′ ∼ Dir(αχχ̄b̃′ + cχ

b̃′), cχ

b̃′b̃=

N∑

n=1

δb̃n−1b̃′δb̃nb̃, (65)

ξ ∼ Dir(αξξ̄ + cξ), cξs =N∑

n=0

δsns. (66)

7

3.4.2 Parameter Sampling for MetHMM1D and MetHMM1SD

χini ∼ Dir(αiniχ̄ini + cini), cini b = δb̃0b, (67)

χb̃′ ∼ Dir(αχχ̄b̃′ + cχ

b̃′), cχ

b̃′b̃=

N∑

n=1

δgn1δb̃n−1b̃′δb̃nb̃, (68)

ζr ∼ Dir(αζ ζ̄r + cζr), cζrh =

N∑

n=1

δgn1I([b̃n−1, b̃n] = r)δhnh, (69)


N∑

n=0

δsns. (70)


The forward variable is defined as

αn(zn) = P (d1:n, zn),

which can be computed as

α0(z0) = P (z0), (71)

αn(zn) =∑

zn−1

P (dn| zn−1, zn)P (zn|zn−1)αn−1(zn−1), (72)

P (d = d1:N ) =∑

zN

αN (zN ). (73)

The backward sampling step goes as follows:

P (zN |d) ∝ P (d, zN ) = αN (zN ), (74)

P (zn| zn+1:N ,d) = P (zn| zn+1, d1:n+1) ∝ P (zn, zn+1, d1:n+1) ∝ P (dn+1| zn, zn+1)P (zn+1|zn)αn(zn).(75)

4 Note Pattern Markov Model


4.1.1 Basic Model (PatMM1)

The basic note pattern Markov model is described with internal variables zn = (kn, in) (n =1, . . . , N). The initial and transition probabilities are given as

P (z1 = (k, i)) = ψini(ki) = Γini kγkini i = Γini kδi1, (76)

P (zn = (k, i) | zn−1 = (k′, i′)) = ψ(k′i′)(ki) = δkk′γki′i + γk

′i′ endΓk′kγ

kini i

= δkk′δi(i′+1) + δi′Ik′Γk′kδi1, (77)

P (rn| zn−1, zn) = I(rn = [Bkn−1in−1 , Bknin ]). (78)

4.1.2 Model with Only Onset Shifts (PatMM1S)

Introduce internal variables zn = (kn, in, sn) (n = 0, . . . , N).

P (z0 = (k, i, s)) = ψini(ki)ξs, P (zn = (k, i, s) | zn−1 = (k′, i′, s′)) = ψ(k′i′)(ki)ξs, (79)

P (rn| zn−1, zn) = I(rn = [Bkn−1in−1 , Bknin ] + sn − sn−1). (80)

8

4.1.3 Model with Only Note Divisions (PatMM1D)

Introduce internal variables z0 = (k0, i0), zn = (kn, in, hn, gn) (n = 1, . . . , N).

P (z0 = (k, i)) = ψini(ki), P (z1 = (k, i, h, g) | z0 = (k′, i′)) = ψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1,

P (zn = (k, i, h, g) | zn−1 = (k′, i′, h′, g′)) = δkk′δii′δhh′δg(g′+1) + δg′Gh′ψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1,

P (rn|zn) = I(rn = qhngn).

4.1.4 Model with Onset Shifts and Note Divisions (PatMM1SD)

Introduce internal variables z0 = (k0, i0, s0), zn = (kn, in, hn, gn, sn) (n = 1, . . . , N).

P (z0 = (k, i, s)) = ψini(ki)ξs, P (z1 = (k, i, h, g, s) | z0 = (k′, i′, s′)) = ψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1ξs,

P (zn = (k, i, h, g, s) | zn−1 = (k′, i′, h′, g′, s′)) =(δkk′δii′δhh′δg(g′+1) + δg′Gh′ψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1

)ξs,

P (rn| zn−1, zn) = I(rn = qhngn + sn − sn−1).


4.2.1 PatHMM1

Latent variables: zn = bn (n = 0, . . . , N).

P (z0 = (k, i)) = ψini(ki), P (zn = (k, i) | zn−1 = (k′, i′)) = ψ(k′i′)(ki), (81)

P (dn| zn−1 = (k′, i′), zn = (k, i)) = φ(k′i′)(ki)dn = Gauss(dn; [Bk′i′ , Bki]v, σ2t ), (82)


P (d1:N , z0:N ) = ψini(k0i0)

N∏

n=1

(ψ(kn−1in−1)(knin)φ(kn−1in−1)(knin)dn).

4.2.2 PatHMM1S

Latent variables: zn = (kn.in, sn) (n = 0, . . . , N).

P (z0 = (k, i, s)) = ψini(ki)ξs, P (zn = (k, i, s) | zn−1 = (k′, i′, s′)) = ψ(k′i′)(ki)ξs, (83)

P (dn| zn−1, zn) = φkn−1in−1sn−1kninsndn = Gauss(dn; ([Bkn−1in−1 , Bknin ] + sn − sn−1)v, σ2t ), (84)


P (d1:N , z0:N ) = ψini(k0i0)ξs0

N∏

n=1

(ψ(kn−1in−1)(knin)ξsnφkn−1in−1sn−1kninsndn). (85)

4.2.3 PatHMM1D

Latent variables: z0 = (k0, i0), zn = (kn, in, hn, gn) (n = 1, . . . , N).

P (z0 = (k0, i0)) = ψini(k0i0), P (z1 = (k, i, h, g) | z0 = (k′, i′)) = ψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1, (86)

P (zn = (k, i, h, g) | zn−1 = (k′, i′, h′, g′)) = δkk′δii′δhh′δg(g′+1) + δg′Gh′ψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1, (87)

P (dn|zn) = φhngndn = Gauss(dn; qhngnv, σ2t ), (88)

9


P (d1:N , z0:N ) = ψini(k0i0)ψ(k0i0)(k1i1)ζ[Bk0i0,Bk1i1

]h1δg11

N∏

n=1

φhngndn

·N∏

n=2

(δknkn−1δgngn−1δhnhn−1δgn(gn−1+1) + δgn−1Ghn−1ψ(kn−1in−1)(knin)ζ[Bkn−1in−1

,Bknin ]hnδgn1).

4.2.4 PatHMM1SD

Latent variables: z0 = (k0, i0, s0), zn = (kn, in, hn, gn, sn) (n = 1, . . . , N).

P (z0 = (k, i, s)) = ψini(ki)ξs, P (z1 = (k, i, h, g, s) | z0 = (k′, i′, s′)) = ξsψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1,

P (zn = (k, i, h, g, s) | zn−1 = (k′, i′, h′, g′, s′)) = ξs(δkk′δii′δhh′δg(g′+1) + δg′Gh′ψ(k′i′)(ki)ζ[Bk′i′ ,Bki]hδg1

),

P (dn| zn−1, zn) = φhngnsn−1sndn = Gauss(dn; (qhngn + sn − sn−1)v, σ2t ),


P (d1:N , z0:N ) = ψini(k0i0)ψ(k0i0)(k1i1)ζ[Bk0i0,Bk1i1

]h1δg11

N∏

n=0

ξsn

N∏

n=1

φhngnsn−1sndn

·N∏

n=2

(δknkn−1δinin−1δhnhn−1δgn(gn−1+1) + δgn−1Ghn−1ψ(kn−1in−1)(knin)ζ[Bkn−1in−1

,Bknin ]hnδgn1).


We put prior models on the parameters as

Γini = DP(αini, Γ̄ini), Γk = DP(αΓ, Γ̄k), ζr = DP(αζ , ζ̄r), ξ = DP(αξ, ξ̄). (89)


The general procedure for Bayesian learning is same as in Sec. 2.4.

4.4.1 Parameter Sampling for PatHMM1 and PatHMM1S

Γini ∼ Dir(αiniχ̄ini + cini), cini k = δk0k, (90)

Γk′ ∼ Dir(αΓΓ̄k′ + cΓk′), cΓ

k′k =

N∑

n=1

δin1δkn−1k′δknk, (91)


N∑

n=0

δsns. (92)

10

4.4.2 Parameter Sampling for PatHMM1D and PatHMM1SD

Γini ∼ Dir(αiniχ̄ini + cini), cini k = δk0k, (93)

Γk′ ∼ Dir(αΓΓ̄k′ + cΓk′), cΓ

k′k =N∑

n=1

δgn1δin1δkn−1k′δknk, (94)

ζr ∼ Dir(αζ ζ̄r + cζr), cζrh =

N∑

n=1

δgn1I([Bkn−1in−1 , Bknin ] = r)δhnh, (95)


N∑

n=0

δsns. (96)


The forward variable is defined as

αn(zn) = P (d1:n, zn),

which can be computed as

α0(z0) = P (z0), (97)

αn(zn) =∑

zn−1

P (dn| zn−1, zn)P (zn|zn−1)αn−1(zn−1), (98)

P (d = d1:N ) =∑

zN

αN (zN ). (99)

The backward sampling step goes as follows:

P (zN |d) ∝ P (d, zN ) = αN (zN ),

P (zn| zn+1:N ,d) = P (zn| zn+1, d1:n+1) ∝ P (zn, zn+1, d1:n+1) ∝ P (dn+1| zn, zn+1)P (zn+1|zn)αn(zn).

11

Documents

arXiv:1908.06969v2 [cs.SD] 16 Feb 2021