¨ · PDF filetranscription of the melody, bass line, and chords in polyphonic pop music. The method uses a ... panels show a piano-roll representation of the

Automatic Transcriptionof Melody, Bass Line, andChords in Polyphonic Music

Matti P. Ryynanen and Anssi P. KlapuriDepartment of Signal ProcessingTampere University of TechnologyP.O. Box 553FI-33101 Tampere, Finland{matti.ryynanen, anssi.klapuri}@tut.fi

This article proposes a method for the automatictranscription of the melody, bass line, and chordsin polyphonic pop music. The method uses aframe-wise pitch-salience estimator as a featureextraction front-end. For the melody and bass-linetranscription, this is followed by acoustic modelingof note events and musicological modeling of notetransitions. The acoustic models include a modelfor the target notes (i.e., melody or bass notes)and a background model. The musicological modelinvolves key estimation and note bigrams thatdetermine probabilities for transitions betweentarget notes. A transcription of the melody orthe bass line is obtained using Viterbi searchvia the target and the background note models.The performance of the melody and the bass-linetranscription is evaluated using approximately8.5 hours of realistic polyphonic music. The chordtranscription maps the pitch salience estimatesto a pitch-class representation and uses trainedchord models and chord-transition probabilities toproduce a transcription consisting of major andminor triads. For chords, the evaluation materialconsists of the first eight Beatles albums. Themethod is computationally efficient and allowscausal implementation, so it can process streamingaudio.

Transcription of music refers to the analysis ofan acoustic music signal for producing a parametricrepresentation of the signal. The representation maybe a music score with a meticulous arrangement foreach instrument or an approximate description ofmelody and chords in the piece, for example. Thelatter type of transcription is commonly used incommercial songbooks of pop music and is usuallysufficient for musicians or music hobbyists toplay the piece. On the other hand, more detailedtranscriptions are often employed in classicalmusic to preserve the exact arrangement of thecomposer.

Computer Music Journal, 32:3, pp. 72–86, Fall 2008c© 2008 Massachusetts Institute of Technology.

We propose a method for the automatic tran-scription of the melody, bass line, and chords inpop-music recordings. Conventionally, these taskshave been carried out by trained musicians wholisten to a piece of music and write down notesor chords by hand, which is time-consuming andrequires musical training. A machine transcriberenables several applications. First, it provides aneasy way of obtaining a description of a musicrecording, allowing musicians to play it. Second,the produced transcriptions may be used in mu-sic analysis, music information retrieval (MIR)from large music databases, content-based audioprocessing, and interactive music systems, forexample.

A note is here defined by a discrete pitch, anonset time, and duration. The melody of a pieceis an organized sequence of consecutive notes andrests, usually performed by a lead singer or by asolo instrument. More informally, the melody isthe part one often hums when listening to a piece.The bass line consists of notes in a lower pitchregister and is usually played with a bass guitar,a double bass, or a bass synthesizer. A chord is acombination of notes that sound simultaneously ornearly simultaneously. In pop music, these conceptsare usually rather unambiguous.

Figure 1 shows the waveform of an examplemusic signal and two different representationsof its melody, bass line, and chords. The middlepanels show a piano-roll representation of themelody and the bass notes, respectively. Notes inthis representation can be compactly saved in aMIDI file. The lowest panel represents the samenotes and the chords in the common musicalnotation where the note onsets and durations areindicated by discrete symbols. The proposed methodproduces a piano-roll representation of the melodyand the bass line together with chord labels. Ifdesired, the note timings can be further quantized toobtain common music notation (Hainsworth 2006;Whiteley, Cemgil, and Godsill 2006).

Work on polyphonic music transcription datesback more than 30 years (Moorer 1977). Nowadays,

72 Computer Music Journal

Figure 1. Given an audiorecording (top panel), onecan represent the melody,bass line, and chords witha piano roll (middlepanels) and with a musicscore (bottom panel).

the concept of automatic music transcriptionincludes several topics, such as multi-pitch analysis,beat tracking and rhythm analysis, transcriptionof percussive instruments, instrument recognition,harmonic analysis and chord transcription, andmusic structure analysis. For an overview of thetopics, see Klapuri and Davy (2006). Fundamentalfrequency (F0) tracking of the melody and bass linesin polyphonic music signals was first considered byGoto (2000, 2004). Later, either F0 tracking or note-level transcription of the melody has been consideredby Paiva, Mendes, and Cardoso (2005); Ellis andPoliner (2006); Dressler (2006); and Ryynanen andKlapuri (2006), and bass-line transcription has beenaddressed by Hainsworth and Macleod (2001) andRyynanen and Klapuri (2007). Poliner et al. (2007)reported results on a comparative evaluation ofmelody transcription methods. Automatic chordtranscription from audio has been considered by

Sheh and Ellis (2003), Bello and Pickens (2005), andHarte and Sandler (2005).

Figure 2 shows a block diagram of the proposedmethod. An audio signal is processed frame-wisewith two feature extractors: a pitch-salience estima-tor and an accent estimator that indicates potentialnote onsets based on signal energy. These featuresare used to compute observation likelihoods fortarget notes (i.e., melody or bass notes), other in-strument notes, and silence or noise segments,each of which is modeled using a Hidden MarkovModel (HMM). (See Rabiner and Juang 1993 foran introduction.) The musicological model esti-mates the musical key based on the pitch-saliencefunction and then chooses between-note transitionprobabilities modeled with a note bigram. TheViterbi algorithm (Forney 1973) is used to find theoptimal path through the target note models toproduce a transcription. To summarize, the method

Ryynanen and Klapuri 73

ESTIMATEMUSICAL KEY

EXTRACTFEATURES

INPUT:AUDIO

COMPUTE CHORDOBS. LIKELIHOODS

VITERBI SEARCHTHROUGH CHORDSCHORD

TRANSITIONS

CHORDMODELS

OTHERNOTES HMM

NOISE ORSILENCE HMM

TARGETNOTE HMM

COMPUTE TARGET AND BACKGROUNDOBSERVATION LIKELIHOODS

TRANSITIONSBETWEEN

NOTES

MELODY AND BASS NOTESCHORD LABELS

KEY SIGNATUREOUTPUT:

VITERBI SEARCHTHROUGH THE NOTES

CHORD TRANSCRIPTION

NOTE MODELS

Figure 2. A block diagramof the proposed method.

incorporates both low-level acoustic modeling andhigh-level musicological modeling, and it producesdiscrete pitch labels and the beginning and endingtimes for the transcribed notes simultaneously.

The chord-transcription method uses a 24-stateHMM consisting of twelve states for both majorand minor triads. The observation likelihoods arecomputed by mapping the pitch saliences into apitch-class representation and comparing themwith trained profiles for major and minor chords.Probabilities of between-chord transitions areestimated from training data, and Viterbi decodingis used to find a path through the chord models.The rest of this article explains the proposedmethod in more detail and presents evaluationresults with quantitative comparison with othermethods.

Feature Extraction

The front-end of the method consists of two frame-wise feature extractors: a pitch-salience estima-tor and an accent estimator. Pitch salience st(τ )measures the strength of fundamental period τ

in analysis frame t, and the accent at measuresspectral change from frame t − 1 to frame t, inpractice indicating potential note onsets. Inputsignals are sampled at fs = 44.1 kHz rate, and stereo

signals are downmixed to mono before the featureextraction.

Pitch Salience Estimation

The salience, or strength, of each F0 candidate iscalculated as a weighted sum of the amplitudesof its harmonic partials in a spectrally whitenedsignal frame. The calculations are similar to thoseof Klapuri (2006) and are briefly explained here.

Spectral whitening, or flattening, is first ap-plied to suppress timbral information and therebymake the estimation more robust against varia-tion in the sound sources. Given one frame ofthe input signal x(n), the discrete Fourier trans-form X(k) is calculated after Hamming-windowingand zero-padding the frame to twice its length.Then, a band-pass filterbank is simulated in thefrequency domain. Center frequencies of the sub-bands are distributed uniformly on the critical-bandscale, fc = 229(10(0.5c+1)/21.4 − 1), and each subbandc = 1,. . . ,60 has a triangular power response ex-tending from fc−2 to fc+2. Power σ 2

c of the signalwithin each subband c is calculated by applyingthe triangular response in the frequency domainand adding the resulting power spectrum valueswithin the band. Then, bandwise compression co-efficients γc = σ ν−1

c are calculated, where ν = 0.33


is a parameter determining the amount of spectralwhitening applied. The coefficients γc are linearlyinterpolated between the center frequencies fc to ob-tain compression coefficients γ (k) for all frequencybins k. Finally, a whitened magnitude spectrum|Y(k)| is obtained as |Y(k)| = γ (k)|X(k)|.

The salience s(τ ) of a fundamental period candi-date τ is then calculated as

s(τ ) =I∑

i=1

g(τ , i)maxk∈κτ ,i

|Y(k)| (1)

where the set κτ ,i defines a range of frequency binsin the vicinity of the partial number i of the F0candidate fs/τ and I = 20. More precisely,

κτ ,i = {〈iK/(τ + �τ/2)〉, . . . , 〈iK/(τ − �τ/2)〉} (2)

where 〈•〉 denotes rounding to the nearest integer,K is the length of the Fourier transform, and �τ

denotes the spacing between successive periodcandidates τ . We use �τ = 0.5, that is, a spacingcorresponding to half the sampling interval. Thepartial weighting function g(τ , i) in Equation 1 is ofthe form

g(τ , i) = fs/τ + ε1

ifs/τ + ε2(3)

where ε1 = 52 Hz and ε2 = 320 Hz. Note that fs/τ

is the fundamental frequency corresponding to τ,and Equation 3 reduces to 1/i if the moderationterms ε1 and ε2 are omitted. For details, see Klapuri(2006).

The salience function in Equation 1 is calculatedfor F0 values between 35 Hz and 1.1 kHz in overlap-ping 92.9-msec frames, with a 23.2-msec intervalbetween successive frames. Based on this, the dif-ferential salience of a particular period candidate τ

is defined as �st(τ ) = st(τ ) − st−1(τ ). For convenience,the fundamental frequency candidates are expressedas unrounded MIDI note numbers by

F (τ ) = 69 + 12 log2(( fs/τ )/440) (4)

Figure 3 shows the salience and the differentialsalience features extracted from the signal shown inFigure 1. Salience values indicate the melody notesquite clearly. Differential salience shows potential

note onsets and vibrato, which is commonly appliedin singing.

Accent Signal

Accent signal at measures the amount of incomingspectral energy in time frame t and is useful fordetecting note beginnings. Calculation of the accentfeature has been explained in detail in Klapuri,Eronen, and Astola (2006). Briefly, a “perceptualspectrum” is first calculated in analysis frame t bymeasuring log-power levels within critical bands.Then, the perceptual spectrum in frame t − 1 iselement-wise subtracted from that in frame t, andthe resulting positive level differences are addedacross bands. As a result, we have at, which is aperceptually motivated measure of the amount ofincoming spectral energy in frame t. The framerate while calculating the accent signal is fourtimes higher than that of the salience estimation;therefore, we down-sample the accent signal byselecting the maximum value in four-frame blocks tomatch the frame rate of the pitch-salience function.

Acoustic Modeling of Notes

The basic idea of acoustic modeling is that allpossible note pitches n at all times are classi-fied either as target notes (melody or bass), notesfrom other instruments, or as noise or silence. Forthis purpose, three acoustic models are trained:(1) target-notes model, (2) other-notes model, and (3)noise-or-silence model. Use of the target-notes andthe other-notes models attempts to improve dis-criminability of the target sound source from otherinstruments. The target and the other notes are mod-eled with three-state left-to-right HMMs, where theconsecutive states can be interpreted to represent theattack, sustain, and release segments of the notes.The noise-or-silence model is a three-state fully con-nected HMM, because no similar temporal order canbe assumed for these segments. At all times, eachcandidate note n is in one of the internal states ofone of the models. The notes are identified by theirdiscrete MIDI note pitch n ∈ N, where N is the set


60

65

70 Melody notes

time (sec)

pitc

h (M

IDI n

ote)

0.5 1 1.5 2 2.5 3 3.5 4

60

65

70 Melody notes

Figure 3. Salience functionst(τ ) in the top panel andthe differential saliencefunction �st(τ ) in thebottom panel.

of possible pitches. The set consists of MIDI notes{44, . . . ,84}, i.e., A-flat2 to C6, for the melody, andof notes {26, . . . ,59}, i.e., D1–B3, for the bass line.

The acoustic models and their parameters do notdepend on note pitch n. This has the advantage thatonly one set of HMM parameters must be trained foreach of the three models. However, the observationvectors on,t are specific to each note. These areobtained from the extracted features by selectingthe maximum-salience fundamental period τn,t in a±1 semitone range around the note n in frame t:

τn,t = arg maxi

st(i), i ∈ {τ | |F (τ ) − n| ≤ 1} (5)

The observation vector on,t is then defined as

on,t = [�F , st(τn,t), �st(τn,t), at]T (6)

where �F = F (τn,t) − n is the distance between thepitch of the detected salience peak and the nom-inal pitch of the note, and s(τn,t) and �s(τn,t) arethe salience and the differential salience of τn,t.The accent value at does not depend on pitch butis common to all notes. Notice that the pitch-dependent features are obtained directly from thesalience function in Equation 5. This is advanta-geous to the often-used approach of deciding a setof possible pitches within a frame already at thefeature-extraction stage; here, the final decision of

transcribed pitches is postponed for the probabilisticmodels, and the feature extraction becomes consid-erably simpler and computationally more efficient.

Training the Acoustic Models

The acoustic models are trained using the Real WorldComputing (RWC) database, which includes realisticmusical recordings with manual annotations of themelody, the bass line, and the other instruments(Goto et al. 2002, 2003). For the time region ofeach annotated note n, the observation vectorsby Equation 6 constitute a training sequence foreither the target-notes or the other-notes model.The HMM parameters are then obtained usingthe Baum-Welch algorithm (Rabiner 1989) whereobservation-likelihood distributions are modeledwith Gaussian Mixture Models (GMMs). Prior tothe training, the features are normalized to havezero mean and unit variance over the training set.

The noise-or-silence model requires trainingsequences as well. Therefore, we generate random“note events” at positions of the time-pitch planewhere there are no sounding notes in the referenceannotation. Durations of the generated notes aresampled from a normal distribution with mean andvariance calculated from the annotated notes in thesong.


After the training, we have the following pa-rameters for each HMM: observation likelihooddistributions P(on,t | qt = j) for states j = 1, 2, 3;state-transition probabilities P(qt = j | qt−1 = i), i.e.,the probability that state i is followed by state jwithin a random state sequence q1q2 . . . qT ≡ q1:T;and initial state probabilities P(q1 = j). The obser-vation likelihood distributions are modeled withdiagonal covariance matrices and three or fourGMM components for melody and bass modelsets, respectively. In the following, we distinguishthe states of the target-notes model, the other-notes model, and the noise-or-silence model by thenotation jtgt, joth, and jns, respectively.

Using the Acoustic Models

As already mentioned, the sequence of targetnotes is extracted by classifying all possible notepitches either as a target note, a note from anotherinstrument, or as noise or silence. However, bydefinition there can be only up to one targetnote sounding at each time. This constraint isimplemented as follows.

First, we find a state sequence b∗n,1:T which best

explains the feature vectors on,1:T of note n usingonly the other-notes and the noise-or-silence model:

b∗n,1:T = arg max

q1:T

[P(q1)P(on,1 | q1)

T∏t=2

P(qt | qt−1)P(on,t | qt)

]

(7)

Here, qt ∈ {ioth, jns}, where i, j ∈ {1, 2, 3}, meaningthat the sequence b∗

n,1:T may visit the states of boththe other-notes and the noise-or-silence model.Because we have a combination of the two models,we must allow switching between them by definingnonzero probabilities for P(qt = joth | qt−1 = ins),where j = 1 and i ∈ {1, 2, 3}, as well as for P(qt =jns | qt−1 = ioth) where i = 3 and j ∈ {1, 2, 3}. Thestate sequence is found with the Viterbi algorithm(Forney 1973). This procedure is repeated for allconsidered note pitches n ∈ N.

Now we have the most likely explanation for allnotes at all times without using the target-notes

Figure 4. Backgroundprobability evaluation forone note. See text fordetails.

model. We define the “background” probability fornote n in frame t as

βn,t = P(on,t | b∗n,t)P(b∗

n,t | b∗n,t−1) (8)

The use of the other notes and the noise-or-silencemodel is illustrated in Figure 4 where the continuousarrowed line shows the state sequence b∗

n,1:7 solvedfrom Equation 7. The example shows the backgroundprobability βn,3 evaluation in frame t = 3 usingEquation 8.

Next, we try to find a path trough time for whichthe target-notes model gives a high likelihood andwhich is not well explained by the other models.The state space of this “target path” is larger than inEquation 7, because the path can visit the internalstates jtgt of any note n. We denote the state of thetarget path by a variable rt, which determines thenote n and the internal state jtgt at time t. Moreexactly, rt = [rt(1), rt(2)]T, where rt(1) ∈ {N ∪ R} andrt(2) ∈ {1, 2, 3}. Here R denotes a rest state, where notarget notes are sounding. In this case, the value ofrt(2) is meaningless.

To find the best overall explanation for all notesat all times, let us first assume that the notes areindependent of each other. In this case, the overallprobability given by the “background” model inframe t is Zt = ∏

n βn,t. When the target path statert visits a note n at time t, the overall probability


Figure 5. Using thetarget-note models toproduce a transcription.See text for details.

becomes

Zt′ = P(rt | rt−1)P(on,t | rt(2))

Zt

βn,t(9)

where Zt/βn,t ≡ ∏i =n βi,t. The best explanation for all

the notes is given by a target path that maximizes∏t Zt

′. The factor Zt in Equation 9 is independent ofthe target path and can be omitted while searchingthe optimal path. As a result, we can omit theindependence assumption too, because it onlyconcerns the calculation of Zt. For rest frames, thebackground explanation applies for all notes, andthus Zt

′ = P(rt | rt−1)Zt. The transition probabilitiesP(rt | rt−1) must still remain in all frames, becausethey include the musicological model that controlstransitions between notes and between notes andrests.

The best target path is then obtained as

r∗1:T = arg max

r1:T

[P(r1) f (r1)

T∏t=2

P(rt | rt−1) f (rt)

](10)

where

f (rt) ={

P(on,t | rt(2))/βn,t, rt(1) = n ∈ N1, rt(1) = R

(11)

The transition probabilities P(rt | rt−1) are defined as

P(rt | rt−1) =

P(rt(2) = jtgt | rt−1(2) = itgt),

when inside target HMMs

P(rt(1) = u | rt−1(1) = u′),

when applying musical model

(12)

Above, the acoustic model is applied when stayingwithin the same target-note HMM, that is, whenrt(1) = rt−1(1) ∈ N and rt(2) ≥ rt−1(2). The musico-logical model is applied when switching betweennotes or between notes and rests, that is, when theabove condition is not fulfilled, and u, u′ ∈ {N ∪ R}.The musicological model is explained in the nextsection. The path by Equation 10 is found with theViterbi algorithm. This simultaneously producesthe discrete pitch labels and note segmentation, i.e.,the note onsets and offsets. A note starts when thepath enters the first state of a note model and endswhen the path exits its last state. Rests are producedwhere r∗

t (1) = R.Figure 5 illustrates this process with two target-

note HMMs for MIDI notes 60 and 61 and the reststate. The arrowed line shows the best target pathr∗

1:21 where the gray arrows show the transitionsinside target-note models and the black arrows showthe transitions where the musicological model is


applied. The black boxes show the four transcribednotes (i.e., MIDI notes 60, 61, 61, 60) with theironsets and offsets. The figure also shows examplecalculations for Equations 11–12.

Musicological Model for Note Transitions

Musical context plays an important role in bothmelodies and bass lines. Consider a target notesequence in the key of C major in which the noteE is possibly followed by F-sharp or G, for whichthe acoustic model gives approximately equal like-lihoods. Because F-sharp follows E less often inthis key, the note G is musically preferred. We usethis feature in the melody and bass-line transcrip-tion by employing key-dependent note-transitionprobabilities to solve ambiguous situations.

Key Estimation

A musical key is roughly defined by the basic scaleused in a piece. A major key and a minor key arehere referred to as a relative-key pair if they consistof scales with the same notes (e.g., C major and Aminor). The relative-key pair also specifies the keysignature of a musical score.

The proposed key estimator finds the relative-keypair using the pitch salience function st(τ ). First,the function values are mapped into a pitch-classrepresentation, where notes in different octaves areconsidered equivalent. The set of notes that belong toa pitch class m∈ {0,1, . . . ,11} is defined by Hm = {n ∈N |mod (n, 12) = m}, where mod(x, y) ≡ x − y�x/y�.For key estimation, the note range N consists ofthe MIDI notes {48, . . . ,83}. The salience func-tion is mapped to the pitch-class profile PCPt(m)by

PCPt(m) = 1Wt

∑n∈Hm

maxi

st(i), i ∈ {τ ||F (τ ) − n| ≤ 0.5}

(13)

where Wt is a normalization factor so that∑11m=0 PCPt(m) = 1. The pitch-class profile in each

frame is then compared to key profiles Kmaj(d)

and Kmin(d), which give the occurrence frequen-cies of note degrees d ∈ {0,1,. . . ,11} in major andminor keys, respectively. For example, d = 0 isthe tonic note of the key and d = 7 is the perfectfifth.

Let k ∈ {0,1,. . . ,11} denote the relative-key pairs[C major/A minor], [D-flat major/B-flat minor], andso forth, until the pair [B major/G-sharp minor],respectively. Given the pitch-class profile PCPt(m)and the key profiles, the most probable relative-keypair k∗

t at time t is calculated by

Lt(k) =11∑

d=0

[log [Kmaj(d)] PCPt(mod(d + k, 12))

+ log [Kmin(d)] PCPt(mod(d + k+ 9, 12))] ,

(14)

k∗t = arg max

k

t∑

j=1

L j(k)

. (15)

When calculating note degree for minor keys, theterm +9 in Equation 14 shifts the key index k tothe relative minor key. As key profiles Kmaj(d) andKmin(d), we use those reported by Krumhansl (1990,p. 67).

Note Bigrams

Note bigrams determine the probability of making atransition between notes or between notes and therest state. This aims at favoring note sequences char-acteristic to the target material. In Equation 12, thesetransitions were denoted by P(rt(1) = u | rt−1(1) = u′),where u, u′ ∈ {N ∪ R}. Because the target note mod-els are left-to-right HMMs, it is possible to enteror exit a target note HMM only via its first or laststate, respectively. In addition, our system includestransitions from target-note models to the rest stateand vice versa, as well as the probability to staywithin the rest state.

Given the relative-key pair k, a transition fromnote u′ to note u is specified by the degree of the firstnote, mod(u′ − k, 12) and the interval between thetwo notes u − u′. For note-to-rest and rest-to-note


transitions, only the note degree matters. Theprobability to stay within the rest state does notdepend on the key but is a free parameter thatcan be used to control the amount of rests in thetranscription. In summary, the transitions given bythe musicological model are defined by

P(rt(1) = u | rt−1(1) = u′)

=

P(rt = [u, 1]T | rt−1 = [u′, 3]T, k), note to note

P(rt = [u, 1]T | rt−1(1) = R, k), rest to note

P(rt(1) = R | rt−1 = [u′, 3]T, k), note to restP(rt(1) = R | rt−1(1) = R), rest to rest

(16)

For melody notes, the note bigram probabilitieswere estimated by counting the note transitions inthe Essen Associative Code and Folksong Database(EsAC) with about 20,000 folksongs with keyinformation (see www.esac-data.org). Krumhansl’sprofiles were used to give probabilities for rest-to-note and note-to-rest transitions. This is based onthe assumption that it is more probable to startor end a phrase with a note that is harmonically“stable” in the estimated key.

For bass lines, the note bigram estimation is donefrom a collection of over 1,300 MIDI files includingbass lines. For each file, we estimate the relative-keypair using the Krumhansl profiles and all MIDInotes from the file. A rest is added between twoconsecutive bass notes if they are separated by morethan 200 msec. Then, we count the key-dependentnote transitions similarly to the melody notes. Thebigrams for both melody and bass lines are smoothedwith the Witten-Bell discounting algorithm (Wittenand Bell 1991).

The note bigrams do not take into account theabsolute pitch of the notes but only the intervalbetween them. However, it is advantageous to prefertarget notes in the typical pitch range of melodiesor bass lines: for example, preferring low-pitchednotes in bass-line transcription. We implement thisby weighting the note bigram probabilities withvalues from a normal distribution over notes n. Formelodies, we found a distribution mean at MIDInote value 62.6 and a variance of 6.8 semitones to

perform best in our simulations. For bass lines, wefound mean 33.2 and variance 3.7.

Chord Transcription

The proposed chord-transcription method attemptsto label each frame with a certain major or minorchord. The reason for introducing the method here isto demonstrate the usefulness of the pitch-saliencefunction in the chord-transcription task. In addition,chords complement the melody and the bass linetranscriptions to produce a useful representation ofthe song under analysis.

Harte et al. (2005) proposed a text-based chordsyntax and publicly provided the chord annotationsfor the first eight albums by the Beatles. Thisdatabase is used here for training and evaluation ofthe proposed method. The database also includescomplex chords, such as major and dominant-seventh and extended chords (ninths, sixths), whichare treated here as major or minor triads. Augmentedor diminished chords are left out from the training.Database details are given in the next section.

The chord transcription method is the following.First, we train pitch-class profiles for major andminor triads. The salience function is mappedinto two pitch-class profiles by using Equation 13:PCPlo

t (m) for low-register MIDI notes n ∈ {26,. . . ,49},and PCPhi

t (m) for high-register MIDI notes n ∈{50,. . . ,73}. For each major chord in the trainingdata, PCPlo

t (m) and PCPhit (m) are calculated, the

profiles are rotated so that the pitch-class m= 0corresponds to the chord root note, and the profilesare then obtained by averaging over time. A similarprocedure is repeated for the minor chords. Figure 6shows the estimated low-register profiles Clo

maj, Clomin

and high-register profiles Chimaj, Chi

min for both majorand minor chords. The low-register profile capturesthe bass notes contributing to the chord root,whereas the high-register profile has more clearpeaks also for the major or minor third and the fifth.

Next, we define a chord HMM with 24 states(twelve states for both major and minor triads).In frame t, a state in this HMM is denoted byct = [ct(1), ct(2)]T, where ct(1) ∈ {0,1,. . . ,11} denotes


I II IIIIV V VI VII I II IIIIV V VI VII I II III IV V VI VII I II III IV V VI VII0.06

0.07

0.08

0.09

0.1

0.11

0.12 Major profile (low) Major profile (high) Minor profile (low) Minor profile (high)

Note degree

Figure 6. Major and minorchord profiles for low andhigh registers.

Figure 6

Figure 7. The estimatedchord transitionprobabilities from majorand minor chords. The textboxes show examples fortransitions from a C-majorchord and an A-minorchord.

Figure 7

the chord roots C, D-flat, and so forth, and ct(2) ∈{maj, min} denotes the chord type. We want tofind a path c1:T through the chord HMM. For thetwelve major chord states ct, ct(2) = maj, the chordobservation log-likelihoods L(ct) are calculated in amanner analogous to the key estimation:

L(ct) =11∑

d=0

[log

[Clo

maj(d)]PCPlo

t (mod(d + ct(1), 12))

+ log[Chi

maj(d)]PCPhi

t (mod(d + ct(1), 12))].

(17)

The calculation is exactly similar for the mi-nor triads L(ct), ct(2) = min, except that themajor profiles are replaced with the minorprofiles.

We also train a chord-transition bigram P(ct | ct−1).The transitions are independent of the key so thatonly the chord type and the distance betweenthe chord roots, mod(ct(1) − ct−1 (1),12), matters. Forexample, a transition from A minor, ct−1 = [9, min]T,to F major, ct = [5, maj]T, is counted as a transitionfrom a minor chord to a major chord with distancemod (5 − 9,12) = 8. Figure 7 illustrates the estimatedchord-transition probabilities. The probability to


time (sec)

Cho

rd

20 40 60 80 100 120 140 160C

D

E

G

A

BCm

Dm

Em

Gm

Am

BmReferenceTranscription

Figure 8. Chordtranscription of With aLittle Help From MyFriends by the Beatles.Likelihoods fromEquation 17 are shown for

all major and minor triads.The reference chords areindicated by the white lineand the transcription c∗

1:Tby the black line.

stay in the chord itself is a free parameter thatcontrols the amount of chord changes.

Now, we have defined the log-likelihoods L(ct) forall chords in each frame t and the chord-transitionbigram P(ct | ct−1). The chord transcription is thenobtained by finding an optimal path c∗

1:T through thechord states:

c∗1:T = arg max

c1:T

[L(c1) +

T∑t=2

(L(ct) + log P(ct | ct−1))

],

(18)

which is again found using the Viterbi algorithm.The initial probabilities for chord states are uniformand are therefore omitted. The method does notdetect silent segments but produces a chord label ineach frame. Figure 8 shows the chord transcriptionfor With a Little Help From My Friends by theBeatles.

Results

The proposed melody-, bass-, and chord-transcription methods are quantitatively evaluatedusing databases described subsequently. For allevaluations, a two-fold cross validation is used.With a C++ implementation running on a 3.2GHz Pentium 4 processor, the entire method takes

about 19 sec to process 180 seconds of stereo audio.The feature extraction takes about 12 sec, and themelody and bass-line transcription take about 3 seceach. The key estimation and chord transcriptiontake less than 0.1 seconds. In addition, the methodallows causal implementation to process stream-ing audio in a manner described in Ryynanen andKlapuri (2007).

Audio Databases

For the development and evaluation of the melodyand bass-line transcription, we use the Real WorldComputing “popular music” and “genre” databases(Goto et al. 2002, 2003). The databases include aMIDI file for each song, which contains a manualannotation of the melody, the bass, and otherinstrument notes, collectively referred to as thereference notes. MIDI notes for drums, percussiveinstruments, and sound effects are excluded from thereference notes. Some songs in the databases werenot used due to an unreliable synchronization of theMIDI annotation and the audio recording. Also, somesongs do not include the melody or the bass line.Consequently, we used 130 full acoustic recordingsfor melody transcription: 92 pop songs (the RWC“popular” database) and 38 songs with varyingstyles (the RWC “genre” database). For bass-line


Table 1. Melody and Bass-Line Transcription Results (%)

Melody R P F ρ Bass line R P F ρ

RWC “popular” 60.5 49.4 53.8 61.1 RWC “popular” 57.7 57.5 56.3 61.9RWC “genre” 41.7 50.3 42.9 55.8 RWC “genre” 35.3 57.5 39.3 57.6Total 55.0 49.6 50.6 59.6 Total 50.1 57.5 50.6 60.4

transcription, we used 84 songs from RWC “popular”and 43 songs from RWC “genre”—altogether, 127recordings. This gives approximately 8.7 and 8.5hours of music for the evaluation of melody andbass-line transcription, respectively. There arereference notes outside the reasonable transcriptionnote range for both melody (<0.1 percent of themelody reference notes) and bass lines (1.8 percentof the notes). These notes are not used in trainingbut are counted as transcription errors in testing.

As already mentioned, the chord-transcriptionmethod is evaluated using the first eight Beatlesalbums with the chord annotations provided byHarte and colleagues. The albums include 110 songswith approximately 4.6 hours of music. The refer-ence major and minor chords cover approximately75 percent and 20 percent of the audio, respectively.Chords that are not recognized by our method andthe no-chord segments cover about 3 percent and1 percent of the audio.

Melody and Bass-Line Transcription Results

The performance of melody and bass-line tran-scription is evaluated by counting correctly andincorrectly transcribed notes. We use the recall rateR and the precision rate P defined by

R = #(correctly transcribed notes)#(reference notes)

(19)

P = #(correctly transcribed notes)#(transcribed notes)

A reference note is correctly transcribed by a notein the transcription if their MIDI note numbers areequal, the absolute difference between their onsettimes is less than 150 msec, and the transcribed

note is not already associated with another referencenote. We use the F-measure F = 2RP/(R + P) togive an overall measure of performance. Temporaloverlap ratio of a correctly transcribed note withthe associated reference note is measured by ρ =(min{E} − max{B})/(max{E} − min{B}), where sets Band E contain the beginning and ending times of thetwo notes, respectively. The mean overlap ratio ρ

is obtained by averaging ρ values over the correctlytranscribed notes. The recall rate, the precisionrate, the F-measure, and the mean overlap ratio arecalculated separately for each recording, and thenthe average over all the recordings is reported.

Table 1 shows the melody and bass-line tran-scription results. Both the melody and the bass-linetranscription achieve over 50 percent average F-measure. The performance on pop songs is clearlybetter than for the songs from various genres. Thiswas expected since the melody and bass lines areusually more prominent in pop music than in othergenres, such as heavy rock or dance music, forexample. In addition, the RWC “popular” databaseincludes only vocal melodies, whereas the RWC“genre” database also includes melodies performedwith other instruments. The musicological modelplays an important role in the method: The totalF-measures drop to 40 percent for both melodyand bass-line transcription if the note bigrams arereplaced with uniform distributions.

For comparison, Ellis and Poliner kindly pro-vided the pitch tracks produced by their melody-transcription method (Ellis and Poliner 2006) forthe recordings in the RWC databases. Their methoddecides note pitch for the melody in each framewhenever the frame is judged to be voiced. Briefly,the pitch classification in each frame is conductedusing one-versus-all, linear-kernel support vectormachine (SVM). The voiced-unvoiced decision isbased on energy thresholding, and the pitch track is


Table 2. Frame-Level Melody Transcription Results (%)

Method Data Overall acc. Raw pitch Vc det. Vc FA Vc d′

Proposed RWC “popular” 63.0 63.6 87.0 39.4 1.40RWC “genre” 62.4 40.5 58.8 17.9 1.14Total 62.8 56.9 78.7 33.1 1.23

Ellis and Poliner RWC “popular” 42.9 50.4 93.4 66.3 1.09RWC “genre” 38.5 42.8 91.6 64.7 1.00Total 41.6 48.2 92.9 65.9 1.06

Table 3. Chord Transcription Results, Frame Classification Proportions AveragedOver the Songs (%)

Proposed Bello-Pickens

Exactly correct 70.6 69.6Root correct (major and minor are confused) 5.7 3.4Relative chord (e.g., C was labeled as Am or vice versa) 2.7 5.8Dominant (e.g., C/Cm was labeled as G/Gm) 4.5 4.0Subdominant (e.g., C/Cm was labeled as F/Fm) 3.7 2.2III or VI (e.g., C was labeled with Em or vice versa) 1.6 2.4Other errors 6.4 7.7Chord not recognized by the methods (e.g., C dim) 4.8 4.8

smoothed using HMM post-processing. Because theEllis-Poliner method does not produce segmentednote events but rather a pitch track, we comparethe methods using frame-level evaluation metricsadopted for melody-extraction evaluation in theMusic Information Retrieval Evaluation eXchange(Poliner et al. 2007). For this, the RWC referenceMIDI note values are sampled every 10 msec toobtain a frame-level reference. Similar conversion ismade for the melody notes produced by our method.

Table 2 shows the results for the proposedmethod and for the Ellis-Poliner method on theRWC databases. The overall accuracy denotes theproportion of frames with either a correct pitchlabel or correct unvoiced decision, where the pitchlabel is correct if the absolute difference betweenthe transcription and the reference is less than halfa semitone. The raw pitch accuracy denotes theproportion of correct pitch labels to voiced framesin the reference. Voicing detection (Vc det) measuresthe proportion of correct voicing in transcriptionto voiced frames in the reference. Voicing falsealarm (Vc FA) measures the proportion of frames

that are labeled as voiced in the transcriptionbut are unvoiced in the reference. The voicingd′ (Vc d′) combines the voicing detection and voicingfalse alarm rates to describe the system’s abilityto discriminate the voiced and unvoiced frames.High voicing detection and low voicing false alarmgive good discriminability with a high voicing d′

value (Duda, Hart, and Stork 2001). According tothe overall accuracy and the voicing d′, the proposedmethod outperforms the Ellis-Poliner method inthis evaluation. Their method classifies most ofthe frames as voiced, resulting in a high voicing-detection rate but also high voicing false-alarm rates.

Chord Transcription Results

The chord-transcription method is evaluated bycomparing the transcribed chords with the referencechords frame-by-frame. For method comparison,Bello and Pickens kindly provided outputs of theirchord transcription method (Bello and Pickens 2005)on the Beatles data. As a framework, their method


also uses a chord HMM where states correspondto major and minor triads. Our method resemblestheirs in this sense, but the methods differ in details.

The results with error analysis are given inTable 3. On the average over the songs, both methodsdecide exactly correct chord in about 70 percent ofthe frames, that is, the transcribed chord root andtype are the same as in the reference. The proposedmethod makes more major/minor chord confusions,whereas the Bello-Pickens method more oftenlabels a reference chord with its relative chord.Because neither of the methods detects augmentedor diminished chords, or no-chord segments, thoseframes are always treated as transcription errors(4.8 percent of all the frames).

Conclusions

We proposed a method for the automatic tran-scription of melody, bass line, and chords in poly-phonic music. The method consists of frame-wisepitch-salience estimation, acoustic modeling, andmusicological modeling. The transcription accuracywas evaluated using several hours of realistic music,and direct comparisons to state-of-the-art methodswere provided. Using quite straightforward timequantization, common musical notation such asthat shown in Figure 1 can be produced. In addition,the statistical models can be easily retrained for dif-ferent target materials. Future work includes usingtimbre and metrical analysis to improve melody andbass-line transcription, and a more detailed chordanalysis method.

The transcription results are already useful forseveral applications. The proposed method hasbeen integrated into a music-transcription toolwith a graphical user interface and MIDI editingcapabilities, and the melody transcription has beensuccessfully applied in a query-by-humming system(Ryynanen and Klapuri 2008), for example. Only afew years ago, the authors considered the automatictranscription of commercial music recordings as avery difficult problem. However, rapid developmentof transcription methods and the latest results havedemonstrated that feasible solutions are possible.We believe that the proposed method with melody,

bass line, and chord transcription takes a step towardmore complete and accurate music transcription.Audio examples of the transcriptions are avail-able online at www.cs.tut.fi/sgn/arg/matti/demos/mbctrans.

Acknowledgments

This work was supported by the Academy of Finland,project no. 5213462 (Finnish Centre of ExcellenceProgram 2006–2011). The authors would like tothank Dan Ellis, Graham Poliner, and Juan PabloBello for providing outputs of their transcriptionmethods for comparison.

References

Bello, J. P., and J. Pickens. 2005. “A Robust Mid-levelRepresentation for Harmonic Content in Music Sig-nals.” Proceedings of the 6th International Conferenceon Music Information Retrieval. London: Queen Mary,University of London, pp. 304–311.

Dressler, K. 2006. “An Auditory Streaming Ap-proach on Melody Extraction.” MIREX AudioMelody Extraction Contest Abstracts. MIREX06extended abstract. London: Queen Mary, Univer-sity of London. Available online at www.music-ir.org/evaluation/MIREX/2006 abstracts/AME dressler.pdf.

Duda, R. O., P. E. Hart, and D. G. Stork. 2001. PatternClassification, 2nd ed. New York: Wiley.

Ellis, D., and G. Poliner. 2006. “Classification-BasedMelody Transcription.” Machine Learning Journal65(2–3):439–456.

Forney, G. D. 1973. “The Viterbi Algorithm.” Proceedingsof the IEEE 61(3):268–278.

Goto, M. 2000. “A Robust Predominant-F0 EstimationMethod for Real-Time Detection of Melody and BassLines in CD Recordings.” Proceedings of the 2000 IEEEInternational Conference on Acoustics, Speech, andSignal Processing. New York: Institute for Electricaland Electronics Engineers, pp. 757–760.

Goto, M. 2004. “A Real-Time Music-Scene-DescriptionSystem: Predominant-F0 Estimation for DetectingMelody and Bass Lines in Real-World Audio Signals.”Speech Communication 43(4):311–329.

Goto, M., et al. 2002. “RWC Music Database: Popular,Classical, and Jazz Music Databases.” Proceedings of


the 3rd International Conference on Music InformationRetrieval. Paris: IRCAM, pp. 287–288.

Goto, M., et al. 2003. “RWC Music Database: Mu-sic Genre Database and Musical Instrument SoundDatabase.” Proceedings of the 4th International Con-ference on Music Information Retrieval. Baltimore,Maryland: Johns Hopkins University. Available onlineat ismir2003.ismir.net/papers/Goto1.PDF.

Hainsworth, S. 2006. “Beat Tracking and Musical MetreAnalysis.” In A. Klapuri and M. Davy, eds. SignalProcessing Methods for Music Transcription. Berlin:Springer, pp. 101–129.

Hainsworth, S. W., and M. D. Macleod. 2001. “AutomaticBass Line Transcription from Polyphonic Music.”Proceedings of the 2001 International Computer MusicConference. San Francisco, California: InternationalComputer Music Association, pp. 431–434.

Harte, C., et al. 2005. “Symbolic Representation of MusicalChords: A Proposed Syntax for Text Annotations.”Proceedings of the 6th International Conference onMusic Information Retrieval. London: Queen Mary,University of London, pp. 66–71.

Harte, C. A., and M. B. Sandler. 2005. “AutomaticChord Identification Using a Quantised Chromagram.”Proceedings of the 118th Audio Engineering SocietyConvention. New York: Audio Engineering Society,paper number 6412.

Klapuri, A. 2006. “Multiple Fundamental FrequencyEstimation by Summing Harmonic Amplitudes.”Proceedings of the 7th International Conference onMusic Information Retrieval. Victoria: University ofVictoria, pp. 216–221.

Klapuri, A., and M. Davy, eds. 2006. Signal ProcessingMethods for Music Transcription. Berlin: Springer.

Klapuri, A. P., A. J. Eronen, and J. T. Astola. 2006.“Analysis of the Meter of Acoustic Musical Signals.”IEEE Transactions on Audio, Speech, and LanguageProcessing 14(1):342–355.

Krumhansl, C. 1990. Cognitive Foundations of MusicalPitch. Oxford: Oxford University Press.

Moorer, J. A. 1977. “On the Transcription of MusicalSound by Computer.” Computer Music Journal 1(4):32–38.

Paiva, R. P., T. Mendes, and A. Cardoso. 2005. “Onthe Detection of Melody Notes in Polyphonic Adio.”Proceedings of the 6th International Conference on

Music Information Retrieval. London: Queen Mary,University of London, pp. 175–182.

Poliner, G., et al. 2007. “Melody Transcription from MusicAudio: Approaches and Evaluation.” IEEE Transactionson Audio, Speech, and Language Processing 15(4):1247–1256.

Rabiner, L., and B.-H. Juang. 1993. Fundamentals ofSpeech Recognition. Upper Saddle River, New Jersey:Prentice Hall.

Rabiner, L. R. 1989. “A Tutorial on Hidden Markov Modelsand Selected Applications in Speech Recognition.”Proceedings of the IEEE 77(2):257–289.

Ryynanen, M., and A. Klapuri. 2006. “Transcription of theSinging Melody in Polyphonic Music.” Proceedings ofthe 7th International Conference on Music InformationRetrieval. Victoria: University of Victoria, pp. 222–227.

Ryynanen, M., and A. Klapuri. 2007. “Automatic BassLine Transcription from Streaming Polyphonic Audio.”Proceedings of the 2007 IEEE International Conferenceon Acoustics, Speech, and Signal Processing. NewYork: Institute of Electrical and Electronics Engineers,pp. 1437–1440.

Ryynanen, M., and A. Klapuri. 2008. “Query byHumming of MIDI and Audio Using Local-ity Sensitive Hashing.” Proceedings of the 2008IEEE International Conference on Acoustics,Speech, and Signal Processing. Available onlineat www.cs.tut.fi/∼mryynane/ryynanen icassp08.pdf.

Sheh, A., and D. P. Ellis. 2003. “Chord Segmentationand Recognition Using EM-Trained Hidden MarkovModels.” Proceedings of the 4th International Con-ference on Music Information Retrieval. Baltimore,Maryland: Johns Hopkins University. Available onlineat ismir2003.ismir.net/papers/Sheh.PDF.

Whiteley, N., A. T. Cemgil, and S. Godsill. 2006. “BayesianModeling of Temporal Structure in Musical Audio.”Proceedings of the 7th International Conference onMusic Information Retrieval. Victoria: University ofVictoria, pp. 29–34.

Witten, I. H., and T. C. Bell. 1991. “The Zero-Frequency Problem: Estimating the Probabilities ofNovel Events in Adaptive Text Compression.” IEEETransactions on Information Theory 37(4):1085–1094.


Documents

¨ · PDF filetranscription of the melody, bass line, and chords in polyphonic pop music. The method uses a ... panels show a piano-roll representation of the