GCT634: Musical Applications of Machine Learning ...mac.kaist.ac.kr/~juhan/gct634/2018/slides/10-polyphonic_transcription.pdf · J. Brahms, Clarinet Quintet in B minor, op.115. 3rd

GCT634: Musical Applications of Machine LearningPolyphonic Music Transcription

Non-negative Matrix Factorization

Graduate School of Culture Technology, KAISTJuhan Nam

Outlines

• Introduction

• Score-Audio Alignment

• Multi-Pitch Estimation

• Non-negative Matrix Factorization (NMF)

Polyphonic Music Transcription

• Converting an acoustic musical signal into some form of music notation- MIDI piano roll, staff notation- Note information: pitch, onset, offset, loudness

Model

Input

Output

Related Tasks

• Multi-pitch estimation- Single source: piano, guitar- Multiple source: quartet (woodwind, string)

• Predominant F0 estimation - Melody extraction, singing melody

• Drum transcription- Kick, snare, high-hat

26/10/2015

4

AMT - Introduction (4)

Core problem: multi-pitch detection

7

AMT - Introduction (5)

How difficult is it? • Let’s listen to a piece and try to transcribe (hum) the

different tracks

8

J. Brahms, Clarinet Quintet in B minor, op.115. 3rd movement

Two Directions

• Performance transcription- Detecting exact timing and dynamics of notes (micro-timing with 10ms

resolution or so) - Frame-level: onset, offset, intensity- Piano-roll notation is usually used (performance score)

• Score transcription- Transform performance into staff notation- Note-level: tempo, beat, downbeat - Rhythmic transcription (tempo, beat, downbeat) à Temporal quantization- Expression detection (pedal, articulation), often phrase-level - Instrument identification

- Very challenging

Score and Performance

MIDI (score)

Valentina Lisitsa

Vladimir Horowitz

Where Are The Differences?

• Tempo- Note-level, (note onset/offset timings), phrase-level, song-level

• Dynamics- Note-level, (note velocity), phrase-level, song-level

• Different interpretation of musical expressions in score- Temporal: ritardando, rubato- Dynamics: piano, forte, crescendo, … - Play techniques or articulation: legato, staccato - Mood and emotion: dolce, grazioso

Score-to-Audio Alignment

• Temporal alignment between score and audio from a piece of music - Audio-to-audio and MIDI-to-MIDI (either one is performance) are possible

• Why do we synchronize them?- Automatic page turning- Performance analysis - Score following- Auto-accompaniment

[Muller]

Algorithm Overview

• Choose feature representations to compare- Often, MIDI is convert to audio for alignment on the same feature space

• Compute a similarity matrix between two features sequences- All possible combinations of local feature pairs

• Find a path that makes the best alignment on the similarity matrix- Dynamic Time Warping (DTW)

DynamicProgramming

Feature Seq. #1 SimilarityMatrix

Feature Seq. #2Compute

the local similarityFind

the best path

Feature Representations

• Audio feature representations- Frequent choice for piano music is chroma

CENS : Normalized Chroma Features (Muller, 2005)

MIDI

Lisitsa

• Similarity between every pair of frame-level features- Euclidean or cosine distance

Similarity Matrix

Finding the Optimal Path

• There are so many possible paths from one corner to another

Schumann−Traumerei−Lisitsa

Schu

man

n−Tr

aum

erei−M

IDI

50 100 150 200 250 300

50

100

150

200

250

3D Surface Plot of Similarity Matrix

• Finding the optimal path is analogous to figuring out a trail route that you can take with minimum efforts in hiking.

Dynamic Time Warping

• Finding an (N, M)-warped path of length L - P = (p1, p2, p3, .. pL) where pi = (ni, mi)

• Three conditions - Boundary condition: p1=(1,1), pL=(N,M)- Monotonicity condition - n1 <= n2 <= … <= nL- m1 <=m2 <= .. <mL

- Step size condition- Move only upward,

rightward, diagonal (upper-right)

[Muller]

Dynamic Time Warping : Bad Examples

[Muller]

Dynamic Programming for DTW

• Algorithm- Initialization:

D(n,1) = sum(C(1:n,1)), n=1…N

D(1,m) = sum(C(1,1:m)), n=1…M

- Recurrence Relation:For each m = 1…MFor each n = 1…N

D(n-1,m)

D(n,m)= C(n,m)+ min D(n,m-1)

D(n-1,m-1)

- Termination:D(N,M) is distance

Dynamic Programming for DTW

• Toy Example

[Muller]

Similarity Matrix (C) Accumulated cost (D)

Score and Audio Alignment by DTW

C(i,j) D(i,j)

Limitations

• The optimal path is obtained after we arrive the destination (by back-tracking)- In other words, DTW works offline - What if the sequences are very long?- Online version of DTW?

• Every frame is equally important- In general, human is more sensitive to note onsets - Perceptually, every frame is not equally important

Online DTW

• Set a moving search window and calculate the cost only within the window- Time and space cost: quadratic à linear

• The movement is determined by the position that gives a minimum cost within the current window. If the position is ... - Corner: move both up and right (alternatively)- Upper edge: move up- Right edge: move right

Proc. of the 8th Int. Conference on Digital Audio Effects (DAFx’05), Madrid, Spain, September 20-22, 2005

1 2

3

4

5

6

7

8

9

10

11

12

13

14 15

16

17

18 19

20

21

Figure 2: An example of the on-line time warping algorithm withsearch window c = 4, showing the order of evaluation for a partic-ular sequence of row and column increments. The axes representthe variables t and j (see Figure 1) respectively. All calculatedcells are framed in bold, and the optimal path is coloured grey.

(step 12). Otherwise the minimum path cost for each cell in thecurrent row and column is found. If this occurs in the current po-sition (t, j), then both the row and column counts are incremented(e.g., steps 20–21); if it occurs elsewhere in row j, then the rowcount is incremented (e.g., step 10), otherwise the column countt is incremented (e.g., step 19). This enables dynamic tracking ofthe minimum cost path using a small fixed width band around avarying “diagonal”.

Since the on-line time warping algorithm cannot look into thefuture, its alignment path must be calculated in the forward direc-tion. In the algorithm above, the function GetInc calculates thecurrent optimal path as ending at the point (x, y), which we callthe current alignment point. Now, if the kth alignment point is(xk, yk), there is no way of knowing if this point will lie on theoptimal path for k0 > k. Further, there is no guarantee of continu-ity between the paths of length k� 1 and k, nor in the sequence ofalignment points (x1, y1), (x2, y2), ..., (xk, yk).

Two approaches can be taken to address this problem. First, ifthe application allows a certain amount of latency, then the choiceof alignment points can be based on a limited view into the fu-ture. That is, for path length k + �, we output the point (x0

k, y0k),

the kth point on the optimal path to (xk+�, yk+�), which mightbe different to the point (xk, yk) calculated for path length k. Forincreasing values of �, the path becomes increasingly smooth andcloser to the global optimum computed by the reverse path algo-rithm of DTW. The second approach applies smoothing directly tothe sequence of alignment points. This requires no future informa-tion, but it still builds an effective latency into the system. (If thesmoothing function is interpreted as a filter, the latency is equal toits group delay.) In the system described in section 4, neither ap-proach was deemed necessary, since if the forward path estimationis correct, no retrospective adjustment of the path is necessary, andthe path consisting of the current alignment points is continuous.

3.1. Efficiency and Correctness

For each new row or column, the on-line time warping algorithmcalculates up to c cells and makes less than 2c + MaxRunCountcomparisons. We are specifically interested in the behaviour withrespect to the arrival of a new element ut. As long as the slope ofthe sequence of increments is bounded (i.e. by MaxRunCount),then the number of calculations to be performed for each time t isbounded by a constant.

The correctness of the algorithm (in terms of finding the glob-ally minimal path) cannot be guaranteed without calculating thecomplete distance matrix. Thus, any path constraint immediatelydenies this sense of optimality, but as stated previously, minimumcost paths with large singularities are usually undesired artifactsof an imperfect cost function. For each incoming data point ut,the minimum cost path calculated at time t is the same as thatcalculated by DTW, assuming the same path constraints. The ad-vantage of the on-line algorithm is that the centre of the searchband is adaptively adjusted to follow the best match, which allowsa smaller search band than the standard bands around a fixed diag-onal.

4. TRACKING OF MUSICAL EXPRESSION

In music performance, high level information such as structureand emotion is communicated by the performer through a rangeof different parameters, such as tempo, dynamics, articulation andvibrato. These parameters vary within a musical piece, betweenmusical pieces and between performers. An important step to-wards modelling of this phenomenon is the measurement of theexpression parameters in human musical performance, which is afar from trivial task [9, 10]. Since we do not anticipate that thegreat musicians would perform with sensors attached to their fin-gers or instruments (!), we wish to extract this information directlyfrom the audio signal.

State of the art audio analysis algorithms are unable to reliablyextract precise performance information, so a hand-correction stepis often employed to complement the automatic steps. This stepis labour intensive, error-prone, and only suitable for off-line pro-cessing, so we propose using automatic alignment of different per-formances of the same piece of music as a key step in extractingperformance parameters. In the off-line case, automatic alignmentenables comparative studies of musical interpretation directly fromaudio recordings [11]. If one performance is already matched tothe score, it can then be used as a reference piece for the extrac-tion of absolute measurements from other performances. Further,one could synthesise a performance directly from the score, andavoid the initial manual matching of score and performance en-tirely. Of particular interest in this paper is the on-line case, thelive tracking and visualisation of expressive parameters during aperformance. This could be used to complement the listening ex-perience of concert-goers, to provide feedback to teachers and stu-dents, and to implement interactive performance and automatic ac-companiment systems. In this section we describe the implemen-tation of a real-time performance alignment system using on-linetime warping, concluding with an example of an application fortracking and visualising musical expression.

4.1. Cost Function

The alignment of audio files is based on a cost function which as-sesses the similarity of frames of audio data. We use a low level

DAFX-3

[Dixon, 2005]

Automatic Page Turner (JKU, Austria)

Onset-sensitive Alignment

• We are sensitive to the time alignment on note onsets. - The similarity matrix has no additional

weight to onsets

• DLNCO Features- Decaying Locally-adapted Normalized

Chroma Onset- Capture only onset strength on chroma

features- Normalize onset energy and note length

(by artificially-created note tail)

[Ewert, 2009]

Demo: PerformScore

• https://jdasam.github.io/PerformScore/

Multi-pitch Estimation

• Two types of polyphonic settings- Polyphonic instruments: piano, guitar- Ensemble of monophonic instruments: woodwind quintet, string quartet,

chorale

• Three levels of subtasks- First-level: frame-wise estimation of pitches and polyphony (number of

notes)- Second-level: tracking pitch within a note based on temporal continuity - Third-level: tracking notes for each sound source, usually for ensembles of

monophonic instruments

Challenges

• Many sources are mixed and played simultaneously - They are likely to be harmonically related in music- Some sources can be masked by others- Content changes continuously by musical expressions (e.g. vibrato)

• Compromises- Transcribe as many source sounds as possible - Only dominant sources: melody, bass, drum

Frame-wise Multi-pitch Estimation

• Three categories of approaches- Iterative F0 search: repeatedly finds predominant-F0 and removes its

related sources- Joint source estimation: examines possible combinations of multiples

sources, e.g., NMF- Classification-base approach: no prior knowledge of musical acoustics,

only relies on supervised learning

Iterative F0 estimation

• Based on repeated cancellation of harmonic overtones of detected F0s (Klapuri, 2003)

• Procedure1. Set the original to the residual2. Detect predominant F0: based on the harmonic sieve method3. Spectral smoothing on harmonics on the detected F04. Cancel the smoothed harmonics from the residual5. Repeat the step 2 & 3 until the residual is sufficiently flat

F0detection

Cancel soundFrom mixture

YR (k)←max(YR (k)− d YD (k), 0)

YR (k)

Iterative F0 estimationSpectral Smoothness

ECE 477 - Computer Audition, Zhiyao Duan 2014 18Spectral Smoothness

Iterative Estimation

Iterative F0 estimation

• Advantages- Deterministic: only by signal processing and no data-driven training - Can handle inharmonicity (e.g. piano) and vibratio

• Limitations- F0 estimation becomes unreliable as iteration increases- Spectral smoothing is not accurate enough

Joint Source Estimation

• Based on a model for sound mixture- All sources compete with each other to explain the mixture and find a

subset that are mostly likely- The number of sources are limited - Non-negative matrix factorization (NMF) has been most widely explored


• How many spectral templates can explain the source ?


• We can explain the spectrogram with three spectral basis (𝑊) and corresponding activations (𝐻)

• Can we decompose 𝑉 into 𝑊 and 𝐻 automatically ?

𝑊

𝐻

𝑉 ≈ 𝑊𝐻

Non-negative Matrix Factorization (NMF)

• One of matrix factorization algorithms but all elements are non-negative- 𝑉(𝑀 x 𝑁 matrix): original data (e.g. spectrogram) - 𝑊(𝑀x 𝐾 matrix ): 𝐾 basis vectors (e.g. dictionary)- 𝐻(𝐾 x 𝑁 matrix): activation matrix (e.g. weights or gains)

• Note that this provides a compressed representation. - A low-rank approximation

!

"

####

$

%

&&&&

≈

!

"

####

$

%

&&&&

!

"

###

$

%

&&&

𝑉 𝑊 𝐻

Algorithm for NMF

• 𝑉 is known, and 𝑊 and 𝐻 are unknown. How?

• Alternative the estimation (similar to the EM algorithm)- Start with random 𝑊- Estimate an 𝐻 given 𝑊- Estimate a new 𝑊 given 𝐻- Repeat until convergence

• If the distance is Euclidean, solve the following:

- Estimate 𝐻 given 𝑊: 𝐻 = (𝑊,𝑊)./𝑊,𝑉 (least squares!)- Make 𝐻 non-negative: 𝐻 = max(𝐻, 0)- Estimate 𝑊 given 𝐻: 𝑊 = 𝑉(𝐻,𝐻)./𝐻,(least squares!)- Make 𝑊 non-negative: 𝑊 = max(𝑊, 0)- Repeat until convergence

• The problems- Require pseudoinverses every iteration: expensive and stability issue- Gaussian assumption on the approximation

Algorithm for NMF

min7,8,9:;

<(𝑉 −𝑊𝐻)>?,@

𝑉A = 𝑊𝐻

Algorithm for NMF

• Instead, we use a special distance- A variant of Kullback-Leibler (KL-divergence)

• “Multiplicative” (magic) update rule - Estimate 𝑊: 𝑊?@ = 𝑊?@ ∑

7CD(89)CD

𝐻@EE

- Estimate 𝐻 : 𝐻@E = 𝐻@E ∑ 𝑊?@7CD

(89)CD?

- Repeat until convergence

• This is much faster and has no inversion!

min7,8,9:;

<(𝑉?@log𝑉?@

(𝑊𝐻)?@𝑉?@ + (𝑊𝐻)?@)

?,@

(Lee and Seung, 2000)

MA

CH

INE

LE

AR

NIN

G F

OR

SIG

NA

L P

RO

CE

SS

ING

– F

AL

L 2

015

Example on faces

• Both PCA and NMF describe the data to a good degree • Eigenfaces are not interpretable

though (very abstract notions) • NMF-faces find parts that are

additive (noses, eyes, etc.)

• NMF is a better way to explain structured data

48

Property of NMF

• The learned basis (𝑊) capture find parts - An example is explained by a combination of the parts (e.g additive

synthesis)- The basis are more structured and interpretable

MA

CH

INE

LE

AR

NIN

G F

OR

SIG

NA

L P

RO

CE

SS

ING

– F

AL

L 2

015

Example on faces

• Both PCA and NMF describe the data to a good degree • Eigenfaces are not interpretable

though (very abstract notions) • NMF-faces find parts that are

additive (noses, eyes, etc.)

• NMF is a better way to explain structured data

48

Interpretation of NMF on spectrogram

• Columns of the spectrogram are a weighted sum of basis vectors

≈

Interpretation of NMF on spectrogram

• The whole spectrogram is approximated as a sum of matrix “layers”, each of which is explained by one spectral component.

= + +

Source Separation by NMF

• We can separate each source

= + +

Resynthesized results:

Supervised Learning

• Perform NMF separately for isolated training data of each source in a mixture - Pre-learn individual models of each source, e.g., W1 , W2 and W3

- Combine them into a single model W = [W1 W2 W3] that explain a mixture - Given a mixture V, perform the NMF (Fix W and update H only)- Then, the activation H indicates the strength of F0s- Usually needs sparsity and temporal continuity on H (Virtanen, 2007)

Supervised Learning

NMF W1

NMF W2

NMF

W=[W1 W2]

H=[H1 H2]

Throw away H1 and H2

Semi-supervised Learning

• Problem in supervised learning- It is difficult to have training data of all individual sources. - Unknown sources are mixed in the majority of real-time scenarios

• Semi-supervised Learning- Learn spectral basis (i.e. dictionaries) for available sources, say, W1

- In testing phase, add new spectral basis W2 which explains the remaining sources in the mixture

- Fix the trained W1 and update W2 only in the NMF iteration

Semi-supervised Learning

44

NMF W1

NMF

W=[W1 W2]

H=[H1 H2]

Throw away H1

W2 is initialized withrandom numbers

Unsupervised Learning

• We have no information about individual source- Update both W and H for the mixture sound - Need additional constraints- Spectral harmonicity and smoothness on W (Vincent, 2010)

- Very difficult!

Unsupervised Learning

Unsupervised NMF

(adapted W)

[Vincent, 2010]

Supervised NMF(fixed W)

Unsupervised NMF+ Harmonicity& Smoothness constraint on W

Issues

• Number of basis vectors (K)- Too small- Reconstruction errors will increase- The model gets under-estimated.

- Too large- Do not learn parts (distribution of spectral basis vectors become sparse). - The model becomes too general and so it can explain other sources well. - Sparsity is often added to activation in order to learn “parts”. For example,

minW ,H≥0

D(V ||WH )+ H1


• Advantages- Compositional model: applicable to any mixture- Models can be expended well with additional constraints: e.g. source-filter

model, inharmonicity

• Limitations- Model can be computationally expensive: long inference time by iteration- Modeled pitches are usually discrete

Classification-Based Transcription

• Train a binary classifier for each note- Each classifier is trained with two groups of audio features: one including

the note and the other not including it- 88 classifiers for polyphonic piano transcription

Classifier(C4 note) on/off

Audiofeatures

Classifier(C#4 note)

on/off

.

...

..Frames

including C4

Framesnot including

C4Feature Space


• Often trained with real music data (not single notes)- There are abundant MIDI files for classic piano music. It is easy to get

audio files from them: e.g. using software synthesizers or player pianos

MIDI Piano rollAudio Spectrogram


• Audio features- Auditory filter bank - Spectrogram or Log-spectrogram

• Classifiers- Support vector machines- Neural Network

• Multi-label classification problem- Approach #1: separated binary classification for each note: select

balanced sets for each note- Approach #2: cross-entropy between the binary label vector and predicted

output (this is more commonly used)

...

... Input

Multiple-NoteTraining

Linear SVM

(Baseline)

Linear SVM

+ Hidden Layers

...

Single-NoteTraining

Output

...

...

...

...

...

...

Output

Input

HiddenLayers

...

Figure 2: Network configurations for single-note andmultiple-note training. Features are obtained from feed-forward transformation as indicated by the bottom-up ar-rows. They can be finetuned by back-propagation as indi-cated by the top-down arrows.

of a normalized spectrogram). Their transcription systemrequires individual supervised training for each note. Thus,we refer to this as single-note training.

We constrained the SVM in our experiments to a lin-ear kernel because Poliner and Ellis reported that high-orderkernels (e.g. RBF kernel) provided only modest performancegains with significantly more computation [13] and also alinear SVM is more suitable to large-scale data. We formedthe training data by selecting spectrogram frames that in-clude the note (positive examples) and those that do not in-clude it (negative examples). Poliner and Ellis randomlysampled 50 positive (when available) and negative exam-ples from each piano song per note. We used their samplingparadigm for single-note training.

While their system used a normalized spectrogram, wereplaced it with DBN-based feature representations on spec-trogram frames. As shown in the left column of Figure 2, theprevious approach directly feeds spectrogram frames intoSVM, whereas our approach transforms the spectrogram fra-mes into mid-level features via one or two layers of learnednetworks and then feeds them into the classifier. We alsofinetuned the networks with the error from the SVM.

3.2 Multiple-note TrainingWhen we experimented with single-note training describedabove, we observed that the classifiers are somewhat “ag-gressive”, that is, they produced even more “false alarm” er-rors (detect inactive notes as active ones) than “miss” errors(fail to detect active notes). In particular, this significantlydegraded onset accuracy. Also, it was substantially slow tofinetune the DBN networks separately for each note. Thus,we suggest a way of training multiple binary classifiers at

the same time. We refer to this as multiple-note training.The idea is to sum 88 SVM objectives and train them

with shared audio features and 88 binary labels (at a giventime, a single audio feature has 88 corresponding binary la-bels), as if we train a single classifier. 1 This allows cross-validation to be jointly performed for 88 SVMs, thereby sav-ing a significant amount of training time. On the other hand,this requires a different way of sampling examples. Sincewe combined all 88 notes in our experiments, all spectro-gram frames except silent ones are a positive example to atleast one SVM. Thus we sampled training data by selectingspectrogram frames at every K frame time. K was set to16 as a trade-off between data reduction and performance.Note that this makes the ratio of positive and negative exam-ples for each SVM determined by occurrences of the notein the whole training set, thereby having significantly morenegative examples than positive ones for most SVMs. Itturned out that this “unbalanced” data ratio makes the clas-sifiers “less aggressive,” as a result, increasing overall per-formance.

We illustrate multiple-note training in the right columnof Figure 2. In fact, without finetuning the DBNs, multiple-note training is equivalent to single-note training with theunbalanced data ratio. The only difference is that the single-note training does separate cross-validation for each SVM.We compared multiple-note training to the single-note train-ing with the unbalanced data ratio, but found no noticeabledifference in performance. On the other hand, when wefinetune the DBNs, these two training approaches becomecompletely different. While single-note training producesseparate DBN parameters for each note, multiple-note train-ing allows the networks to shares the parameters among allnotes by updating them with the errors from the combinedSVMs. For example, when the multiple-note training looksat the presence of a C3 note given input features, it simulta-neously checks out if other notes (e.g. C4 or C5) are played.This can be seen as an example of multi-task learning.

3.3 HMM Post-processingThe frame-level classification described above treats train-ing examples independently without considering dependencybetween frames. Poliner and Ellis used HMM-based post-processing to temporally smooth the SVM prediction. Theymodeled each note independently with a two-state HMM.We also adopted this approach. In our implementation, how-ever, we converted the SVM output (distance to the bound-ary) to a posterior probability using

p(yi = 1|xi) = sigmoid(↵(✓T xi)), (3)

1 The classifier we used is a linear SVM with a L2-regularized L2-loss [2]. We implemented the SVM in MATLAB using minFunc, whichis a Matlab library found in http://www.cs.ubc.ca/⇠schmidtm/Software/minFunc.html. Thus, summing 88 SVM objectives wasdone by simply treating 88 binary labels as a vector.

Viterbi Decoding

• Temporal smoothing of predicted outputs - Separated HMM for each note: binary state (note on/off)- 88 initial states distributions (2x1) and transition probability matrices (2x2)

HMM outputSVM outputInput (Spectrogram ) Hidden layer activation HMM outputSVM outputInput (Spectrogram ) Hidden layer activation

[Nam,2011]

References

• G. Widmer, “In search of the Horowitz Factor”, 2003• S. Dixon, “Live Tracking Of Musical Performance Using On-line Time

Warping”, 2005• S. Ewert, “High Resolution Audio Synchronization Using Chroma Onset

Features”, 2009• A. Klapuri, “Multiple fundamental frequency estimation based on harmonicity

and spectral smoothness”, 2003• T. Virtanen, “Monaural Sound Source Separation by Nonnegative Matrix

Factorization with Temporal Continuity and Sparseness Criteria”, 2007• E. Vincent, “Adaptive Harmonic Spectral Decomposition for Multiple Pitch

Estimation”, 2010• G. Poliner, “A Discriminative Model for Polyphonic Piano Transcription”, 2007• J. Nam, “A Classification-Based Polyphonic Piano Transcription Approach

Using Learned Feature Representations”, 2011

Documents

GCT634: Musical Applications of Machine Learning ...mac.kaist.ac.kr/~juhan/gct634/2018/slides/10-polyphonic_transcription.pdf · J. Brahms, Clarinet Quintet in B minor, op.115. 3rd