31
1/22 Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music Pablo A. Alvarado , Dan Stowell Published in: arXiv.org Centre for Digital Music Queen Mary University of London

Efficient Learning of Harmonic Priors for Pitch Detection

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music1/22
Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music
Pablo A. Alvarado, Dan Stowell Published in: arXiv.org
Centre for Digital Music Queen Mary University of London
2/22
Content
Results Transcription of polyphonic music
Conclusions
3/22
Introduction
Automatic music transcription (AMT) AMT consists in updating our beliefs about the symbolic description (piano-roll) of a piece of music, after observing a corresponding audio recording [1, 2].
p(piano-roll|signal) = p(signal|piano-roll)p(piano-roll)
Results Transcription of polyphonic music
Conclusions
5/22
Pitch detection model
I We address the transcription problem from a time-domain source separation perspective, as in [3].
I Given an acoustic signal D = {tn, yn}Nn=1, we use the regression model
y(t) = M∑
m=1
fm(t) + ε,
m=1
φm(t)wm(t) + ε,
where the set of functions {φm(t)}Mm=1 and {wm(t)}Mm=1 are called activation processes, and quasi-periodic component processes respectively.
6/22
g(t) ∼ GP (µ(t), k(t, t ′)) .
I The covariance function or kernel k(t, t ′) defines the properties of the random function g(t), such as smoothness, frequency content.
I Any finite number of function evaluations g = [g(t1), · · · , g(tN)]>
follows a multivariate normal distribution.
7/22
y = M∑
m=1
φm(t)wm(t) + ε.
φm(t) = σ(gm(t)),
and
8/22
Activation process
Softmax model: To introduce dependences between all activations we use the softmax function [4, 5]
φm(t) = exp(gm(t))∑ ∀j exp(gj(t))
and
9/22
y = M∑
m=1
φm(t)wm(t) + ε.
We seek to make
F {km(r)} ≈ |F {ym(t)} |,
where ym(t) corresponds to the audio recording of an isolated sound event with pitch m.
0.0 0.5 1.0 1.5 2.0 Time (s)
−0.4
−0.2
0.0
0.2
0.4
0 500 1000 1500 2000 2500 3000 3500 Frequency (Hz)
0.1
0.5
0.9
Standard model:
y = M∑
m=1
φm(t)wm(t) + ε.
LOO model:
I Activations follow φi (t) = σ(gi (t)).
I Components follow
wj(t) ∼ GP
I GPs posteriors are in general computationally expensive to compute.
I In this case the posterior also does not have closed form.
12/22
I GPs posteriors are in general computationally expensive to compute.
I In this case the posterior also does not have closed form.
I The key idea is to approximate to posterior with optimization [2].
I We first choose a family of probability distributions, then we try to find the member of that family closest to the exact posterior.
13/22
−1.0
−0.5
0.0
0.5
1.0
1.5
−0.5
0.0
0.5
−0.5
0.0
0.5
−0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
0.0
0.2
0.4
0.6
0.8
Components and activations.
−1
0
1
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−1
0
1
0.0
0.2
0.4
0.6
0.8
13/22
−0.5
0.0
0.5
1.0
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
1.0
0.0
0.2
0.4
0.6
0.8
Posterior after 5 iterations.
−0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
0.0
0.2
0.4
0.6
0.8
Posterior after 50 iterations.
−0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
0.0
0.2
0.4
0.6
0.8
Posterior after 500 iterations.
−0.5
0.0
0.5
−0.5
0.0
0.5
Results Transcription of polyphonic music
Conclusions
15/22
Test data
I Synthetic electric guitar audio signal [3].
I Mixture sounds: C4, E4, G4, C4+E4, C4+G4, E4+G4, and C4+E4+G4.
I Duration 14 seconds, sampled at 16KHz.
0 2 4 6 8 10 12 14 Time (s)
−1.00
−0.75
−0.50
−0.25
0.00
0.25
0.50
0.75
1.00
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
Training data
I Three isolated sound events with pitches C4 (261.63Hz), E4 (329.63Hz), G4 (392.00Hz), respectively.
Experiments
I Detection of pitches C4, E4 (using standard model & sigmoid (SIG) or softmaxv(SOF)).
I Detection of all three pitches C4, E4, G4 (using SIG-LOO).
Learning hyperparameters
I Maximising the marginal likelihood (ML).
I Reducing the MSE between F {km(r)} and |F {ym(t)} | (FL, proposed method).
17/22
Results: transcription of polyphonic music Pitch detection using SIG-LOO model.
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
Results: transcription of polyphonic music
TM ML FL SIG 89.54% 59.23% 98.68% SOF 86.28% 55.28% 97.15%
SIG-LOO 76.21% 84.86% 98.19%
Table: F-measure for SIG, SOF models detecting two pitches (first two rows), and F- measure for SIG-LOO model detecting three pitches (bottom row), using three different learning approaches: TM , ML, and FL.
19/22
Content
Results Transcription of polyphonic music
Conclusions
20/22
Conclusions
I We proposed a GP regression approach for pitch detection in polyphonic music.
I We introduced Matern mixture kernel able to reflect the complex frequency content of sounds of single notes.
I The proposed approach allows to introduce prior beliefs about smoothness, positive-values constrains, and correlation between activations.
I Pitch detection results suggest that a set of proper frequency content priors over of the sound events to be detected are more relevant than encouraging dependency between activations.
I The linear scalability of the LOO model regarding the number of pitches makes it appropriate to detect more than just three pitches.
21/22
Bibliography I
[1] A. T. Cemgil, S. J. Godsill, P. H. Peeling, and N. Whiteley, “Bayesian statistical methods for audio and music processing,” The Oxford Handbook of Applied Bayesian Analysis, 2010.
[2] D. Blei, A. Kucukelbir, and J. McAuliffe, “Variational inference: A review for statisticians,” arXiv preprint arXiv:1601.00670, 2016.
[3] K. Yoshii, R. Tomioka, D. Mochihashi, and M. Goto, “Beyond nmf: Time-domain audio source separation without phase reconstruction.” in 14th International Society for Music Information Retrieval Conference (ISMIR 2013), 2013.
[4] K. P. Murphy, Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
[5] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
22/22
Questions
Introduction