Efficient Learning of Harmonic Priors for Pitch Detection
31
1/22 Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music Pablo A. Alvarado , Dan Stowell Published in: arXiv.org Centre for Digital Music Queen Mary University of London
Efficient Learning of Harmonic Priors for Pitch Detection
Efficient Learning of Harmonic Priors for Pitch Detection in
Polyphonic Music1/22
Efficient Learning of Harmonic Priors for Pitch Detection in
Polyphonic Music
Pablo A. Alvarado, Dan Stowell Published in: arXiv.org
Centre for Digital Music Queen Mary University of London
2/22
Content
Results Transcription of polyphonic music
Conclusions
3/22
Introduction
Automatic music transcription (AMT) AMT consists in updating our
beliefs about the symbolic description (piano-roll) of a piece of
music, after observing a corresponding audio recording [1,
2].
p(piano-roll|signal) = p(signal|piano-roll)p(piano-roll)
Results Transcription of polyphonic music
Conclusions
5/22
Pitch detection model
I We address the transcription problem from a time-domain source
separation perspective, as in [3].
I Given an acoustic signal D = {tn, yn}Nn=1, we use the regression
model
y(t) = M∑
m=1
fm(t) + ε,
m=1
φm(t)wm(t) + ε,
where the set of functions {φm(t)}Mm=1 and {wm(t)}Mm=1 are called
activation processes, and quasi-periodic component processes
respectively.
6/22
g(t) ∼ GP (µ(t), k(t, t ′)) .
I The covariance function or kernel k(t, t ′) defines the
properties of the random function g(t), such as smoothness,
frequency content.
I Any finite number of function evaluations g = [g(t1), · · · ,
g(tN)]>
follows a multivariate normal distribution.
7/22
y = M∑
m=1
φm(t)wm(t) + ε.
φm(t) = σ(gm(t)),
and
8/22
Activation process
Softmax model: To introduce dependences between all activations we
use the softmax function [4, 5]
φm(t) = exp(gm(t))∑ ∀j exp(gj(t))
and
9/22
y = M∑
m=1
φm(t)wm(t) + ε.
We seek to make
F {km(r)} ≈ |F {ym(t)} |,
where ym(t) corresponds to the audio recording of an isolated sound
event with pitch m.
0.0 0.5 1.0 1.5 2.0 Time (s)
−0.4
−0.2
0.0
0.2
0.4
0 500 1000 1500 2000 2500 3000 3500 Frequency (Hz)
0.1
0.5
0.9
Standard model:
y = M∑
m=1
φm(t)wm(t) + ε.
LOO model:
I Activations follow φi (t) = σ(gi (t)).
I Components follow
wj(t) ∼ GP
I GPs posteriors are in general computationally expensive to
compute.
I In this case the posterior also does not have closed form.
12/22
I GPs posteriors are in general computationally expensive to
compute.
I In this case the posterior also does not have closed form.
I The key idea is to approximate to posterior with optimization
[2].
I We first choose a family of probability distributions, then we
try to find the member of that family closest to the exact
posterior.
13/22
−1.0
−0.5
0.0
0.5
1.0
1.5
−0.5
0.0
0.5
−0.5
0.0
0.5
−0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
0.0
0.2
0.4
0.6
0.8
Components and activations.
−1
0
1
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−1
0
1
0.0
0.2
0.4
0.6
0.8
13/22
−0.5
0.0
0.5
1.0
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
1.0
0.0
0.2
0.4
0.6
0.8
Posterior after 5 iterations.
−0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
0.0
0.2
0.4
0.6
0.8
Posterior after 50 iterations.
−0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
−0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08 0.10 Time (s)
0.0
0.2
0.4
0.6
0.8
Posterior after 500 iterations.
−0.5
0.0
0.5
−0.5
0.0
0.5
Results Transcription of polyphonic music
Conclusions
15/22
Test data
I Synthetic electric guitar audio signal [3].
I Mixture sounds: C4, E4, G4, C4+E4, C4+G4, E4+G4, and
C4+E4+G4.
I Duration 14 seconds, sampled at 16KHz.
0 2 4 6 8 10 12 14 Time (s)
−1.00
−0.75
−0.50
−0.25
0.00
0.25
0.50
0.75
1.00
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
Training data
I Three isolated sound events with pitches C4 (261.63Hz), E4
(329.63Hz), G4 (392.00Hz), respectively.
Experiments
I Detection of pitches C4, E4 (using standard model & sigmoid
(SIG) or softmaxv(SOF)).
I Detection of all three pitches C4, E4, G4 (using SIG-LOO).
Learning hyperparameters
I Maximising the marginal likelihood (ML).
I Reducing the MSE between F {km(r)} and |F {ym(t)} | (FL, proposed
method).
17/22
Results: transcription of polyphonic music Pitch detection using
SIG-LOO model.
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
0 2 4 6 8 10 12 14 Time (seconds)
30
40
50
60
70
80
Results: transcription of polyphonic music
TM ML FL SIG 89.54% 59.23% 98.68% SOF 86.28% 55.28% 97.15%
SIG-LOO 76.21% 84.86% 98.19%
Table: F-measure for SIG, SOF models detecting two pitches (first
two rows), and F- measure for SIG-LOO model detecting three pitches
(bottom row), using three different learning approaches: TM , ML,
and FL.
19/22
Content
Results Transcription of polyphonic music
Conclusions
20/22
Conclusions
I We proposed a GP regression approach for pitch detection in
polyphonic music.
I We introduced Matern mixture kernel able to reflect the complex
frequency content of sounds of single notes.
I The proposed approach allows to introduce prior beliefs about
smoothness, positive-values constrains, and correlation between
activations.
I Pitch detection results suggest that a set of proper frequency
content priors over of the sound events to be detected are more
relevant than encouraging dependency between activations.
I The linear scalability of the LOO model regarding the number of
pitches makes it appropriate to detect more than just three
pitches.
21/22
Bibliography I
[1] A. T. Cemgil, S. J. Godsill, P. H. Peeling, and N. Whiteley,
“Bayesian statistical methods for audio and music processing,” The
Oxford Handbook of Applied Bayesian Analysis, 2010.
[2] D. Blei, A. Kucukelbir, and J. McAuliffe, “Variational
inference: A review for statisticians,” arXiv preprint
arXiv:1601.00670, 2016.
[3] K. Yoshii, R. Tomioka, D. Mochihashi, and M. Goto, “Beyond nmf:
Time-domain audio source separation without phase reconstruction.”
in 14th International Society for Music Information Retrieval
Conference (ISMIR 2013), 2013.
[4] K. P. Murphy, Machine Learning: A Probabilistic Perspective.
The MIT Press, 2012.
[5] C. M. Bishop, Pattern Recognition and Machine Learning
(Information Science and Statistics). Secaucus, NJ, USA:
Springer-Verlag New York, Inc., 2006.
22/22
Questions
Introduction