Upload
dodiep
View
224
Download
1
Embed Size (px)
Citation preview
GCT535- Sound Technology for MultimediaTonal Analysis
Graduate School of Culture TechnologyKAIST
Juhan Nam
1
Outline
§ Pitch Perception– Perceptual Pitch Scale– Log-Scaled Spectrum
§ Tonal Analysis– Chroma Feature– Key Estimation– Chord Recognition
2
Frequency Scale in Spectrogram
§ Linear frequency scale – Great to see the harmonic structure of a single tone. – However, it is not the most intuitive way to visualize musical signals
3Piano (Chromatic Scale)Beatles “Hey Jude”
time [second]
frequ
ency−H
z
10 20 30 40 500
500
1000
1500
2000
2500
3000
3500
4000
time [second]
frequ
ency−H
z
0 1 2 3 4 5 6 7 80
2000
4000
6000
8000
10000
Human Pitch Perception
§ Human ears are sensitive to frequency changes in a log scale– Pitch resolution: just noticeable difference (JND) increases as the frequency goes up– Place theory: resonance position along the basilar membrane in cochlea
4Response of the basilar membrane to a pair of tones
From CCRMA Music 150 slides (Thomas Rossing)
§ Frequency bandwidth within which one tone interferes with the perception of another tone by auditory masking – Constant at low frequency but linear at high frequency
Critical Bandwidth
5From CCRMA Music 150 slides (Thomas Rossing)
0 0.5 1 1.5 2 2.5x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
frequency (Hz)
norm
aliz
ed s
cale
s
ERBMelBark
§ Mel scale– Based on pitch ratio of tones (mel from
“melody”)
§ Bark scale– Critical band measurement by masking
§ Equivalent Regular Bandwidth (EBR) rate– Critical band measurement using the notched-
noise method
Psychoacoustical Pitch Scales
6
m = 2595log10 (1+ f / 700)
Bark =13arctan(0.00075 f )+3.5arctan(( f / 7500)2 )
ERBS = 21.4 ⋅ log10 (1+ 0.00437 f )Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/
Comparison of Pitch Scales
Musical Pitch Scale
§ Equal temperament– 1: 21/12 ratio between two adjacent notes– Music note (m) and frequency (f) in Hz
7
f = 440 ⋅2(m−69)
12m =12 log2(f440
)+ 69,
https://newt.phys.unsw.edu.au/jw/notes.html
time [second]
MID
I not
e nu
mbe
r
10 20 30 40 50
20
40
60
80
100
120
time [second]fre
quen
cy−H
z
10 20 30 40 500
500
1000
1500
2000
2500
3000
3500
4000
Frequency Mapping Using Spectrogram
§ Mapping linear scale to a perceptual (log-like) scale– Locate center frequencies according to the frequency mapping– Linear interpolation on the center frequency with the corresponding bandwidth skirt
8
Center Frequency
Bandwidth
Log-Frequency Spectrogram Linear-Frequency Spectrogram
§ The mapping can be formed as matrix multiplication– Each column of the mapping matrix contain the interpolation coefficients
§ Limitation– Simple but time frequency resolutions are still constrained on STFT
100 200 300 400 500 600
20
40
60
80
100
120
Frequency Mapping Using Spectrogram
9
×
time [second]
MID
I not
e nu
mbe
r
10 20 30 40 50
20
40
60
80
100
120
Y =M ⋅X(M: mapping matrix, X: spectrogram, Y: scaled spectrogram)
time [second]
frequ
ency−H
z
10 20 30 40 500
500
1000
1500
2000
2500
3000
3500
4000
=
Mel-Frequency Spectrogram
§ Mel scale is a popularly choice– Example: MFCC
10Linear-Frequency Spectrogram Mel-Frequency Spectrogram
time [second]
frequ
ency−H
z
0 1 2 3 4 5 6 7 80
2000
4000
6000
8000
10000
time [second]
Mel
bin
0 1 2 3 4 5 6 7 8
50
100
150
200
250
Constant-Q transform
§ Use a set of sinusoidal kernels with: – Logarithmically spaced frequencies– Constant Q = frequency/bandwidth
11
Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.
3.1 Algorithm of Brown and Puckette
Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.
A computationally more efficient implementation is ob-tained by utilizing the identity
N�1X
n=0
x(n)a⇤(n) =
N�1X
j=0
X(j)A⇤(j) (7)
where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].
Using (7), the CQT transform in (1) can be written as
XCQ(k,N/2) =
NX
j=0
X(j)A⇤k(j) (8)
where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the
terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.
As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.
For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as
XCQ = A⇤X (9)
where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.
3.2 Processing One Octave at a Time
There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.
We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.
Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nk
of the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.
Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel
2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].
Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.
3.1 Algorithm of Brown and Puckette
Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.
A computationally more efficient implementation is ob-tained by utilizing the identity
N�1X
n=0
x(n)a⇤(n) =
N�1X
j=0
X(j)A⇤(j) (7)
where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].
Using (7), the CQT transform in (1) can be written as
XCQ(k,N/2) =
NX
j=0
X(j)A⇤k(j) (8)
where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the
terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.
As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.
For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as
XCQ = A⇤X (9)
where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.
3.2 Processing One Octave at a Time
There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.
We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.
Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nk
of the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.
Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel
2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].
Comparison of Different Time-Frequency Representations
12
Spectrogram (short window)time
frequency
Spectrogram (long window)time
frequency
Mel Spectrogramtime
frequency
Constant-Q transformtime
frequency
Example of Constant-Q transform
13
time [second]
MID
I not
e nu
mbe
r
10 20 30 40 50
20
40
60
80
100
120
Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)
0 10 20 30 40 50100
120
140
160
180
200
220
240
260
280
300
320
time [second]
Chord Recognition in MIR
§ Identifying chord progression of tonal music
§ It is a challenging task (even for human)– Chords are not explicit in music – Non-chord notes or passing notes– Key change and chromaticism: requires in-depth knowledge of music theory– In audio, multiple musical instruments are mixed
• Relevant: harmonically arranged notes• Irrelevant: percussive sounds (but can help detecting chord changes)
§ What kind of audio features can be extracted to recognize chords in a robust way?
14
Pitch Helix
§ The basic assumption in tonal harmony is that octave-distance notes belong to the same pitch class– No dissonance among them– As a result, there are “12 pitch class”
§ Shepard represented the octave equivalence with “pitch helix”– Chroma: represents the inherent circularity
of pitch organization– Height: naturally increase and have one
octave apart for one rotation
15
PitchHelixandChroma(Shepard,2001)
Chroma
§ Chroma is independent of the height– Shepard tone: single pitch class in harmonics– Constant rising and falling
§ Chroma contains the relative distribution of pitch classes and pitch height is noisy variation in chord recognition– Thus, chroma is considered to be well-suited for analyzing harmony.
16
OpticalillusionstairsShepardtonehttps://vimeo.com/34749558
Chroma Features
§ Chroma features are audio feature vectors that contain the chroma characteristics – Ideally, obtained by polyphonic note transcription but too expensive– In addition, as notes are more harmonized, separating polyphonic notes become
harder
§ In practice, chroma features are obtained by projecting all time-frequency energy onto 12 pitch classes
§ Used for not only for chord recognition but also key estimation, segmentation, synchronization, cover-song detection
17
Chroma Features: FFT-based approach
§ Compute spectrogram and mapping matrix– Convert frequency to music pitch scale and get the pitch class – Set one to the corresponding pitch class and, otherwise, set zero– Adjust non-zeros values such that low-frequency content have more weights
18
Improvements
§ Blurring– Intrinsic problem with STFT– Solutions: find amplitude peaks and use them only
§ De-tuning– Notes can be deviated from reference tuning– Compute 36 bin chroma features: add two neighboring bins to each pitch class– Use only a peak value among the three bins per pitch class
§ Normalization– Divide the frame chroma features by the local maximum or mean to regularize the
volume change
19
Chroma Features: Filter-bank approach
§ Alternatively, a filter-bank can be used to get a log-scale time-frequency representation– Center frequencies are arranged over 88 piano
notes – band widths are set to have constant-Q and robust
to +/- 25 cent detune
§ The outputs that belong to the same pitch class are wrapped and summed.
20(Muller,2011)
Beat-Synchronous Chroma Features
§ Make chroma features homogeneous within a beat (Bartsch and Wakefield, 2001)
21(FromEllis’slides)
Key Estimation Overview
§ Estimate music key from music data– One of 24 keys: 12 pitch classes (C, C#, D, .., B) + major/minor
§ General Framework (Gomez, 2006)
22
GmajorSimilarityMeasure
ChromaFeatures
Average
KeyTemplate
KeyStrength
Key Template
§ Probe tone profile (Krumhansl and Kessler, 1982)– Relative stability or weight of tones – Listeners rated which tones best completed the first seven notes of a major scale.
• For example, in C major key, C, D, E, F, G, A, B, … what?
23ProbeToneProfile- RelativePitchRanking
Key Estimation
§ Similarity by cross-correlation between chroma features and templates
§ Find the key that produces the maximum correlation
24
Chord Recognition
§ Estimate chords from music data– Typically, one of 24 keys: 12 pitch classes + major/minor – Often, diminish chords are added (36 chords)
§ General Framework
25
ChordsDecisionMaking
Audio/Transform
ChromaFeatures
ChordTemplateorModels
TemplateMatchingHMM,SVM
Template-Based Approach
§ Use chord templates (Fujishima, 1999; Harte and Sandler, 2005) and find the best matches
§ Chord Templates
26
(fromBello’sSlides)
Template-Based Approach
§ Compute the cross-correlation between chroma features and chord templates and select chords that have maximum values
27
(fromBello’sSlides)
Limitations
§ Template approach is too straightforward– The binary templates are hard assignments
§ Temporal dependency of chords is not considered– The majority of tonal music have certain types of chord progression
§ The recognized chords are not smooth– Some post-processing (smoothing) is necessary
28
Demo
§ Chordify: https://chordify.net
29
References
§ P. R. Cook (Editor), “Music, Cognition, and Computerized Sound: An Introduction to Psychoacoustics”, book, 2001
§ C. Krumhansl, “Cognitive Foundations of Musical Pitch”, 1990 § M.A. Bartsch and G. H. Wakefield,“To catch a chorus: Using chroma-based
representations for audio thumbnailing”, 2001§ E. Gómez, P. Herrera, “Estimating The Tonality Of Polyphonic Audio Files:
Cognitive Versus Machine Learning Modeling Strategies”, 2004. § M. Müller and S. Ewert, “Chroma Toolbox: MATLAB Implementations for
Extracting Variants of Chroma-Based Audio Features”, 2011.§ T. Fujishima, “Real-time chord recognition of musical sound: A system
using common lisp music,” 1999
30