GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/2017/slides/13-tonal analysis.pdf · Psychoacoustical Pitch Scales 6 m=2595log 10 ... culate the CQT over two octaves,

GCT535- Sound Technology for MultimediaTonal Analysis

Graduate School of Culture TechnologyKAIST

Juhan Nam

1

Outline

§ Pitch Perception– Perceptual Pitch Scale– Log-Scaled Spectrum

§ Tonal Analysis– Chroma Feature– Key Estimation– Chord Recognition

2

Frequency Scale in Spectrogram

§ Linear frequency scale – Great to see the harmonic structure of a single tone. – However, it is not the most intuitive way to visualize musical signals

3Piano (Chromatic Scale)Beatles “Hey Jude”

time [second]

frequ

ency−H

z

10 20 30 40 500

500

1000

1500

2000

2500

3000

3500

4000

time [second]

frequ

ency−H

z

0 1 2 3 4 5 6 7 80

2000

4000

6000

8000

10000

Human Pitch Perception

§ Human ears are sensitive to frequency changes in a log scale– Pitch resolution: just noticeable difference (JND) increases as the frequency goes up– Place theory: resonance position along the basilar membrane in cochlea

4Response of the basilar membrane to a pair of tones

From CCRMA Music 150 slides (Thomas Rossing)

§ Frequency bandwidth within which one tone interferes with the perception of another tone by auditory masking – Constant at low frequency but linear at high frequency

Critical Bandwidth

5From CCRMA Music 150 slides (Thomas Rossing)

0 0.5 1 1.5 2 2.5x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

frequency (Hz)

norm

aliz

ed s

cale

s

ERBMelBark

§ Mel scale– Based on pitch ratio of tones (mel from

“melody”)

§ Bark scale– Critical band measurement by masking

§ Equivalent Regular Bandwidth (EBR) rate– Critical band measurement using the notched-

noise method

Psychoacoustical Pitch Scales

6

m = 2595log10 (1+ f / 700)

Bark =13arctan(0.00075 f )+3.5arctan(( f / 7500)2 )

ERBS = 21.4 ⋅ log10 (1+ 0.00437 f )Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/

Comparison of Pitch Scales

Musical Pitch Scale

§ Equal temperament– 1: 21/12 ratio between two adjacent notes– Music note (m) and frequency (f) in Hz

7

f = 440 ⋅2(m−69)

12m =12 log2(f440

)+ 69,

https://newt.phys.unsw.edu.au/jw/notes.html

time [second]

MID

I not

e nu

mbe

r

10 20 30 40 50

20

40

60

80

100

120

time [second]fre

quen

cy−H

z

10 20 30 40 500

500

1000

1500

2000

2500

3000

3500

4000

Frequency Mapping Using Spectrogram

§ Mapping linear scale to a perceptual (log-like) scale– Locate center frequencies according to the frequency mapping– Linear interpolation on the center frequency with the corresponding bandwidth skirt

8

Center Frequency

Bandwidth

Log-Frequency Spectrogram Linear-Frequency Spectrogram

§ The mapping can be formed as matrix multiplication– Each column of the mapping matrix contain the interpolation coefficients

§ Limitation– Simple but time frequency resolutions are still constrained on STFT

100 200 300 400 500 600

20

40

60

80

100

120

Frequency Mapping Using Spectrogram

9

×

time [second]

MID

I not

e nu

mbe

r

10 20 30 40 50

20

40

60

80

100

120

Y =M ⋅X(M: mapping matrix, X: spectrogram, Y: scaled spectrogram)

time [second]

frequ

ency−H

z

10 20 30 40 500

500

1000

1500

2000

2500

3000

3500

4000

=

Mel-Frequency Spectrogram

§ Mel scale is a popularly choice– Example: MFCC

10Linear-Frequency Spectrogram Mel-Frequency Spectrogram

time [second]

frequ

ency−H

z

0 1 2 3 4 5 6 7 80

2000

4000

6000

8000

10000

time [second]

Mel

bin

0 1 2 3 4 5 6 7 8

50

100

150

200

250

Constant-Q transform

§ Use a set of sinusoidal kernels with: – Logarithmically spaced frequencies– Constant Q = frequency/bandwidth

11

Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.

3.1 Algorithm of Brown and Puckette

Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.

A computationally more efficient implementation is ob-tained by utilizing the identity

N�1X

n=0

x(n)a⇤(n) =

N�1X

j=0

X(j)A⇤(j) (7)

where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].

Using (7), the CQT transform in (1) can be written as

XCQ(k,N/2) =

NX

j=0

X(j)A⇤k(j) (8)

where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the

terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.

As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.

For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as

XCQ = A⇤X (9)

where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.

3.2 Processing One Octave at a Time

There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.

We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.

Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nk

of the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.

Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel

2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].

Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.

3.1 Algorithm of Brown and Puckette

Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.

A computationally more efficient implementation is ob-tained by utilizing the identity

N�1X

n=0

x(n)a⇤(n) =

N�1X

j=0

X(j)A⇤(j) (7)

where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].

Using (7), the CQT transform in (1) can be written as

XCQ(k,N/2) =

NX

j=0

X(j)A⇤k(j) (8)

where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the

terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.

As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.

For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as

XCQ = A⇤X (9)

where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.

3.2 Processing One Octave at a Time

There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.

We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.

Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nk

of the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.

Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel

2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].

Comparison of Different Time-Frequency Representations

12

Spectrogram (short window)time

frequency

Spectrogram (long window)time

frequency

Mel Spectrogramtime

frequency

Constant-Q transformtime

frequency

Example of Constant-Q transform

13

time [second]

MID

I not

e nu

mbe

r

10 20 30 40 50

20

40

60

80

100

120

Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)

0 10 20 30 40 50100

120

140

160

180

200

220

240

260

280

300

320

time [second]

Chord Recognition in MIR

§ Identifying chord progression of tonal music

§ It is a challenging task (even for human)– Chords are not explicit in music – Non-chord notes or passing notes– Key change and chromaticism: requires in-depth knowledge of music theory– In audio, multiple musical instruments are mixed

• Relevant: harmonically arranged notes• Irrelevant: percussive sounds (but can help detecting chord changes)

§ What kind of audio features can be extracted to recognize chords in a robust way?

14

Pitch Helix

§ The basic assumption in tonal harmony is that octave-distance notes belong to the same pitch class– No dissonance among them– As a result, there are “12 pitch class”

§ Shepard represented the octave equivalence with “pitch helix”– Chroma: represents the inherent circularity

of pitch organization– Height: naturally increase and have one

octave apart for one rotation

15

PitchHelixandChroma(Shepard,2001)

Chroma

§ Chroma is independent of the height– Shepard tone: single pitch class in harmonics– Constant rising and falling

§ Chroma contains the relative distribution of pitch classes and pitch height is noisy variation in chord recognition– Thus, chroma is considered to be well-suited for analyzing harmony.

16

OpticalillusionstairsShepardtonehttps://vimeo.com/34749558

Chroma Features

§ Chroma features are audio feature vectors that contain the chroma characteristics – Ideally, obtained by polyphonic note transcription but too expensive– In addition, as notes are more harmonized, separating polyphonic notes become

harder

§ In practice, chroma features are obtained by projecting all time-frequency energy onto 12 pitch classes

§ Used for not only for chord recognition but also key estimation, segmentation, synchronization, cover-song detection

17

Chroma Features: FFT-based approach

§ Compute spectrogram and mapping matrix– Convert frequency to music pitch scale and get the pitch class – Set one to the corresponding pitch class and, otherwise, set zero– Adjust non-zeros values such that low-frequency content have more weights

18

Improvements

§ Blurring– Intrinsic problem with STFT– Solutions: find amplitude peaks and use them only

§ De-tuning– Notes can be deviated from reference tuning– Compute 36 bin chroma features: add two neighboring bins to each pitch class– Use only a peak value among the three bins per pitch class

§ Normalization– Divide the frame chroma features by the local maximum or mean to regularize the

volume change

19

Chroma Features: Filter-bank approach

§ Alternatively, a filter-bank can be used to get a log-scale time-frequency representation– Center frequencies are arranged over 88 piano

notes – band widths are set to have constant-Q and robust

to +/- 25 cent detune

§ The outputs that belong to the same pitch class are wrapped and summed.

20(Muller,2011)

Beat-Synchronous Chroma Features

§ Make chroma features homogeneous within a beat (Bartsch and Wakefield, 2001)

21(FromEllis’slides)

Key Estimation Overview

§ Estimate music key from music data– One of 24 keys: 12 pitch classes (C, C#, D, .., B) + major/minor

§ General Framework (Gomez, 2006)

22

GmajorSimilarityMeasure

ChromaFeatures

Average

KeyTemplate

KeyStrength

Key Template

§ Probe tone profile (Krumhansl and Kessler, 1982)– Relative stability or weight of tones – Listeners rated which tones best completed the first seven notes of a major scale.

• For example, in C major key, C, D, E, F, G, A, B, … what?

23ProbeToneProfile- RelativePitchRanking

Key Estimation

§ Similarity by cross-correlation between chroma features and templates

§ Find the key that produces the maximum correlation

24

Chord Recognition

§ Estimate chords from music data– Typically, one of 24 keys: 12 pitch classes + major/minor – Often, diminish chords are added (36 chords)

§ General Framework

25

ChordsDecisionMaking

Audio/Transform

ChromaFeatures

ChordTemplateorModels

TemplateMatchingHMM,SVM

Template-Based Approach

§ Use chord templates (Fujishima, 1999; Harte and Sandler, 2005) and find the best matches

§ Chord Templates

26

(fromBello’sSlides)

Template-Based Approach

§ Compute the cross-correlation between chroma features and chord templates and select chords that have maximum values

27

(fromBello’sSlides)

Limitations

§ Template approach is too straightforward– The binary templates are hard assignments

§ Temporal dependency of chords is not considered– The majority of tonal music have certain types of chord progression

§ The recognized chords are not smooth– Some post-processing (smoothing) is necessary

28

Demo

§ Chordify: https://chordify.net

29

References

§ P. R. Cook (Editor), “Music, Cognition, and Computerized Sound: An Introduction to Psychoacoustics”, book, 2001

§ C. Krumhansl, “Cognitive Foundations of Musical Pitch”, 1990 § M.A. Bartsch and G. H. Wakefield,“To catch a chorus: Using chroma-based

representations for audio thumbnailing”, 2001§ E. Gómez, P. Herrera, “Estimating The Tonality Of Polyphonic Audio Files:

Cognitive Versus Machine Learning Modeling Strategies”, 2004. § M. Müller and S. Ewert, “Chroma Toolbox: MATLAB Implementations for

Extracting Variants of Chroma-Based Audio Features”, 2011.§ T. Fujishima, “Real-time chord recognition of musical sound: A system

using common lisp music,” 1999

30

Documents

GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/2017/slides/13-tonal analysis.pdf · Psychoacoustical Pitch Scales 6 m=2595log 10 ... culate the CQT over two octaves,