44
GCT634: Musical Applications of Machine Learning Rhythm Transcription Dynamic Programming Graduate School of Culture Technology, KAIST Juhan Nam

GCT634: Musical Applications of Machine Learning Rhythm …juhan/gct634/2018/slides/08... · 2018. 9. 14. · Overview of Automatic Music Transcription (AMT) •Predicting musical

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • GCT634: Musical Applications of Machine LearningRhythm Transcription

    Dynamic Programming

    Graduate School of Culture Technology, KAISTJuhan Nam

  • Outlines

    • Overview of Automatic Music Transcription (AMT)- Types of AMT Tasks

    • Rhythmic Transcription- Introduction- Onset detection- Tempo Estimation

    • Dynamic Programming- Beat Tracking

  • Overview of Automatic Music Transcription (AMT)

    • Predicting musical score information from audio- Primary score information is note but they are arranged based on rhythm,

    harmony and structure- Equivalent to automatic speech recognition (ASR) for speech signals

    Model

    Beat

    Key Chord

    Structure

    TempoOnsets

  • Types of AMT Tasks

    • Rhythm transcription- Onset detection- Tempo estimation- Beat tracking

    • Tonal analysis - Key estimation- Chord recognition

    • Timbre analysis- Instrument identification

    • Note transcription- Monophonic note- Polyphonic note- Expression detection

    (e.g. vibrato, pedal)

    • Structure analysis- Musical structure- Musical boundary / repetition

    detection- Highlight detection

  • Types of AMT Tasks

    • Rhythm transcription- Onset detection- Tempo estimation- Beat tracking

    • Tonal analysis - Key estimation- Chord recognition

    • Timbre analysis- Instrument identification

    • Note transcription- Monophonic note- Polyphonic note- Expression detection

    (e.g. vibrato, pedal)

    • Structure analysis- Musical structure- Musical boundary / repetition

    detection- Highlight detection

    We will mainly focus on these topics!

  • Overview of AMT Systems

    • Acoustic model- Estimate the target information given input audio (usually short segment)

    • Musical knowledge- Music theory (e.g. rhythm, harmony), performance (e.g. playability)

    • Prior/Lexical model- Statistical distribution of the score-level music information (e.g. chord

    progression)

    AcousticModel

    Musical Knowledge

    TranscriptionModel

    Beat, Tempo

    Key, Chords

    Notes

    Prior or Lexical Model

    Audio-Level

    Score-Level

  • Introduction to Rhythm

    • Rhythm- A strong, regular, and repeated pattern of sound- Distinguish music from speech

    • The most primitive and foundational element of music- Melody, harmony and other musical elements are arranged on the basis of

    rhythm

    • Human and rhythm- Human has innate ability of rhythm perception: heart beat, walking - Associated with motor control: dance, labor song

  • Introduction to Rhythm

    • Hierarchical structure of rhythm- Beat (tactus): the most prominent level,

    foot tapping rate- Division (tatum): temporal atom, eighth

    or sixteenth- Measure (bar): the unit of rhythm

    pattern (and also harmonic changes)

    • Notations- Tempo: beats per minute, e.g. 90 bpm - Time signature: e.g. 4/4, 3/4, 6/8

    [Wikipedia]

  • Human Perception of Tempo

    • Mckinney and Moelant (2006)- Collect tapping data from 40 human subjects- Initial synchronization delay and anticipation (by tempo estimation)- Ambiguity in tempo: beat or its division ?

    [D. Ellis’ e4896 slides]

  • Overview of Rhythm Transcription Systems

    • Consists of several cascaded tasks that detect moments of musical stress (accents) and their regularity

    Beat Tracking

    Tempo Estimation

    OnsetDetection

    Musical Knowledge

  • Onset Detection

    • Identify the starting times of musical events- Notes, drum sounds

    • Types of onsets- Hard onsets: percussive sounds- Soft onsets: source-driven sounds (e.g. singing voice, woodwind, bowed

    strings)

    [M.Muller]

  • Example: Onset Detection

    0 1 2 3 4 5 6−1

    −0.5

    0

    0.5

    1

    time [sec]

    ampl

    itude

    ?

    “Eat (꺼내먹어요) ”Zion.T

  • Onset Detection Systems

    • Onset detection function (ODF)- Instantaneous measure of temporal change, often called “novelty” function- Types: time-domain energy, spectral or sub-band energy, phase difference

    • Decision algorithm- Ruled-based approach- Learning-based approach

    DecisionAlgorithm

    Onset Detection Function

    AudioRepresentations

    (Feature Extraction) (Classifier)

  • Onset Detection Function (ODF)

    • Types of ODFs- Time-domain energy- Spectral or sub-band energy- Phase difference

  • Time-Domain Onset Detection

    • Local energy - Usually have high energy at onsets - Effective for percussive sounds

    • Various versions- Frame-level energy

    - Half-wave rectification

    𝑂𝐷𝐹(𝑛) = 𝐸 𝑛 = ) 𝑥 𝑛 +𝑚 𝑤(𝑚) ./

    012/

    𝑂𝐷𝐹(𝑛) = 𝐻(𝐸 𝑛 + 1 − 𝐸 𝑛 )

    𝐻 𝑟 =𝑟 + 𝑟2

    = 8𝑟, 𝑟 ≥ 00, 𝑟 < 0

    0 1 2 3 4 5 6−1

    −0.5

    0

    0.5

    1

    time [sec]

    ampl

    itude

    Waveform

    0 1 2 3 4 5 60

    5

    10

    15

    20

    time [sec]

    OD

    F

    0 1 2 3 4 5 60

    2

    4

    6

    8

    10

    time [sec]

    OD

    F

  • Spectral-Based Onset Detection

    • Spectral Flux- Sum of the positive differences from

    log spectrogram- ODF changes depending on the

    amount of compression 𝜌

    time [sec]

    frequ

    ency−k

    Hz

    1 2 3 4 50

    0.5

    1

    1.5

    2

    x 104

    0 1 2 3 4 50

    100

    200

    300

    400

    time [sec]

    OD

    F𝑂𝐷𝐹(𝑛) = ) 𝐻(𝑌 𝑛 + 1, 𝑘 − 𝑌 𝑛, 𝑘 )/2A

    B1C

    𝑌 𝑛, 𝑘 = log 1 + 𝜌 𝑋 𝑛, 𝑘 𝑋 𝑛, 𝑘 : STFT

  • Phase Deviation

    • Sinusoidal components of a note is continuous while the note is sustained- Abrupt change in phase means that there may be a new event

    [D. Ellis’ e4896 slides]

    Deviation from the steady-statefor all frequency bins

    ϕk (n)−ϕk (n−1) ≈ϕk (n−1)−ϕk (n− 2) Phase continuation (e.g. during sustain of a single note)

    Δϕk (n) =ϕk (n)− 2ϕk (n−1)+ϕk (n− 2) ≈ 0

    ζ p =1N

    Δϕk (n)k=1

    N

  • Post-Processing

    • DC removal - Subtract the mean of ODF

    • Normalization- Scaling level of ODF

    • Low-pass filtering- Remove small peaks

    • Down-sampling- For data reduction

    Low-pass Filtering (Solid line)

    (Tzanetakis, 2010)

  • Onset Decision Algorithm

    • Rule-based Approach: peak detection rule- Peaks above thresholds are determined as onsets- The thresholds are often adaptively computed from the ODF- Averaging and median are popular choices to compute the thresholds

    threshold =α +β ⋅median(ODF) α : offset,β : scaling

    Median with window size 5

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5time [sec]

    0

    50

    100

    150

    200

    250

    300

    350

    OD

    F

    ODFThreshold

  • Challenging Issue in Onset Detection: Vibrato

    Onset detection using spectral flux

  • SuperFlux

    • A state-of-the-art rule-based onset detection function- S. Bock et al., “Maximum Filter Vibrato Suppression For Onset Detection”,

    DAFx, 2013

    • Step1: log-spectrogram- Make harmonic partials have the same depth of vibrato contour

    • Step2: max-filtering - Take the maximum in a window on the frequency axis- The vibrato contours become thicker

    𝑌 𝑛,𝑚 = log 1 + 𝑋 𝑛, 𝑘 L 𝐹 𝑘,𝑚 𝑋 𝑛, 𝑘 : STFT

    𝑌0MN 𝑛,𝑚 = max(𝑌 𝑛,𝑚 − 𝑙:𝑚 + 𝑙 )

  • SuperFlux

    • A state-of-the-art rule-based onset detection function - S. Bock et al., “Maximum Filter Vibrato Suppression For Onset Detection”,

    DAFx, 2013

    • Step1: log-spectrogram- Make harmonic partials have the same depth of vibrato contours

    • Step2: max-filtering - Take the maximum in a window on the frequency axis- The vibrato contours become thicker

    𝑌 𝑛,𝑚 = log 1 + 𝑋 𝑛, 𝑘 L 𝐹 𝑘,𝑚 𝑋 𝑛, 𝑘 : STFT

    𝑌0MN 𝑛,𝑚 = max(𝑌 𝑛,𝑚 − 𝑙:𝑚 + 𝑙 )

  • SuperFlux

    Log-spectrogram

    Max-filteredLog-spectrogram

  • SuperFlux

    • Step3: Super-flux- Take the difference with some distance- Assumption: frame-rate is high in onset detection (i.e. small hop size)

    • Step 4: pick-picking- 1) 𝑆𝐹∗(𝑛) = max(𝑆𝐹∗ 𝑛 − 𝑝𝑟𝑒0MN: 𝑛 + 𝑝𝑜𝑠𝑡0MN )- 2) 𝑆𝐹∗(𝑛) ≥ mean(𝑆𝐹∗ 𝑛 − 𝑝𝑟𝑒M\]: 𝑛 + 𝑝𝑜𝑠𝑡M\] ) + 𝛿- 3) 𝑛 − 𝑛_`a\bcde2cfeag > 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑤𝑖𝑑𝑡ℎ

    𝑆𝐹∗(𝑛) = ) 𝐻(𝑌 𝑛 + 𝜇, 𝑘 − 𝑌 𝑛, 𝑘 )/2A

    B1C𝜇 = max(1,

    (𝑁2 − min 𝑛 𝑤 𝑛 > 𝑟 )ℎ

    + 0.5

    (0 ≤ 𝑟 ≤ 1)

  • SuperFlux

    Peak-picking

    Max-filteredLog-spectrogram

  • Tempo Estimation

    • Estimate a regular time interval between beats- Tempo is a global attribute of a song: e.g. bpm or mid-tempo song

    • Tempo often changes within a song - Intentionally: e.g. dramatic effect: Top 10 tempo changes- Unintentionally: e.g. re-mastering, live performance

    • There are also local tempo changes: e.g. rubato

  • Tempo Estimation Methods

    • Auto-Correlation- Find the periodicity as used in pitch detection

    • Discrete Fourier Transform- Use DFT over ODF and find the periodicity

    • Comb-filter Banks- Leverage the “oscillating nature” of musical beats

  • Auto-Correlation

    • ACF is a generic method to detect periodicity of a signal- Thus, this can be applied to ODF to find a dominant period that may

    correspond to tempo- The ACF shows the dominant peaks that indicate dominant tempi

    0 1 2 3 4 5−1

    0

    1

    2

    3 x 105

    time [sec]O

    DF

    0 1 2 3 4 50

    100

    200

    300

    400

    time [sec]

    OD

    F

    Onset Detection Function (spectral flux) Auto-Correlation

  • Tempo Estimation Using Tempo Prior

    • Tempo is estimated by multiplying the prior with the auto-correlation (observation)- The auto-correlation corresponds to a likelihood function- Tempo prior can be calculated from beat annotations of a dataset- The distribution fits to a log-normal distribution well

    Histogram of beats from a dataset

    [D. Ellis’ e4896 slides]

    (Klapuri, 2003)

  • Beat Spectrum

    • Leverage the repetitive nature of music

    • Algorithm- Step1: compute cosine distance between two

    frames of magnitude spectrogram

    - Step 2: sum the elements on the diagonals

    (Foote, 2001)

    𝑆(𝑖, 𝑗) =𝑉b L 𝑉b𝑉b L 𝑉w

    𝐵(𝑙) =)𝑆(𝑘, 𝑘 + 𝑙)�

    B

  • Beat Spectrum

    • A more robust version can be obtained from the 2D auto-correlation of the similarity matrix

    • The final beat spectrum is derived by summing over one axis- The left plot shows five beats and a triplet

    within a beat.

    • “Beat spectrogram” can be also obtained by successive beat spectra

    𝐵(𝑘, 𝑙) =)𝑆(𝑖, 𝑗) L 𝑆(𝑖 + 𝑘, 𝑗 + 𝑙)�

    b,w

    (Foote, 2001)

    Five beats and a triplet within a beat

  • Tempogram

    • Algorithm- Step 1: compute ODF from the half-wave

    rectified spectral flux- Step2: obtain the frequency and phase

    that provide the maximum correlation with for the ODF and form a local sinusoidal kernel

    - Step 3: accumulate the successive local sinusoidal kernels to form a PLP curve

    - Step 4: take DFT or auto-correlation(Grosche, 2009)

    k(m) = w(m− n)cos(2π (ŵm− ϕ̂ ))

    • Modeling the onset function using sinusoid as predominant local periodicity (PLP)

  • Tempogram

    • Cyclic Tempogram- Accumulate the tempogram

    for integer multiples of a tempo (up to four octaves)

    - Conceptually similar to “Chromagram”

    (Grosche, 2011)

  • Comb-Filter Banks

    • Also called resonant filter banks- Comb filter equation

    • Builds up rhythmic evidences (by anticipation?)

    (Klapuri, 2006)

    𝑦 𝑛 = 𝑥 𝑛 + 𝛼𝑦 𝑛 − 𝜏

  • Sub-band Resonant Filter Banks

    • Algorithm- A sub-band filter bank as a front-end

    processing - Parallel ODFs for 6 bands- 150 resonators for each band and all

    possible tempo values (60 - 240 bpm)

    - Pick up the delay that provides the highest peak as a tempo

    (Scheirer, 1998)

  • Beat Tracking

    • Estimate the position of beats in music - Usually a subset of detected onsets selected by the tempo

  • Beat Tracking by the Resonator Model

    • Once the resonator model chooses the tempo that returns the highest peaks, the output produces a sequence of resonated peaks- They correspond to the beats

    (Scheirer, 1998)

  • • Find the optimal “hopping” path on music (Ellis, 2007)

    - 𝐶 𝑡b : cost of the path 𝑡b- 𝑂 𝑡b : onset strength function (i.e. ODF)

    - 𝐹(∆𝑡, 𝑇): tempo (𝑇) consistency score: e.g. 𝐹 ∆𝑡, 𝑇 = −(𝑙𝑜𝑔 ∆g).

    Beat Tracking by Dynamic Programming

    𝐶 𝑡b =)𝑂 𝑡b

    b1A

    + 𝛼)𝐹 𝑡b − 𝑡b2A, 𝑇

    b1.

    . . .

    1

  • Finding the Minimum-Cost-Path

    • Naïve approach- Find all paths from A to K and calculate the cost for each, and choose the

    path that has the minimum cost.- As the number of nodes increases, the number of possible paths increases

    exponentially

    A C

    B

    D

    E

    F

    G

    H

    24

    3

    36

    2

    42

    2

    32

    5

    4 12

    33

    1

    53

    I

    J

    K7

    45

    6

    3

    3

    5

    74

    3 23

    2

  • Dynamic Programming (DP)

    • Observation- Say the minimum-cost-path passes by a node p, - What is the minimum-cost-path from A to p ?- It is just a sub-path of the minimum-cost-path from A to K.- Thus, we don’t have to compute the cost from scratch; we can use the cost

    computed from the previous nodes.

    A C

    B

    D

    E

    F

    G

    H

    24

    3

    36

    2

    42

    2

    32

    5

    4 12

    33

    1

    53

    I

    J

    K7

    45

    6

    3

    3

    5

    74

    3 23

    2

  • Dynamic Programming (DP)

    • The minimum cost is computed by the following equation:

    • The minimum-cost-path can be found by tracing back the computation

    Ck ( j) =Ok ( j)+mini {Ck−1(i)+ cij}Ck ( j)Ok ( j)

    : cost up to node j: local cost at node j

    cij : transition cost from i to j

    A C

    B

    D

    E

    F

    G

    H

    24

    3

    36

    2

    42

    2

    32

    5

    4 12

    33

    1

    53

    I

    J

    K7

    45

    6

    3

    3

    5

    74

    3 23

    2

  • Applying DP to Beat Tracking

    • To optimize:

    - Define 𝐶∗ 𝑡 as best score up to time 𝑡 and compute it for every 𝑡

    - Also, store the time that returns maximum score 𝑃 𝑡

    - At the end of the sequence, traceback 𝑃 𝑡 , which returns the best path 𝑡b

    𝐶 𝑡b =)𝑂 𝑡b

    b1A

    + 𝛼)𝐹 𝑡b − 𝑡b2A, 𝑇

    b1.

    𝐶∗ 𝑡 = 𝑂 𝑡 + max{𝛼𝐹 𝑡 − 𝜏, 𝑇 + 𝐶∗ 𝜏 }

    𝑃 𝑡 = argmax

    {𝛼𝐹 𝑡 − 𝜏, 𝑇 + 𝐶∗ 𝜏 }

    0 1 2 3 4 50

    100

    200

    300

    400

    time [sec]

    ODF

    𝑡𝜏

    𝐶∗ 𝑡

  • Example of DP to Beat Tracking

  • References

    • E. Scheirer, “Tempo and Beat Analysis of Acoustic Musical Signals”, 1998• J. Foote and S. Uchihashi, “The Beat Spectrum: A New Approach to Rhythm

    Analysis”, 2001• G. Tzanekatis, “Musical Genre Classification of Audio Signals”, 2002• A. Klapuri, “Analysis of the Meter of Acoustic Musical Signals”, 2006• P. Grosche and M. Muller, “Computing Predominant Local Periodicity

    Information In Music Recordings”, 2009• P. Grosche and M. Muller, “Cyclic Tempogram – A Mid-Level Tempo

    Representation For Music Signals”, 2010• D. Ellis, “Beat Tracking by Dynamics Programming”, 2007• S. Bock and G. Widmer, “Maximum Filter Vibrato Suppression For Onset

    Detection”, 2013