13
Representations 1 SGN-24006 Mid-level representations for audio content analysis *Slides for this lecture were created by Anssi Klapuri Sources: Ellis, Rosenthal, Mid-level representations for Computational Auditory Scene Analysis, IJCAI, 1995. Schörkhuber et al. A Matlab Toolbox for Efficient Perfect Reconstruction Time-Frequency Transforms with Log-Frequency Resolution, AES 2014. Virtanen, Audio Signal Modeling with Sinusoids plus Noise, MSc thesis, TUT, 2001. Contents Introduction Desirable properties of mid-level representations STFT spectrogram Constant-Q transform Sinusoids plus noise Perceptually-motivated representations Representations 2 SGN-24006 1 Introduction The concept of mid-level data representations is useful in characterizing signal analysis systems The analysis process can be viewed as a sequence of representations from an acoustic signal ( low level ) towards the analysis result ( high ) Usually intermediate abstraction levels are needed between the two since high-level information is not readily visible in the raw acoustic signal An appropriate mid-level representation functions as an interface for further analysis and facilitates the design of efficient algorithms Representations 3 SGN-24006 Desirable properties of mid-level representations It is natural to ask if a certain mid-level representation is better than others in a given task. Ellis and Rosenthal list several desirable qualities for a mid- level representation: Component reduction: the number of objects in the representation is smaller and the meaningfulness of each is higher compared to the individual samples of the input signal The sound should be decomposed into sufficiently fine-grained elements so as to support sound source separation by grouping the elements to their sound sources. Invertibility: the original sound can be resynthesized from its representation in a perceptually accurate way, Psychoacoustic plausibility of the representation. Representations 4 SGN-24006 Categorization of some mid-level representations Ellis and Rosenthal classify representations according to three conceptual axes choice between fixed and variable bandwidth of the initial frequency analysis discreteness: is representation structured as meaningful chunks? dimensionality of the transform (some possess extra dimensions) Figure: [Ellis-Rosenthal]

Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Embed Size (px)

Citation preview

Page 1: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 1SGN-24006

Mid-level representationsfor audio content analysis*Slides for this lecture were created by Anssi Klapuri

Sources:Ellis, Rosenthal, Mid-level representations for Computational Auditory Scene Analysis, IJCAI, 1995.Schörkhuber et al. A Matlab Toolbox for Efficient Perfect Reconstruction Time-FrequencyTransforms with Log-Frequency Resolution, AES 2014.Virtanen, Audio Signal Modeling with Sinusoids plus Noise, MSc thesis, TUT, 2001.

ContentsIntroductionDesirable properties of mid-level representationsSTFT spectrogramConstant-Q transformSinusoids plus noisePerceptually-motivated representations

Representations 2SGN-240061 Introduction

The concept of mid-level data representations is useful incharacterizing signal analysis systemsThe analysis process can be viewed as a sequence ofrepresentations from an acoustic signal ( low level ) towardsthe analysis result ( high )Usually intermediate abstraction levels are needed between thetwo since high-level information is not readily visible in the rawacoustic signalAn appropriate mid-level representation functions as aninterface for further analysis and facilitates the design of

efficient algorithms

Representations 3SGN-24006Desirable properties of mid-level

representationsIt is natural to ask if a certain mid-level representation is betterthan others in a given task.Ellis and Rosenthal list several desirable qualities for a mid-level representation:

Component reduction: the number of objects in the representationis smaller and the meaningfulness of each is higher compared tothe individual samples of the input signalThe sound should be decomposed into sufficiently fine-grainedelements so as to support sound source separation by groupingthe elements to their sound sources.Invertibility: the original sound can be resynthesized from itsrepresentation in a perceptually accurate way,Psychoacoustic plausibility of the representation.

Representations 4SGN-24006Categorization of some mid-level

representationsEllis and Rosenthal classify representations according to threeconceptual axes

choice between fixed and variable bandwidth of the initialfrequency analysisdiscreteness: is representation structured as meaningful chunks?dimensionality of the transform (some possess extra dimensions)

Figure: [Ellis-Rosenthal]

Page 2: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 5SGN-240062 Complex-valued STFT

spectrogramSTFT = short-time Fourier transformTime-domain signal x(n) is transformed into time-frequency domain by employing discrete Fourier transform(DFT) in successive time frames

complex spectra: all information is preservedamount of data remains the same

To some extent fulfills the criteria ofsupporting sound source separation (sources overlap less in time-frequency than in time domain)invertibility: the original sound can be perfectly reconstructedpsychoacoustic plausibility: frequency analysis (albeit in adifferent form) happens in the auditory system too

Representations 6SGN-24006

Example signal(music)

top: time-domain(zoomed in)middle:time-domainbottom: STFTspectrogram(magnitudes)

STFT spectrogram

Representations 7SGN-24006Spectrum estimation

Spectrum of audio signals is typically estimated in shortconsecutive segments, framesWhy?

the Fourier transform models the signal with stationary sinusoids(constant spectrum)real audio signals are not stationary but vary through timeframewise processing assumes the signal is time-invariant in shortenough frames

For audio signals, the frame length typically variesbetween 10ms 100ms, depending on the application

for speech signals often 25msTransient-like sounds are difficult to represent andprocess in the frequency domain

time blurring (but let s see constant-Q transform later...)

Representations 8SGN-24006Windowing

Windowing is essential in frame-wise processingweight the signal with a window function w(k) prior to transformas a rule of thumb, windowing is always needed: one cannot justtake a short part of a signal without windowing

1

0

)()()(N

n

nkmm WnwnxkX)()( nwnxm1,...,0),( Nnnxm

signal in frame m: windowed signal: short-time spectrum:

Page 3: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 9SGN-24006Windowing

Example: spectrum of a sinusoid with/without windowing1. No windowing (=rectangular window), sinusoid at a spectral bin2. No windowing, random off-bin frequency spectral blurring!3. Hanning window, sinusoid at a spectral bin4. Hanning window, random off-bin frequency ok

There are different types of windows, but most importantis not to forget windowing altogether

Representations 10SGN-24006Windowing in framewise processing

Figure: Hanning windows (Hamming works too)adjacent windows sum to unity when frames overlap 50%all parts of the signal get an equal weight

In each frame, the signal is weighted with the window function andshort-time discrete Fourier transform is calculated

This yields a spectrogramcomplex spectrum in each frame over time

framewise processing

timefrequ

ency

time (ms)

Representations 11SGN-24006Windowing in analysis-synthesis systems

Sine window is useful in analysis-synthesis systems (see Figure)Windowing is done again in resynthesis to avoid artefacts at frameboundaries in the case that the signal is manipulated in the f-domain

Figure below: 50% frame overlap leads to perfect reconstruction if nothingis done at subbands

window

ing

DFT

signal inone frame ...

processing atsubbands(freq. domain) ...

inverseD

FT

window

ing

output

(overlap-addframes)

Representations 12SGN-24006Reconstructing the time domain signal:

overlap-add techniqueReconstructing a signal from its spectrogram:1. inverse Fourier transform the spectrum of each frame back to

time domain2. apply windowing in each frame (e.g. sine or Hanning window)3. successive frames are positioned to overlap 50% or more, and

summed sample-by-sample

Page 4: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 13SGN-24006

Time-frequency representation where the frequency bins are uniformlydistributed in log-frequency and their Q-factors (ratios of centerfrequencies to bandwidths) are all equalIn effect, that means that the frequency resolution is better for lowfrequencies and the time resolution is better for high frequenciesMusically and perceptually motivated

frequency resolution of the inner ear is approx. constant Q above 500 Hzin music (equal temperament), note frequencies obey Fk = 440Hz 2k/12

3 Constant-Q transform (CQT)

(piano notes)

Representations 14SGN-24006

Mathematical definition:

where N is length of input signal, gk(m) is zero-centered window functionthat picks one time frame of the signal at point n and fs sample rateCompare CQT with short-time Fourier transform (STFT):

where now the window function h(m) is the same for all frequency binsIn CQT, to achieve constant Q-factors, the support of the window (timelength of significant non-zero values) is inversely proportional to fk

In CQT, the center frequencies are geometrically spaced:where B determines the number of bins per octave and f0 is lowest binIn DFT, the center frequencies are linearly spaced:

Constant-Q transform (CQT)

( , ) = ( ) ( - )e- 2 /

=0

-1

å

( , ) = ( ) ( - ) - 2 /

=0

-1

å

= 0 2 /

=

Representations 15SGN-24006Constant-Q transform

Time-domainwindow function

Frequencyresolution

Representations 16SGN-24006

CQT is essentially a wavelet transform, but with rather high frequencyresolution (typically 10 100 bins/octave)

conventional wavelet transform techniques cannot be usedMatlab Toolbox for CQT and ICQT[Schörkhuber et al. 2013]:

http://www.cs.tut.fi/sgn/arg/CQT/Efficient computation achieved by1. FFT of the entire input signal2. apply one CQT frequency-bin

wide bandpass on the (huge)spectrum

3. move subband around zero4. inverse-FFT transform the

narrowband spectrum to get CQTcoefficients over time for that bin

Toolbox for computing CQT

Page 5: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 17SGN-24006

Due to the way that CQT is computed by the toolbox, the frequency-domain response of an individual frequency bin can be controlledperfectly (no sidelobes), but the effective time-domain window hassidelobes (that extend over the entire signal)Figure: the response of one time-frequency element as a function offrequency (left) and as a function of time (right)

Toolbox for computing CQTRepresentations 18

SGN-24006

Representations 19SGN-24006STFT spectrogram for the same signal

(either high or low frequencies blur)

Representations 20SGN-24006STFT spectrogram for the same signal

(either high or low frequencies blur)

Page 6: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 21SGN-24006Constant-Q transform

A

Representations 22SGN-24006

Reasons why CQT has not more widely replaced FFT in audiosignal processing:

1. CQT is computationally more intensive than DFT spectrogram2. CQT produces a data structure that is more difficult to handle

than the time-frequency matrix (spectrogram)

Drawbacks of CQT

Representations 23SGN-24006

The toolbox includes PITCHSHIFT.m to implement thatPitch shifting is a natural operation in CQT domain:1. CQT2. translate CQT coefficients or in frequency3. retain phase coherence4. inverse CQT

Examples:original- 6 semitones+6 semitones

Transients at high freqs. are retained due to short frameTime stretching can be done by pitch shifting + resampling

Example application: Pitch shiftingRepresentations 24

SGN-240064 Sinusoids plus noise model(Recap from SGN-14006)

Signal model

signal x(t) is represented with N sinusoids (freq, amplitude, phase)and noise residual r(t)

Additive synthesisaccording to Fourier theorem, any signal can be represented as asum of sinusoidsmakes sense only for periodic signals, for which the amount ofsinusoids needed is smallnon-deterministic part would require a large number of sinusoids

use stochastic modeling

)()()(2cos)()(1

trtttftatxN

nnnn

Page 7: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 25SGN-24006

Representations 26SGN-24006

Representations 27SGN-24006

Representations 28SGN-24006

Page 8: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 29SGN-24006

Representations 30SGN-24006

Representations 31SGN-24006

Representations 32SGN-24006Sinusoids+noise model

AnalysisBlock diagram[Virtanen 2001]1. detect sinusoids

in framewise spectra2. estimate sinusoid

parameters andresynthesize

3. subtract sinusoidsfrom original signal

4. model the noiseresidual

We getsinusoid parametersnoise level at differentsubbands

Page 9: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 33SGN-24006Sinusoids+noise model

Detecting and estimating sinusoidsBlock diagram: [Virtanen01]Spectral peaks areinterpreted as sinusoids1. peak : local maximum

in magnitude spectrum2. peak frequency, amplitude,

and phase can be pickedfrom the complex spectrum

Tracking the peaksdetected in successiveframes

gives parameters ofa time-varying sinusoidsinusoidal trajectory

time

Representations 34SGN-24006Sinusoids+noise model

Tracking the peaksIf needed, spectral peaks in successive frames can beassociated and joined into time-vayring sinusoids

frequency, amplitude, and phases joined into curves

Figure: peak tracking algorithm [Virtanen2001]based e.g. on the trackderivatives; try to forma smooth trackkill: if no continuation found,end the sinusoidbirth: if spectral peakis not a continuationfor an existing sinusoid,create a new one

Representations 35SGN-24006Sinusoids+noise model

Synthesis of sinusoidsAdditive synthesis

Often tracking the peaks is not necessary, butsynthesize sinusoids in each frame separately, keep theparameters fixed in one framewindow the obtained signal with Hann windowoverlap-add

N

nnnn tttftats

1

)()(2cos)()(

Representations 36SGN-24006Sinusoids+noise model

Synthesis, subtraction from originalSynthesized sinusoids vs. the original signal (upper panel)Residual obtained from subtraction (lower panel)

Page 10: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 37SGN-24006Sinusoids+noise model

Modeling the noise residualResidual is obtained by subtacting synthesized sinusoidsfrom the original signal in the time-domainResidual signal is analyzed frame-by-frame

calculate spectrum Rt(f) in frame tsubdivide the spectrum into 25 perceptual subbands (Bark scale)calculate short-time energy at each band b,b=1,2,...,25

2)(bf

tt fRbE

Representations 38SGN-24006Sinusoids+noise model

Noise synthesis from parametersNoise residual is represented parametrically

in each frame, store only the short-time energies within Barkbands, Et(b)this modeling can be done, because the auditory system is notsensitive to energy changes within one Bark band in the case ofnoise

Synthesis1. generate magnitude spectrum, where the energy within each

Bark band is shared uniformly within the band2. generate random phases3. inverse Fourier transform to time-domain4. windowing with Hann window5. overlap-add

tt EfR )(

Representations 39SGN-24006Sinusoids+noise model

PropertiesAudio examples:http://www.cs.tut.fi/sgn/arg/music/tuomasv/sinusoids.htmlSinusoids+noise model has several nice properties

satisfies the component reduction property (see slide 3)invertibility: synthesized signal has reasonable quality (transientsounds are problematic)the model is generic: any sound can be processedstraightforward to compute (especially if peak tracking is skipped)manipulation such as time stretching and pitch shifting is easy

The representation also supports sounds sourceseparation to some extent, see next slides

Representations 40SGN-24006Intelligent component grouping

Auditory organization in humans has been found todepend on certain acoustic cuesTwo components may be grouped, i.e., associated to acommon sound source by1. Spectral proximity (time, frequency)2. Harmonic concordance

frequencies components in integral ratiosthese components are deduced to be produced by a common source

3. Synchronous changes of the componentscommon onset / offsetcommon AM / FM modulationequidirectional movement in the spectrum

4. Spatial proximity (angle of arrival)

Cues may compete and conflict

Page 11: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 41SGN-24006Example: grouping sinusoids

Sinusoidal model reveals the cues better than the time-domain signal

original

sines

Representations 42SGN-24006Sound separation

Example: grouping implementedEstimate perceptual distance between each two spectralcomponents

sinusoids are classified into groups

Representations 43SGN-24006Sound separation

Example: grouping implementedClassified sets of sinusoids can be synthesized separately

Representations 44SGN-240065 Perceptually-motivated

representationsPeripheral hearing1. Frequency selectivity of

the inner earBank of linear bandpass filters

auditory channels2. Mechanical-to-neural

transductionCompression, rectification,lowpass filteringDetailed models exist, too

In brain, for pitch processing:3. Periodicity analysis within channels4. Combination across channels

Between-channel phase differences do not affect

Signal in theauditory nerve

Brain

not directlyobservable (yet)

Page 12: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 45SGN-24006Perceptually-motivated

representationsThe signal traveling in the auditory nerve fibers from theauditory periphery to the brain can be viewed as a mid-level representationThe idea of using the same data representation as thehuman auditory system is very appealingAuditory periphery is quite accurately known

Representations 46SGN-24006Auditory filterbank

Band-wise processing is an inherent part of hearingFigure: Frequency responses (top) and impulse response (bottom) ofa few auditory filters

Bandwidths proportional to center frequency:0.108 24.7c cb f Hz

Gammatone filters[Slaney93]

Representations 47SGN-24006Mechanical-to-neural transduction

Simplified model:a. Compression

(and level adaptation)b. Half-wave rectificationc. Lowpass filtering

Compression:Memoryless: scale the signalwith a factor

ac= ( c) 1

where c is the std of the signal within channel cSpectral flattening (whitening) when 0< <1

Representations 48SGN-24006Mechanical-to-neural transduction

Half-wave rectificationHalf-wave rectification within subbands

input partials + beating partials (freq. intervals btw the input partials)

Page 13: Representations 1 Representations 2 Mid-level ...sgn24006/PDF/L02-mid-level-representations.pdf · STFT spectrogram Constant-Q transform ... for speech signals often 25ms ... Windowing

Representations 49SGN-24006Mechanical-to-neural transduction

Half-wave rectification

Rectification maps the contribution of higher-order partials to theposition of the F0 and its few multiples in the spectrumThe extent to which harmonic h is mapped to the position of thefundamental increases as a function of h

Harmonic sound Amplitude-modulated noise

Representations 50SGN-24006Autocorrelation within channels

Autocorrelation:

Summary autocorrelation: combining across channels

Representations 51SGN-24006Correlogram

Performing periodicity analysis within critical bands produces a three-dimensional volume rc(n , for channel c at time n and lagFigure below illustrates the correlogram

Input signal was a trumpet sound with F0 260 Hz (period 3.8 ms)left: the 3D correlogram volume.middle: zero-lag face of the correlogram (= power spectrogram)right: one time slice of the volume, from which summary ACF canbe obtained by summing over frequency.