advanced’spectral’processing’ - ETIC UPFjjaner/teaching/CDSIM2014/CDSIM-Advanced...advanced’spectral’processing ... ’violin,!cello,!oboe ... – Original’ ’’’Vocals’mute

advanced spectral processing

Jordi Janer Music Technology Group Universitat Pompeu Fabra, Barcelona jordi.janer @ upf.edu

CDSIM – UPF May 2014 hKp://mtg.upf.edu/

Outline 1.  IntroducNon to spectral processing 2.  Decomposing sound signals

1-‐ IntroducNon to spectral processing

CDSIM UPF – May 2014

Simple Periodic Waves (sine waves)

Time (s)0 0.02

–0.99

0.99

0

•  Characterized by: •  period: T •  amplitude A •  phase φ

•  Fundamental frequency in cycles per second, or Hz F0=1/T T

A

y(0)=A·∙sin(φ) y = A·sin(2πF0t+φ)

(Many slides come from materials from Dan Jurafsky)


Simple periodic waves

•  Frequency: 5 cycles in .5 seconds = 10 cycles/second = 10 Hz •  Amplitude: 1 •  Phase: at time 0 seconds, y(0)=A·sin(2π10t+φ)=sin(φ)=0 ⇒ φ=πk , k∈! ⇒ φ=0 •  Equation:

y(t) = A·sin(20πt)


(more) Basic facts about sound waves

•  where c = speed of sound, and λ = wave length (longitud d’ona) in meters

•  c=3440 cm/s (≈345 m/s) at 21 degrees Celsius at sea level

•  Example: with λ=10m, frequency f=34,5Hz

λ

f = c/λ


Speech sound waves

•  A liKle piece from the waveform of a vowel •  Y axis:

–  Amplitude = amount of air pressure at that Nme point •  PosiNve is compression •  Zero is normal air pressure, •  negaNve is rarefacNon (expansion)

•  X axis: Nme.


Fundamental frequency •  The fundamental frequency (or F0) is the lowest frequency of a periodic

(voiced) waveform, produced by any particular instrument (our vocal folds are like a “complicated” instrument)

•  It is also called the first harmonic, in comparison with its integer multiples called second, third, etc. harmonics

Fundamental Frequency = first harmonic

2nd harmonic

3rd harmonic

4th harmonic

5th harmonic

6th harmonic

7th harmonic


Fundamental frequency

In speech, see for example the waveform of a vowel

•  The fundamental frequency could be computed as the number of repeNNons/second of the wave: –  Above vowel has 10 reps in .03875 secs -‐> freq. is 10/.03875 = 258 Hz

•  This is the speed at which vocal folds move, hence voicing speed

•  Each peak corresponds to an opening of the vocal folds


Pitch •  Pitch is defined as the perceived fundamental frequency of a sound

•  F0 and pitch are different concepts: –  F0 corresponds to a physically measurable frequency –  Pitch corresponds to a perceivable frequency

•  The relaNonship between pitch and F0 is not linear –  human pitch perception is most accurate between 100Hz and

1000Hz. •  Linear in this range: At F01=200Hz, if Pitch2=Pitch1/2 then F02≈100Hz •  Logarithmic above 1000Hz: At F01=5KHz if Pitch2=Pitch1/2 then F02<2KHz

•  SNll, in the literature many Nmes F0 and pitch are treated as the same


F0 tracking

• 

F0 can be computed using several techniques, and using tools like PRAAT


Frequency analysis •  Waves have different frequencies

Time (s)0 0.02

–0.99

0.99

0

Time (s)0 0.02

–0.99

0.99

0

100 Hz

1000 Hz


Frequency analysis •  Complex waves: Adding a 100 Hz and

1000 Hz wave together

Time (s)0 0.05

–0.9654

0.99

0


Spectrum

100 1000 Frequency in Hz

Am

plitu

de

Frequency components (100 and 1000 Hz) on x-‐axis


Fourier transform analysis •  Fourier analysis: any wave can be represented as the (infinite) sum of sine waves of different frequencies (amplitude, phase)

•  For conNnuous signals:

•  For discrete signals:

When N is finite (and relaNvely short) we call the resulNng signal the short term spectrum (STFT)


Spectrum example

•  Spectrum of one instant in an actual soundwave: many components across the frequency range

•  Each frequency component of the wave is separated

Frequency (Hz)0 5000

0

20

40Magnitude

(in dB

)


Formants •  Formants are defined as the spectra peaks of

the sound spectrum envelope •  Formants are independent of the F0 frequency,

as they are defined over the envelope of the spectrum

•  They are created by the pass of the sound through the vocal tract


Seeing formants: the spectrogram


Example

What about Helium voice? … hKp://www.phys.unsw.edu.au/jw/speechmodel.html

1.  IntroducNon to acousNc signals 2.  Spectral analysis 3.  ApplicaNons of spectral processing


Spectrogram


Spectrogram

•  Time-‐frequency representaNon •  Short-‐Nme windowing •  Fast Fourier Transform (FFT) •  Available tools:

–  Sonic Visualizer (for music analysis) –  Praat (for speech analysis)

•  Other resources: –  Live spectrogram: hKp://labrosa.ee.columbia.edu/expo/


Window size

•  Understanding Time-‐Frequency resoluNon – Long windows: good freq resoluNon – Short windows: good temporal resoluNon


Observing test signals

•  Two near tones •  Noise burst •  Chirp •  Pure tones •  Harmonic richness (square/saw) •  Low tone SonicVisualizer h.p://mtg.upf.edu/~jjaner/teaching/CDSIM2014/Test_various_signals.wav


ApplicaNons of spectral processing

technologies for the synthesis of sound and music

technologies for the analysis of sound and music

technologies for the transforma9on of sound and music


Analysis

•  Skore – automaNc singing voice raNng


Transforming signals

•  Approaches for spectral transformaNons: – SMS: hKp://mtg.upf.edu/sms – Phase Vocoder

•  Basic transformaNons – Pitch transposiNon – Harmonic/noise decomposiNon – Time-‐stretching

(Matlab internal MTG sosware)


Transforming signals

•  Basic transformaNons – Original

– Pitch transposiNon

– Harmonic/noise decomposiNon

– Time-‐stretching (50x)


TransformaNon •  Time scaling

– DetecNon of transients – RepeNNon/Removal of spectral frames – Demo: Fast Remixing

•  Original fast Nme-‐varying remix

•  Swing detecNon – Tempo detecNon at 8th note level – Change swing factor – Demo: video


Synthesis

•  Sample-‐based (Violin) – Gesture modelling to provide a more realisNc synthesis

•  Voice-‐driven synthesis – Voice analysis is used to control the synthesis of a violin sound

2-‐ Decomposing sound signals Signal decomposiNon and Source separaNon


source separaNon

The objecNve

•  Music is distributed as mixdowns in various formats •  Users aim to further manipulate music signals in mulNple applicaNon

contexts (karaoke, soloing, remixing, etc.)

* from mulNtrack originals

The problem

•  Music signals are complex •  Variety of music styles and instrumentaNons •  Modern producNon techniques go beyond linear combinaNon of recorded

acousNc sources –  (FX’s, digital synth, etc.)

ExisNng generic SS approaches: •  Spectral subtrac9on

–  IntuiNve –  Well-‐studied (industrial interest) –  Good for speech/staNonary noise reducNon –  Less appropriate for music signals

Background I

Background II

ExisNng music-‐specific approaches I: •  Pan-‐frequency masks

o  Assumes non-‐overlapping signals in Nme-‐frequency bins o  Stereo signals are required o  Amplitude raNo between L and R FFT bins o  2D user interface

•  Examples o  Good for simple excerpts o  Bad for complex mixes

* Loses brightness, vocals less reduced due to reverb, flute is also removed,.,…

ExisNng music-‐specific approaches II: •  Non-‐nega9ve Matrix Factoriza9on (NMF)

–  Magnitude spectrogram (non-‐negaNve) –  DecomposiNon as matrix product –  W (spectral basis) and H (gain acIvaIons over Ime) –  Spectrum frame explained as linear combinaNon of R basis. –  MinimizaNon problem that finds W and H: min(D (V, WH))

Background III

W

H

V

•  Non-‐nega9ve Matrix Factoriza9on I •  3 spectral basis W

NMF details

1 overlapping note

H: acIvaIon gains

•  Non-‐nega9ve Matrix Factoriza9on I •  3 spectral basis W

NMF details

2 overlapping notes

H: acIvaIon gains

NMF challenges

•  Predominant instrument separaNon –  (pitch/Nmbre analysis)

•  Completeness of instrument removal –  (aKack/sustain, residual/breathing noise, unvoiced consonants,…)

•  Percussive instruments separaNon –  (Transient detecNon, wideband spectrum)

•  Polyphonic instrument separaNon –  (blind and score-‐informed)

•  “Music print” decomposiNon: –  song containing a region without target (e.g. vocals), –  basis model learnt from the user-‐selected “music-‐print”

Music print (without vocals)

Region with vocals

Vocals/Background separaNon

•  “Music print” decomposiNon: –  Demos:

Basis decomposiNon W·∙H Wbgd

Background excerpt

Basis decomposiNon [Wbgd,Wother]·∙[Hbgd,Hoth

er]

Input

Wiener filtering (Wbgd,Hbgd)/(W·∙M)

output mute

original mute


•  “Music print” decomposiNon: –  Demos:

Basis decomposiNon W·∙H Wbgd

Background excerpt

Basis decomposiNon [Wbgd,Wother]·∙[Hbgd,Hoth

er]

Input

Wiener filtering (Wother,Hother)/(W·∙M)

output solo

original solo


•  “Music print” decomposiNon: –  not always possible…

•  accompaniment (music print) changes throughout the song •  target always present in some secNons


•  Solu9on à Predominant Pitch detec9on –  e.g MELODIA (J. Salomon, MTG)

•  SeparaNon à Binary mask from pitch informaNon –  Simplest approach –  Nme-‐frequency mask 1’s at harmonic posiNons, 0’s rest –  Can be combined with pan-‐frequency mask

•  Demos •  Voice is properly removed/aKenuated •  Bass guitar is “comb-‐filtered”, and horns aKenuated •  Soloing produces more arNfacts

original mute solo


Advanced separa9on approaches Special treatment for vocals: source / filter models

Breathiness residual (noise added on formant envelope) Demos: Solo version

without residual Solo version with residual

Original

Vocals/Background removal

Advanced separa9on approaches Special treatment for vocals

Breathiness residual (noise added on formant envelope) Unvoiced FricaIve modelling /s/, /f/, /sh/,…

•  supervised basis from solo phoneme recordings o  Demos: Solo version

/s/ are missing Solo version /s/ are present

Original

Spectrogram of the fricaNve recording used to train the spectral basis.

Vocals/Background removal

Piano decomposiNon/retouch

•  Using instrument-‐specific NMF dicNonaries –  Piano model of 88 notes (W matrix is pre-‐learned).

•  Retouch use-‐case: –  Amateur recording with errors. –  The user can select and correct individual notes aser decomposiNon/

separaNon.

Original (played with errors)

Separated notes

Corrected remix

Original (ref)

•  Mul9ple sources in an orchestral recording •  Score data is used to iniNalize acNvaNons matrix H

Score-‐informed separaNon

•  Video Demo: •  Isolated instruments: violin, cello, oboe, bassoon, flute

Other potenNal applicaNons

Other potenNal applicaNons

•  Singer replacement –  Original Vocals mute Vocaloid Clara Vocaloid Clara Mix

•  Drums enhancement –  Original Drums+6dB Drums-‐6dB

•  Step-‐remixer for drums –  user-‐supervised transients (onsets Nme and instrument) –  Original All Drums Single Instrument

Other potenNal applicaNons (piano)

•  Mono-‐to-‐stereo upmixing •  Input

– Mozart K331 recording (RWC dataset)

•  Output –  Upmixing from Mono

•  les/right hands are panned in stereo

Other potenNal applicaNons (piano)

•  Automa9c accompaniment •  Input

– Mozart K331 recording (RWC dataset)

•  Output •  automaNc object detecNon •  String ensemble resynthesis

synth solo (Kontakt)

mixture

Thanks!

Jordi Janer Music Technology Group Universitat Pompeu Fabra, Barcelona jordi.janer @ upf.edu

CDSIM – UPF May 2014 hKp://mtg.upf.edu/~jjaner

Documents

advanced’spectral’processing’ - ETIC UPFjjaner/teaching/CDSIM2014/CDSIM-Advanced...advanced’spectral’processing ... ’violin,!cello,!oboe ... – Original’ ’’’Vocals’mute