54
advanced spectral processing Jordi Janer Music Technology Group Universitat Pompeu Fabra, Barcelona jordi.janer @ upf.edu CDSIM – UPF May 2014 hKp://mtg.upf.edu/

advanced’spectral’processing’ - ETIC UPFjjaner/teaching/CDSIM2014/CDSIM-Advanced...advanced’spectral’processing ... ’violin,!cello,!oboe ... – Original’ ’’’Vocals’mute

Embed Size (px)

Citation preview

advanced  spectral  processing    

Jordi  Janer  Music  Technology  Group    Universitat  Pompeu  Fabra,  Barcelona  jordi.janer  @  upf.edu    

CDSIM  –  UPF              May  2014  hKp://mtg.upf.edu/    

Outline    1.  IntroducNon  to  spectral  processing  2.  Decomposing  sound  signals  

1-­‐  IntroducNon        to  spectral  processing  

CDSIM  UPF  –  May  2014  

Simple Periodic Waves (sine waves)

Time (s)0 0.02

–0.99

0.99

0

•   Characterized  by:  •   period:  T  •   amplitude  A  •   phase  φ  

•   Fundamental  frequency        in  cycles  per  second,  or  Hz        F0=1/T   T  

A  

y(0)=A·∙sin(φ)  y = A·sin(2πF0t+φ)

(Many  slides  come  from  materials  from  Dan  Jurafsky)  

CDSIM  UPF  –  May  2014  

Simple periodic waves

•  Frequency: 5 cycles in .5 seconds = 10 cycles/second = 10 Hz •  Amplitude: 1 •  Phase: at time 0 seconds, y(0)=A·sin(2π10t+φ)=sin(φ)=0  ⇒  φ=πk , k∈! ⇒  φ=0 •  Equation:

y(t) = A·sin(20πt)

CDSIM  UPF  –  May  2014  

(more)  Basic  facts  about  sound  waves              

 •  where  c  =  speed  of  sound,  and  λ  =  wave  length  (longitud  d’ona)  in  meters  

•  c=3440  cm/s  (≈345  m/s)  at  21  degrees  Celsius  at  sea  level  

•  Example:  with  λ=10m,  frequency  f=34,5Hz  

λ  

f  =  c/λ  

CDSIM  UPF  –  May  2014  

Speech sound waves

•  A  liKle  piece  from  the  waveform  of  a  vowel  •  Y  axis:    

–  Amplitude  =  amount  of  air  pressure  at  that  Nme  point  •  PosiNve  is  compression  •  Zero  is  normal  air  pressure,    •  negaNve  is  rarefacNon  (expansion)  

•  X  axis:  Nme.      

CDSIM  UPF  –  May  2014  

Fundamental frequency •  The fundamental frequency (or F0) is the lowest frequency of a periodic

(voiced) waveform, produced by any particular instrument (our vocal folds are like a “complicated” instrument)

•  It is also called the first harmonic, in comparison with its integer multiples called second, third, etc. harmonics

Fundamental  Frequency  =  first  harmonic  

2nd  harmonic  

3rd  harmonic  

4th  harmonic  

5th  harmonic  

6th  harmonic  

7th  harmonic  

CDSIM  UPF  –  May  2014  

Fundamental frequency

In  speech,  see  for  example  the  waveform  of  a  vowel    

•  The  fundamental  frequency  could  be  computed  as  the  number  of  repeNNons/second  of  the  wave:  –  Above  vowel  has  10  reps  in  .03875  secs  -­‐>  freq.  is  10/.03875  =  258  Hz  

•  This  is  the  speed  at  which  vocal  folds  move,  hence  voicing  speed  

•  Each  peak  corresponds  to  an  opening  of  the  vocal  folds  

CDSIM  UPF  –  May  2014  

Pitch  •  Pitch  is  defined  as  the  perceived  fundamental  frequency  of  a  sound  

•  F0  and  pitch  are  different  concepts:  –  F0  corresponds  to  a  physically  measurable  frequency  –  Pitch  corresponds  to  a  perceivable  frequency  

•  The  relaNonship  between  pitch  and  F0  is  not  linear  –  human pitch perception is most accurate between 100Hz and

1000Hz. •  Linear in this range: At  F01=200Hz,  if  Pitch2=Pitch1/2  then  F02≈100Hz •  Logarithmic above 1000Hz: At  F01=5KHz  if  Pitch2=Pitch1/2  then  F02<2KHz

•  SNll,  in  the  literature  many  Nmes  F0  and  pitch  are  treated  as  the  same  

CDSIM  UPF  –  May  2014  

F0 tracking

• 

F0  can  be  computed  using  several  techniques,  and  using  tools  like  PRAAT  

CDSIM  UPF  –  May  2014  

Frequency analysis •  Waves  have  different  frequencies  

Time (s)0 0.02

–0.99

0.99

0

Time (s)0 0.02

–0.99

0.99

0

100  Hz  

1000  Hz  

CDSIM  UPF  –  May  2014  

Frequency analysis •  Complex waves: Adding a 100 Hz and

1000 Hz wave together  

Time (s)0 0.05

–0.9654

0.99

0

CDSIM  UPF  –  May  2014  

Spectrum

100 1000 Frequency in Hz

Am

plitu

de

Frequency  components  (100  and  1000  Hz)  on  x-­‐axis  

CDSIM  UPF  –  May  2014  

Fourier transform analysis •  Fourier  analysis:  any  wave  can  be  represented  as  the  (infinite)  sum  of  sine  waves  of  different  frequencies  (amplitude,  phase)  

•  For  conNnuous  signals:  

•  For  discrete  signals:  

 When  N  is  finite  (and  relaNvely  short)  we  call  the  resulNng  signal  the  short  term  spectrum  (STFT)      

CDSIM  UPF  –  May  2014  

Spectrum example

•  Spectrum of one instant in an actual soundwave: many components across the frequency range

•  Each frequency component of the wave is separated

Frequency (Hz)0 5000

0

20

40Magnitude

 (in  dB

)  

CDSIM  UPF  –  May  2014  

Formants •  Formants are defined as the spectra peaks of

the sound spectrum envelope •  Formants are independent of the F0 frequency,

as they are defined over the envelope of the spectrum

•  They are created by the pass of the sound through the vocal tract

CDSIM  UPF  –  May  2014  

Seeing formants: the spectrogram

CDSIM  UPF  –  May  2014  

Example

What  about  Helium  voice?  …  hKp://www.phys.unsw.edu.au/jw/speechmodel.html  

1.  IntroducNon  to  acousNc  signals  2.  Spectral  analysis  3.  ApplicaNons  of  spectral  processing  

CDSIM  UPF  –  May  2014  

Spectrogram  

CDSIM  UPF  –  May  2014  

Spectrogram  

•  Time-­‐frequency  representaNon  •  Short-­‐Nme  windowing  •  Fast  Fourier  Transform  (FFT)  •  Available  tools:  

–  Sonic  Visualizer  (for  music  analysis)  –  Praat  (for  speech  analysis)  

•  Other  resources:  –  Live  spectrogram:  hKp://labrosa.ee.columbia.edu/expo/  

CDSIM  UPF  –  May  2014  

Window  size  

•  Understanding  Time-­‐Frequency  resoluNon  – Long  windows:  good  freq  resoluNon  – Short  windows:  good  temporal  resoluNon  

CDSIM  UPF  –  May  2014  

Observing  test  signals  

•  Two  near  tones  •  Noise  burst  •  Chirp  •  Pure  tones  •  Harmonic  richness  (square/saw)  •  Low  tone  SonicVisualizer  h.p://mtg.upf.edu/~jjaner/teaching/CDSIM2014/Test_various_signals.wav  

CDSIM  UPF  –  May  2014  

ApplicaNons  of  spectral  processing  

technologies  for  the    synthesis  of  sound  and  music    

 

technologies  for  the    analysis  of  sound  and  music    

 

technologies  for  the    transforma9on  of  sound  and  music    

 

CDSIM  UPF  –  May  2014  

Analysis  

•  Skore  –  automaNc  singing  voice  raNng  

CDSIM  UPF  –  May  2014  

Transforming  signals  

•  Approaches  for  spectral  transformaNons:  – SMS:  hKp://mtg.upf.edu/sms  – Phase  Vocoder  

•  Basic  transformaNons  – Pitch  transposiNon  – Harmonic/noise  decomposiNon  – Time-­‐stretching  

(Matlab  internal  MTG  sosware)  

CDSIM  UPF  –  May  2014  

Transforming  signals  

•  Basic  transformaNons  – Original  

– Pitch  transposiNon  

– Harmonic/noise  decomposiNon  

– Time-­‐stretching  (50x)  

CDSIM  UPF  –  May  2014  

TransformaNon  •  Time  scaling  

– DetecNon  of  transients  – RepeNNon/Removal  of  spectral  frames  – Demo:  Fast  Remixing  

•  Original          fast        Nme-­‐varying          remix  

•  Swing  detecNon  – Tempo  detecNon  at  8th  note  level  – Change  swing  factor    – Demo:  video  

CDSIM  UPF  –  May  2014  

Synthesis  

•  Sample-­‐based  (Violin)  – Gesture  modelling  to  provide  a  more  realisNc  synthesis    

•  Voice-­‐driven  synthesis  – Voice  analysis  is  used  to  control  the  synthesis  of  a  violin  sound  

2-­‐  Decomposing  sound  signals  Signal  decomposiNon  and  Source  separaNon  

CDSIM  UPF  –  May  2014  

source  separaNon  

The  objecNve  

•  Music  is  distributed  as  mixdowns  in  various  formats  •  Users  aim  to  further  manipulate  music  signals  in  mulNple  applicaNon  

contexts  (karaoke,  soloing,  remixing,  etc.)    

*  from  mulNtrack  originals  

The  problem  

•  Music  signals  are  complex  •  Variety  of  music  styles  and  instrumentaNons  •  Modern  producNon  techniques  go  beyond  linear  combinaNon  of  recorded  

acousNc  sources    –  (FX’s,  digital  synth,  etc.)  

 

ExisNng  generic  SS  approaches:  •  Spectral  subtrac9on    

–  IntuiNve  –  Well-­‐studied  (industrial  interest)  –  Good  for  speech/staNonary  noise  reducNon  –  Less  appropriate  for  music  signals  

Background  I  

Background  II  

ExisNng  music-­‐specific  approaches  I:  •  Pan-­‐frequency  masks  

o  Assumes  non-­‐overlapping  signals  in  Nme-­‐frequency  bins  o  Stereo  signals  are  required  o  Amplitude  raNo  between  L  and  R    FFT  bins  o  2D  user  interface  

•  Examples  o  Good  for  simple  excerpts  o  Bad  for  complex  mixes  

   *  Loses  brightness,  vocals  less  reduced  due  to  reverb,    flute  is  also  removed,.,…  

ExisNng  music-­‐specific  approaches  II:  •  Non-­‐nega9ve  Matrix  Factoriza9on    (NMF)  

–  Magnitude  spectrogram  (non-­‐negaNve)  –  DecomposiNon  as  matrix  product  –  W  (spectral  basis)  and  H  (gain  acIvaIons  over  Ime)  –  Spectrum  frame  explained  as  linear  combinaNon  of  R  basis.    –  MinimizaNon  problem  that  finds  W  and  H:      min(D (V, WH))

Background  III  

W  

H  

V  

•  Non-­‐nega9ve  Matrix  Factoriza9on  I  •  3  spectral  basis  W  

NMF  details  

1  overlapping  note  

H:  acIvaIon  gains  

•  Non-­‐nega9ve  Matrix  Factoriza9on  I  •  3  spectral  basis  W  

NMF  details  

2  overlapping  notes  

H:  acIvaIon  gains  

NMF  challenges  

•  Predominant  instrument  separaNon    –  (pitch/Nmbre  analysis)  

•  Completeness  of  instrument  removal    –  (aKack/sustain,  residual/breathing  noise,  unvoiced  consonants,…)  

•  Percussive  instruments  separaNon  –  (Transient  detecNon,  wideband  spectrum)  

•  Polyphonic  instrument  separaNon    –  (blind  and  score-­‐informed)  

•  “Music  print”  decomposiNon:  –  song  containing  a  region  without  target  (e.g.  vocals),  –  basis  model  learnt  from  the  user-­‐selected  “music-­‐print”  

Music  print  (without  vocals)  

Region  with  vocals  

Vocals/Background  separaNon  

•  “Music  print”  decomposiNon:  –  Demos:  

Basis  decomposiNon  W·∙H   Wbgd  

Background  excerpt  

Basis  decomposiNon  [Wbgd,Wother]·∙[Hbgd,Hoth

er]    

Input  

Wiener  filtering  (Wbgd,Hbgd)/(W·∙M)  

 output  mute  

original            mute  

Vocals/Background  separaNon  

•  “Music  print”  decomposiNon:  –  Demos:  

Basis  decomposiNon  W·∙H   Wbgd  

Background  excerpt  

Basis  decomposiNon  [Wbgd,Wother]·∙[Hbgd,Hoth

er]    

Input  

Wiener  filtering  (Wother,Hother)/(W·∙M)  

 output  solo  

original            solo  

Vocals/Background  separaNon  

•  “Music  print”  decomposiNon:  –  not  always  possible…  

•  accompaniment  (music  print)  changes  throughout  the  song  •  target  always  present  in  some  secNons  

 

Vocals/Background  separaNon  

•  Solu9on      à  Predominant  Pitch  detec9on  –  e.g  MELODIA  (J.  Salomon,  MTG)  

•  SeparaNon    à  Binary  mask  from  pitch  informaNon  –  Simplest  approach  –  Nme-­‐frequency  mask  1’s  at  harmonic  posiNons,  0’s  rest  –  Can  be  combined  with  pan-­‐frequency  mask  

 

•  Demos  •  Voice  is  properly  removed/aKenuated  •  Bass  guitar  is  “comb-­‐filtered”,  and  horns  aKenuated  •  Soloing  produces  more  arNfacts  

original          mute                      solo  

Vocals/Background  separaNon  

Advanced  separa9on  approaches  Special  treatment  for  vocals:  source  /  filter  models  

Breathiness  residual  (noise  added  on  formant  envelope)    Demos:   Solo  version  

without  residual  Solo  version  with  residual  

Original  

Vocals/Background  removal  

Advanced  separa9on  approaches  Special  treatment  for  vocals  

Breathiness  residual  (noise  added  on  formant  envelope)  Unvoiced  FricaIve  modelling    /s/,  /f/,  /sh/,…  

•  supervised  basis  from  solo  phoneme  recordings  o  Demos:     Solo  version  

/s/  are  missing  Solo  version  /s/  are  present  

Original  

Spectrogram  of  the  fricaNve  recording  used  to  train  the  spectral  basis.  

Vocals/Background  removal  

Piano  decomposiNon/retouch  

•  Using  instrument-­‐specific  NMF  dicNonaries  –  Piano  model  of  88  notes  (W  matrix  is  pre-­‐learned).  

•  Retouch  use-­‐case:  –  Amateur  recording  with  errors.  –  The  user  can  select  and  correct  individual  notes  aser  decomposiNon/

separaNon.  

Original  (played  with  errors)  

Separated  notes  

Corrected    remix  

Original  (ref)  

•  Mul9ple  sources  in  an  orchestral  recording  •  Score  data  is  used  to  iniNalize  acNvaNons  matrix    H  

Score-­‐informed  separaNon  

•  Video  Demo:  •  Isolated  instruments:  violin,  cello,  oboe,  bassoon,  flute  

Other  potenNal  applicaNons  

Other  potenNal  applicaNons  

•  Singer  replacement  –  Original        Vocals  mute                    Vocaloid  Clara                Vocaloid  Clara  Mix  

•  Drums  enhancement  –  Original                        Drums+6dB                      Drums-­‐6dB  

•  Step-­‐remixer  for  drums  –  user-­‐supervised  transients  (onsets  Nme  and  instrument)  –  Original        All  Drums              Single  Instrument  

Other  potenNal  applicaNons  (piano)  

•  Mono-­‐to-­‐stereo  upmixing  •  Input  

– Mozart  K331  recording  (RWC  dataset)  

•  Output  –  Upmixing  from  Mono    

•  les/right  hands  are    panned  in  stereo  

Other  potenNal  applicaNons  (piano)  

•  Automa9c  accompaniment  •  Input  

– Mozart  K331  recording  (RWC  dataset)  

•  Output  •  automaNc  object  detecNon  •  String  ensemble  resynthesis  

synth  solo  (Kontakt)  

mixture  

Thanks!    

Jordi  Janer  Music  Technology  Group    Universitat  Pompeu  Fabra,  Barcelona  jordi.janer  @  upf.edu    

CDSIM  –  UPF              May  2014  hKp://mtg.upf.edu/~jjaner