187
The Development of a Parametric RealTime Voice Source Model for use with Vocal Tract Modelling Synthesis on Portable Devices Jacob Harrison MSc by Research University of York Electronics November 2014

The!Development!of!a! Parametric!Real4Time!Voice!Source ...etheses.whiterose.ac.uk/8334/1/MScThesis-Final-10-03-15.pdfThe!Development!of!a! Parametric!Real4Time!Voice!Source!Model!for

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  •    

         

    The  Development  of  a  Parametric  Real-‐Time  Voice  Source  Model  for  use  with  Vocal  Tract  Modelling  Synthesis  on  

    Portable  Devices                    

    Jacob  Harrison  MSc  by  Research  University  of  York  

    Electronics  November  2014  

       

     

  •    

    i  

    Abstract  

     

    This   research   is   concerned   with   the   natural   synthesis   of   the   human   voice,   in  

    particular,  the  expansion  of  the  LF-‐model  voice  source  synthesis  method.    The  LF-‐

    model  is  a  mathematical  representation  of  the  acoustic  waveform  produced  by  the  

    vocal   folds   in   the   human   speech   production   system.  Whilst   being   used   in   many  

    voice   synthesis   applications   since   its   inception   in   the   1970s,   the   parametric  

    capabilities  of  this  model  have  remained  mostly  unexploited  in  terms  of  real-‐time  

    manipulation.  With   recent   advances   in  dynamic   acoustic  modelling  of   the  human  

    vocal  tract  using  the  two-‐dimensional  digital  waveguide  mesh  (2D  DWM),  a  logical  

    step  is  to  include  a  real-‐time  parametric  voice  source  model  rather  than  the  static  

    LF-‐waveform  archetype.  

     

    This  thesis  documents  the  development  of  a  parameterised  LF-‐model  to  be  used  in  

    conjunction  with  an  iOS-‐based  2D  DWM  vocal  tract  synthesiser,  designed  with  the  

    further  study  of  voice  synthesis  naturalness  as  well  as   improvements   to  assistive  

    technology  in  mind.  

       

  •    

    ii  

    Table  of  Contents  

     

    Abstract  ......................................................................................................................................  i  

    List  of  Figures  ..........................................................................................................................  v  

    List  of  Tables  .........................................................................................................................  vii  

    List  of  Accompanying  Material  ......................................................................................  viii  

    Acknowledgements  ..............................................................................................................  ix  

    Author’s  Declaration  ............................................................................................................  x  

    1.  Introduction  ........................................................................................................................  1  

    1.1  Thesis  Overview  .......................................................................................................................  1  1.2  Thesis  Structure  .......................................................................................................................  3  

    2.  Literature  Review  .............................................................................................................  5  

    2.1  The  ‘Source  +  Modifier’  Principle  .......................................................................................  5  2.2  Physiology  of  the  Vocal  Folds  ..............................................................................................  8  2.3  Voice  Types  ..............................................................................................................................  11  2.4  Modelling  the  Voice  Source  ................................................................................................  16  2.4.1  Existing  Methods  for  Voice  Source  Synthesis  .....................................................................  17  2.4.2  The  Liljencrants-‐Fant  Glottal  Flow  Model  ............................................................................  20  

    2.5  Vocal  Tract  Modelling  ..........................................................................................................  28  2.6  ‘Naturalness’  in  Speech  Synthesis  ....................................................................................  37  

    3.  A  Parametric,  Real-‐Time  Voice  Source  Model  .......................................................  44  

    3.1  Motivation  for  Design  ...........................................................................................................  45  3.2  Specifications  ..........................................................................................................................  47  3.3  Design  ........................................................................................................................................  48  3.3.1  Choice  of  Parameters  .....................................................................................................................  48  3.3.2  Wavetable  Synthesis  ......................................................................................................................  53  3.3.3  Voice  Types  ........................................................................................................................................  55  3.3.4  iOS  Interface  ......................................................................................................................................  58  3.3.5  Final  Design  .......................................................................................................................................  59  

    3.4  Implementation  .....................................................................................................................  61  3.4.1  Implementation  in  MATLAB  ......................................................................................................  62  3.4.2  Implementation  in  iOS  ..................................................................................................................  69  

  •    

    iii  

    3.5  System  Testing  ........................................................................................................................  80  3.5.1  Waveform  Reproduction  .............................................................................................................  83  3.5.2  Fundamental  Frequency  ..............................................................................................................  86  3.5.3  ‘Vocal  Tension’  Parameters  ........................................................................................................  86  3.5.4  Automatic  Pitch-‐Dependent  Voice  Types  .............................................................................  90  3.5.5  Automatic  f0  Trajectory  ...............................................................................................................  95  

    3.6  Conclusions  ..............................................................................................................................  97  

    4.  Vocal  Tract  Modelling  ....................................................................................................  98  

    4.1  Vocal  Tract  Modelling  with  the  2D  DWM  .......................................................................  98  4.2  Implementation  of  the  2D  DWM  in  MATLAB  .............................................................  102  4.3  Implementation  in  iOS  ......................................................................................................  105  4.4  System  Testing  .....................................................................................................................  109  4.4.1  Formant  Analysis  ..........................................................................................................................  109  4.4.2  Multiple  Vowels  .............................................................................................................................  111  4.4.3  System  Performance  ....................................................................................................................  113  

    4.5  Conclusions  ...........................................................................................................................  114  

    5.  Summary  and  Analysis  ...............................................................................................  115  

    5.1  Summary  ...............................................................................................................................  115  5.2  Analysis  ..................................................................................................................................  116  5.2.1  Voice  Source  Synthesis  using  the  LF-‐Model  ......................................................................  116  5.2.2  Extensions  to  the  LF-‐Model  ......................................................................................................  116  5.2.3  Use  within  2D  DWM  Vocal  Tract  Model  ..............................................................................  119  5.2.4  Core  Aims  .........................................................................................................................................  119  

    5.3  Future  Research  ..................................................................................................................  121  5.3.1  Issues  within  LF-‐Model  implementation  ............................................................................  121  5.3.2  Further  Extensions  to  the  Voice  Source  Model  ................................................................  122  5.3.3  Multi-‐touch,  Gestural  User  Interfaces  ..................................................................................  124  5.3.4  Implementation  of  Dynamic  Impedance  Mapping  within  the  2D  DWM  ...............  126  

    5.4  Conclusion  .............................................................................................................................  126  

    Appendix  A  –  ‘LFModelFull.m’  MATLAB  Source  Code  ...........................................  128  

    Appendix  B  –  ‘ViewController.h’  LFGen  App  Header  File  ....................................  133  

    Appendix  C  –  ‘ViewController.m’  LFGen  App  Main  File  .......................................  134  

    Appendix  D  –  ‘AudioEngine.h’  LFGen  App  Header  File  ........................................  136  

  •    

    iv  

    Appendix  E  –  ‘AudioEngine.m’  LFGen  App  Main  File  ............................................  139  

    References  ...........................................................................................................................  170  

     

       

  •    

    v  

    List  of  Figures  Figure  2.1   The  human  vocal  system             7  

    Figure  2.2   Cross-‐section  of  the  human  speech  system         9  

    Figure  2.3   Glottal  flow  waveform  and  derivative         10  

    Figure  2.4   Comparison  between  F-‐  and  L-‐  model  waveforms       22  

    Figure  2.5   Annotated  LF-‐model  flow  derivative  waveform       23  

    Figure  2.6   Vocal  tract  represented  as  a  series  of  tubes         29  

    Figure  2.7   1D  Digital  waveguide  structure           30  

    Figure  2.8   Achieving  a  cross-‐sectional  area  function  from  MRI  data     31  

    Figure  2.9   2D  Digital  waveguide  mesh  structure         31  

    Figure  2.10   Raised  Cosine  Function             34  

    Figure  2.11   2D  and  3D  DWM  topologies             36  

    Figure  2.12   Wolfgang  von  Kempelen’s  ‘Speaking  Machine’       39  

    Figure  2.13   The  ‘Uncanny  Valley’  Effect             42  

    Figure  3.1   ‘Typical’  LF  waveform             49  

    Figure  3.2   ‘Typical’  LF  waveform  with  varying  te  value         50  

    Figure  3.3   ‘Typical’  LF  waveform  with  varying  tp  value       51  

    Figure  3.4   ‘Typical’  LF  waveform  with    varying  ta  value       52  

    Figure  3.5   LFGen  app  interface  with  CorePlot  waveform  display     55  

    Figure  3.6   ‘Breathy’  voice  waveform             58  

    Figure  3.7   Black  box  diagram  for  LFGen  app           59  

    Figure  3.8   Software  diagram  for  LFGen  app           70  

    Figure  3.9   LFGen  app  interface  (final  version)           79  

    Figure  3.10   3D-‐printed  vocal  tract  model           81  

    Figure  3.11   Modal  voice  type  waveform  and  spectrum         83  

    Figure  3.12   Breathy  voice  type  waveform  and  spectrum         84  

    Figure  3.13   Vocal  Fry  voice  type  waveform  and  spectrum       84  

    Figure  3.14   Falsetto  voice  type  waveform  and  spectrum       84  

    Figure  3.15   ‘Typical’  voice  type  waveform  and  spectrum       85  

    Figure  3.16   ‘Typical’  voice  type  waveform  with  varying  vocal  tension     87  

    Figure  3.17   ‘Typical’  voice  type  spectrum             87  

    Figure  3.18   ‘Typical’  voice  type  waveform  with  minimum  VT       88  

    Figure  3.19   ‘Typical’  voice  type  waveform  with  maximum  VT       88  

  •    

    vi  

    Figure  3.20   ‘Typical’  voice  type  waveform  with  varying  ta  values     88  

    Figure  3.21   ‘Typical’  voice  type  spectrum             89  

    Figure  3.22   ‘Typical’  voice  type  spectrum    with  minimum  ta       89  

    Figure  3.23   ‘Typical’  voice  type  spectrum    with  maximum  ta       89  

    Figure  3.24   Waveform  of  an  f0  sweep  with  ‘auto-‐voice’  enabled     91  

    Figure  3.25   Waveform  between  24-‐52  Hz  with  ‘auto-‐voice’  enabled     92  

    Figure  3.26   Waveform  between  52-‐94  Hz  with  ‘auto-‐voice’  enabled     92  

    Figure  3.27   Waveform  between  94-‐207  Hz  with  ‘auto-‐voice’  enabled     93  

    Figure  3.28   Waveform  between  207-‐208  Hz  with  ‘auto-‐voice’  enabled     93  

    Figure  3.29   Waveform  above  288  Hz  with  ‘auto-‐voice’  enabled       94  

    Figure  3.30   Spectrogram  of  human  /A/  vowel  with  varying  f0       95  

    Figure  3.31   Spectrogram  of  synthesised  /A/  vowel  with  varying  f0     96  

    Figure  4.1   Spectrogram  of  synthesised  /A/  vowel  with  ‘typical’  voice     110  

    Figure  4.2   English  vowel  chart               111  

    Figure  4.3   Xcode  performance  check             113  

    Figure  5.1   Idealised  Vocal  Fry  waveform           123  

    Figure  5.2   HandSynth  touchscreen  interface           125  

    Figure  5.3   Proposed  multitouch  interface  design         125  

     

     

     

     

     

     

     

     

     

       

  •    

    vii  

    List  of  Tables  

    Table  2.1   Four  voice  types  and  their  corresponding  waveforms     12  

    Table  2.2   Four  voice  types  with  spectra,  pitch  range  and  noise  amount   14  

    Table  2.3   Four  male  voice  types  and  their  timing  parameter  values     25  

    Table  3.1   Five  LFGen  voice  types  and  their  timing  parameter  values     56  

    Table  4.1   Synthesised  formants  vs  average  English  male  speech  formants   112  

       

  •    

    viii  

    List  of  Accompanying  Material    The  following  material  can  be  found  on  the  accompanying  data  CD:  

    1. A  PDF  of  this  document  

    2. ‘Audio  Examples’  folder  –  synthesised  voice  types  and  2D  DWM  vowels:  

    a. ‘Breathy110Hz.wav’  -‐  breathy  voice  type  at  110  Hz  

    b. ‘Falsetto110Hz.wav’  -‐  falsetto  voice  type  at  110  Hz  

    c. ‘Modal110Hz.wav’  -‐  modal  voice  type  at  110  Hz  

    d. ‘Typical3-‐bird.wav’  -‐  typical  voice  type  with  /3/  vowel    

    e. ‘Typical110Hz.wav’  -‐  typical  voice  type  at  110  Hz  

    f. ‘TypicalA-‐bart.wav’  -‐  typical  voice  type  with  /A/  vowel  

    g. ‘TypicalAe-‐Bat.wav’  -‐  typical  voice  type  with  /Ae/  vowel  

    h. ‘TypicalI-‐beet.wav’  -‐  typical  voice  type  with  /I/  vowel  

    i. ‘TypicalQ-‐bod.wav’  -‐  typical  voice  type  with  /Q/  vowel  

    j. ‘TypicalU-‐food.wav’  -‐  typical  voice  type  with  /U/  vowel  

    k. ‘VocalFryFu110Hz.wav’  -‐  vocal  fry  voice  type  at  110  Hz  

    3. ‘Code  Listings’  folder  

    a. ‘LFGenMkVI.zip’   -‐   compressed   folder   containing   xcode   project   for  

    LFGen  iOS  app  

    b. ‘LFModelF0Data.m’  -‐  matlab  script  for  producing  a  synthesised  vowel  

    for  a  given  voice  type  with  an  f0  sweep  taken  from  a  voice  recording  

    c. ‘LFModelFull.m’   –   matlab   script   for   producing   any   voice   typewith  

    options  for  pitch,  amplitude,  duration,  breathiness  and  vocal  tension.  

    4. Demonstration  video  –  ‘LFGenDemoVideo.mp4’  

       

  •    

    ix  

    Acknowledgements  

     

    To   my   parents,   thank   you   for   your   constant   love,   support   and   encouragement  

    throughout  this  project.    

     

    To   my   supervisor   David   Howard,   thank   you   for   the   inspiring   supervisions   and  

    general  advice  during  this  project  and  others  throughout  my  time  at  York.  

     

    To   Steve,   Amelia,   Laurence,   Becky,   Andrew,   Tom,   Eyal,   Frank,   Jude   and   Helena,  

    thank   you   for   some   truly   memorable   crossword   sessions   during   the   Audio   Lab  

    lunch  breaks,  and  the  near-‐constant  supply  of  cake.    

     

    To   Jiajun,   Ed   and   Simon,   your   patience   and   understanding   with   the   often-‐

    frustrating   life  of  a  post-‐graduate  researcher  made  our  house  a  pleasure   to  come  

    back  to  after  many  late  nights  in  the  library.    

     

    To  Dimitri,  your  expertise  and  willingness  to  teach  iOS  and  Core  Audio  helped  this  

    project  materialise  at  a  crucial  point  in  the  development  stages.  

     

    Special  thanks  to  my  friends  on  both  sides  of  the  country,  especially  Benedict,  Sam,  

    JP,  Ben,  Mike,  Annie  and  Rosie.  

     

       

  •    

    x  

    Author’s  Declaration  

    The  work  presented  in  this  thesis  is  entirely  the  author’s  own,  with  any  substantial  

    external  influences  attributed  in  the  text.  None  of  the  content  in  this  thesis  has  been  

    published  by  the  author  in  any  form.  This  work  has  not  previously  been  presented  

    for  an  award  at  this,  or  any  other,  University.  

  •     1.  Introduction    

      1  

    1.  Introduction  

     

    The  title  of  this  thesis  is  The  Development  of  a  Parametric  Real-‐Time  Voice  

    Source   Model   for   use   with   Vocal   Tract   Modelling   Synthesis   on   Portable  

    Devices.   The   research   project   described   herein   is   concerned   with   digital  

    modelling  of  the  human  voice  source  to  help  improve  the  naturalness  of  existing  

    speech   synthesis   technology.   This   thesis   contains   an   analysis   of   existing   voice  

    source  models,   followed  by  a  description  of   the  development  of   a  voice   source  

    modelling  application  for  iOS  devices.    

     

    This  chapter  introduces  the  key  themes  of  this  research,  and  the  motivation  for  

    this   specific  project.  An  overview  of   the   remaining   chapters   is   given   in   section  

    1.2.  

     

    1.1  Thesis  Overview    

    The   human   voice   is   the  most   expressive   and   versatile   instrument  we   possess.  

    Whether  delivering  a  public  speech,  singing  in  a  church  choir  or  having  a  private  

    conversation,  the  sheer  flexibility  of  the  vocal   instrument  allows  us  to  convey  a  

    huge  spectrum  of  human  emotion,  with   the  subtlest  of  expressive   touches.   It   is  

    not   surprising   that   a   totally   accurate   reproduction   of   the   human   vocal   system  

    has   not   yet   been   achieved.   Apple’s   Siri   software   [1]   is   capable   of   producing  

    speech  output  that,  on  a  casual   listen,  can  sound  indistinguishable   from  human  

  •     1.  Introduction    

      2  

    speech,   however   the   software’s   vocabulary   is   limited   to   pre-‐recorded   voice  

    sounds.   The   DECtalk   system   (commonly   associated   with   Stephen   Hawking’s  

    communication  aid)   [2]   is   instantly   recognisable  as  a  computerised  or   ‘robotic’  

    voice  and  has  an  unlimited  vocabulary,  as  it  can  produce  any  speech  sounds.  The  

    compromises   inherent   in   both   these   systems   are   informed   by   the   context   in  

    which  they  are  used  –  Siri  users  do  not  rely  on  the  software  to  communicate,  but  

    might  prefer  a  pleasant  voice.  Users  of  communication  aids  such  as  DECtalk  rely  

    on  the  ability   to  convey  any   information   in  an  efficient  and   intelligible  manner,  

    with  naturalness  or  realism  being  of  lesser  importance.    

     

    The  work  described  in  this  thesis  is  concerned  with  the  idea  of  contributing  to  a  

    voice  synthesis  system  that  is  both  versatile  and  expressive.  This  work  takes  into  

    account   the   importance  of   the  voice   source   (discussed   in  Chapter  2)   in  human  

    speech  production,  and  aims  to  look  at  ways  in  which  a  more  sophisticated  voice  

    source  model  can  be  incorporated  in  existing  speech  synthesis  applications.    

     

    The   motivation   for   this   research   comes   from   two   places   of   interest.   Firstly,  

    natural   voice   synthesis   provides   a   fascinating   research   area,   with   inspiration  

    from   and   implications   for   a   variety   of   disciplines   such   as   engineering,  

    psychoacoustics,  linguistics,  voice  pathology  and  even  philosophy.  The  software  

    developed   for   this   work   was   designed   predominantly   as   a   research   tool   that  

    could   be   used   in   any   of   these   fields,   as   an   input   source   for   a   new   vocal   tract  

    model,   for   example,   or   as   a   means   of   exploring   the   nature   of   voice   source  

    variation  in  the  perception  of  synthesised  voices.  

       

  •     1.  Introduction    

      3  

    As  well  as  a  general  interest  in  voice  synthesis,  the  impact  of  related  software  for  

    assistive  technology  applications  is  considered  a  major  motivation  for  improving  

    the   technology   in   this   field.   This   partly   informed   the   decision   to   focus   on  

    portable   devices   such   as   tablets   and   smartphones,   which,   for   some   users   of  

    assistive  technology,  have  become  useful  and  often  essential  items  [3]  [4].  Whilst  

    the  goal  of  this  work  was  never  to  develop  a  fully  formed  communication  aid,  it  is  

    hoped  that  the  research  and  software  described  herein  will  contribute  to  future  

    developments  for  such  an  application.  

     

    1.2  Thesis  Structure    

    Chapter  2  provides  a  summary  of  the  existing  literature  on  topics  related  to  this  

    work.   First,   the   ‘source   +   modifier’   model   of   speech   production   and   voice  

    synthesis  is  explained,  followed  by  a  description  of  voice  source  physiology.  The  

    main   ‘voice   types’   are   then   introduced,   and   an   overview   of   voice   source  

    modelling   is   given.   Vocal   tract   modelling   techniques   are   then   described,  

    including  a  description  of  the  digital  waveguide  mesh,  which  is  used  to  model  the  

    vocal   tract   in   this   work.   Finally,   previous   research   projects   on   the   subject   of  

    ‘naturalness’  are  recounted  to  set  the  work  in  context.  

     

    Chapter  3  describes   the  majority  of   the  development  process   for   a  parametric,  

    real-‐time  voice  source  model.  The  general  motivation  for  this  design  is  given  as  

    well  as  a  technical  specification.  The  design  of  the  software  is  described,  followed  

    by  an  implementation  report  and  system  testing  results.  

     

  •     1.  Introduction    

      4  

    Chapter   4   documents   the   process   of   porting   an   existing   2D   digital   waveguide  

    mesh  model  of  the  vocal  tract  first  to  MATLAB  and  then  iOS.    Chapter  5  concludes  

    the   work,   with   an   analysis   of   the   project   as   a   whole,   followed   by   a   brief  

    exploration  of  potential  future  work  on  the  subject.  

     

       

  •     2.  Literature  Review    

      5  

    2.  Literature  Review  

     

    This   chapter   summarises   the   key   themes   of   the   research   undertaken,   and  

    discusses  existing   literature  on  the  subject.  The   impetus   for   this  research  came  

    from  the  conclusions  from  two  earlier  research  projects  [5]  which  dealt  with  the  

    concept   of   ‘naturalness’   in   voice   synthesis,   and   made   attempts   to   improve   or  

    explore  this  notion  through  real-‐time  control.    

     

    During  the  initial  stages  of  the  current  project,  it  was  concluded  that  a  different  

    approach  should  be   taken,  namely   improving   the   synthesis  engine,   rather   than  

    its   interface.  For   the  sake  of  completeness,  and   to  place   this  work   in  context,  a  

    brief   summary   of   these   earlier   studies   and   related   literature   is   included.   The  

    relevant  literature  can,  therefore,  be  split  into  four  key  areas:    

    • voice  source  physiology  and  acoustics  

    • speech  synthesis  and  vocal  tract  modelling  

    • ‘naturalness’  in  speech  synthesis  

    • voice  source  modelling  

    The  latter  (voice  source  modelling)  is  the  primary  research  area.    

     

    2.1  The  ‘Source  +  Modifier’  Principle    

    Before  discussing  the  physiology  and  acoustics  of  the  voice  source,  it  is  necessary  

    to  define  what  is  meant  by  the  ‘voice  source’  in  relation  to  speech  production  as  a  

  •     2.  Literature  Review    

      6  

    whole.  It  is  widely  understood  that  an  appropriate  analogue  of  the  human  speech  

    system   is   its   description   consisting   of   a   sound   source  with   sound  modifiers.  

    Howard  and  Murphy  [6]  provide  a  detailed  introduction  to  voice  science,  which  

    encompasses   everything   from  speech   system  physiology   to   speech  and   singing  

    recording  techniques.  This  book  includes  another  component  to  the  ‘source  and  

    modifier’  analogue:  the  ‘power  source’,  being  the  lungs.  It  is  important  to  include  

    the   power   source   when   considering   human   speech   production,   however   in  

    synthesised  speech,  the  airflow  from  the  lungs  is  (usually)  not  incorporated  into  

    the   synthesis   engine,   so   a   single   ‘voice   source’   can   be   considered   as   an  

    approximation  of  the  waveform  created  when  the  airstream  resulting  from  lung  

    pressure   acts   on   the   vocal   folds.   The   sound  modifiers   are   the   acoustic   cavities  

    between  the  glottis  and  the  lips  (the  vocal  tract),  and  the  articulators  (the  tongue,  

    lips   and   jaw),   which   modify   the   voice   source   signal   by   acoustically   filtering  

    certain  frequencies,  and  creating  speech  components  such  as  consonants.  Figure  

    2.1   displays   a   cross-‐section   of   the   human   vocal   system,   detailing   power   and  

    noise  source,  compared  with  the  voice  source  waveform  created  when  they  act  

    together.    

     

  •     2.  Literature  Review    

      7  

     

    Figure  2.1  –  The  human  vocal  system  with  waveform  of  voice  source  (equivalent  to  

    power  source  +  noise  source)  

     

    It  should  be  noted  that,  in  reality,  the  vocal  folds  and  vocal  tract  do  not  act  fully  

    independently  of  each  other  [7],  and  a  truly  accurate  model  of  the  speech  system  

    would  take   into  account  the  cross-‐coupled  relationship  between  the  vocal   tract  

    and  vocal  folds  [8].  Most  existing  voice  source  models  remain  fairly  rudimentary,  

    staying   faithful   to   the   discretised   model   presented   above   [7]   [9].   There   are  

    advantages   and   disadvantages   to   both   approaches   -‐     complex,   cross-‐coupled,  

    physical   models   are   able   to   replicate   the   behaviour   of   the   vocal   folds   under  

    certain   conditions,   at   the   expense   of   computational   ease.   Rudimentary  

    mathematical  models  of  the  glottal  flow  waveform  can  be  more  computationally  

    efficient,  at  the  expense  of  realistic  behaviour  under  certain  conditions.  However,  

  •     2.  Literature  Review    

      8  

    the  flexibility  given  by  these  models  allows  for  increased  functionality  in  terms  of  

    acoustic  responses  to  given  conditions.  

     

    2.2  Physiology  of  the  Vocal  Folds    

    Fig.  2.2  below  shows  a  cross-‐section  of  the  voice  production  system  in  humans.  

    Voice   production   begins   at   the   diaphragm   below   and   the   intercostal   muscles  

    surrounding  the  lungs.  At  rest,  the  diaphragm  is  bowed  upwards,  and  flattens  out  

    when   constricted.   When   the   diaphragm   is   constricted   and   the   intercostal  

    muscles  expand  the  ribs,  air  enters  the  lungs.  Breathing  out  requires  the  lungs  to  

    be   compressed   in   some   manner,   through   contraction   of   the   intercostal   or  

    abdominal  muscles   [6].  Airflow   from  the   lungs   then  passes   towards   the  glottis.  

    The  glottis  is  the  area  between  the  vocal  folds.  The  vocal  folds  are  described  as  

    ‘the   vibrating   elements   in   the   larynx’   [6]  and   are   the   two  mucosal  membranes  

    that  traverse  either  side  of  the  glottis,  and  meet  in  the  middle  to  close  the  larynx  

    completely.   The   prevailing   theory   for   the   kinematic   process   of   vocal   fold  

    vibration  is  attributed  to  the  Bernoulli  effect  [10].  This  is  the  same  process  that  is  

    used  to  describe  lift  in  aeroplanes,  helicopters  and  aerofoils,  and  occurs  when  an  

    airstream  passes  over  a  curved  surface,  creating  an  area  of  low  pressure  due  to  

    the  faster  airstream  closer  to  the  curve.  When  air  passes  through  the  glottis,  the  

    vocal   folds  are   forced  open.  The  curvature  of   the  open   folds   creates  an  area  of  

    low  pressure  in  between  and  below  them,  drawing  the  folds  back  together.  This  

    process   repeats,   creating   a   constant   oscillation.   It   should   be   noted   that   more  

    recent   research   has   discredited   the   use   of   the   Bernoulli   effect   to   explain  

    phenomena   such   as   aerofoil   lift   and   vocal   fold   vibration.   In   [11],   Babinsky  

  •     2.  Literature  Review    

      9  

    explains   the   fallacy   of   invoking   the   Bernoulli   equation,   but   an   in-‐depth  

    discussion   of   this   is   outside   the   scope   of   this   thesis.   To   put   it   briefly,   the  

    Bernoulli   equation   can  only   legitimately  be  used  when   all   airstreams  originate  

    from  the  same  source.  In  the  case  of  vocal  folds,  where  the  airstreams  above  and  

    below  the  glottis  have  different  origins  (from  the  lungs  below  and  the  area  above  

    glottis),   Bernoulli’s   equation   cannot   be   used   to   describe   the   behaviour   of   both  

    airstreams  simultaneously.  

     

     

    Figure  2.2  –  Cross-‐section  of  the  human  speech  system  

     

    The  muscles   surrounding   the   glottis   alter   the   tension  of   the   vocal   folds.   Like   a  

    stringed  instrument,  a  change  in  tension  causes  slower  or  faster  oscillations  -‐  in  

    other  words,   a   change   in  pitch  or   frequency.  The   frequency  at  which   the  vocal  

    folds  oscillate  is  the  fundamental  frequency  of  any  voicing  produced.  The  terms  

  •     2.  Literature  Review    

      10  

    glottal   flow   and   glottal   flow   derivative   are   used   throughout   the   literature   to  

    describe  the  observed  glottal  pulse  waveform  obtained  via  inverse-‐filtering  and  

    its  numerical  derivative.  The  glottal  flow  derivative  waveform  takes  into  account  

    the  effects  of   lip  radiation,  which  can  be  modeled  as  a   first-‐derivative   filter  [9].  

    Figure  2.3  displays  the  glottal  flow  waveform  compared  with  its  derivative:  

     

    Figure  2.3  –  One  full  pitch  period  of  the  glottal  flow  waveform  and  its  numerical  

    derivative  

     

    The   waveform   displayed   above   is   an   approximation   of   the   true   acoustic  

    waveform,   using   the   Liljencrants-‐Fant   glottal   flow   model   [12].   This   is   a  

    mathematical  model  of  the  voice  source  waveform,  which  will  be  discussed  later  

    in  this  chapter.  Fant’s  earlier  work  on  the  acoustic  and  physical  properties  of  the  

  •     2.  Literature  Review    

      11  

    voice   source   [7]   [13]   highlighted   the   complex,   interactive  nature   of   the   role   of  

    the  vocal   folds  within  the  vocal  system.  He  showed  that   the  voice  source   is  not  

    merely  a  function  of  a  pitched  vocal  fold  vibration,  but  was  also  dependent  on  the  

    speaker’s  physiology,   impedance   load   from  sub-‐  and  supra-‐glottal  air  pressure,  

    and  even  the  current  vowel  being  spoken  [7].  

     

    2.3  Voice  Types    

    The  voice  type  is  a  factor  of  voiced  speech  that  is  defined  by  the  voice  source.  The  

    speaker’s  age,  gender,  physiology,  mood  and  setting  all  contribute  to  the  overall  

    acoustic   properties   of   the   glottal   flow   waveform,   and   thus   the   overall   speech  

    output.  Childers  and  Lee  [14]  cite  six  distinct  voice  types:  modal  voice,  vocal  fry,  

    falsetto,   breathy   voice,   harshness   and   whisper.   In   their   study,   harshness   and  

    whisper  were   excluded   due   to   the   lack   of   periodicity   in   both   voice   types.   The  

    four   voice   types   are   presented   in   table   2.1,   along   with   inverse-‐filtered   voice  

    source  waveforms  and  their  approximated  LF-‐model  fits.    

       

  •     2.  Literature  Review    

      12  

     

    Table  2.1  -‐  Four  voice  types  and  their  corresponding  waveforms  (LF-‐model  fits  to  

    inverse  filtered  glottal  source  recordings  from  [14])  

    Voice  Type  

    Description   LF-‐model  waveform  

    Modal   The   most   commonly   used   voice  type   for   speech   and   singing.   Also  referred   to   as   the   ‘typical’   voice  type,   most   cultures   and   languages  make   use   of   the   modal   voice   for  everyday   phonation.   Little   to   no  turbulent   airflow   present,  meaning  no  high  frequency  noise  component  in  the  waveform  [14]  

     

     

    Vocal  Fry  

    Commonly   employed   to   achieve  lower   frequencies   than   is   possible  using   a   modal   voice   (although   can  extend   into   the   modal   pitch   range  as   well).   Characterised   by   very  short   glottal   bursts   followed   by   a  large  closed  quotient  [14]    

     

    Breathy   During  breathy  voice  phonation,  the  vocal   folds   do   not   fully   seal   the  glottis,   allowing   an   amount   of  turbulent   airflow.   This   can   be  perceived  as  a  high-‐frequency  noise  component   during   the   closing   and  opening   stages   of   the   glottal   flow  cycle.  [14]  

     

    Falsetto   Created   by   only   vibrating   a   small  portion   of   the   vocal   cords,   this  allows   the   speaker/singer   to  achieve   a   much   higher   frequency  range   than   modal   voice.   A   noise  component   is  also  present  due  to  a  lack   of   complete   closure   at   the  glottis.  [14]  

     

     

    Childers  and  Lee  found  that  the  voice  type  could  be  characterised  by  four  main  

    factors,  namely  glottal  pulse  width,  glottal  pulse  skewness,  abruptness  of  glottal  

    closure,  and  turbulent  noise  [14].  ‘Glottal  pulse  width’  refers  to  the  portion  of  the  

  •     2.  Literature  Review    

      13  

    waveform  where  the  glottis  is  open,  also  known  as  the  open  quotient.    In  terms  of  

    the  glottal  flow  derivative,  the  open  quotient   ‘is  estimated  by  the  time  duration  

    between  a  positive  peak  and  the  next  adjacent  negative  peak’  [14].  Glottal  pulse  

    skewness  (or  the  speed  quotient)  refers  to  the  relationship  between  the  lengths  

    of   the   opening   phase   and   the   closing   phase.   Abruptness   of   glottal   closure   and  

    turbulent  noise  refer  to  the  steepness  of  the  return  phase  of  the  waveform  and  

    the  high  frequency  noise  created  by  airflow  through  the  glottis  respectively.  

     

    Table   2.2   shows   the   approximate   spectrum,   fundamental   pitch   range   and  

    turbulent  noise  properties  for  the  four  voice  source  types.    

       

  •     2.  Literature  Review    

      14  

    Table  2.2  -‐  Four  voice  types  with  spectral  content,  pitch  range,  and  noise  

    component  information  

    Voice  Type  

    Spectrum    (Diagrams  taken  from  [14])  

    Range  (approx.  male  voice)  

    Noise  Component  

    Modal  

     

    ~52-‐207  Hz   None  

    Vocal  Fry  

     

    ~24-‐94  Hz   None  

    Breathy  

     

    ~52-‐207  Hz   Noise  present  at  around   5%   of  total  signal  

    Falsetto  

     

    ~207-‐440  Hz  

    Noise  present  at  around   5%   of  total  signal  

     

    The  voice  source  type  (also  referred  to  as  ‘voice  quality’)  has  been  shown  to  play  

    a  major  role  in  the  perception  of  emotion  and  stress  in  speech  [15].  Whilst  there  

    have  been  many  empirical  studies  analysing  the  nature  of   these  voice  qualities,  

    Gobl   states   that   ‘very   few   …   have   focussed   on   the   voice   source   correlates   of  

    affective  speech’.  In  Gobl’s  study,  a  recording  of  an  utterance  spoken  in  Swedish  

    was  inverse-‐filtered  to  obtain  an  approximation  of  the  voice  source  waveform.  A  

  •     2.  Literature  Review    

      15  

    voice   source   model   was   then   fitted   to   this   approximation,   which   allowed   for  

    parameterisation  of  the  voice  source  to  fit  seven  voice  qualities.  The  voice  source  

    model  was   then  used   to  drive  a   formant  synthesiser,  and   the  original   recorded  

    phrase  was   resynthesised   for   each   voice   quality.   The   resynthesised   utterances  

    were   played   to   a   number   of   non-‐Swedish   speaking   subjects   (so   that   the  

    emotional  context  of   the  words  would  not   influence   the  subject’s  perception  of  

    emotion).   It   was   found   that   the   perceived   ‘tenseness’   of   the   voice   source  

    influenced   the   listener’s  perception  of  emotional   content   in   the  voice,  although  

    this  was   shown   to   be   far  more   effective   for   some   emotions   (relaxed/stressed,  

    bored,  intimate,  content)  than  others  (happy,  friendly,  sad,  afraid).    

     

    Chen  discusses  the  glottal  gap  phenomenon  in  [16].  This  is  a  feature  of  the  voice  

    source   that   occurs  when   the   glottis   does   not   fully   close,   such   as   in   breathy   or  

    falsetto  phonations.   It  was   found   that   the   size  of   the  glottal   gap   relative   to   the  

    pitch  cycle  affected  the  overall  speech  output  to  a  significant  degree,  in  terms  of  

    the  perceived  voice  quality.  Most  affected  were  the  spectral  tilt  and  the  turbulent  

    noise   component,   both   of   which   increased   proportionally   with   the   size   of   the  

    glottal  gap.  

     

    Though  not  a  distinct  voice  type  in  and  of  itself,  vocal  vibrato  is  a  common  vocal  

    feature   that   originates   at   the   voice   source,   primarily   used   in   singing.   Sung  

    phrases  are   typically  of   the  modal  or   falsetto  voice   types   (although  vocal   fry   is  

    somewhat   prevalent   in   pop   singing).   In   [17],   the   perceptual   benefits   of   vocal  

    vibrato   are   discussed.   One   such   benefit   is   the   effective   ‘gluing’   of   partials,   or  

    harmonics  together.  For  example,  while  vocal  sounds  are  generally  perceived  as  

  •     2.  Literature  Review    

      16  

    a  homogenous  blend  of  harmonics,   it  has  been  shown  that,  at  a  fixed  pitch,   it   is  

    possible  to  discern  between  separate  partials  present   in  the  speech  signal  [17].  

    When  the  f0  is  constantly  varied,  as  in  vocal  vibrato,  these  separate  partials  are  

    ‘glued’  together  again.  Another  hypothesised  perceptual  effect  of  vocal  vibrato  is  

    the   increased   intelligibility   of   vowels   when   vibrato   is   present.   As   Sundberg  

    states,   it   is  reasonable   to  assume  that  as   the  harmonics  above  the   fundamental  

    frequency  undulate   in   time  with   the   f0,   those  harmonics  present  around  vowel  

    formant  frequencies  will  reinforce  the  perception  of  the  formant.  This   is  due  to  

    the   amplitude   modulation   of   these   harmonics   as   they   align   with   the   formant  

    frequency.   Counter-‐intuitively,   further   studies   failed   to   prove   this   effect  

    conclusively   [18]   although   during   the   current   research   it   was   also   found   that  

    subjective  responses  to  a  synthesised  voice  with  varying  pitch  were  much  more  

    favourable  than  a  constant  f0.  

     

    2.4  Modelling  the  Voice  Source    

    Any   source/modifier   approach   to   synthesising   the   human   voice   will   employ  

    some   form   of   voice   source  model.   These   can   be   fairly   rudimentary,   such   as   a  

    simple  saw-‐wave  or  pulse-‐train  [19]  [20],  to  resynthesised  human  voice  source  

    waveforms   obtained   via   inverse-‐filtering   [21].   As   Chen   et   al.   point   out,   ‘few  

    studies   have   attempted   to   systematically   validate   glottal   models   perceptually,  

    and  model  development  has  focused  more  on  replicating  observed  pulse  shapes  

    than   on   perceptual   sufficiency’   [22].   Fitting   existing  models   to   observed   pulse  

    shapes  is  so  far  the  most  reliable  method  for  achieving  accurate  recreations,  due  

    to   the   impracticality   of   capturing   an   isolated   voice   source   waveform   using  

  •     2.  Literature  Review    

      17  

    conventional   recording  methods   [15]   -‐   this  has  been  attempted,  but   the  highly  

    invasive   procedure   involved   miniature   transducer   microphones   inserted  

    between   the   vocal   folds,   which   necessitated   the   use   of   local   anaesthetic   [23].  

    Inverse   filtered   glottal   pulse   signals   and   LX-‐waveforms   obtained   via  

    laryngoscope   [21]   [24]   are   the  most   common   references  used   for   voice   source  

    modelling.   This   sub-‐section   summarises   attempts  made   to   recreate   this   signal  

    using  mathematical  modelling  and  other  techniques.  

     

    2.4.1  Existing  Methods  for  Voice  Source  Synthesis  

     

    In  order  to  produce  the  formants  that  occur  in  natural  speech,  a  complex  source  

    waveform  with  sufficient  harmonics  must  be  used.  It  has  been  recognized  since  

    at   least   the   1970s   [25]   that   a   source   waveform   approximating   that   found   in  

    natural   speech   would   provide   the   most   accurate   speech   output.   While   it   is  

    possible   for   very   simple   formant   synthesisers   to   achieve   speech-‐like   results  

    using   saw-‐waves,   square  waves,   or   even  white   noise   as   an   input,   the   spectral  

    content   of   the   glottal   source   signal   is   of   significant   importance   to   the   overall  

    naturalness  of  the  synthesised  speech  content.  Rosenberg  was  one  of  the  first  to  

    compare   differing   methods   of   speech   synthesis   excitation   using   time-‐domain  

    representations  of  the  source  waveform.  He  showed  that  out  of  six  wave  shapes  

    of   varying   complexity,   a   complex   trigonometric   waveform,   based   on  

    observations   of   glottal   pulse   movement   and   speech   recordings   was   the   most  

    preferred  in  a  listening  test,  when  compared  with  a  natural  speech  recording.      

     

  •     2.  Literature  Review    

      18  

    In  [9],  the  distinction  is  made  between:    

    1.)  non-‐interactive  parametric  glottal  models  -‐  mathematical  models  that  assume  

    a  linear  separability  between  the  voice  source  and  vocal  tract,    

    2.)  interactive  mechanical  and  parametric  glottal  models,  which  are  based  on  the  

    interaction  between  the  vocal  source  and  the  rest  of  the  vocal  system,  either  via  a  

    mechanical  model  or  numerical  simulation,  and    

    3.)   physiological   glottal   models,   in   which   an   attempt   is   made   to   accurately  

    simulate  the  physical  properties  of  the  vocal  folds  in  three  dimensions.      

     

    Non-‐interactive  parametric  glottal  models  are  intuitively  the  simplest  to  achieve,  

    requiring  only  knowledge  of  the  voice  source  waveform  and  its  spectrum.  Early  

    studies   such   as   Rosenberg’s   [25]   confirmed   that   as   the   glottal   pulse   shape  

    approached   similarity  with   that   observed   through   inverse-‐filtering   techniques,  

    the   perceived   quality   of   voice   synthesis   improved.   In   these   early   studies,   the  

    glottal   flow   waveform   was   modeled,   rather   than   the   glottal   flow   derivative.  

    Liljencrants  and  Fant  [12]  were  one  of  the  first  to  apply  the  first-‐derivative  filter  

    to   the  glottal  pulse  model   in  order   to  simulate   the  effects  of   lip  radiation.  They  

    developed   a   parameterised  model   of   the   glottal   flow   derivative,   known   as   the  

    Liljencrants-‐Fant  or  LF  Model  which  is  now  the  most  commonly  used  among  the  

    non-‐interactive  parametric  models  [9].  Due  to  the  model’s  flexibility  and  ease  of  

    adaptation  to  existing  speech  source  waveforms,  it  has  been  widely  accepted  as  

    the  standard  voice  source  model  for  speech  processing  and  analysis  [14].  The  LF  

    model  has  provided  the  basis   for   this  research,  and  so  will  be   further  analysed  

    later   in   the   chapter.   Other   parameterised  models   of   the   glottal   flow  derivative  

    have   been   developed,   such   as   Fujisaki   and   Ljungqvist’s  model   [26]  which  was  

  •     2.  Literature  Review    

      19  

    shown   to   be   equally   successful   in  minimising   the   linear   predictive   error  when  

    directly   compared   with   natural   speech   as   the   LF   model,   however   due   to   the  

    computational   complexity   in   calculating   this   model,   the   LF  model   is   generally  

    favoured  [9]  [14].  

     

    Cummings  et  al.  state  that  

     

    ‘although   simple   non-‐interactive   glottal   models   produce   intelligible  

    synthetic  speech  and  are  adequate  for  many  coding  and  analysis  tasks,  very  

    high-‐quality   speech   synthesis   and   complex   speech   analysis   necessitate   the  

    ability  to  model  glottal  excitation  more  accurately’  [9]  

     

    Cummings   summarises   these   methods,   which   are   achieved   numerically   or   via  

    equivalent-‐circuit   design.   Two   common   effects   of   source-‐tract   interaction   that  

    are  included  in  these  models  are  the  effects  of   low  first-‐formant  frequencies  on  

    the  vocal  tract’s  impedance  load  and  the  glottal  pulse  ripple.    

     

    The  most  complex  form  of  glottal  source  model  is  the  physiological  glottal  model.  

    Titze   and  Talkin   [27]   [28]   developed   a   four-‐parameter  mathematical  model   of  

    the  glottis  based  on  earlier  theoretical  work  by  Titze  [8]  [29].  This  is  essentially  a  

    mass-‐and-‐spring  mathematical  model  of  the  physiology  of  the  vocal  folds,  which  

    takes  into  account  the  following:  

    • ‘abduction   quotient,   a   measure   indicating   the   amount   of   adduction   or  

    abduction  of  the  vocal  folds,  

  •     2.  Literature  Review    

      20  

    • shape   quotient,   a   measure   of   the   shape   of   the   pre-‐phonatory   glottis  

    (converging,  diverging,  or  partly  converging  and  partly  diverging)  

    • bulging   quotient,   a   measure   representing   the   amount   of   medial   surface  

    bulging  of  the  vocal  folds,  and  

    • phase   quotient,   a   measure   of   the   phase   delay   between   the   upper   and  

    lower  edges  of  the  vocal  folds.’  [9]  

     Physiological   glottal   models   such   as   these   are   capable   of   creating   a   highly  

    sophisticated  representation  of  the  glottal  flow.  However,  a  precise  knowledge  of  

    glottal  physiology  is  required  in  order  to  use  models  such  as  these,  as  the  glottal  

    volume  velocity  waveform   is  an   indirect  result  of   the  model,  as  opposed   to   the  

    simpler  model   types  which   attempt   to   recreate   the   volume   velocity  waveform  

    directly.    

     

    2.4.2  The  Liljencrants-‐Fant  Glottal  Flow  Model  

     

    As  discussed,  the  Liljencrants-‐Fant  (or  LF-‐)  model  is  one  of  the  most  widely  used  

    glottal  flow  models  in  voice  synthesis  and  speech  processing  applications.  This  is  

    largely  due  to  its  relative  computational  ease  and  parameterisation.  Earlier  work  

    by   Gunnar   Fant   [7]   established   a   foundation   for   this   model   by   observing  

    predicted   glottal   flow   volume   velocity   waveforms   from   inverse-‐filtered  

    recordings   of   connected   speech.   Findings   from   this   study   allowed   Fant   to  

    develop   an   early   two-‐parameter   glottal   flow  model   (called   the   F-‐model).   This  

    early   model   comprised   a   rising   and   descending   branch   around   the   boundary  

    between  the  opening  and  closing  phases.  The  F-‐model  contained  a  discontinuity  

  •     2.  Literature  Review    

      21  

    at  the  flow  peak  (Fig.  2.4),  so  a  more  sophisticated  model  was  sought.  The  three-‐

    parameter  L-‐Model  developed  by  Liljencrants  was  used  as  a  starting  point.  

     

    The   advantage   of   the   L-‐model   over   the   F-‐model   is   its   continuity,  which  means  

    that  no  secondary  weak  excitations  are  present  in  the  acoustic  waveform.  The  L-‐

    model   also   displayed   less   spectral   ripple   than   the   F-‐model.   Neither   models  

    incorporated   a   term   for   the   gradient   of   the   return   phase   of   the   glottal   pulse,  

    which  was  found  to  be  crucial  for  modelling  certain  voice  types  and  phonations  

    [12].  For  example,  during  a  voiced  ‘H’  sound,  the  glottis  remains  open  for  most  of  

    the   pitch   cycle,   allowing   turbulent   airflow   to   create   the   high-‐frequency   noise  

    component  (also  observed  in  breathy  and  falsetto  voice  types).  In  order  to  model  

    voice  source  effects  such  as   these,  an  exponential   return  phase  whose  gradient  

    was   a   fourth   parameter,   based   on   observations   by   Liljencrants,   Fant   and  

    Ananthapadmanabha  [12]  [7]  [13]  was  added.    

     

  •     2.  Literature  Review    

      22  

     

    Figure  2.4  -‐  Comparisons  between  the  F-‐  and  L-‐  glottal  model  waveforms  (left)  and  

    their  derivatives  (right)  with  varying  values  of  Rd  –  a  ‘shape  parameter’  based  on  

    the  amplitude  and  position  of  the  positive  peak  -‐  taken  from  [12]  

  •     2.  Literature  Review    

      23  

     

    Figure  2.5  -‐  LF-‐model  glottal  flow  derivative  waveform  with  timing  parameter  

    annotations.  The  value  for  ta  is  the  distance  between  point  te  and  the  zero  crossing  

    of  the  derivative  of  the  return  curve  (red  line  on  graph).  

     

    Figures  2.4  and  2.5  show  one  pitch  period  of  each  of  the  aforementioned  glottal  

    models.   The   timing   parameters   tp,   te,   ta,   and   tc   are   shown   on   the   LF-‐model  

    diagram.  These   four   timing  parameters   can  be  modified   in  order   to   fit   existing  

    glottal  flow  measurements  for  speech  analysis,  or  to  synthesise  new  waveforms  

    in   order   to   simulate   different   voice   types   in   speech   synthesis.   These   timing  

    parameters   are   defined   as   a   percentage   of   the   overall   pitch   cycle   length   T0.  

    Parameter  tp  describes  the  length  of  the  opening  phase  of  the  cycle,  i.e  when  the  

    vocal   folds   are   moving   upwards   and   the   glottis   is   opening   (the   moment   of  

    maximum  flow).  Te  gives  the  timing  of  the  negative  peak  in  the  waveform,  which  

    occurs  at  the  beginning  of  the  return  phase.  Ta  gives  the  effective  duration  of  the  

    return  phase,  calculated  by  the  length  of  time  between  te,  and  the  zero-‐crossing  

    of   the   derivative   of   the   return   slope   at   te.   Tc   describes   the   length   of   the   open  

  •     2.  Literature  Review    

      24  

    phase,   or   the   portion   of   the   pitch   cycle   during   which   the   vocal   folds   are   in  

    motion.  If  tc  is  less  than  T0,  the  remainder  of  the  waveform  between  tc  and  T0  is  

    known  as  the  closed  phase.  One  requirement  of  the  LF-‐model  is  that  the  overall  

    net  gain  of  flow  during  a  pitch  period  must  equal  zero:  

     

    𝐿𝐹 𝑡 = 0!!

    !  

    [2.1]    

    The   waveform   is   calculated   in   two   stages.   The   first   stage   involves   an  

    exponentially  growing  sinusoid  between  the  moment  of  glottal  opening  (𝑡 = 0)  

    and  the  negative  peak  at  𝑡 = 𝑡! .  An  exponential  component  describes  the  second  

    stage  -‐   the  return  phase  between  te  and  tc.  The  two  equations  for  the  LF-‐model  

    waveform  can  be  written  as  

     

    𝐿𝐹 𝑡 =  𝐸!𝑒!" sin 𝜔!𝑡 , 0 ≤ 𝑡   ≤   𝑡!  [2.2]  

    𝐿𝐹 𝑡 =  −𝐸!𝜀𝑡!

    𝑒!! !!!! − 𝑒! !!!!!  , 𝑡! ≤ 𝑡 ≤ 𝑡! ≤ 𝑇!  [2.3]  

     

    Where  E0  describes  the  maximum  positive  flow,  Ee  the  maximum  negative  flow,  𝛼  

    and  wg  are  respectively  the  exponential  growth  factor  and  the  angular  frequency  

    of  the  sinusoidal  component,  and  𝜀  is  the  exponential  time  constant  of  the  return  

    phase.  In  order  to  maintain  the  area  balance  condition  described  in  equation  2.1,  

    𝜀  then  E0  and  𝛼  are  solved  iteratively  so  that  the  following  conditions  hold:  

    𝜀 =  1− 𝑒!!(!!!!!)

    𝑡!  

    [2.4]  

  •     2.  Literature  Review    

      25  

    𝐸! =  −𝐸!

    𝑒!!!sin  (𝜔!𝑡!)  

    [2.5]  

    (analysis  of  LF-‐Model  equation  based  on  Jack  Mullen’s  summary  [30])  

     

    By  manipulating  the  values  of   the  timing  parameters  (tc,  te,  tp,  ta),   the  LF-‐model  

    can   be   modified   to   describe   certain   voice   types,   or   matched   to   pre-‐recording  

    voice  source  data.  ‘Voice  quality  factors:  Analysis,  synthesis,  and  perception’  [14]  

    is  an  example  of  one  of   the  many  studies   into  voice  synthesis  and  analysis   that  

    have  used  the  LF-‐model  in  an  attempt  to  synthesise  different  voice  types,  as  well  

    as  establish  the  role  of  various  LF-‐parameters  in  terms  of  the  perception  of  the  

    synthesised   voice.   Beginning  with   inverse-‐filtered   speech  waveforms   and   data  

    from  electroglottographic  recordings,  Childers  &  Lee  [14]  analysed  the  spectral  

    content  and  waveform  characteristics  of  four  voice  types    -‐  modal,  breathy,  vocal  

    fry   and   falsetto.   From   earlier   studies   [31],   it   was   found   that   the   LF-‐model  

    provided  a  convenient  and  efficient  basis  from  which  to  recreate  the  timing  and  

    spectral  characteristics  of  the  four  voice  types.  By  adjusting  LF-‐parameters  to  fit  

    the  initial  recordings,  then  optimising  the  LF-‐model  estimate  using  a  least-‐mean-‐

    squared   error   criteria,   average   values   of   the   LF-‐parameters   for   different   voice  

    types  were  found:  

    Table  2.3  –  Four  male  voice  types  and  corresponding  timing  parameter  values  

    (taken  from  [14])  

      Te  (%)   Tp  (%)   Ta  (%)   Tc  (%)  Modal   55.4   41.3   0.4   58.2  Breathy   57.5   45.7   0.9   100  Vocal  Fry   59.6   48.1   0.27   72  Falsetto   89   62   4.3   n/a  

     

  •     2.  Literature  Review    

      26  

    By  modifying  each  timing  parameter  in  turn  followed  by  the  overall  pulse  width  

    (open   quotient   or   OQ)   and   pulse   skewing   (speed   quotient   or   SQ),   keeping   all  

    other   parameters   fixed,   and   synthesising   short   vowels   using   a   Klatt   formant  

    synthesiser   [32],   it   was   possible   to   evaluate   the   perceptual   effects   of   each  

    parameter,   and   to   establish  which  were  most   useful   for   synthesising   different  

    voice  types.  Criteria  for  simulating  hypo-‐/hyperfunction  (lax/tense  vocal  quality)  

    were   established   in   the   time   and   frequency   domains,   with   a   high   SQ   creating  

    more   high   frequency   energy,   contributing   to   a   perceptually   more   tense   voice  

    quality.   This   study   also   incorporated   a   noise   generator,   in   order   to   simulate  

    breathiness.   It  was  found  that  white  noise,  high-‐pass  filtered  at  2  kHz  added  to  

    the  LF-‐model  signal  contributed  to  the  perception  of  breathiness.  Modulating  the  

    noise   signal’s   amplitude   so   that   it   was   present   during   50%   of   the   pitch   cycle  

    (roughly  lining  up  with  the  closed  part  of  the  vocal  fold  oscillation),  with  a  noise-‐

    signal  ratio  of  0.25%,  provided  the  best  results   for  simulating  breathiness.  This  

    study  confirmed  the  importance  of  a  variation  in  voice  source  quality  in  natural  

    speech  synthesis,  and  concludes  that  ‘various  intonation  and  stress  patterns  may  

    be   correlated   to   source   parameters   other   than   fundamental   frequency   and  

    timing’  [14].  This  idea  is  the  primary  concept  behind  the  current  research,  which  

    is   aimed   at   developing   a   more   natural,   dynamic   and   user-‐configurable   voice  

    source  for  voice  synthesis  applications.    

     

    Whilst   the   least-‐mean-‐squared-‐error   technique   described   in   [14]   and   [33]  

    provides   a   close   fit   of   the   LF-‐model   waveform   to   a   glottal   source   recording,  

    further  research  has  been  undertaken  to  optimise  the  timing  parameter  values  in  

    order  to  more  accurately  recreate  voice  source  qualities  [34]  [21]  [24].  One  such  

  •     2.  Literature  Review    

      27  

    method  is  described  in  [21],  known  as  Extended  Kalman  Filtering  (EKF).  EKF  is  

    an   iterative   error   correction   method   that   makes   use   of   a   priori   estimates   to  

    converge  on  an  optimum  estimate.  By  incorporating  the  EKF  equations  in  those  

    describing   the  LF-‐model,   it   is   possible   to   calculate  𝛼  and  𝜀  values   to   achieve   an  

    optimum  model   fit.  Further  research   into  EKF  techniques   for  model   fitting  also  

    generated  a   time-‐domain   fitting  algorithm  using  EKF   that  was   shown   to  be   far  

    more   accurate   than   a   previously   used   standard   algorithm   [35].   The   timing  

    parameters   described   in   [21]   were   originally   obtained   from   [24],   which  

    describes  the  use  of  a  pitch-‐synchronous  model-‐based  glottal  source  estimation  

    method   to   obtain   an   accurate   set   of   mean   values   for   LF-‐parameters   from   an  

    inverse-‐filtered  glottal  source  waveform.  

     

    In   [36]   the   many   parameters   used   to   describe   the   glottal   source   model   are  

    investigated   and   their   importance   in   terms   of   vocal   quality   perception   is  

    explored.   It   is   acknowledged   that   ‘the   closing   phase   constitutes   the   main  

    excitation   of   the   vocal   tract’.   The   closing   phase,   or   normalised   amplitude  

    quotient  (NAQ),  describes  the  phase  of  the  pitch  period  from  the  negative  peak  to  

    the   point   of   glottal   closure.   The   authors   recommend   varying   the   NAQ   for   the  

    largest  and  most  effective  perceptible  variation  in  voice  type.  These  findings  are  

    corroborated  in  [37]  [38]  and  [39].  

       

  •     2.  Literature  Review    

      28  

    2.5  Vocal  Tract  Modelling    

    As   well   as   the   voice   source,   the   physical   properties   of   the   vocal   tract   can   be  

    mathematically  modeled  in  order  to  recreate  its  acoustic  effects.  This  method  of  

    voice   synthesis   falls   into   the   category   of   ‘articulatory   speech   synthesis’.   [40]  

    gives  the  following  definition  for  articulatory  synthesis:  

     

    ‘Articulatory  speech  synthesis  models  the  natural  speech  production  process  

    as  accurately  as  possible.  This  is  accomplished  by  creating  a  synthetic  model  

    of  human  physiology  and  making  it  speak.’  [40]  

     

    Palo  acknowledges  that  articulatory  speech  synthesis  methods  are  less  effective  

    at  creating  intelligible  speech  when  compared  with  concatenative  synthesis,  but  

    vastly  more   flexible   in   terms   of   the   range   of   speech-‐like   vocalisations   that   are  

    available.  The  first  example  of  synthesised  speech  created  by  a  vocal  tract  model  

    was   developed   by   Kelly   and   Lochbaum   in   the   1960s   [41].   This   was   a   fully  

    digitised   acoustic  model   of   the  human  vocal   tract,   achieved  by  discretising   the  

    vocal   tract   into   a   series   of   concatenated   tubes   (fig.   2.6).   The   travelling   wave  

    solution  for  each  tube  was  obtained,  and  then  digitised  using  Nyquist’s  sampling  

    theorem.  Vocal  tract  area  data  was  obtained  via  x-‐ray  for  several  vowel  sounds,  

    and  the  cross-‐sectional  area  of  each  tube  section  of  the  model  was  proportional  

    with   the   corresponding   vocal   tract   area.   This   was   one   of   the   first   and   most  

    enduring  examples  of  physical  modelling  synthesis.  

  •     2.  Literature  Review    

      29  

     

    Figure  2.6  -‐  Representation  of  the  vocal  tract  idealized  as  a  series  of  concatenated  

    acoustic  tubes,  with  glottis  end  at  the  left  and  lips  at  the  right.  Note  that  the  ‘bend’  

    in  the  vocal  tract  that  occurs  above  the  glottis  is  not  included  in  this  

    representation.  

     

    The   advances   made   in   computing   by   the   1980s   meant   that   new   methods   of  

    physical  modelling   synthesis  were   being   experimented  with.  One   such  method  

    that  had  implications  for  vocal  tract  modelling  was  digital  waveguide  synthesis.  

    Julius   Orion   Smith   III   describes   the   early   conception   of   the   one-‐dimensional  

    digital   waveguide   in   [42].   As   d’Alembert   first   pointed   out,   the   vibration   of   an  

    ideal   string   can   be   described   as   the   sum   of   two   travelling   waves   going   in  

    opposite  directions  [43].  The  conception  of  the  digital  waveguide  is  based  on  this  

    principle.   A   digital   waveguide   is   essentially   a   bi-‐directional   digital   delay   line,  

    with   the   sample   propagation   travelling   in   opposite   directions   (fig.   2.7).   This  

    approach   allows   for   an   efficient  discrete-‐time   simulation  of   the   traveling  wave  

    solution,   which   can   be   used   to   model   ‘any   one-‐dimensional   linear   acoustic  

    system  such  as  a  violin  string,  clarinet  bore,  flute  pipe,  trumpet-‐valve  pipe,  or  the  

    like’    [44].  Terminations  and  changes  in  impedance  along  the  acoustic  system  can  

    be  modelled  using  boundary  conditions  and  scattering   junctions.  A  termination  

  •     2.  Literature  Review    

      30  

    (for  example  a  bridge  on  a  guitar)  can  be  modelled  simply  by  inverting  the  phase  

    of   the   incoming   signal,   which   acts   as   a   total   reflection   of   the   displaced   wave.  

    Changes   in   impedance   are   modelled   using   the   Kelly-‐Lochbaum   scattering  

    junction.   Conservation   of   mass   and   energy   dictates   that   for   a   change   in  

    impedance  (such  as  from  a  narrow  to  a  wide  section  of  tube),  the  pressure  and  

    volume  velocity  variables  of   the   travelling  wave  must  be   continuous   [44].  This  

    means  that  some  of  the  acoustic  energy  will  be  transmitted  across  the  impedance  

    discontinuity,  and  the  remainder  will  be  reflected  back.  This  is  achieved  digitally  

    via  the  scattering  junction.    

     

    Figure  2.7  -‐  1D  Digital  Waveguide  Structure.  Sample  delay  units  (marked  z-‐1)  

    propagate  an  input  signal  in  left  and  right  directions,  with  changes  in  impedance  

    modelled  by  attenuating  the  signal  between  delay  units.  Sampling  points  extract  

    the  current  sample  and  a  particular  space  along  the  DWG  –  similar  to  a  pickup  

    along  a  guitar  string.  

     

    The   1D   digital   waveguide  models   changes   in   cross-‐sectional   area   in   the   vocal  

    tract   as   a   series   of   impedance   changes   in   a   1D   linear   acoustic   system.   A   2D  

    extension   of   this   method,   known   as   the   2D   Digital   Waveguide   Mesh   (DWM)  

  •     2.  Literature  Review    

      31  

    models   the   same   cross-‐sectional   area   function   as   a   2D   plane,  with  width-‐wise  

    delay  lines  of  varying  length,  as  seen  in  figures  2.8  and  2.9.  

     

    Figure  2.8  -‐  Achieving  a  cross-‐sectional  area  function  from  MRI  data.  Note  the  lack  

    of  nasal  cavity  in  the  vocal  tract  cross-‐section  and  the  ‘straightening’  of  the  track  

    when  converted  to  an  area  function.  

     

    Figure  2.9  –  2D  Digital  Waveguide  Mesh  structure  with  impedance  mapping,  with  

    glottis  end  at  the  left  and  lips  at  the  right  (red  indicates  a  high  impedance,  creating  

    effective  ‘boundaries’  [highlighted  in  blue])  

     

  •     2.  Literature  Review    

      32  

    The  2D  DWM  was  developed  by  Van  Duyne  and  Smith   in   the   early  1990s   [45]  

    with   further  development   at   the  Audio  Lab   at   the  University   of   York   [46]   [47]  

    [48]   [49]   [50]   [51].   The   DWM   structure   is   ideally   suited   to   modelling   the  

    propagation  of   acoustic  waves   across   a  membrane  or  plate,   although   the   extra  

    dimensionality  is  also  an  advantage  over  the  1D  waveguide  for  modelling  other  

    acoustic   systems.   In   the   example   of   vocal   tract   modelling,   the   cross-‐sectional  

    tract  area  can  be  directly  modeled  as  a  widthwise  number  of  waveguide  points,  

    as   opposed   to   the   1D   solution,   which   requires   a   conversion   from   area   to  

    impedance.   Inputs   and   outputs   to   the   system   can   also   be   included   at   spatially  

    meaningful  points  on  the  mesh,  due  to  the  analogous  topography  of  the  mesh  to  

    the  modeled  surface  [46].  As  Mullen  points  out,    

     

    ‘it  should  be  noted  that  the  magnitude  of  vibrations  is  the  physical  variable  

    under  simulation  that  would  be  observed   in  the  real-‐world  system.  The  bi-‐

    directional  travelling  wave  components  are  a  hypothetical  consideration  to  

    facilitate  propagation’.  [30]  

     

    Waveguide   mesh   topographies   are   not   limited   to   a   grid   layout   as   illustrated  

    above,  and  other  arrangements  of  delay  lines  and  scattering  junctions  have  been  

    experimented  with  [50].  

     

    An  extensive  study  into  vocal  tract  modelling  using  the  2D  DWM  is  described  in  

    [30].  This  thesis  describes  the  theory  behind  digital  waveguide  mesh  modelling,  

    and  its  application  to  vocal  tract  modelling.  It  also  describes  the  development  of  

    a  novel  method  of  modelling  dynamic  area  function  changes  in  real  time,  known  

  •     2.  Literature  Review    

      33  

    as  dynamic  impedance  mapping.  Conventional  waveguide  mesh  structures  follow  

    the  layout  of  the  acoustic  area  they  are  modelling.  Vocal  tract  modelling  requires  

    a  more   flexible  method,   as   the   layout   is   constantly   changing  depending  on   the  

    current   articulation.   Dynamic   impedance   mapping   allows   the   mesh   size   and  

    shape   to   remain   constant,  while  manipulating   the   impedances   at   each   node   to  

    effectively   alter   the   shape   of   the   area   through   which   acoustic   energy   can  

    propagate.  This  is  much  less  computationally  expensive  than  altering  the  layout  

    of  the  mesh  at  each  sample  step,  and  allows  for  real-‐time,  dynamic  articulations.  

     

    The   process   for   vocal   tract   modelling   using   the   digital   waveguide   mesh   is   as  

    follows:    

    1. Obtain   cross-‐sectional   area   function   data   of   the   vocal   tract   for   a   set   of  

    specific   vowels.   This   is   achieved   using   a   magnetic   resonance   imaging  

    (MRI)  machine  (see  Figure  2.8  above).    

    2. Convert   area   function  data   to   a   series   of   discrete   area   values   at   regular  

    intervals  along  the  tract.  

    3. Calculate   size   of   a   single   waveguide.   This   is   related   to   the   theoretical  

    distance  an  acoustic  wave  would  propagate  during  one  sample  length.  It  is  

    calculated  using  the  following  formula:    

     

    2×𝑐/𝑓!  

    [2.6]  

      Where  c  is  the  speed  of  sound  and  fs  is  the  sampling  frequency.  

  •     2.  Literature  Review    

      34  

    4. Calculate   the   size   of   the   waveguide   mesh   in   terms   of   the   number   of  

    individual  waveguides   in   the  x   and  y   direction.  The   average  dimensions  

    for  a  male  vocal  tract  are  17.5  cm  long  and  5  cm  wide.    

    5. Interpolate  area  function  data  from  original  number  of  values  to  number  

    of   waveguides   in   x   direction.   Invert   each   value   to   obtain   the   related  

    impedance  value  for  each  cross-‐section.  

    6. The   impedances   of   the   width-‐wise   waveguides   (y   direction)   are  

    calculated  using  a  raised-‐cosine  area  function  (fig.  2.10).  This  was  found  

    to  be  the  ideal  solution  for  maintaining  an  open  ‘channel’  in  the  middle  of  

    the  mesh  (i.e.  at  minimum  impedance)  with  maximum  impedance  at   the  

    outer   edges   of   the   impedance   map.   This   means   at   each   point   in   the   x  

    direction,   for   a   DWM   n   waveguides   wide,   a   raised   cosine   function   of   n  

    samples  is  created.  Each  point  in  the  y  direction  is  assigned  an  impedance  

    value  based  on  the  corresponding  raised  cosine  value.    

     

    Figure  2.10  -‐  Approximation  of  raised  cosine  function,  with  minimum  impedance  

    (Zmin)  at  the  centre  and  maximum  (Zmax)  at  the  edges  

     

  •     2.  Literature  Review    

      35  

    7. The   averages   between   adjacent   points   in   the   mesh   are   taken,   and   the  

    impedance  map  is  updated  based  on  these  averages.  The  pressure  at  the  

    current   junction   (average   of   all   surrounding   points)   is   taken,   and   the  

    outgoing  pressures  are  calculated.    

    8. At  every  timestep,  the  incoming  pressures  to  each  junction  are  calculated  

    based  on  the  previous  pressure  values  at  surrounding  points.  Boundaries  

    are  modeled   in   the   same  way   as   a   termination   in   a   1-‐D  waveguide,   for  

    each  outer  point  in  the  mesh.  At  the  glottis  end,  the  incoming  pressure  for  

    each   junction   is   excited  with   the   current   sample   of   the   input  waveform  

    (i.e  the  voice  source).  

    9. Finally,  the  output  pressure  is  taken  as  the  sum  of  all  rightmost  junctions  

    multiplied  by  the  lip  radiation.  

     

    The   impedance-‐mapped  2D  DWM  was   excited  with  Gaussian   noise   to   obtain   a  

    frequency   response   for   several   vowel   area   functions.   The   results   showed   that  

    formant   frequencies   obtained   from   the   2D   DWM   varied   in   accuracy   when  

    compared  with  average   formant   frequency  values   for  male   speakers.  For   some  

    vowels,  the  2D  DWM  formant  frequencies  were  less  in  line  with  average  values  

    than   the   1D   waveguide   counterpart.   It   is   acknowledged   that   these   average  

    formant   values   are   not   definitive,   and   the   strongest   case   for   vowel   accuracy  

    would   be   a   perceivable   similarity   to   the   simulated   vowel,   based   on   subjective  

    listening  results.    

     

    The   increased   dimensionality   introduced   by   the   2D   DWM   allows   for   more  

    accurate  plane-‐wave  propagation  simulations  than  the  1D  counterpart.  However,  

  •     2.  Literature  Review    

      36  

    the  impedance  mapping  of  the  2D  DWM  is  based  on  the  same  1D  area  function  

    data.  The  effects  of  the  curve  in  the  vocal  trac