146
Analysis and Signal Processing of Oesophageal and Pathological Voices Guest Editors: Juan Ignacio Godino-Llorente, Pedro Gómez-Vilda, and Tan Lee EURASIP Journal on Advances in Signal Processing

Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Analysis and Signal Processing of Oesophageal and Pathological Voices

Guest Editors: Juan Ignacio Godino-Llorente, Pedro Gómez-Vilda, and Tan Lee

EURASIP Journal on Advances in Signal Processing

Page 2: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Analysis and Signal Processing of Oesophagealand Pathological Voices

Page 3: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing

Analysis and Signal Processing of Oesophagealand Pathological Voices

Guest Editors: Juan Ignacio Godino-Llorente,Pedro Gomez-Vilda, and Tan Lee

Page 4: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Copyright © 2009 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2009 of “EURASIP Journal on Advances inSignal Processing.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits un-restricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 5: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Editor-in-ChiefPhillip Regalia, Institut National des Telecommunications, France

Associate Editors

Adel M. Alimi, TunisiaKenneth Barner, USAYasar Becerikli, TurkeyKostas Berberidis, GreeceJose Carlos Bermudez, BrazilEnrico Capobianco, ItalyA. Enis Cetin, TurkeyJonathon Chambers, UKMei-Juan Chen, TaiwanLiang-Gee Chen, TaiwanHuaiyu Dai, USASatya Dharanipragada, USAKutluyil Dogancay, AustraliaFlorent Dupont, FranceFrank Ehlers, ItalySharon Gannot, IsraelM. Greco, ItalyIrene Y. H. Gu, SwedenFredrik Gustafsson, SwedenUlrich Heute, GermanySangjin Hong, USAJiri Jan, Czech RepublicMagnus Jansson, SwedenSudharman K. Jayaweera, USA

Soren Holdt Jensen, DenmarkMark Kahrs, USAMoon Gi Kang, South KoreaWalter Kellermann, GermanyLisimachos P. Kondi, GreeceAlex Chichung Kot, SingaporeC.-C. Jay Kuo, USAErcan E. Kuruoglu, ItalyTan Lee, ChinaGeert Leus, The NetherlandsT.-H. Li, USAHusheng Li, USAMark Liao, TaiwanY.-P. Lin, TaiwanShoji Makino, JapanStephen Marshall, UKC. Mecklenbrauker, AustriaGloria Menegaz, ItalyRicardo Merched, BrazilMarc Moonen, BelgiumVitor Heloiz Nascimento, BrazilChristophoros Nikou, GreeceSven Nordholm, AustraliaPatrick Oonincx, The Netherlands

Douglas O’Shaughnessy, CanadaBjorn Ottersten, SwedenJacques Palicot, FranceAna Perez-Neira, SpainWilfried Philips, BelgiumAggelos Pikrakis, GreeceIoannis Psaromiligkos, CanadaAthanasios Rontogiannis, GreeceGregor Rozinaj, SlovakiaMarkus Rupp, AustriaWilliam Allan Sandham, UKBulent Sankur, TurkeyLing Shao, UKDirk Slock, FranceY.-P. Tan, SingaporeJoao Manuel R. S. Tavares, PortugalGeorge S. Tombras, GreeceDimitrios Tzovaras, GreeceBernhard Wess, AustriaJar-Ferr Yang, TaiwanAzzedine Zerguine, Saudi ArabiaAbdelhak M. Zoubir, Germany

Page 6: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Contents

Analysis and Signal Processing of Oesophageal and Pathological Voices, Juan I. Godino-Llorente,Pedro Gomez-Vilda, and Tan LeeVolume 2009, Article ID 283504, 4 pages

Jitter Estimation Algorithms for Detection of Pathological Voices, Darcio G. Silva, Luıs C. Oliveira,and Mario AndreaVolume 2009, Article ID 567875, 9 pages

Removing the Influence of Shimmer in the Calculation of Harmonics-To-Noise Ratios UsingEnsemble-Averages in Voice Signals, Carlos Ferrer, Eduardo Gonzalez, Marıa E. Hernandez-Dıaz,Diana Torres, and Anesto del ToroVolume 2009, Article ID 784379, 7 pages

On the Use of the Correlation between Acoustic Descriptors for the Normal/Pathological VoicesDiscrimination, Thomas Dubuisson, Thierry Dutoit, Bernard Gosselin, and Marc RemacleVolume 2009, Article ID 173967, 19 pages

A Joint Time-Frequency and Matrix Decomposition Feature Extraction Methodology for PathologicalVoice Classification, Behnaz Ghoraani and Sridhar KrishnanVolume 2009, Article ID 928974, 11 pages

A First Comparative Study of Oesophageal and Voice Prosthesis Speech Production, Massimiliana Carelloand Mauro MagnanoVolume 2009, Article ID 821304, 6 pages

Linear Classifier with Reject Option for the Detection of Vocal Fold Paralysis and Vocal Fold Edema,Constantine Kotropoulos and Gonzalo R. ArceVolume 2009, Article ID 203790, 13 pages

Back-and-Forth Methodology for Objective Voice Quality Assessment: From/to Expert Knowledgeto/from Automatic Classification of Dysphonia, Corinne Fredouille, Gilles Pouchoulin, Alain Ghio,Joana Revis, Jean-Franois Bonastre, and Antoine GiovanniVolume 2009, Article ID 982102, 13 pages

Analysis of Acoustic Features in Speakers with Cognitive Disorders and Speech Impairments, Oscar Saz,Javier Simon, W.-Ricardo Rodrıguez, Eduardo Lleida, and Carlos VaqueroVolume 2009, Article ID 159234, 11 pages

Automated Intelligibility Assessment of Pathological Speech Using Phonological Features,Catherine Middag, Jean-Pierre Martens, Gwen Van Nuffelen, and Marc De BodtVolume 2009, Article ID 629030, 9 pages

Page 7: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers,Santiago Omar Caballero Morales and Stephen J. CoxVolume 2009, Article ID 308340, 14 pages

Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker RecognitionTechniques, Ruben Fernandez Pozo, Jose Luis Blanco Murillo, Luis Hernandez Gomez,Eduardo Lopez Gonzalo, Jose Alcazar Ramırez, and Doroteo T. ToledanoVolume 2009, Article ID 982531, 11 pages

Alternative Speech Communication System for Persons with Severe Speech Disorders,Sid-Ahmed Selouani, Mohammed Sidi Yakoub, and Douglas O’ShaughnessyVolume 2009, Article ID 540409, 12 pages

Page 8: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 283504, 4 pagesdoi:10.1155/2009/283504

Editorial

Analysis and Signal Processing of Oesophageal andPathological Voices

Juan Ignacio Godino-Llorente,1 Pedro Gomez-Vilda (EURASIP Member),2 and Tan Lee3

1 Department of Circuits & Systems Engineering, Universidad Politecnica de Madrid, Carretera Valencia Km 7, 28031, Madrid, Spain2 Department of Computer Science & Engineering, Universidad Politecnica de Madrid, Campus de Montegancedo, Boadilla del Monte,28660, Madrid, Spain

3 Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong

Correspondence should be addressed to Juan Ignacio Godino-Llorente, [email protected]

Received 29 October 2009; Accepted 29 October 2009

Copyright © 2009 Juan Ignacio Godino-Llorente et al. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Speech not only is limited to the process of communicationbut also is very important for transferring emotions, it is asmall part of our personality, reflects situations of stress, andhas a cosmetic added value in many different professionalactivities. Since speech communication is fundamental tohuman interaction, we are moving toward a new scenariowhere speech is gaining greater importance in our daily lives.On the other hand, modern styles of life have increased therisk of experiencing some kind of voice alterations. In thissense, the National Institute on Deafness and Other Commu-nication Disorders (NIDCD) pointed out that approximately7.5 million people in the United States have trouble usingtheir voices [1]. Even though providing statistics on peopleaffected by voice disorders is a very difficult task, as reportedin [2], it is underlined that between 5 and 10% of the USworking population have to be considered as using theirvoice in an intensive way. In Finland, these statistics areestimated close to 25%. Still in [2], the conclusions pointout that the voice is the primary tool for about 25 to 33% ofthe working population. While the case of teachers has beenlargely studied in literature [2, 3], singers, doctors, lawyers,nurses, (tele-)marketer people, professional trainers, andpublic speakers also make great demands on their voices and,consequently, they are prone to experiencing voice problems[1, 4–6]. Therefore, in addition to medical consequencesin daily life (treatment, rehabilitation, etc.), some voicedisorders have also severe consequences regarding profes-sional (job performance, attendance, occupation changes)

and economical aspects but also far from being negligible,regarding social activities, and interaction with others [2–4].

However, despite many years of effort devoted todeveloping algorithms for speech signal processing, anddespite the elaboration of automatic speech recognition andsynthesis systems, our knowledge of the nature of the speechsignal and the effects of pathologies is still limited. In spite ofthis, voice scientists and clinicians take profit of the simplemodels and methods developed by speech signal processingengineers to build up their own analysis methods for theassessment of disorders of voice (DoV).

Yet, the limitations of existing models and methods arefelt in both areas of expertise, that is, speech signal processingapplications and assessment of DoV. For example, theintervals within which signal model parameters must remainconstant to represent signals with timbre that is perceivedas natural are unknown. Moreover, such efficient controlof voice quality has important applications in modern text-to-speech synthesis systems (creating new synthetic voices,simulating emotions, etc.). Voice clinicians, on the otherhand, have expressed their disappointment with regard to theperformance of existing methods for assessing voice quality,with a special focus on the forensic implications. Major issueswith current methods include robustness against noise,consistency of measurements, interpretation of estimatedfeatures from a speech production point of view, andcorrelation with perception.

So there exist a need for new and objective ways to evalu-ate the speech, its quality, and its connection with other phe-nomena, since the deviation out of the patterns considered of

Page 9: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

normality can be correlated with many different symptomsand psychophysical situations. As previously commented,research to date in speech technology has focussed the effortin areas such as speech synthesis, recognition, and speakerverification/recognition. Speech technologies have evolved tothe stage where they are reliable enough to be applied inother areas. In this sense, acoustic analysis is a noninvasivetechnique which is an efficient tool for the objective supportand the diagnosis of DoV, the screening of vocal andvoice diseases (and particularly their early detection), theobjective determination of vocal function alterations, and theevaluation of surgical as well as pharmacological treatmentsand rehabilitation. Its application should not be restricted tothe medical area alone, as it may also be of special interestin forensic applications, the control of voice quality for voiceprofessionals such as singers, speakers, the evaluation of thestress, and so forth.

In addition, digital speech processing techniques pay aspecial role dealing with oesophageal voices. The quality ofvoice and the functional limitations of the laryngectomizedpatients remain an important challenge for improving theirquality of life.

On the other hand, the acoustic analysis reveals as acomplementary tool to other methods of evaluation usedin the clinic based on the direct observation of the vocalfolds using videoendoscopy. Therefore, a deeper insight intothe voice production mechanism and its relevant parameterscould help clinicians to improve prevention and treatmentof DoV. In this sense, and in order to contribute filling in thisgap, during the last ten years, links and co-operation amongdifferent research fields have become effective to define andset up simple and reliable tools for voice analysis. As a result,there exists a joint initiative to the European level devoted tothe research in this field: the COST 2103 Action [7], fundedby the European Science Foundation, is a joint initiative ofspeech processing teams and the European LaryngologicalResearch Group (ELRG). The main objective of this action isto improve voice production models and analysis algorithmswith a view to assessing voice disorders, by incorporatingnew or previously unexploited techniques, with recent the-oretical developments in order to improve modelling of nor-mal and abnormal voice production, including substitutionvoices. This is an interdisciplinary action that aims to fostersynergies between various complementary disciplines as apromising way to efficiently address the complexity of manycurrent research and development problems in the field ofDoV. In particular, the progress in the clinical assessmentand enhancement of voice quality requires the cooperationof speech processing engineers and voice clinicians.

The aim of this special issue is to contribute with a step-forward filling in the aforementioned gaps.

2. Summary of the Issue

For this special issue, 31 submissions were received. After adifficult review process, 12 papers have been accepted forpublication. The accepted articles address important issuesin speech processing and applications on oesophageal andpathological voices.

The articles in this special issue cover the followingtopics: methods of voice quality analysis based on fre-quency and amplitude perturbation and noise measure-ments; development of acoustic features to detect, classify,or discriminate pathological voices; classification techniquesfor the automatic detection of pathological voices; automaticassessment of voice quality; automatic word and phonemeintelligibility in pathological voices; analyzing and assess-ing the speech of cognitive impaired people; automaticdetection of obstructive sleep apnoea from the speech;robust recognition of dysarthric speakers; and, automaticspeech recognition and synthesis to enhance the quality ofcommunication.

In this issue, two papers describe the methods of voicequality analysis based on frequency and amplitude pertur-bation (i.e., jitter and shimmer) and noise measurements.Although these measurements have been widely applied inthe state of the art for a long time, still present somedrawbacks, and further research is needed in this field.

The jitter value is a measure of the irregularity of aquasiperiodic signal and is a good indicator of the presenceof pathologies in the larynx such as vocal fold nodules ora vocal fold polyp. The paper by Silva et al. focuses on theevaluation of different methods found in the state of the artto estimate the amount of jitter present in speech signals.Also, the authors proposed a new jitter measurement.Given the irregular nature of the speech signal, each jitterestimation algorithm relies on its own model making a directcomparison of the results very difficult. For this reason, inthis paper, the evaluation of the different jitter estimationmethods is targeted on their ability to detect pathologicalvoices. The paper shows that there are significant differencesin the performance of the jitter algorithms under evaluation.

In addition, with respect to the classic acoustic measure-ments, since the calculations of Harmonics-to-Noise Ratio(HNR) in voiced signals are affected by general aperiodicity(like jitter, shimmer, and waveform variability), the paper byFerrer et al. develops a method to reduce the shimmer effectsin the calculation of the HNR. The authors proposed anensemble averaging technique that has been gradually refinedin terms of its sensitivity to jitter, waveform variability,and required number of pulses. In this paper, shimmer isintroduced in the model of the ensemble average and aformula is derived which allows the reduction of shimmereffects in HNR calculation.

On the other hand, several articles presented in thisissue reported works about detecting, classifying, or dis-criminating pathological voices. Three of them focus on thedevelopment of acoustic features.

The paper by Dubuisson et al. presents a system devel-oped to discriminate normal and pathological voices. Theproposed system is based on features inspired from voicepathology assessment and music information retrieval. Thepaper uses two features (spectral decrease and first spectraltristimulus in the Bark scale) and their correlation, leadingto correct classification rates of 94.7% for pathological voicesand 89.5% for normal ones. Moreover, the system provides anormal/pathological factor giving an objective indication tothe clinician.

Page 10: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

Ghoraani and Krishnan propose another methodologyfor the automatic detection of pathological voices. Theauthors proposed the extraction of meaningful and uniquefeatures using adaptive time-frequency distribution (TFD)and nonnegative matrix factorization (NMF). The adaptiveTFD dynamically tracks the nonstationarity in the speech,and NMF quantifies the constructed TFD. The proposedmethod extracts meaningful and unique features from thejoint TFD of the speech, and automatically identifies andmeasures the abnormality of the signal.

In addition, Carello and Magnano evaluated in theirpaper the acoustic properties of oesophageal voices (EVs)and tracheo-oesophageal voices (TEPs). For each patient,some acoustic features were calculated: fundamental fre-quency, intensity, jitter, shimmer, and noise-to-harmonicratio. Moreover, for TEP patients, the tracheostoma pressureat the time of phonation was measured in order to obtaininformation about the “in vivo” pressure necessary to openthe phonatory valve to enable speech. The authors reportednoise components between 600 Hz and 800 Hz in all patients,with a harmonic component between 1200 Hz and 1600 Hz.Besides, the TEP have better acoustic characteristics anda lower standard deviation. To investigate the correlationbetween the pressure and the TEP voice signals, the crossspectrum based on the Fourier transform was evaluated. Themost important and interesting result pointed out by thisanalysis is that the two signals reported equal fundamentalfrequency and the same harmonic components for each TEPsubject considered.

Two more papers in this issue discussed different classifi-cation techniques for the automatic detection of pathologicalvoices. The paper by Kotropoulos et al. compares twodistinct pattern recognition approaches: the detection ofmale subjects who are diagnosed with vocal fold paralysisagainst male subjects who are diagnosed as normal; thedetection of female subjects who are suffering from vocalfold edema against female subjects who do not suffer fromany voice pathology. Linear prediction coefficients extractedfrom sustained vowels were used as features. The evaluationwas carried out using a Bayes classifier with Gaussianclass conditional probability density functions with equalcovariance matrices.

Fredouille et al. address the important task of voicequality assessment. They proposed an original back-and-forth methodology involving an automatic classificationsystem as well as knowledge of the human experts (machinelearning experts, phoneticians, and pathologists). The auto-matic system was validated with a dysphonic corpus,rated according to the GRBAS perceptual scale by anexpert jury. The analysis showed the interest of the (0–3000) Hz frequency band for this classification problem.Additionally, an automatic phonemic analysis underlinedthe significance of consonants and more surprisingly ofunvoiced consonants for the same classification task. Sub-mitted to the human experts, these observations led to amanual analysis of unvoiced plosives, which highlighted alengthening of voice onset time (VOT) according to thedysphonia severity validated by a preliminary statisticalanalysis.

Four more papers deal with the analyzing and assessingof different types of impaired or disordered speech.

The paper by Saz et al. presents the results in the analysisof the acoustic features (formants and the three supraseg-mental features: tone, intensity, and duration) of the vowelproduction in a group of young speakers suffering differentkinds of speech impairments due to physical and cognitivedisorders. A corpus with unimpaired children’s speech isused to determine the reference values for these features inspeakers without any kind of speech impairment within thesame domain of the impaired speakers; that is, 57 isolatedwords. The signal processing to extract the formant andpitch values is based on a linear prediction coefficient (LPC)analysis of the segments considered as vowels in a hiddenMarkov model- (HMM-) based Viterbi forced alignment.Intensity and duration are also based in the outcome ofthe automated segmentation. As main conclusion of thework, it is shown that intelligibility of the vowel productionis lowered in impaired speakers even when the vowel isperceived as correct by human labelers. The decrease inintelligibility is due to a 30% of increase in confusability inthe formants map, a reduction of 50% in the discriminativepower in energy between stressed and unstressed vowels, anda 50% increase of the standard deviation in the length ofthe vowels. On the other hand, impaired speakers kept goodcontrol of tone in the production of stressed and unstressedvowels.

Likewise, it is commonly acknowledged that word orphoneme intelligibility is an important criterion in theassessment of the communication efficiency of a pathologicalspeaker. Middag et al. developed a system based on automaticspeech recognition (ASR) technology to automate andobjectify the intelligibility assessment. This paper presentsa methodology that uses phonological features, automaticspeech alignment (based on acoustic models trained withnormal speech), context-dependent speaker feature extrac-tion, and intelligibility prediction based on a small modelthat can be trained on pathological speech samples. Theexperimental evaluation of the new system revealed thatthe root mean squared error of the discrepancies betweenperceived and computed intelligibilities can be as low as 8on a scale of 0 to 100.

Morales and Cox modelled the errors done by adysarthric speaker and attempt to correct them using twotechniques: a) a set of “metamodels” that incorporate amodel of the speaker’s phonetic confusion-matrix into theASR process; b) a cascade of weighted finite-state transducersat the confusion-matrix, word, and language levels. Bothtechniques attempt to correct the errors made at the phoneticlevel and make use of a language model to find the bestestimate of the correct word sequence. The experimentsshowed that both techniques outperform standard adapta-tion techniques.

Pozo et al. proposed the use of ASR techniques for theautomatic diagnosis of patients with severe obstructive sleepapnoea (OSA). Early detection of severe apnoea cases isimportant so that patients can receive early treatment, andan effective ASR-based detection system could dramaticallyreduce medical testing time. Working with a carefully

Page 11: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

designed speech database of healthy and apnoea subjects,they describe an acoustic search for distinctive apnoea voicecharacteristics. The paper also studies abnormal nasalizationin OSA patients by modelling vowels in nasal and nonnasalphonetic contexts using Gaussian mixture model (GMM)pattern recognition on speech spectra.

Finally, the paper by Selouani et al. proposes the use ofassistive speech-enabled systems to help both French andEnglish speaking persons with various speech disorders. Theproposed assistive systems use ASR and speech synthesisin order to enhance the quality of communication. Thesesystems aim at improving the intelligibility of pathologicspeech making it as natural as possible and close to theoriginal voice of the speaker. The resynthesized utterancesuse new basic units, a new concatenating algorithm, anda grafting technique to correct the poorly pronouncedphonemes. The ASR responses are uttered by the new speechsynthesis system in order to convey an intelligible messageto listeners. An improvement of the perceptual evaluation ofthe speech quality (PESQ) value of 5% and more than 20%was achieved by the speech synthesis systems dealing withsubstitution disorders (SSD) and dysarthria, respectively.

To conclude, this special issue aims at offering aninterdisciplinary platform for presenting new knowledge inthe field of analysis and signal processing of oesophagealand pathological voices. From these papers, we hope thatthe interested reader will find useful suggestions and furtherstimulation to carry on research in this field.

Acknowledgments

The authors are extremely grateful to all the reviewerswho took time and consideration to assess the submittedmanuscripts. Their diligence and their constructive criticismand remarks contributed greatly to ensure that the finalpapers have conformed to the high standards expected inthis publication. Moreover, we would like to thank all theauthors who submitted papers to this special issue for theirpatience during the always hard and long reviewing process,especially to those that unfortunately had no opportunityto see their work published. Last, but not least, we wouldlike to thank the Editor in-Chief and the Editorial Office ofEURASIP Journal on Advances in Signal Processing for theircontinuous efforts and valuable support.

Juan Ignacio Godino-LlorentePedro Gomez Vilda

Tan Lee

References

[1] National Institute on Deafness and Other Communication Dis-orders (NIDCD), ANR2008—Document B/anglais VoxAcComPage 6/39, October 2009, http://www.nidcd.nih.gov/health/statistics/vsl.asp.

[2] La voix. Ses Troubles Chez Les Enseignants, INSERM, 2006.[3] American Speech-Language-Hearing Association, October

2009, http://www.asha.org/default.htm.[4] E. Smith, M. Taylor, M. Mendoza, J. Barkmeier, J. Lemke, and

H. Hoffman, “Spasmodic dysphonia and vocal fold paralysis:

outcomes of voice problems on work-related functioning,”Journal of Voice, vol. 12, no. 2, pp. 223–232, 1998.

[5] Medline Plus, October 2009, http://www.nlm.nih.gov/ med-lineplus/voicedisorders.html.

[6] J. Kreiman, B. R. Gerratt, G. B. Kempster, A. Erman, and G. S.Berke, “Perceptual evaluation of voice quality: review, tutorial,and a framework for future research,” Journal of Speech andHearing Research, vol. 36, no. 1, pp. 21–40, 1993.

[7] M. Kob and P. H. Dejonckere, ““Advanced voice functionassessment”—goals and activities of COST action 2103,”Biomedical Signal Processing and Control, vol. 4, no. 3, pp. 173–175, 2009.

Page 12: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 567875, 9 pagesdoi:10.1155/2009/567875

Research Article

Jitter Estimation Algorithms for Detection ofPathological Voices

Darcio G. Silva,1 Luıs C. Oliveira,1 and Mario Andrea2

1 INESC-ID/IST, Lisbon, 1649-028 Lisbon, Portugal2 Faculty of Medicine, University of Lisbon, Portugal

Correspondence should be addressed to Luıs C. Oliveira, [email protected]

Received 27 November 2008; Revised 15 April 2009; Accepted 30 June 2009

Recommended by Juan I. Godino-Llorente

This work is focused on the evaluation of different methods to estimate the amount of jitter present in speech signals. The jittervalue is a measure of the irregularity of a quasiperiodic signal and is a good indicator of the presence of pathologies in the larynxsuch as vocal fold nodules or a vocal fold polyp. Given the irregular nature of the speech signal, each jitter estimation algorithmrelies on its own model making a direct comparison of the results very difficult. For this reason, the evaluation of the differentjitter estimation methods was target on their ability to detect pathological voices. Two databases were used for this evaluation:a subset of the MEEI database and a smaller database acquired in the scope of this work. The results showed that there weresignificant differences in the performance of the algorithms being evaluated. Surprisingly, in the largest database the best resultswere not achieved with the commonly used relative jitter, measured as a percentage of the glottal cycle, but with absolute jittervalues measured in microseconds. Also, the new proposed measure for jitter, LocJitt, performed in general is equal to or betterthan the commonly used tools of MDVP and Praat.

Copyright © 2009 Darcio G. Silva et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Most voice-related pathologies are due to irregular masseslocated on the vocal folds interfering in their normal andregular vibration. This phenomenon causes a decrease invoice quality, that is, usually the first symptom of this typeof disorders. In the past, the only way to measure voicequality was by applying perceptual measurements denotingthe existence or absence of several voice characteristics [1].There has been an increasing need for techniques thatcan evaluate voice quality in an objective way, providinga robust and reliable measurement of important acousticparameters in voice [2]. With the recent development intechnology, quality equipment and sophisticated softwareare now available to analyse the speech signal in order toestimate numerous parameters that indicate amplitude andfrequency perturbations, the level of air leakage, the degreeof turbulence, and so forth. The implementation of real-time analysis tools can give an important and instantaneousfeedback of voice performance for both voice therapy andvoice coaching procedures.

One of the most commonly used tools for this purposeis the Multidimensional Voice Program (MDVP) producedby KayPENTAX [3]. This commercial software tool is ableto perform different types of acoustic analysis on thespeech signal producing a large number of parameters.The MDVP is usually sold together with the KayPENTAX’sComputerized Speech Lab, a hardware platform for digitalvoice recording, making its use very common among healthprofessionals.

Another commonly used speech analysis tool is Praat [4],created by Paul Boersma and David Weenink of the Instituteof Phonetic Sciences, University of Amsterdam. This freesoftware is used by speech researchers, and it has a widerrange of use than MDVP although with a steeper learningcurve.

In this work we will focus on the estimation ofirregularities in the vibration of the vocal folds that iscommonly measured by the jitter parameter. Jitter measuresthe irregularities in a quasi-periodic signal and can accountfor variations in one or more of its features, like period,amplitude, shape, and so forth [5]. In the case of speech

Page 13: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

signal, its definition is less clear since the signal is veryirregular. Even a sustained vowel produced by a professionalspeaker can hardly be considered a periodic signal. Thisway, the jitter of a voiced speech signal is usually taken as ameasure of the change in the duration of consecutive glottalcycles. When this definition is applied to a sustained vowelwith a constant average glottal period, the presence of jitterindicates that there are some periods that are shorter whileothers are longer than the average pitch period.

Both MDVP and Praat have the possibility of producingan estimate of the amount of jitter in a sustained vowel.However, it is known that MDVP system has a tendencyto score jitter values above the ones calculated by Praat;when applied to the same speech signal they provide differentestimates [6]. Apart from these there are other methods toestimate jitter, and the question is on how to compare them.

In this paper we present the results of our evaluation of 3jitter estimation methods including the one used by MDVPand Praat. The goal of this study is not to develop a systemfor the detection of pathologic voices [7–9] but solely tounderstand the relative performance of the 3 jitter estimationtechniques in this task.

The paper starts by presenting the glottal source andvocal tract models used in this work, followed by adescription of the speech material that was used in theevaluation process. Next we present some methods formarking fixed points in the glottal cycle as required by thejitter estimation algorithms. We formalise the three jittermodels that were used, followed by a description of the jitterestimation algorithms that were evaluated. A comparison ofthe algorithms for both pitch marking and jitter estimationis then presented. A set of 14 tools, combining the differentalgorithms, are then evaluated in their ability to detectpathological voices. Finally we present the conclusions andsome ideas for future work.

2. Voice Source Model

Voice production starts with the vibration of the vocal folds,which can be more or less stretched to achieve higher orlower pitch tones. In normal conditions and in spite of thispitch variation ability, phonation is considered stabilized andregular. Any transformation on the vocal fold’s tissue cancause an irregular, nonperiodic vibration which will changethe shape of the glottal source signal from one period tothe next, introducing jitter [10]. The same problem canoccur in amplitude. If, for instance, the vocal folds are toostiff, they will need a higher subglottal pressure to vibrate.The glottal cycle can thus be irregularly disturbed also inamplitude, originating shimmer. Not less important is thepossible existence of high frequency noise, especially duringthe closed phase of the glottal cycle, originated by a partialclosure of the vocal folds, which will cause an air leakagethrough the glottis, providing a turbulence effect. All thesephenomena affect the glottal source signal, but we do nothave direct access to this signal, only to the sound pressureradiated at the lips. The estimation of the glottal source signalfrom the voice signal is not a simple task. Research in this

field shows that it is reasonable to approximate the influenceof the vocal tract by a linear filter. Using this approximation,the voice signal can be filtered by inverse of this filter toobtain an estimate of the glottal source signal [11]. In thiswork we will use a noninteractive approach that does notconsider the influence of the supraglottis vocal tract nor theinfluence of the subglottis cavities on the glottal flow. As aconsequence, we assume that the source and filter parametersare independent.

3. Vocal Tract Model

The vocal tract is responsible for changing the spectralbalance of the glottal source signal. By changing the vocaltract shape the speaker can modify its resonance frequenciesto produce a wide variety of different sounds. Humans usethe evolution in time of the resonance frequencies to producespeech. In this work, we model the vocal tract by an all-pole filter estimated using a Linear Prediction Analysis (LPC)[12]. LPC is a powerful and widely used tool for speechanalysis that assumes the already mentioned separation ofthe source signal from the vocal tract filter. The contributionof the vocal tract resonances estimated by the LPC algorithmcan be removed from the speech signal by inverse filtering.This process produces an estimate of the glottal source signal,also called residue. The ability to change the residue forother similar inputs, with different fundamental frequenciesor amplitudes, and applying them to the original vocaltract filter, allows the production of many combinations ofsynthetic voices.

4. Speech Data

The evaluation of the jitter detection algorithms was alsoconducted on real voices. For this purpose, two databaseswere used: the Disordered Voice Database (MEEI) providedby KayPENTAX, and a database named DB02 specificallycreated for this study.

The Disordered Voice Database (MEEI) was developedby the Massachusetts Eye and Ear Infirmary (MEEI) Voiceand Speech Lab. It includes more than 1400 voice samplesfrom approximately 700 subjects [13]. The database includessamples from patients with a wide variety of organic, neu-rological, traumatic, psychogenic, and other voice disorders,together with normal subjects. For this work, a group of50 pathological voices and 50 normal voices was randomlychosen from this data set.

The DB02 database was acquired in similar conditionsas the MEEI database using the Computerized Speech Lab4150 acquisition system from KayPENTAX, together witha dynamic low impedance microphone (SURE SM48). TheCSL 4150 provides a 16-bit A/D conversion, preamplifi-cation, and antialiasing filtering. All voices for this studywere recorded with a sampling frequency of 50 kHz and asignal-to-noise ratio of 39.5 dB [3]. Special care was takento maintain the same microphone position, the posture, andalso the type of interaction with the patient. The suggestedposture was, according to the normal procedures for a correct

Page 14: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

phonation, back and head straight and aligned with the chair.The microphone was positioned in a way to minimize theeffect of room reverberation making an angle of 45◦ to theopposing wall. Another important issue was to maintaina fixed distance between the microphone and the patient’smouth, which can influence the amplitude of the capturedsignal or even provide undesirable resonances at specificfrequencies. The direction is also relevant; a microphonedirected to the mouth can capture a pressure wave thatwill cause an exaggerated excitation of the microphone. Thedistance and angle chosen was 15 cm and 45◦.

Before each recording session, the volume level wascalibrated to adapt the dynamic range of the input signal inorder to prevent overload distortion and, at the same time,minimize the quantization error provided by the discrete andlimited range of the A/D converter.

The new database was organized per patient and perdate of exam. Each exam was saved in wav format with thefilename according to the type of the exam and patient’sreference number. The personal identification number ofthe patients was separated from the rest of the database forprivacy reasons.

The DB02 database is still being acquired, and it currentlycomprises 22 speakers of which 8 had diagnosed larynxpathologies. For balancing reasons a subset of the databasewas also used in this case including all the diagnosedspeakers and 8 randomly selected speakers with no diagnosedpathologies.

5. Pitch-Mark Detectors

The jitter estimation algorithms that we want to evaluaterequire the location of a fixed point in the glottal cycle,called a pitch-mark (PM). A good candidate for this referencepoint is the glottal closure instant (GCI) since it correspondsto a discontinuity in the glottal flow caused by the abruptclosure of the vocal folds, interrupting the passage of the airthrough the glottis. Since the residue signal resulting fromthe inverse filtering of the speech signal by the LPC filteris an approximation of the derivative of the glottal flow,the discontinuity in the flow produces large negative peaks.Normally these peaks fall slower than they recover, which canbe explained by the vocal folds’ closing/opening process. Aregular vibration produces periodic peaks with fundamentalfrequency F0.

A common algorithm for the glottal closure instantdetection is dypsa [14], for which there is an implementationin the VoiceBox toolbox [15].

We have implemented a modification to dypsa algorithmfor sustained vowels, named dymp. This modification con-siders that the glottal closure instants, calculated by dypsa, area first approximation of the real GCIs. Since we assume thatthe vocal tract is stable, instead of using time-varying LPCfilter coefficients, we can try to locate the set of coefficientsthat produced the most prominent peaks in the residue. Byanalysing the residue resulting from the time-varying LPCfilter we can locate the pair of pitch periods with the largestpeaks and the corresponding set of filter coefficients. This

Table 1: Naming of the pitch marking tools.

Name Summary

dympPitch marks computed using dypsa withpitch-synchronous LPC coefficients

mdvpPitch marks computed with MDVP’s peak-pickingtool

praatPitch marks computed with Praat’s cross-correlationtool

best set of filter coefficients is then used to filter the wholesustained vowel producing a residue with more prominentpeaks (Figure 1(b)) . The GCIs are then better located in thisenhanced residue signal.

The results, when compared to advanced systems likePraat and MDVP, suggest a significant improvement, espe-cially for irregular voices.

MDVP and Praat rely on pitch marks that do not coincidewith the glottal closure instant. Praat uses a waveform-matching procedure, that locates the pitch marks where thebest matching between wave shapes occurs using the “cross-correlation” maximum. On the other hand, MDVP uses apeak-picking procedure that locates the pitch marks on thelocal peaks of the waveform.

6. Jitter Models

For this study, three different models of jitter were used.The first one considers that jitter is just a simple

variation of period, which can be measured by subtractingeach period of the pitch period sequence to its neighbouror combinations of its neighbours. This method usuallyassumes a long time periodicity that sometimes does notexist and provides a single measurement for the whole signal:

Jitta = 1N − 1

∑N−1

k=1|P0(n + 1)− P0(n)|, (1)

where P0(n) is the sequence of pitch periods lengths mea-sured in microseconds.

The second model can be represented by a combinationof two periodic phenomena on a long time range to achievelocal aperiodicity behaviour in a short time range (Figure 2).If we assume a pulse like signal, it can be expressed as

s(n) =∑+∞

k=−∞δ(n− 2kP) +∑+∞

k=−∞δ(n + ε − (2k + 1)P).

(2)

In this model, P is the average period and ε is a value thatexpresses the displacement of every other period, in a cyclicperturbation of a local constant value, occurring in everysecond impulse. The value of ε can range from 0, no jitter,to P, the average period length.

It is important to note that, for a direct comparison of theresults, if we apply the first model to this second approach,the estimated jitter value is Jitta = 2ε. This factor comesfrom the assumption that in the first case Jitta is the directsubtraction of two periods, while for the latter ε is the halfdifference of the subtraction of two periods (Figure 2).

Page 15: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

Time (samples)

Am

plit

ude

(V

)

1.54 1.56 1.58 1.6 1.62

−0.01

−0.005

0

0.01

0.005

1.52

PM detection on residue using dypsa

×104

(a)

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

1.52 1.54 1.56 1.58 1.6 1.62

0.3

Time (samples)

Am

plit

ude

(V

)

PM detection on residue using dypsaMP

×104

(b)

Figure 1: The residue signal resulting from the original dypsa algorithm (a) and from the proposed dypsaMP (b) .

2PP − ε 3P − ε 4P0

P + ε P + εP − ε

Am

plit

ude

P − ε

1

Period (samples)

Figure 2: Example of a pitch period sequence with a local periodicand a local aperiodic component.

The major inconvenient of both of these models is theassumption that the underlying signal has a fixed funda-mental frequency. However, apart for professional singers,many speakers do not have a total control on the wholeprocess of phonation. Providing a regular glottal flow as wellas a constant position of the vocal tract, while producing aregular vibration of the vocal folds, during recording period(normally 8 to10 seconds), is not achievable by all speakers.The amount of jitter determined by both previous methodsdepends on the ability of the speaker to hold a constant pitch.Slow monotonic changes in the fundamental frequency areconsidered as a period-to-period variation. In our view, onlythe nonmonotonic variation should be used as an indicatorof the presence pathologies in the voice. For this reason wepropose a third model allowing the glottal period to changelinearly over time as shown in Figure 3. In this approach εaccounts only for the alternate change in period length, not

including the effect of monotonic fundamental frequencyvariations.

The model can be expressed as

P0(n) = P0 + (n− 1)�P + (−1)nε, (3)

where ΔP is the constant variation in the period length, εrepresents the jitter value, and P0 is the initial glottal period.Using 3 pitch mark instants (P0(1),P0(2),P0(3)) it is possibleto determine the 3 parameters of the model. With thisshort analysis window, it is sufficient to consider the linearapproximation of the monotonic variation of the period.

This model assumes that the constant variation ofperiod within the 3 period frame should not be consideredpathologic jitter. The separation of both contributions isthought to be important to properly study real voices withor without fundamental frequency variations, leading to amore realistic measurement of local pathologic jitter. Thisthird model is the base for a new method for jitter estimation.

7. Jitter Estimation Algorithms

7.1. The Jitt Algorithm (Used by MDVP and Praat). BothMDVP and Praat estimate the jitter value by computingthe average absolute difference between consecutive periods(from the period sequence P0(n)), divided by the averageperiod expressed as a percentage:

Jitt = 100(1/(N − 1))

∑N−1

k=1|P0(n + 1)− P0(n)|

(1/N)∑N

k=1P0(n)

. (4)

This measure is commonly referred as percent jitter orrelative jitter, while Jitta is the absolute jitter value expressedin microseconds. In MDVP this algorithm is named Jitt,while in Praat it is called Jitter (Local). In this work we willuse the MDVP name.

Page 16: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

0 P2 P3

Period

1

P1

Am

plit

ude

Figure 3: Example of a pitch period sequence with an increasingperiod.

We will also evaluate the average absolute differencebetween consecutive periods as expressed in (1), naming itby Jitta expressed in microseconds.

7.2. The STJE Algorithm. The Short Time Jitter Estimation(STJE) algorithm was proposed by Vasilakis and Stylianou[16], and it uses the second model for jitter mentioned above.The algorithm is based on mathematical attributes of themagnitude spectrum; the train of impulses can be separatedin a harmonic part (H) and subharmonic part (S), where thesubharmonic part is a direct result of the jitter effect:

|P(w)|2 = H(ε,w) + S(ε,w). (5)

When both spectra are represented in the same graph it canbe proved that the number of crossings of both componentsis equal to the number of samples of jitter (ε) of the signal.This means that the minimum number of crossings in agraph of this type is also 0 (no jitter) and the maximum isP (the period length). An example of these plots can be seenlater in this study (Figures 4 and 5).

The algorithm uses a sliding frame of 4P samples, whichwill slide P samples at the time to estimating a jitter value foreach step. More details of implementation can be found in[6].

It is important to remind that this algorithm providesa sequence of local jitter estimations that does not dependon long-term periodicity, while Praat and MDVP providea unique value due to expressions (3) and (4). To comparethis result with the ones provided by MDVP or Praat, it isnecessary to calculate the mean value of the sequence of localjitter estimations.

To analyse the performance of the STJE algorithm weused a synthetic voice produced using an all-pole filter tomodel the vocal tract. The filter coefficients were obtained byperforming an LPC analysis on a sustained vowel producedby a male speaker, with a fundamental frequency of around144 Hz, and using an analysis frame size of 4 glottalperiods. As expected, the algorithm STJE was able to detectfive intersections, corresponding to the exact jitter value

introduced in the impulse train used as the filter excitation(Figure 4). For a more realistic result, the STJE algorithmwas applied on two frames of a real voice using a windowlength of four periods. The first frame was carefully chosenin order to comply with the second jitter model while thesecond frame was chosen randomly. In both cases the jittervalue was also manually estimated on the time signal usingthe Jitt algorithm and the results were compared. In the firstcase the STJE correctly estimated a jitter of 1 sample, but inthe second one the estimated jitter was 5 samples while themanual estimation was 1 sample.

Figure 5 shows the power spectrum of both the har-monic and subharmonic components. The result shows anunexpected number of intersections, which increase jitterto values impossible to compare with MDVP’s or Praat’s.Several attempts to correct the intersection counting, such aschanging the threshold for intersection validation, applyingdifferent pre-emphasis, or even displacing the middle of theanalysis frame inside the period (to assure that it was nota PM detection problem), were taken into account, but nosignificant improvements were obtained.

One explanation for the higher than expected intersec-tion count can be the lowpass characteristic of the voicedcomponent of the speech signal that, when aspiration noiseis present, it is masked in the high-frequency region ofthe spectrum. This adds additional crossings between theharmonic and sub-harmonic components not predicted bythe model.

In conclusion, if the real voice follows the proposedjitter model, the algorithm estimates correct values. However,since natural human voices are quite irregular, only in afew cases STJE produces results comparable with MDVP orPraat.

7.3. The LocJitt Algorithm (Proposed). The proposed LocJittalgorithm aims to estimate the local jitter using the thirdmodel for jitter that was previously presented. The maingoal is to provide a better jitter estimation by discardingmonotonic variations in fundamental frequency that occursin natural voices.

The algorithm uses a frame of length equal to 3consecutive glottal cycles (4 Pitch Marks):

P0(1) = P0 − ε,P0(2) = P0 + Δp + ε,

P0(3) = P0 + 2Δp − ε,(6)

where P0 is the length of the first glottal cycle excluding thejitter effect, �P is the monotonic increment in the length ofthe glottal cycle occurring every period, and ε is cycle-to-cycle fluctuation caused by jitter. Using this set of equationsit is easy to derive an expression to compute the jitter valueusing the length of 3 consecutive glottal cycles:

ε = 14

[(P0(2)− P0(1))− (P0(3)− P0(2))]. (7)

Like SJTE, this algorithm has the ability to compute ajitter estimate for every glottal cycle by shifting the analysiswindow by one glottal cycle.

Page 17: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

−60

−40

−20

0

20

40

60

80

Frequency (degrees)

Pow

er (

dB)

0 20 40 60 80 100 120 140 160 180

Synthetic voice: ε = 5

Figure 4: Power spectrum of harmonic and subharmonic parts of asynthetic signal. The jitter introduced (ε = 5 samples) correspondsto five crossings. No pre-emphasis was preformed.

Two versions of the algorithm were made: LocJitt pro-duces an estimate of the local jitter as a percentage of theaverage glottal period, and LocJitta estimates the absolutevalue of the local jitter expressed in microseconds.

To evaluate the effect of these slower variations on thefundamental frequency on the jitter estimation computedusing the Jitt algorithm used by MDVP and Pratt, we willassume that the pitch period sequence is given by (2) withfixed values for εi and �P Using (4) it can easily be shownthat for an even number of periods if the amount of jitter islarger than the slow varying changes in the pitch period, theJitta algorithm estimates the correct value for ε:

2ε > Δp −→ Jitta = 2ε. (8)

However, for small jitter values when compared with the slowvariations of the glottal period, the Jitta algorithm estimatesnot ε but the slow variation:

2ε < Δp −→ Jitta = Δp. (9)

The proposed LocJitta algorithm does not have this problemand correctly separates the estimate of ε from the value of �P.

This difference is most important in the cases where jitteris present but with a small value, when it is most difficult todetect. Also, localized variations in fundamental frequencythat went undetected during the voice acquisition procedurecan result in erroneous jitter estimation.

8. Evaluation of Jiiter Algorithms forPathological Voice Detection

As we saw earlier, each algorithm for jitter estimation isbased on its own model of jitter. It is thus hard to comparethe results on real voices since each algorithm is, in effect,measuring a different thing. The best way to evaluate the

Table 2: Naming of the jitter estimation algorithms.

Name Summary

Jitt Global estimation based on the average difference inperiod length

STJE Local estimation based on the difference in length of every2 periods

LocJitt Local estimation based on the non-monotonic differencein period length

−60

−40

−20

0

20

40

60

Pow

er (

dB)

0 20 40 60 80 100 120 140 160 180

Real voice: ε = 1 (second case)

Frequency (degrees)

Figure 5: Power spectrum of harmonic and subharmonic parts of areal voice. The jitter measured on the time signal was 1 sample butthe STJE algorithm counted 5 crossings.

different algorithms is on their ability to perform a certaintask. In our case we decided to compare the algorithms intheir capability of detecting a pathologic voice. This way,we are not interested in their ability of providing a goodestimate on the amount of irregularity of the glottal cyclesbut only if they can discriminate the irregularities thatcorrespond to pathological conditions as opposed to thenormal aperiodicity observed in natural voices.

For this purpose, two databases were analysed, the MEEIdatabases, provided by KayPENTAX, and the database DB02created for this study and presented earlier.

The goals of the analysis were first, to test if eachalgorithm was good enough to be used by itself to distinctpathologic from normal voices, and second, to find outwhich algorithm had the best performance for such task.

To evaluate both the pitch marking methodology and thejitter estimation algorithm a set of 14 tools were created:

(i) dympSTJE: STJE based on dypsaMP’s pitch marks,jitter measured as a percentage of the period,

(ii) dympSTJEa: same as previous but with jitter mea-sured as an absolut value in microseconds,

(iii) dympJitt: Jitt based on dypsaMP’s pitch marks, jittermeasured as a percentage of the period,

(iv) dympJitta: same as previous but with jitter measuredas an absolute value in microseconds,

Page 18: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

(v) dympLocJitt: LocJitt based on dypsaMP’s pitchmarks, jitter measured as a percentage of the period,

(vi) dympLocJitta: same as previous but with jitter mea-sured as an absolut value in microseconds,

(vii) mdvpJitt: Jitt using MDVP’s pitch marks, jittermeasured as a percentage of the period,

(viii) mdvpJitta: same as previous but with jitter measuredas an absolut value in microseconds,

(ix) mdvpLocJitt: LocJitt using MDVP’s pitch marks,jitter measured as a percentage of the period,

(x) mdvpLocJitta: same as previous but with jitter mea-sured as an absolute value in microseconds,

(xi) praatJitt: Jitt using MDVP’s pitch marks, jitter mea-sured as a percentage of the period,

(xii) praatJitta: same as previous but with jitter measuredas an absolut value in microseconds,

(xiii) praatLocJitt: LocJitt using Praat’s pitch marks, jittermeasured as a percentage of the period,

(xiv) praatLocJitta: same as previous but with jitter mea-sured as an absolute value in microseconds.

The preliminary results with the STJE algorithm showedthat, when compared with other methods, it has a reduceddependency on the pitch marking tool being used. Thisis because the algorithm is based on spectral analysis,while the remaining methods are temporalbased. Theseresults, together with the computational complexity of thealgorithm, justify its use only in conjunction with the pitchmarking tool dymp.

8.1. Decision Threshold. All the tools provided their ownestimate on the amount of jitter in the input signal. Sincewe require a binary decision regarding the possibility of thevoice being pathological or not, a decision threshold must befound for each tool.

To tune the thresholds we have used a group of 50pathological and 50 normal voices randomly selected fromthe MEEI data set presented earlier. Since some data wassampled at 25 kHz and some at 50 kHz, we decided to startby converting all voices to 25 kHz and then to oversamplethem to 44.1 kHz. In order to avoid overtraining, the dataset was divided into 10 randomly chosen groups with fivepathologic and five nonpathologic voices each. A 10-foldcross-validation was then preformed, where, in each fold,the threshold was selected based on nine of these groups (atotal of 40 samples), but its performance was evaluated onthe remaining group of 10 voices. By rotating the left-outgroup, ten tests were conducted and the results are presentedin Table 3. The mean accuracy is the average of the percentageof correct pathological/nonpathological voice decisions foreach fold. The variance of the 10 results is also presented.This table shows that the different tools provide differentestimates for jitter not only because they rely on differentmodels for jitter but also because the results are basedon different pitch marking methods. This can explain, for

Table 3: Results of the 10-fold cross validation procedure. ThemdvpLocJitta tool produced the better average accuracy with a lowvariance on the 10 tests.

Mean accuracy Variance Threshold

dympSTJE 76% 2% 3.44%

dympJitt 68% 2% 0.72%

dympLocJitt 68% 2% 0.66%

mdvpJitt 70% 0% 0.44%

mdvpLocJitt 70% 0% 0.40%

praatJitt 78% 3% 0.15%

praatLocJitt 74% 2% 0.12%

dympSTJEa 81% 2% 250.1 μs

dympJitta 70% 1% 46.1 μs

dympLocJitta 71% 1% 60.9 μs

mdvpJitta 82% 1% 19.1 μs

mdvpLocJitta 84% 1% 19.6 μs

praatJitta 80% 2% 8.6 μs

praatLocJitta 79% 2% 7.4 μs

example, the difference between the threshold from dympJittand mdvpJitt, or between dympLocJitt and dympSTJE.

The results of the 10-fold cross validation procedure wereused to calculate the best decision threshold for each tool.The values are also presented in Table 3.

8.2. Tool Evaluation. After the definition of the best thresh-olds for pathological/nonpathological voice classifier, thedifferent tools were evaluated in the two previously describeddatabase: the subset of the MEEI and DB02.

On the selected subset of the MEEI database, the toolsshowed a similar behaviour to what was observed in the10-fold cross validation test: the best PM locator is theMDVP software. Regarding the jitter estimation tool, theSTJE algorithm performed better than the remaining tools.Comparing this result with the 10-fold test, it is clear thatthe larger variability of values that this algorithm producesmakes it more dependent on the size of the data, that is,used to tune the threshold. Except for the case of pitchmarks produced by the Praat tool, the new LocJitt algorithmperformed equal to or better than the common Jitt measure.

Another interesting result is the better performanceof absolute jitter values (measured in microsecond) overrelative ones (measured in %) of the glottal period sequence.This observation suggests that there is a certain amountof aperiodicity that seems to indicate the presence of apathology, that is, independent of the length of the glottalcycle. The use of relative jitter measures can prevent thedetection of a pathology when the voice has a very lowfundamental frequency, that would be detected with anabsolute jitter measurement. Finally, the STJE algorithmseems to present a good accuracy, although it provides muchhigher thresholds combined with a rather low robustness(defined by a larger variance).

To see how the tools behaved in a completely differentdatabases we also performed the evaluation on the DB02database. This database, although smaller, had the advantage

Page 19: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

Table 4: Results of the evaluation on the full databases.

MEEI DB02

dympSTJE 83% 69%

dympJitt 71% 88%

dympLocJitt 71% 88%

mdvpJitt 73% 63%

mdvpLocJitt 75% 63%

praatJitt 80% 69%

praatLocJitt 77% 69%

dympSTJEa 87% 69%

dympJitta 75% 88%

dympLocJitta 76% 88%

mdvpJitta 84% 63%

mdvpLocJitta 85% 63%

praatJitta 82% 69%

praatLocJitta 82% 69%

of not being used in the threshold tuning process, plus,it was recorded in a completely different environment.Table 4 presents interesting results when compared to theprevious ones. A general analysis shows that the results forthis database are quite different. Firstly, STJE performancedecreases, probably explained by the fact that these voiceswere recorded with a much higher sampling frequency, con-taining also more noise, which will increase the probabilityof intersections in the frequency domain.

Secondly, tools using mdvp’s PM seem also to providelower accuracies on the new Database. It is a fact thatMDVP is sensitive to noise, which may probably influencethe localization of the Pitch Marks, conditioning the finalJitter estimation. On the other hand, Praat seems to present,for a noisy environment, more accurate results; this fact isalso described in literature [6].

For evaluation on DB02, the best performance goesfor the tools using the dymp pitch marking tool. Due tothe low number of voices in this database, it is assumedacceptable the fact that no differences between Jitt and LocJittalgorithms are detected. Also, in this database, there were nonoticeable differences in performance of absolute jitter valuesover relative ones. This can be explained by the smaller size ofthis database and by the fact that it was recorded at a highersampling rate (50 kHz).

All results, although preliminary, provide a very impor-tant conclusion that the jitter seems to be in fact an impor-tant measurement to indicate the existence of a possiblepathology of the vocal folds.

9. Conclusions and Future Work

The first conclusion is that although most previous resultsuse relative jitter values, in our study on the MEEI databaseabsolute jitter values produced better results in the detectionof pathological voices. This difference was not observed inthe DB02 which can be explained by the smaller size of thisdatabase. It was expected that the amount of the disorder

(expressed by the parameter jitter) would depend directlyon the frequency of vibration of the vocal folds, but theresults suggest a different conclusion: the jitter threshold forpathological voice seems to be independent of the periodlength. In a future work we plan to extend this study,analysing sustained vowels of the same speaker with a higherand a lower pitch to see the influence of the fundamentalfrequency on jitter measurements.

The dymp pitch marking tool, when applied to nonidealconditions or to higher sampling frequencies, producedthe best performance. The inverse filtering technique is apromising solution for clinical applications, where normallyit is difficult to provide an ideal acoustic environment.

Concerning the new proposed measure for jitter, LocJitt,it provided the highest accuracy and the minimum variance,during the parameter tuning process. In the evaluation onthe full database the best results for the MEEI database wereachieved with the STJE algorithm; however, the result seemsto be dependent on the database since it did not performedas well on the DB02. The only case when Jitt outperformsLocJitt is when the pitchmarks are computed with the Praattool and when using relative jitter. In all other cases and forboth databases LocJitt achieved results that are equal to orbetter than Jitt.

An interesting future work would be to continue therecordings of the DB02 database in order to have a significantnumber of entries to better adjust threshold levels, not onlyfor an individual jitter evaluation but also for more complexevaluation where jitter is one of several features to detectperturbations in voice.

At last, the database DB02 also include other exams, likethe sustained vowel with increasing pitch, the text reading, oreven the AEIOU exam, that were not yet used. We hope thatfurther research on these exams will bring useful informationabout the effect of the different pathologies in the mode ofvibration of the vocal folds.

As final conclusion, we would to reinforce that theobjective measures of voice quality resulting from acousticanalysis can be a very powerful tool, not just for pathologicalvoice detection but also for other domains like voice-therapyor even professional voice coaching. The joint effort ofphysicians and engineers should be targeted not only infinding voice disorders but, most importantly, in preventingthem.

Acknowledgments

The authors would like to acknowledge the support of CostAction 2103 for this work, namely, in funding the participa-tion of the first author in the “Multi-disciplinary SummerTraining School on Voice Assessment” in Tampere, Finland.This work was also partially funded by the PortugueseFoundation for Science and Technology (FCT).

References

[1] J. P. Dworkin and R. J. Meleca, Vocal Pathologies: Diagnosis,Treatment & Case Studies, Singular, San Diego, Calif, USA,1996.

Page 20: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

[2] J. Kreiman, B. R. Gerratt, G. B. Kempster, A. Erman, and G. S.Berke, “Perceptual evaluation of voice quality: review, tutorial,and a framework for future research,” Journal of Speech andHearing Research, vol. 36, no. 1, pp. 21–40, 1993.

[3] “Multi-Dimensional Voice Program, Model 5105”.[4] P. Boersma and D. Weenink, “Praat, a system for doing

phonetics by computer,” Glot International, vol. 5, pp. 341–345, 2001.

[5] J. Schoentgen, “Stochastic models of jitter,” Journal of theAcoustical Society of America, vol. 109, no. 4, pp. 1631–1650,2001.

[6] O. Amir, M. Wolf, and N. Amir, “A clinical comparisonbetween two acoustic analysis softwares: MDVP and Praat,”Biomedical Signal Processing and Control, vol. 4, no. 3, pp. 202–205, 2009.

[7] J. I. Godino-Llorente and P. Gomez-Vilda, “Automatic detec-tion of voice impairments by means of short-term cepstralparameters and neural network based detectors,” IEEE Trans-actions on Biomedical Engineering, vol. 51, no. 2, pp. 380–384,2004.

[8] R. J. Moran, R. B. Reilly, P. de Chazal, and P. D. Lacy,“Telephony-based voice pathology assessment using auto-mated speech analysis,” IEEE Transactions on BiomedicalEngineering, vol. 53, no. 3, pp. 468–477, 2006.

[9] P. Gomez-Vilda, R. Fernandez-Baillo, V. Rodellar-Biarge, etal., “Glottal source biometrical signature for voice pathologydetection,” Speech Communication, vol. 50, no. 9, pp. 759–781,2009.

[10] D. Wong, M. R. Ito, N. B. Cox, and I. R. Titze, “Observationof perturbations in a lumped-element model of the vocal foldswith application to some pathological cases,” The Journal of theAcoustical Society of America, vol. 89, no. 1, pp. 383–394, 1991.

[11] L. Lehto, M. Airas, E. Bjorkner, J. Sundberg, and P. Alku,“Comparison of two inverse filtering methods in parameter-ization of the glottal closing phase characteristics in differentphonation types,” The Journal of Voice, vol. 21, no. 2, pp. 138–150, 2007.

[12] B. S. Atal and S. L. Hanauer, “Speech analysis and synthesisby linear prediction of the speech wave,” The Journal of theAcoustical Society of America, vol. 50, no. 2B, pp. 637–655,1971.

[13] “Disordered Voice Database and Program, Model 4337,” 1994.[14] A. Kounoudes, P. Naylor, and M. Brookes, “The DYPSA

algorithm for estimation of glottal closure instants in voicedspeech,” in Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP ’02), vol. 1,pp. 349–352, Orlando, Fla, USA, May 2002.

[15] M. Brookes, “VOICEBOX: Speech Processing Toolbox forMATLAB,” 2003.

[16] M. Vasilakis and Y. Stylianou, “A mathematical model foraccurate measurement of jitter,” in Proceedings of the 5thInternational Workshop on Models and Analysis of VocalEmissions for Biomedical Applications, Firenze University Press,Firenze, Italy, December 2007.

Page 21: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 784379, 7 pagesdoi:10.1155/2009/784379

Research Article

Removing the Influence of Shimmer in the Calculationof Harmonics-To-Noise Ratios Using Ensemble-Averagesin Voice Signals

Carlos Ferrer, Eduardo Gonzalez, Marıa E. Hernandez-Dıaz,Diana Torres, and Anesto del Toro

Center for Studies on Electronics and Information Technologies, Central University of Las Villas, C. Camajuanı,km 5.5, Santa Clara, CP 54830, Cuba

Correspondence should be addressed to Carlos Ferrer, [email protected]

Received 1 November 2008; Revised 10 March 2009; Accepted 13 April 2009

Recommended by Juan I. Godino-Llorente

Harmonics-to-noise ratios (HNRs) are affected by general aperiodicity in voiced speech signals. To specifically reflect a signal-to-additive-noise ratio, the measurement should be insensitive to other periodicity perturbations, like jitter, shimmer, and waveformvariability. The ensemble averaging technique is a time-domain method which has been gradually refined in terms of its sensitivityto jitter and waveform variability and required number of pulses. In this paper, shimmer is introduced in the model of the ensembleaverage, and a formula is derived which allows the reduction of shimmer effects in HNR calculation. The validity of the techniqueis evaluated using synthetically shimmered signals, and the prerequisites (glottal pulse positions and amplitudes) are obtained bymeans of fully automated methods. The results demonstrate the feasibility and usefulness of the correction.

Copyright © 2009 Carlos Ferrer et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

When the source-filter model of speech production [1]is assumed in Type 1 [2] signals (no apparent bifurca-tions/chaos), the sources of periodicity perturbations invoiced speech signals can be divided in four classes [3]:(a) pulse frequency perturbations, also known as jitter, (b)pulse amplitude perturbations, also known as shimmer, (c)additive noise, and (d) waveform variations, caused either bychanges in the excitation (source) or in the vocal tract (filter)transfer function. Vocal quality measurements have focusedmainly in the first three classes (see [4] for a comprehensivesurvey of methods reported in the previous century). Thefindings of significant interrelations among measures ofjitter, shimmer, and additive noise [5] raised the question on“whether it is important to be able to assign a given acousticmeasurement to a specific type of aperiodicity” (page 457).This ability of a measurement to gauge a particular signalattribute, being insensitive to other factors, has been apersistent interest in vocal quality research.

Harmonics-to-Noise-Ratios (HNRs) have been proposedas measures of the amount of additive noise in the acousticwaveform. However, an HNR measure insensitive to allthe other sources of perturbation is, if feasible, still to befound. Methods in both time and frequency (or trans-formed) domain do always have intrinsic flaws. Schoentgen[6] described analytically the effects of the different per-turbations in the Fourier spectra of source and radiatedwaveforms. According to the derivations from his models,it is not possible to perform separate measurements ofeach type of perturbation by using spectral-based methods.Time domain methods have been criticized [7, 8] fordepending on the correct determination of the individ-ual pulse boundaries, among many other method-specificfactors.

Yumoto et al. introduced a time-domain method fordetermining HNR [9], where the energy of the harmonic(repetitive) component is equal to the variance of a pulse“template” obtained as the ensemble average of the individ-ual pulses. The energy of the noise component is calculated

Page 22: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

as the variance of the differences between the ensemble andthe template (see (4) in Section 2).

The original ensemble-averaging technique has beencriticized [10, 11] for its slow convergence with N , thenumber of averaged pulses. The requirement of large Nfacilitates the inclusion of slow waveform changes in theensemble, which are incorrectly treated as noise by themethod. The sensitivity of the method to jitter and shimmerhas also been reported [5], and many approaches attemptingto overcome these limitations have been proposed.

In [12] the need of averaging a large number of pulses issuppressed, by determining an expression which corrects theensemble-average HNR.

Qi et al. used Dynamic Time Warping (DTW) [13]and later Zero Phase Transforms (ZPTs) [14] of individualpulses prior to averaging to reduce waveform variability (andjitter) influences in the template. For the same purpose theensemble averaging technique was applied to the spectralrepresentations of individual glottal source pulses in [3],where a pitch synchronous method allowed to account forjitter and shimmer in the glottal waveforms. However, theassumptions are valid only on glottal source signals; henceresults are not applicable to vocal tract filtered signals.Functional Data Analysis (FDA) has also been used toperform the optimal time alignment of pulses prior toaveraging [15].

Shimmer corrections to ensemble averages HNRs havereceived lesser attention than pulse duration (jitter) cor-rections, in spite of being a prerequisite for some of thementioned jitter correction methods. DTW and FDA, forinstance, depart from considering equal amplitude pulsesto determine the required expansion/compression of thewaveform duration. Besides, shimmer always increases thevariability of the ensemble with respect to the template in thereported methods. A normalization of each individual pulseby its RMS value was proposed in [7] to reduce shimmereffects on HNR and was first used on a method that alsoaccounted for jitter and offset effects in [16]. This pulseamplitude (shimmer) normalization can help in the timewarping of the pulses and actually reduces the variance of thetemplate in Yumoto’s HNR formula. However, it still yieldsonly an approximate value of HNR.

In this paper, an analysis on the original ensemble averageHNR formula in the presence of shimmer is performed,which results in a general form of Ferrer’s correcting formula[12] and allows the suppression of the effect of shimmer inHNR.

2. Ensemble-Averages HNR Calculation

The most widely used model for ensemble averaging assumeseach pulse representation xi(t) prior to averaging as arepetitive signal s(t) plus a noise term ei(t):

xi(t) = s(t) + ei(t). (1)

This representation has been used for source [3] andradiated signals [5, 9, 14, 16] as well as for both indistinctly[12, 15]. If we denote the glottal flow waveform as g(t),

the vocal tract impulse response as h(t), the radiation atlips as r(t), and the turbulent noise generated at the glottisas n(t), the components of the pulse waveform in (1)can be expressed differently for the source and radiatedsignals. If (1) represents the excitation signal, then s(t) =g(t), and e(t) = n(t), while for radiated signals s(t) =g(t) ∗ h(t) ∗ r(t) and e(t) = n(t) ∗ h(t) ∗ r(t) [17],with the asterisk denoting the convolution operation. Someimportant differences between both alternatives are [17] asfollows.

(i) HNR measured in the radiated signal differs fromHNR in the glottal signal.

(ii) Jitter in the glottal signal produces shimmer in theradiated signal.

(iii) Additive White Gaussian Noise (AWGN) in the glottis(a rough approximation [18] frequently assumed)yields colored noise at the lips.

In the general form of the ensemble average approach,if the noise term ei(t) is stationary and ergodic and s(t) andei(t) are zero mean signals (the typical assumptions in theminimization of the mean squared error [12, 19, 20]) withvariances σs2 and σe2, the actual HNR for the set of N pulsesis

HNR =E[∑N

i=1s(t)2]

E[∑N

i=1ei(t)2]

=N × E

[s(t)2

]

∑Ni=1E

[ei(t)

2]

= σs2

σe2

(2)

with E[ ] denoting the expected value operation. The ensem-ble averaging method proposed by Yumoto et al. [9] is basedon the use of a pulse template x(t) as an estimate of therepetitive component s(t):

x(t) =∑N

i=1xi(t)N

= s(t) +

∑Ni=1ei(t)N

.

(3)

This approximation to s(t) is then used to obtain anestimate of ei(t) according to (1), and both estimates are usedin (2) to produce Yumoto’s HNR formula:

HNRYum = N × E[x2(t)]

∑Ni=1E

[(xi(t)− x(t))2

] . (4)

The bias produced in HNRYum due to the use of (3) on itscalculation and the terms needed to correct it are describedin [12], where it is shown that

HNR = σs2

σe2= N − 1

NHNRYum − 1

N. (5)

However, the model previously described neglects theeffect of shimmer when the different replicas of the repetitivesignal are of different amplitude.

Page 23: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

3. Insertion of Shimmer in the Model

To account for shimmer, a variable ai can be added to themodel in (1):

xi(t) = ais(t) + ei(t). (6)

For this model, the actual HNR is

HNR =E[∑N

i=1(ais(t))2]

E[∑N

i=1ei(t)2]

=∑N

i=1ai2E[s(t)2

]

∑Ni=1E

[ei(t)

2]

=∑N

i=1ai2σs2

Nσe2.

(7)

Using the original ensemble average procedure, thetemplate yields

x(t) =∑N

i=1xi(t)N

= s(t)∑N

i=1ai +∑N

i=1ei(t)N

, (8)

and its variance is

σ2x

=E[x2(t)]

= E[(s(t)∑N

i=1ai)2+2s(t)

∑Ni=1ei(t)

∑Nk=1ak+

∑Ni=1ei(t)

∑Nk=1ek(t)]

N2.

(9)

If ei(t) is uncorrelated with s(t) or any ek(t) such thatk <> i, the second term between brackets in (9) as well asall the products in the third term where k <> i can besuppressed:

E[x2(t)

] =(∑N

i=1ai)2E[s(t)2

]+∑N

i=1E[ei(t)

2]

N2

=⎛⎝

N∑

i=1

ai

⎞⎠

2σ2s

N2+σ2e

N.

(10)

With the inclusion of shimmer in the model, thedenominator in (4) is

Den =N∑

i=1

E[

(xi(t)− x(t))2]

=N∑

i=1

E

⎡⎢⎣⎛⎝ais(t) + ei(t)−

N∑

j=1

ajs(t)

N−

N∑

j=1

ej(t)

N

⎞⎠

2⎤⎥⎦

=N∑

i=1

E

⎡⎢⎢⎢⎣

⎛⎜⎜⎜⎝ai

(N − 1)N

s(t)−N∑

j=1j /= i

a jNs(t)

+ei(t)(N − 1)N

−N∑

j=1j /= i

e j(t)

N

⎞⎟⎟⎟⎠

2⎤⎥⎥⎥⎦.

(11)

To simplify further derivations, the letters m, n, o, and pare used to represent the four terms summed and squared in(11):

m = ai(N − 1)N

s(t), n = −N∑

j=1j /= i

a jNs(t),

o = ei(t)(N − 1)N

, p = −N∑

j=1j /= i

e j(t)

N.

(12)

Using (12), (11) can be written as

Den =N∑

i=1

E[m2 + n2 + o2 + p2 + 2mn + 2mo + 2mp

+2no + 2np + 2op],

(13)

where the last five terms between brackets can be suppressed,since E[ei(t)ej(t)] = 0 for any i <> j. From the first fiveterms, it was already shown in [12] that

N∑

i=1

E[o2 + p2] = (N − 1)σ2

e . (14)

The summations of the other nonzero expected values(E[m2], E[n2] and E[2mn]) are examined as follows:

N∑

i=1

E[m2] =

N∑

i=1

E

[a2i

(N − 1)N2

2

s2(t)

]

= (N − 1)2∑Ni=1a

2i

N2σ2s ,

(15)

Page 24: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

while

N∑

i=1

E[n2] =

N∑

i=1

E

⎡⎢⎢⎢⎣s2(t)N2

N∑

j=1j /= i

a j

N∑

k=1k /= i

ak

⎤⎥⎥⎥⎦

= σ2s

N2

N∑

i=1

⎛⎜⎜⎜⎝

N∑

j=1j /= i

a j

N∑

k=1k /= i

ak

⎞⎟⎟⎟⎠,

(16)

and using

N∑

i=1

⎛⎜⎜⎜⎝

N∑

j=1j /= i

a j

N∑

k=1k /= i

ak

⎞⎟⎟⎟⎠ =

⎛⎜⎝

N∑

i=1

(ai)2 + (N − 2)

⎛⎝

N∑

i=1

ai

⎞⎠

2⎞⎟⎠ (17)

(16) yields

N∑

i=1

E[n2] = σ2

s

N2

⎛⎜⎝

N∑

i=1

(ai)2 + (N − 2)

⎛⎝

N∑

i=1

ai

⎞⎠

2⎞⎟⎠. (18)

Finally

N∑

i=1

E[2mn] = −2(N − 1)E[s2(t)

]

N2

N∑

i=1

ai

N∑

j=1j /= i

a j , (19)

since⎛⎝

N∑

i=1

ai

⎞⎠

2

=N∑

i=1

(ai)2 +

N∑

i=1

ai

N∑

j=1j /= i

a j , (20)

then (19) results in

N∑

i=1

E[2mn] = −2σ2s

(N − 1)N2

⎛⎜⎝⎛⎝

N∑

i=1

ai

⎞⎠

2

−N∑

i=1

(ai)2

⎞⎟⎠. (21)

The sum of (15), (18), and (21) is

N∑

i=1

E[m2 + n2 + 2mn

] = σ2s

⎛⎜⎝

N∑

i=1

(a2i

)−⎛⎝

N∑

i=1

ai

⎞⎠

21N

⎞⎟⎠. (22)

Now, substituting (14) and (22) in the denominator of(4) and (10) in the numerator gives

HNRYum =(∑N

i=1ai)2(

σ2s /N

)+ σ2

e

σ2s

(∑Ni=1a

2i −

(∑Ni=1ai

)2(1/N)

)+ σ2

e (N − 1).

(23)

From (23) the ratio of signal and noise variances can bedetermined as

σ2s

σ2e= [HNRYum(N − 1)− 1](∑N

i=1ai)2

(1/N)−HNRYum

(∑Ni=1a

2i −(∑N

i=1ai)2

(1/N)) ,

(24)

and the actual HNR given by (7) can be rewritten as

HNR = [HNRYum(N − 1)− 1]∑N

i=1a2i(∑N

i=1ai)2 −HNRYum

(N∑N

i=1a2i −

(∑Ni=1ai

)2) .

(25)

Equation (25) can be simplified by using a factor Kdefined as

K = N∑N

i=1a2i(∑N

i=1ai)2 (26)

and HNR expressed as

HNR = K[HNRYum(N − 1)− 1]N(1−HNRYum(K − 1))

. (27)

According to (26), K will be a positive number rangingfrom one (in the no-shimmer case, being all ai equal) to Nwhen a single pulse is a lot greater than all the others. Thelatter situation is not the case in voiced signals, where thelargest shimmer almost never exceeds the 50% of the meanamplitude [2] in extremely pathological voices. Equation(27) is a generalization of Ferrer’s correcting formula [12]expressed in (5), being equal in the no-shimmer case (K =1).

4. Experiment

The calculation of (27) requires the prior determination ofboth pulse boundaries and amplitudes. Pulse boundariesare usually determined by means of a cycle-to-cycle pitchdetection algorithm (PDA). The determination of pulseamplitudes relies on the pitch contour detected by the PDA,and a comparison of several amplitude measures can befound in [21]. In practice, the detected pulse boundaries andamplitudes differ from the real ones, causing a reduction inthe theoretical usefulness of (27). An additional deteriora-tion can be expected in the presence of correlated noise, asshould be the case in radiated speech signals.

To evaluate the effects of these deteriorations, syntheticvoiced signals were used with known pulse positions, noiseand shimmer levels. The synthesis procedure of the speechsignal s(t) is described by (28):

s(t) = h(t)∗M∑

i=1

kig′(t − iT0) + e(t), (28)

where h(t) is the vocal tract impulse response, ∗ denotesthe convolution operation, ki is the variable pulse amplitude,g(t) is the glottal flow waveform, i is the pulse number,T0 is the pitch period, and e(t) is the additive noise inthe signal. The effect of lip radiation has been included asthe first derivative operation present in g′(t). This synthesisprocedure is similar to the one used in [12, 19, 21, 22], butusing a more refined glottal excitation than an impulse train.In this case, a train of Rosemberg’s type B polynomial modelpulses [23] was chosen; this alternative is used in [3, 24].

Page 25: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

0 6.8 13.6 20.4 27.2 34 40.8 47.6

Maximum shimmer level (%)

18192021222324252627282930313233343536

HN

R(d

B)

HNRS’HNRSHNRC’HNRC

HNRY’HNRYHNRSr’HNRSr

Figure 1: Results for the different HNR estimation methods. HNRY(in triangles) is the original formula in [9], HNRC (squares) thepulse number correction in [12], HNRS (plus signs) the shimmercorrection proposed here (using known pulse amplitudes), andHNRSr (circles) the shimmer correction using estimated pulseamplitudes. Dashed lines represent results with AWGN; solid linesand apostrophes represent vocal tract filtered AWGN. Horizontaldashed line at 30 dB represents true HNR.

The discrete implementation of (28) was performed bysetting a sampling frequency of 22050 Hz, a fundamentalfrequency of 150 Hz (yielding 147 samples per period), andM = 300, to produce an approximate of 2 seconds ofsynthesized voice. The h(t) was obtained as the impulseresponse of a five formant all-pole filter, with the sameparameters used in [12, 19, 21, 22]. The glottal flow wasgenerated using a rising time of 0.33T0 and a falling timeof 0.09T0; the values which resulted in the most natural-sounding synthesis in [23].

The shimmer was controlled by changing the value ofeach pulse amplitude ki, obtained as ki = 1 + vi, where vi is arandom real value, uniformly distributed in the interval±vm.Eight levels of shimmer were synthesized, using values of vmfrom 0% to 47.6% in steps of 6.8%, measured in percent ofthe unaltered amplitude k = 1, the same values as in [12, 21].

The estimates of HNR calculated were the originalensemble average formula by Yumoto given in (4), thecorrection for any number of pulses given in (5), andthe removal of shimmer effects given by (27). The threeHNR estimates were calculated using first the known pulsedurations and amplitudes, and then using the positions givenby a well-known PDA (the superresolution approach fromMedan et al. [19]), and the amplitudes were calculated withMilenkovic’s formula [20] using the procedure described in[21].

A base level of noise was added to the signal, to avoidvalues near to zero in the denominator of HNRYum in (4).

The variance of the noise added was chosen to produce anactual HNR = 1000 (30 dB). Two types of noise were added:AWGN, in conformity with the assumptions of uncorrelatednoise made on deriving (27), and a vocal tract filteredversion, having some level of correlation which is most likelythe case in radiated signals.

The HNR estimates were found for ensembles of twoconsecutive pulses (N = 2) in the synthesized signals, andthe overall HNR was found as the average of these pairwiseHNR’s.

5. Results and Discussion

The average value for 100 realizations of the randomvariables involved (noise and shimmer) was found for eachHNR estimation variant on each shimmer level. It is relevantto note that the PDA detected the pulse positions withoutany error (not even a sample), for all realizations and alllevels of shimmer. For this reason, (4) and (5) produced thesame results using both the known and the detected pulsepositions. Equation (27) produced different results since itinvolves also the calculation of the amplitude ratios amongpulses, which produced results different to the values used inthe synthesis.

The results for the different methods facing both noisetypes are shown in Figure 1, and the discussion below isfirst centered in the AWGN and later in the effect of thecorrelation present in the vocal tract filtered noise.

AWGN. For the zero-shimmer level the results are aspredicted: the original approach (HNRY) overestimates theactual HNR (30 dB), while the corrected approaches produceadequate and equivalent results. When shimmer appears,HNRC begins to fall in parallel with HNRY, while bothapproaches considering shimmer, HNRS and HNRSr, showsuperior performance, with their values less affected by theincreasing levels of shimmer.

Two relevant facts are as follows.

(i) Shimmer-corrected approaches (HNRS and HNRSr)are nevertheless deteriorated by the shimmer level.

(ii) There is a better performance of HNRSr in compari-son with HNRS, in spite of using estimated values forthe pulse amplitudes.

Both facts can be explained by the presence, in any pulseof the signal, of the decaying tails of previous pulses. Thissummation of tails adds differences to the pulses, interpretedas noise in the model and causing a reduction in thecalculated HNR as the introduced shimmer increases. Onthe other hand, the summation of tails in one pulse isnot completely uncorrelated with the summation of tails inthe other. For this reason, the estimation of relative pulseamplitudes, based in the assumption of uncorrelated noise,produces amplitudes with an overestimation of the signalcomponent, yielding a higher HNRSr than HNRS.

It is to be expected that in the presence of jitter HNRSrwill perform worse, since pulse tails would not always bealigned with the adjacent pulse, and the correlation should

Page 26: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

be lower. The evaluation of the influence of jitter (as wellof other levels of noise and their combinations) in theperformance of the PDA and HNRSr would require extensivetests and is out of the scope of this paper.

Vocal tract filtered AWGN. When noise is not uncorrelated asassumed in the derivation of (27), a fraction of it is regardedas signal, incrementing HNR estimates (solid lines) in allvariants with respect to the results with uncorrelated noise(dashed lines). A significant fact is that this overestimationis more relevant in HNRS (plus signs in Figure 1) thanin HNRSr (circles). The correlated contributions of noiseand shimmered tails add to what is considered signal bythe model in HNRS, while in HNRSr this effect seems tobe compensated by its related consequence in estimatingpulse amplitudes with the same assumptions about noise andsignal correlations.

In general, shimmer corrections with estimated ampli-tude contours (HNRSr, in circles in Figure 1) producethe closest estimates to the true HNR, which for theseexperiments would be the flat horizontal line at 30 dB shownin Figure 1.

6. Conclusions

The performed analysis shows that shimmer effects canbe reduced in HNR estimations based in the ensemble-averages technique using similar assumptions than in [3, 20].The requirements for the calculation of (27) (detection ofpulse positions and amplitudes) can be performed withsatisfactory results using available methods.

More tests should be performed considering more typesof perturbations (different noise and jitter values, as wellas their combinations) as well as different vocal tractconfigurations. However, the experiments in this paper wereperformed using configurations reported in other works,and based on the preliminary results shown, the proposedapproach appears to be an alternative for the estimation ofHNR in the time domain superior to previous ensembleaverages techniques.

Acknowledgments

This research was partially funded by the Canadian Inter-national Development Agency Project Tier II-394-TT02-00and by the Flemish VLIR-UOS Program for InstitutionalUniversity Cooperation (IUC).

References

[1] G. Fant, Acoustic Theory of Speech Production, Mouton, TheHague, The Netherlands, 1960.

[2] I. R. Titze, Workshop on Acoustic Voice Analysis: SummaryStatement, National Center for Voice and Speech, 1994.

[3] P. J. Murphy, “Perturbation-free measurement of theharmonics-to-noise ratio in voice signals using pitchsynchronous harmonic analysis,” Journal of the AcousticalSociety of America, vol. 105, no. 5, pp. 2866–2881, 1999.

[4] E. H. Buder, “Acoustic analysis of vocal quality: a tabulationof algorithms 1902–1990,” in Voice Quality Measurement, R.D. Kent and M. J. Ball, Eds., pp. 119–244, Singular, San Diego,Calif, USA, 2000.

[5] J. Hillenbrand, “A methodological study of perturbation andadditive noise in synthetically generated voice signals,” Journalof Speech and Hearing Research, vol. 30, no. 4, pp. 448–461,1987.

[6] J. Schoentgen, “Spectral models of additive and modulationnoise in speech and phonatory excitation signals,” Journal ofthe Acoustical Society of America, vol. 113, no. 1, pp. 553–562,2003.

[7] J. Hillenbrand, R. A. Cleveland, and R. L. Erickson, “Acousticcorrelates of breathy vocal quality,” Journal of Speech andHearing Research, vol. 37, no. 4, pp. 769–778, 1994.

[8] Y. Qi and R. E. Hillman, “Temporal and spectral estimationsof harmonics-to-noise ratio in human voice signals,” Journal ofthe Acoustical Society of America, vol. 102, no. 1, pp. 537–543,1997.

[9] E. Yumoto, W. J. Gould, and T. Baer, “The harmonic-to-noiseratio as an index of the degree of hoarseness,” Journal of theAcoustical Society of America, vol. 71, pp. 1544–1550, 1982.

[10] H. Kasuya, S. Ogawa, K. Mashima, and S. Ebihara, “Nor-malized noise energy as an acoustic measure to evaluatepathologic voice,” Journal of the Acoustical Society of America,vol. 80, no. 5, pp. 1329–1334, 1986.

[11] J. Schoentgen, M. Bensaid, and F. Bucella, “Multivariate statis-tical analysis of flat vowel spectra with a view to characterizingdysphonic voices,” Journal of Speech, Language, and HearingResearch, vol. 43, no. 6, pp. 1493–1508, 2000.

[12] C. Ferrer, E. Gonzalez, and M. E. Hernandez-Dıaz, “Cor-recting the use of ensemble averages in the calculation ofharmonics to noise ratios in voice signals,” Journal of theAcoustical Society of America, vol. 118, no. 2, pp. 605–607,2005.

[13] Y. Qi, “Time normalization in voice analysis,” Journal of theAcoustical Society of America, vol. 92, no. 5, pp. 2569–2576,1992.

[14] Y. Qi, B. Weinberg, N. Bi, and W. J. Hess, “Minimizingthe effect of period determination on the computation ofamplitude perturbation in voice,” Journal of the AcousticalSociety of America, vol. 97, no. 4, pp. 2525–2532, 1995.

[15] J. C. Lucero and L. L. Koenig, “Time normalization of voicesignals using functional data analysis,” Journal of the AcousticalSociety of America, vol. 108, no. 4, pp. 1408–1420, 2000.

[16] N. B. Cox, M. R. Ito, and M. D. Morrison, “Data labeling andsampling effects in harmonics-to-noise ratios,” Journal of theAcoustical Society of America, vol. 85, no. 5, pp. 2165–2178,1989.

[17] P. J. Murphy, K. G. McGuigan, M. Walsh, and M. Colreavy,“Investigation of a glottal related harmonics-to-noise ratioand spectral tilt as indicators of glottal noise in synthesizedand human voice signals,” Journal of the Acoustical Society ofAmerica, vol. 123, no. 3, pp. 1642–1652, 2008.

[18] R. E. Hillman, E. Oesterle, and L. L. Feth, “Characteristics ofthe glottal turbulent noise source,” Journal of the AcousticalSociety of America, vol. 74, no. 3, pp. 691–694, 1983.

[19] Y. Medan, E. Yair, and D. Chazan, “Super resolution pitchdetermination of speech signals,” IEEE Transactions on SignalProcessing, vol. 39, no. 1, pp. 40–48, 1991.

[20] P. Milenkovic, “Least mean square measures of voice pertur-bation,” Journal of Speech and Hearing Research, vol. 30, no. 4,pp. 529–538, 1987.

Page 27: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

[21] C. Ferrer, E. Gonzalez, and M. E. Hernandez-Dıaz, “Usingwaveform matching techniques in the measurement of shim-mer in voice signals,” in Proceedings of the 8th Annual Con-ference of the International Speech Communication Association(INTERSPEECH ’07), pp. 1214–1217, Antwerp, Belgium,August 2007.

[22] V. Parsa and D. G. Jamieson, “A comparison of high precisionFo extraction algorithms for sustained vowels,” Journal ofSpeech, Language, and Hearing Research, vol. 42, pp. 112–126,1999.

[23] A. E. Rosemberg, “Effect of glottal pulse shape on the qualityof natural vowels,” Journal of the Acoustical Society of America,vol. 49, no. 2B, pp. 583–590, 1971.

[24] I. R. Titze and H. Liang, “Comparison of Fo extraction meth-ods for high-precision voice perturbation measurements,”Journal of Speech, Language, and Hearing Research, vol. 36, pp.1120–1133, 1993.

Page 28: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 173967, 19 pagesdoi:10.1155/2009/173967

Research Article

On the Use of the Correlation between Acoustic Descriptors forthe Normal/Pathological Voices Discrimination

Thomas Dubuisson,1 Thierry Dutoit,1 Bernard Gosselin,1 and Marc Remacle2

1 TCTS Lab, Faculte Polytechnique de Mons, 31 Boulevard Dolez, 7000 Mons, Belgium2 ORL-ORLO Lab, Universite Catholique de Louvain, Avenue Dr Therasse, 8, 5530 Yvoir, Belgium

Correspondence should be addressed to Thomas Dubuisson, [email protected]

Received 27 October 2008; Revised 25 February 2009; Accepted 23 April 2009

Recommended by Juan I. Godino-Llorente

This paper presents an analysis system aiming at discriminating between normal and pathological voices. Compared to literatureof voice pathology assessment, it is characterised by two aspects. First the system is based on features inspired from voice pathologyassessment and music information retrieval. Second the distinction between normal and pathological voices is simply based onthe correlation between acoustic features, while more complex classifiers are common in literature. Based on the normal andpathological samples included the MEEI database, it has been found that using two features (spectral decrease and first spectraltristimulus in the Bark scale) and their correlation leads to correct classification rates of 94.7% for pathological voices and 89.5%for normal ones. The system also outputs a normal/pathological factor aiming at giving an indication to the clinician about thelocation of a subject according to the database.

Copyright © 2009 Thomas Dubuisson et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

The acoustic evaluation of voice quality is an important toolfor the assessment of pathological voices. This assessmentmay be performed following two different approaches: theperceptive judgement and the objective assessment. On theone hand, the perceptive judgement is used in the clinicaldomain and consists in qualifying and quantifying the voicepathologies by listening the production of a patient. Thisevaluation is performed by trained professionals who rate thespeech samples on a grade scale (GRBAS scale [1]) accordingto their perception of voice disorder. This subjective evalua-tion suffers of the drawbacks to be highly dependent on theexperience of the listener and on its inconsistency on judgingpathological voice quality. On the other hand, the objectiveanalysis consists in qualifying and quantifying the voicepathologies by acoustical, aerodynamic, and physiologicalmeasurement. It offers the advantages to be quantitative,cheaper, faster, and more comfortable for the patient thanmethods like the electroglottography (EGG) [2] or theimaging of the vocal folds (by stroboscopy [3] or morerecently by high-speed camera [4]).

Many methods of acoustic evaluation of pathologicalvoices have been proposed in literature. Among them, animportant part consists of computing acoustic descriptors,using them in a classifier, and computing a classificationperformance from the outputs of this system. Interestingresults are obtained but two drawbacks can be highlighted.

(i) Using a classifier like Neural Networks is equivalentto a kind of “black box,” it is difficult to identify whatreally happens in it and, in the case of transformationof the input space, which feature or set of features dis-criminate well the normal and pathological samples.

(ii) The features used in literature are often linked to theclinical evaluation, while other acoustic features havebeen proposed in other domains of sound processing.Moreover, features like jitter or Harmonic-to-NoiseRatio (HNR) are based on the evaluation of funda-mental period, which can be subject to controversy inspeech analysis, even more in the case of pathologicalvoices [5].

For these reasons, this paper presents an analysis systemusing only the information from the correlation between

Page 29: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

Speech signal

Normal/pathological factor

Features extraction

Correlation computation

Features selection

Figure 1: Structure of the analysis system.

acoustic descriptors in order to discriminate the normal andthe pathological voices. These features come from both theclinical and sound analysis domains and have in commonthat none is based on the value of the fundamental frequency.

The analysis system is composed of three parts (seeFigure 1).

(1) Feature extraction: this part consists in cutting thesignal into frames, windowing them and computingdescriptors (from both normal/pathological voicesassessment and music analysis) in temporal andspectral domains. Only the value of descriptors cor-responding to voiced parts of speech are considered.This part is described in Section 5.

(2) Correlation computation: the correlation betweendescriptors in voiced parts of speech is computedand stored into a matrix. This part is described inSection 6.

(3) Exploitation of the correlation: the elements ofthe correlation matrix are manipulated in orderto discriminate between normal and pathologicalsamples and to compute a final descriptor (nor-mal/pathological factor). These manipulations arethe results of a statistical study described in Section 7.

In a nutshell, the aims of this study are as follows.

(1) Giving an overview of the classic features and clas-sifiers in normal/pathological voices discrimination.These two aspects are, respectively, described inSections 2 and 3.

(2) Proposing features coming from other domain ofsound analysis.

(3) Showing that simply using the correlation betweenfeatures not based on fundamental frequency insteadof a classic classifier allows to discriminate wellbetween normal and pathological samples, extractedfrom the database described in Section 4.

2. Classic Features in Pathological VoiceAssessment

The subject of this section is the overview of the classicfeatures involved in pathological voice assessment. It isobviously not possible to include all the descriptors foundin literature, only the most common are presented.

2.1. Fundamental Frequency. When working in speech pro-cessing domain, an obvious feature for researchers is thefundamental period, and its spectral equivalent, the funda-mental frequency. This parameter is used in most of thestudies, sometimes in conjunction with the Mel-FrequencyCepstral Coefficients (MFCC).

2.2. Mel -Frequency Cepstral Coefficients. MFCCs are one ofthe most widely-used way to represent the speech signal indomains like recognition or coding [6]. These coefficientsare computed by weighting the Fourier Transform of thesignal by a MEL filterbank (perceptive scale), then computingthe cepstrum from this weighted spectrum and finally theDiscrete Cosinus Transform (DCT) of this cepstrum.

Using the MFCC parameters provides three advantages.

(i) The human perception is taken into account byconsidering a perceptive scale of frequencies.

(ii) The MFCC parameters are uncorrelated thanks tothe DCT operation. This may be an advantage ifthese parameters are used directly as input of aclassification system. In this case, each parameterbrings its own information, without link to otherones.

(iii) The spectral envelope is summarized into a limitedset of parameters.

As MFCC coefficients are widely used in speech processing,some studies aim at adapting techniques of AutomaticSpeaker Recognition (ASR) to the pathological voice assess-ment. In [7, 8] the aim is to train a GMM classifier (seeSection 3.1) able to determine the grade corresponding toa particular voice sample. 16 MFCC coefficients and theirfirst derivative are computed by using a 24 MEL filterbank.In [9–11], 12 MFCC coefficients, with their first and secondderivative, and the fundamental period are the inputs ofan HMM classifier (see Section 3.2) trained in order todiscriminate between normal and pathological samples. Thedistinction between these two classes is also the subject of[12], in which MFCC coefficients are used, among others, asinputs of a SVM classifier (see Section 3.3).

2.3. Acoustic Features from MDVP Software. The Multi-Dimensional Voice Program (MDVP) is a software producedby KayPentax Corp. [13]. When assessing the production of asubject, this system computes acoustic descriptors related tothe perturbation of the fundamental frequency (or period)and to the amplitude of the signal (the whole definition ofthese descriptors is given in [14]).

As these features are considered as “classic” in the domainof speech pathology assessment, some authors use them in

Page 30: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

their classification system. Among these studies, some use theacoustic descriptors computed directly from MDVP software[10, 15], which is facilitated by the fact that these featuresare stored with speech samples in the MEEI database (seeSection 4). However, one can object that there is little controlon the computation of these acoustic descriptors. Some otherstudies use features inspired from those computed by MDVPsoftware, meaning that their definitions is taken or inspiredfrom [14]. For example, [16, 17] present a classificationsystem of normal/pathological discrimination after trans-mission through a telephone channel in which input featuresare, among others, the perturbation of fundamental periodand amplitude as defined in MDVP software.

2.4. Fundamental Frequency and Amplitude Perturbations(Jitter and Shimmer). Perhaps among the most famousacoustic descriptors in speech pathology assessment, jitterand shimmer are defined as the variation of the duration andthe amplitude of the glottal cycle during the production of asustained vowel.

MDVP software includes different ways of calculatingjitter and shimmer, all of them being based on a classicestimation of fundamental frequency. Among these differentimplementations, Perturbation Quotient of fundamentalfrequency and amplitude are especially used as a measureof jitter and shimmer in a majority of studies [17–20]. Another descriptor derived from the Perturbation Quotient andshowing more correlation with pathology is proposed in[21].

Most of the methods of jitter and shimmer computationrely on the assumption that periodicity exists in speech,which may be questionable in presence of pathology. That iswhy some methods propose alternative ways of computingjitter and shimmer than those based on the estimation offundamental period. In [22] the salience of a sample (definedas the longest interval over this sample is maximum) is usedto derive a duration quite close to the definition of glottalcycle length. Jitter and shimmer can easily be derived oncethis duration is available. One other interesting method relieson the modelisation of the power spectrum of speech into anharmonic part influenced by the jitter and a subharmonicone appearing because of jitter [5]. Jitter can be estimatedby observing the behaviour of these two parts. Concerningthe shimmer, the study proposed in [23] uses the waveformmatching technique [24] to compute it. This study alsoproposes an interesting review of the different acceptationsof the term amplitude in the definition of shimmer.

2.5. Spectral Balances. It is considered that the location ofenergy in spectral domain may be discriminant between thetwo populations. That is why descriptors are computed inlimited frequency regions. Among these, the spectral balanceis defined as the ratio between the spectral energy in twofrequency bands.

Apart from HNR, de Krom defines in [25] 4 frequencybands ([60–400 Hz], [400–2000 Hz], [2000–5000 Hz],[5000–8000 Hz]) between which spectral balances arecomputed. The method exposed in [19] extends thesefrequency bands with the [8000–11025 Hz] band. Spectral

balances between all possible pairs of bands and the wholespectrum are also considered. Other frequency bands (below1 kHz, above 1 kHz, below 2 kHz, above 2 kHz) are proposedin [26] and are involved in the computation of two spectraltilt parameters, tuned to indicate the amount of noisewithout influence of jitter and shimmer.

2.6. Harmonics and Formant Level Variation. Consideringthe level in harmonic and formant regions is popular inspeech pathology assessment because they reflect the pres-ence of perturbation on speech signal (jitter, shimmer) andglottal source characteristics (spectral tilt, open quotient).

Concerning the harmonics, the first and second har-monic are most of the time considered. In [25, 27] thedifference of amplitude between these harmonics and therelative level of the first one regarding to the level in the[400–2000 Hz] band are measured. The study [26] uses thismeasure in addition to the measure of the first harmoniclevel. Finally the authors of [19] choose to compute the levelof the first harmonic and the relative level between the twofirst harmonics in the cepstral domain.

Concerning the formants, the level differences betweenthe first harmonic and the strongest one in the first andthird formant region are used in [26] and the level differencebetween the first and third formant is considered in [27].The energy level in the region of the first, second and thirdformant are also selected in [28].

It must however be noticed that the measurementof these levels strongly relies on fundamental frequencyestimation (which may be problematic in pathological cases)and formant detection.

2.7. Noise Features. As pathological speech is often perceivedas noisy, researchers have been interested in measuring theharmonic and noise components of speech. This kind ofmeasure is besides part of tools used in clinical domain (e.g.,MDVP Software).

Some noise measures proposed in literature can behighlighted:

(i) Harmonic to Noise Ratio (HNR): defined in [29] asthe log ratio of the energy of the periodic and aperiodiccomponents, different methods of HNR computation havebeen proposed in literature (a comparison between them isproposed in [29] in the context of voice quality analysis).Some methods are based on a model in which speech isassumed to be composed of a periodic component and anaperiodic component [25, 30] (notably by computing thecepstrum of speech signal) while other use the short-timeautocorrelation function [31]. They all share the fact thatthey are based on the estimation of fundamental frequency.

When looking at studies of pathological speech assess-ment, HNR arises as a popular measure. It is sometimescomputed for the whole frequency range [15] or morefrequently in particular frequency bands because of theassumption that noise energy is located in different frequencyregions in normal and pathological phonations. Indeed HNRin four frequency bands is used in [32] with spectral energyin critical bands for the discrimination between normaland pathological samples in the MEEI Database. The same

Page 31: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

measure is used in [33] to show that HMM is able to classifydifferent voice qualities and in [16, 17] to discriminatenormal and pathological samples after transmission througha telephone channel. Speech samples for both methods arealso extracted from the MEEI Database.

(ii) Normalized Noise Energy (NNE): this measure isproposed in [34] and aims at quantifying the energy ofthe noise component from the spectrum of speech. In thecomputation of NNE, noise energy is obtained betweenthe harmonics directly from spectrum while inside theharmonics noise energy is assumed to be the mean value ofthe adjacent minima in the spectrum. The authors point thatthere may be a problem of estimation when the harmonicsare broadened (in case of jitter). This measure is used in[15] for the discrimination between normal and pathologicalsamples in the MEEI database.

(iii) Glottal to Noise Excitation (GNE) ratio: this measureis proposed in [35] and aims at quantifying the amount ofvoice excitation by vocal-fold oscillations versus excitationby turbulent noise. This descriptor is computed as themaximum correlation between the Hilbert enveloppes of theinverse filtered speech wave, in different frequency bands.

GNE measure is compared to HNR and NNE in [35] inwhich the authors show its relevance for the measure of noiseenergy, even in presence of strong jitter and shimmer (in thecase of synthetic speech signals). This work is continued in[18], in which GNE is compared to other features (HNR,GNE, jitter and shimmer from MDVP software) in the fieldof voice quality assessment (in the case of real speech signals).As for HNR and NNE, GNE is computed for differentfrequency bands. It turns from this study that, in pathologicalspeech, GNE (measuring the additive noise due to air passingthrough the glottis in case of uncomplete closure), andjitter and shimmer (measuring the irregularity of vocal foldsvibration) describe different voice aspects that often appearfor this kind of voice.

GNE has also been proved to show significant differencebetween subjects with normal phonation or pathologicalphonation ([15] for various pathologies or [27] for theparticular case of cancer).

3. Classifiers used in Normal/PathologicalVoices Discrimination

This section aims at describing the different kinds of clas-sifiers used in voice pathology assessment. Their structureand behavior are briefly presented, with a focus on the waythey are adapted to this particular problem. This section iscomplementary to Section 2, since the features used as inputare described in this latter.

3.1. Gaussian Mixture Model. Gaussian Mixture Modeling(GMM) is widely used in Automatic Speaker Recognition,where it acts as a supervised classification system able todifferentiate speech samples into classes (two for speakerverification and n for speaker identification).

In [7] GMM is adapted from speaker identification sothat a class does not longer belong to a given speaker butto one of the grades in GRBAS scale (from 0 (normal) to

3). Each class is thus learnt using samples whose pathologyis associated to this grade. The normal and pathologicalcorpus are part of a database developed by LAPEC (Hopitalde la Timonne, Marseille, France) and consists of 20 normalsamples and 60 pathological samples whose grade has beenassessed by experts. The building of the classification systemconsists into three phases.

(i) Parametrization: MFCC coefficients and their firstderivatives are extracted from speech.

(ii) Model training: a generic GMM is estimated ona normal corpus and GMMs are derived fromthe generic one by adapting of the mean of allthe gaussians (MAP technique). In case of nor-mal/pathological discrimination, a normal and apathologic GMM are adapted from the generic GMMwhile in the case of grade classification, each gradeis represented by a GMM adapted from the genericGMM.

(iii) Classification: when a speech sample has to beclassified, the likelihood between this sample andeach GMM is estimated and the decision relies on themaximum between these likelihoods.

For the normal/pathological classification, 95% of normalsubjects and 81.7% of pathological ones are correctlyclassified. For the grade classification, the same performanceis obtained for the grade 0 (corresponding to the normalsubjects) while a loss of performance is observed forthe pathological ones, specially between adjacent grades.Although the results are judged promising by the authors,no particular attention is put on the choice of acousticparameters. The same system is used in [8] to determinewhich kind of information is better suited to the classificationof the four grades. Three levels are considered.

(i) Energy: only the information extracted from non-silence frames is considered.

(ii) Phonetic: the information is extracted from framesafter automatic phonetic alignment.

(iii) Voiced: only the information extracted from voicedsegments is considered.

Whatever the level of information, the same performance forgrade 0 than in [7] is obtained. For other grades, it turnsout that the information extracted from the phonetic levelprovides the best overall classification result (71% for thesame database as in [7]). The authors pursue their workin [36], where the same system than in [7, 8] is appliedbut this time to parameters extracted from a cut of thefrequency range [0–8000 Hz] into subbands. It turns outthat the [0–3000 Hz] band seems to be more informative (interms of discrimination between the four grades) that thefull frequency range. Interesting results of this study are theproof that (1) an attention on acoustic features is importantfor the classification (2) the whole frequency range may notbe as performant as a subband to discriminate normal andpathological voices.

Page 32: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

GMMs are also used in [37] in order to discriminatebetween 41 normal samples and 111 pathological samplescollected in a room of the ENT department of a hospital.This time, the features come from the MDVP analysis (seeSection 2.3) and consists of Jitt, RAP, Shim, APQ, HNR,and SPI. The methodology for building the GMM is slightlydifferent than, for example, in [7] because this time oneGMM models all the pathological samples and the varyingdimension is the number of Gaussians involved in themixture. For the optimal number of gaussians, the correctclassification rate is 92.9% for normal data and 98.6% forthe pathological ones. As in [7], the authors point that itwould be interesting to pay more attention to the choice ofthe acoustic features.

3.2. Hidden Markov Model. Hidden Markov Models(HMMs) are well known in speech processing, notably inspeech recognition and more recently in speech synthesis.

In [9] HMMs are trained on 12 MFCC coefficients,their first and second derivatives and the pitch for thesustained vowels /a/ and the spoken utterances from theMEEI Database (see Section 4). The correct classificationrate for the sustained vowels is 98.3% and 97.75% for thespoken utterances. The authors compare these performancesto those obtained by using the results of MDVP analysis onthe same corpus. It turns that using these features provideslower classification rate for the two kinds of production.This work is continued in [10], in which a discriminationbetween four degrees of a particular pathology (A-PSqueezing) are classified using an HMM structured as in[9]. It turns that the correct classification rate is higherwhen the degree of pathology is severe and that usingan HMM with the same features than in [9] providesbetter classification rate than using the features fromMDVP analysis. The classification of different pathologiesis pursued in [11], where the authors aim at discriminatingfive pathologies (A-P Squeezing, hyper-function, ventricularcompression, paralysis, gastric reflux) in the pathologicalcorpus of vowels /a/ from the MEEI Database. The samefeatures than in the two papers above are used as inputof the HMM. In this case, five HMMs are trained, eachone corresponding to a pathology versus the others. Whena new sample is presented as input of the classificationsystem, the assigned pathology is the one corresponding tothe HMM that outputs the maximum score. The averagecorrect classification rate is 71%. Although the results ofdiscrimination between pathologies are encouraging, theauthors point that extending this work to other pathologieswould be difficult because of the sparseness of data in theMEEI database. In terms of features, these three papers showthat (1) spectral enveloppe features and pitch tend to bemore reliable than the features estimated in MDVP analysisand (2) using HMM in classification provides good results inthe case of discrimination between normal and pathologicalvoices, and between different kinds of pathologies.

3.3. Support Vector Machines. Support Vector Machines(SVMs) [38] are a well-known classifier used in problems ofclassification, regression, and novelty detection.

Some people use this classifier in discrimination betweennormal and pathological samples. For example, [12] pro-poses to use a set of features consisting of 11 MFCC coeffi-cients, HNR, NNE, GNE, Energy, and their first derivatives.The classifier is trained on the vowels /a/ from the patho-logical corpus of MEEI Database (53 normal samples and 77pathological samples) and the average correct classificationrate is 95.12%. The author point that the cepstral and thenoise features complement well and that the results are betterthan using an MLP classifier with the same inputs. This kindof classifier is also used in [39], in which input featuresare extracted from the wavelet transform of 30 normaland 60 pathological speech utterances (from a databasedesigned in Republic Center of Hearing, Voice and SpeechPathologies, Minsk, Belarus). The correct classification rate isthis time 97.5% for normal voices and 100% for pathologicalones.

3.4. Linear Discriminant. Among the simplest classificationsystems, Linear Discriminant (LD) classifier aims at cuttingthe feature spaces under the hypothesis of Gaussian Distri-bution for features of each class. Under assumptions aboutthe distributions, the decision boundaries are linear. When anew sample is presented as input of this classifier, its assignedclass is the one for which the classifier outputs the highestprobability.

The remote diagnosis tool presented in [16] uses asinput features inspired from MDVP analysis (pitch andamplitude perturbations, HNR in different frequency bands)to discriminate normal and pathological samples whenspeech is transmitted through the telephonic channel. Thedatabase consists of all the samples from MEEI databaseafter transmission through this channel. The authors usean LD classifier and obtain a correct classification rateof 89.10% for the original database and 74.15% for thetelephone database. Although they point out that the resultsare promising, they admit that more samples are needed toincrease the performance of the system and that the difficultyin accurately tracking the pitch in speech could severly limitthe discriminant ability of pitch perturbation measures.

This work is pursued in [17] by using pitch andamplitude perturbation features to classify the pathologicalsamples from the telephone database of [16] into differentkinds of pathologies (neuromuscular, physical, and mixed).The LD classifier provides a correct classification rate of87.27% for the neuromuscular corpus, 77.97% for thephysical corpus and 61.08% for the mixed corpus. It alsoturns that, in the case of the database transmitted bytelephone channel, HNR measures are not as relevant asothers to discriminate between normal and pathologicalgroups and between the different groups of pathologies.

3.5. K-Nearest Neighbours. The K-Nearest Neighbours(KNNs) classifier [38] is a system aiming at clustering afeature space into as many regions as classes, these regionsbeing delimited by piecewise linear planes.

In [32] a modification of this classifier is used in orderto classify 53 normal and 163 pathological samples extractedfrom the MEEI database. In this system, a new sample is not

Page 33: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

compared to its K nearest neighbours but to a vector whichrepresents the mean of all the features vectors belonging toa class. Thus the class assigned to the new sample is the onecorresponding to the closest mean vector to the new sample.As the method exposed in [32] considers HNR in 4 frequencybands and spectral energy in 21 critical bands as inputs, thesetwo kinds of features are used to design classifiers betweennormal and pathological samples. For the first set of features,the obtained accuracy is 94.28% and for the second one92.38%. Although the dimension of the first set of featuresis smaller and the obtained accuracy higher than for thesecond set, the authors point out that fundamental frequencyis difficult to estimate for pathological voices, leading toerroneous estimation of harmonic and noise components.They also highlight that the computation load of HNR ishigher than for spectral energy.

3.6. Neural Networks. Artificial Neural Networks are a widelyused classifier in various domains, in pattern classificationand recognition or in speech recognition. Basically this typeof classifier can be viewed as an interconnexion betweensimple small units, the neurons, designed to model to someextent the behaviour of human brain.

In [15] an MLP classifier is designed to discriminatebetween normal and pathological samples in the MEEIdatabase. The network consists of a 26-neurons input layer(26 acoustic descriptors computed by MDVP software andstored in the database with the speech samples), one hiddenlayer and 1-neuron output layer (normal or pathological).The average correct classification rate is 94% when HNR,VTI, and ShdB are used as input features. The authorsof [19] are also interested in the discrimination betweennormal and pathological samples in a database of 5 spanishsustained vowels (100 normal samples and 68 pathologicalsamples). Each vowel is treated by a neural network whichtakes as input classic parameters and others extracted fromthe bicoherence. The decision from the 5 networks are thencombined to decide if the input sample is healthy or not. Thecorrect classification rate is 94.4% for the classic parametersand is increased of 4% when the others ones are added.A similar study is conducted in [20], in which the sameclassifier than in [19] is applied to two sets of featuresextracted from the database presented in the same paper.The two sets of features consist on classic parameters andclassic parameters plus non-linear features inspired fromthe dynamic system theory (the correlation dimension andthe largest Lyupanov exponent). The author shows thatusing this latter configuration of features leads to a correctclassification rate of 93%.

4. Database

In the domain of pathological voices assessment, a widely-used database is the MEEI Disordered Voice Database,produced by KayPentax Corp. [14]. It has been chosenbecause a certain amount of studies [10–12, 15–17, 32, 33]use it in order to compare themselves to other methods andbecause it already provides a distinction between normal andpathological samples.

Speech signal

Temporal features Spectral features

Normalization

Voicing mask

Windowing

Framing

Features extraction

Figure 2: Details of the first part of the analysis system.

The database contains sustained vowels and reading textsamples (12 seconds readings of the “Rainbow Passage”),from 53 subjects with normal voice and 657 subjects witha large panel of pathologies. The recordings are linked toinformations about the subjects (age, gender, smoking ornot) and to the results of the analysis by the MDVP software.The sampling frequency of the recordings is 25 kHz or50 kHz, with only 25 kHz for the normal voices.

In this study, only the sustained vowels of the MEEIDatabase are considered. This group is split into a trainingand a test set, respectively, representing 65% and 35% of thewhole database of sustained vowels.

(i) Training set: this set contains normal and patholog-ical samples. The normal ones consist of 34 normalsamples randomly chosen among the 53 samples ofthe database. The pathological ones consist of 427samples randomly chosen among the 657 samples ofthe database.

(ii) Test set: this set contains the normal and pathologicalsamples that are not part of the training set. It thusconsists of 19 normal samples and 230 pathologicalones.

The training set is used to find which information is themost discriminant for the discrimination between normaland pathological samples and the test set is used to assess theclassification performance of this information.

In order to limit the computational load and to avoid aneffect of the sampling frequency value on the discriminationbetween the two groups, all the voices are resampled at16 kHz and quantified on 16 bits.

5. Feature Extraction

The first part of our analysis system consists in extractingfeatures from speech signal (see Figure 2). This section thusaims at describing first the practical conditions of feature

Page 34: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

extraction, then the reasons of selecting features from theones presented in Section 2 and from music sound analysis.The mathematical formulation of the selected descriptorsand the implementation of voicing decision are finallypresented.

5.1. Practical Conditions of Features Extraction. The compu-tation of descriptors is preceded by several steps aiming atpreparing the speech signal to be treated.

As explained in Section 4, the speech samples studiedhere (sustained vowels /a/) are part of the MEEI database.Their sampling frequency are various (25 or 50 kHz). It wasthen chosen resample them directly to 16 kHz.

In order to be as independent as possible to the recordingconditions (e.g., tuning of the recording system), speechsignal is normalized according to (1) (x(n) stands for thesamples of speech signal and N for its length):

x(n) = x(n)− (1/N)∑N

i=1x(i)√(1/N)

∑Ni=1

(x(i)− (1/N)

∑Ni=1x(i)

)2. (1)

The descriptors considered in this paper are all local descrip-tors. They have thus to be computed for short time periods.Besides, in order to keep a good time resolution for theextraction of information, these time periods must overlap.As 30 milliseconds and 10 milliseconds are common values inspeech processing community for, respectively, the durationand the delay between consecutive time periods, these twoparameters are chosen in the analysis system. Besides, eachtime period is weighted by a window function (here aHanning window), in order to avoid strong discontinuitiesat the boundaries.

5.2. Selecting Descriptors from Pathological Voices Assess-ment. Acoustic widely used descriptors in the domains ofnormal/pathological samples discrimination have been pre-sented in Section 2. As already said in the introduction andstated in [5], the estimation of fundamental frequency maybe doubtful in speech (especially in pathological speech).That is why it has been chosen not to consider descriptorsrelying on this measure (e.g., HNR, jitter, shimmer, harmoniclevel). Besides the results of some classification methods as[36] suggest that cutting the frequency range into frequencybands may be more informative than considering the wholefrequency range. That is why, from the normal/pathologicalvoice discrimination literature, the features related to spec-tral balances are considered in our system. These features aredefined in Section 5.5.

5.3. Selecting Descriptors from Other Domains of SoundAnalysis. Speech signal is itself a particular example ofsound. Apart from the speech processing domain, manyothers are dedicated to the extraction of information fromthe sound. Among those, Music Information Retrieval (MIR)aims at extracting information from music in order tobuild classification system of music by, for example, artists.As this extraction is based on acoustic descriptors, it isinteresting to highlight here some of them that could be

used in voice pathologies assessment. These descriptors arepart of the CUIDADO project [40], aiming at developing anew chain of applications through the use of audio/musiccontent descriptors, and of the MPEG-7 [41] standard formultimedia content description. These features are not so farfrom speech processing: they are just complementary to thestandard features exposed in Section 2.

All the considered features coming from the MIR domain(excepted the tristimuli) are not based on the estimation offundamental frequency. However a modified definition oftristimuli is proposed in order to keep this measure in thefeature vector. All these features are defined in Sections 5.4and 5.5.

5.4. Temporal Domain. The features describing the temporalbehaviour of speech signal are mathematically defined in thissection. For the rest of the paper, x(n) stands for a frame ofspeech signal and N for its length.

5.4.1. Temporal Energy. The energy of the frame (expressedin dB) is computed as

ET = 10× log10

N∑

n=1

(x(n))2. (2)

5.4.2. Temporal Mean. The mean value of the frame iscomputed as

μT = 1N

N∑

n=1

x(n). (3)

5.4.3. Temporal Standard Deviation. The standard deviationof the frame is computed as

σT =

√√√√√ 1N

N∑

n=1

(x(n)− μT

)2. (4)

5.4.4. Temporal Zero Crossing. The zero-crossing rate [40,42] aims at quantifying the frequency at which the signalcrosses the zero-axis. This descriptor is notably used toindicate if a speech fragment is voiced or not [43]. For agiven frame, the number of times that sign changes betweena sample and the previous one (from positive to negative orthe opposite) is computed. To convert this value in Hz, it isdivided by the interval of time on which it is computed, 30milliseconds in the present case.

5.5. Spectral Domain. The features describing the spectralbehaviour of speech signal are mathematically defined in thissection. For the rest of the paper, X(k), |X(k)|, k, and NFFT

stand, respectively, for the Discrete Fourier Transform, itsmodulus, its bin index, and the number of frequency binson which it is computed for the numeric sequence x(n).NFFT

is set to 1024 in this study.

Page 35: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

5.5.1. Spectral Delta. The delta value aims at quantifying therange of the amplitude spectrum. It is defined as

DeltaS = maxk

(|X(k)|)−mink

(|X(k)|). (5)

5.5.2. Spectral Mean Value. The mean value of the amplitudespectrum is defined as

μS = 1NFFT

NFFT∑

k=1

|X(k)|. (6)

5.5.3. Spectral Median Value. The median value of theamplitude spectrum is defined as the amplitude value thatdivides all the values in two groups of same cardinality. Themedian is characterized by the fact that it is less influencedby extreme values than the mean value.

5.5.4. Spectral Standard Deviation. The standard deviation ofthe amplitude spectrum is defined as

σS =

√√√√√ 1NFFT

NFFT∑

k=1

(|X(k)| − μS)2. (7)

5.5.5. Spectral Center of Gravity. The spectral center ofgravity (also known as spectral centroid) is a very commonfeature in MIR domain [42, 44–47]. Perceptively connectedto the perception of brightness, it indicates where the “centerof mass” of the spectrum is. The spectral center of gravity ofthe amplitude spectrum is known as an economic spectraldescriptor giving an estimation of the major location ofspectral energy. It is computed as

COG =∑NFFT/2

k=1 k × |X(k)|∑NFFT/2k=1 |X(k)| . (8)

The amplitude corresponding to this frequency is also storedas a feature.

5.5.6. Spectral Moments. As the power spectrum of a signalcan be considered as the distribution of energy along fre-quency, one can describe this distribution by using descrip-tors from the theory of statistics. The spectral moments ofthe power spectrum [46] are well adapted to this description.The first four moments of the power spectrum [48] areconsidered in this study.

In order to compute them, the power spectral density andits total energy are computed as

PSD(k) = 1NFFT

|X(k)|2,

ES =NFFT∑

k=1

PSD(k).

(9)

Then the four moments are computed as follows.

(1) The first moment is equivalent to the spectral centerof gravity but computed this time on the PSD:

M1 = 2ES

NFFT/2∑

k=1

k × PSD(k). (10)

(2) The second moment expresses the spread of thespectrum around its first moment:

M2 = 2ES

NFFT/2∑

k=1

(k −M1)2 × PSD(k). (11)

(3) The third moment is defined as

M3 = 2ES

NFFT/2∑

k=1

(k −M1)3 × PSD(k). (12)

As itself, the third moment is not stored as a featurebecause it is used to compute the skewness [49],defining the orientation of the PSD around its firstmoment. If it is positive, the PSD is more oriented tothe right and to the left if is negative. The skewness iscomputed as

Skewness = M3

M23/2 . (13)

(4) The fourth moment is defined as

M4 = 2ES

NFFT/2∑

k=1

(k −M1)4 × PSD(k). (14)

As itself, the fourth moment is not stored as afeature because it is used to compute the kurtosis[49], defining the acuity of the PSD around it firstmoment. A Gaussian distribution having a kurtosisequal to 3, a distribution with a higher kurtosis ismore acute than a Gaussian one while a distributionwith a lower kurtosis is more flat than a gaussiandistribution. The kurtosis is computed as

Kurtosis = M4

M22. (15)

5.5.7. Spectral Decrease. The spectral decrease [40] aimsat quantifying the amount of decrease of the amplitudespectrum. Coming from perceptive studies, it is supposed tobe more correlated with human perception. This descriptoris computed as

Decrease =∑NFFT/2

k=2 ((|X(k)| − |X(1)|)/(k − 1))∑NFFT/2

k=2 |X(k)| . (16)

5.5.8. Spectral Slope. The spectral slope [46, 50] is another representation of the amount of decreasing of theamplitude spectrum. It is computed by linear regression of

Page 36: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9M

elsc

ale

0

500

1000

1500

2000

2500

3000

3500

Frequency scale (Hz)

0 1 2 3 4 5 6 7 8 9 10×103

Figure 3: Relation between Hz and MEL scales.

the spectrum. In this formulation, the amplitude spectrumis approximated

X(k) = S× k + constant, (17)

and the slope is computed

S=[

(NFFT/2)∑NFFT/2

k=1 k|X(k)|]−[∑NFFT/2

k=1 k×∑NFFT/2k=1 |X(k)|

]

[∑NFFT/2k=1 |X(k)|

][(NFFT/2)

∑NFFT/2k=1 k2−

(∑NFFT/2k=1 k

)2] .

(18)

5.5.9. Spectral Roll-Off. The spectral roll-off [42, 46] is thefrequency so that 95% of the energy is located below thispoint. kc is computed by solving the equality

kc∑

k=1

|X(k)|2 = 0.95×NFFT/2∑

k=1

|X(k)|2. (19)

5.5.10. Perceptive Scales. As already said in Section 2.2,perceptive behaviour of human hearing can be approximatedby non linear scale of frequencies. Among those, one may citethe MEL scale and the Bark scale.

The MEL scale consists in a non linear division of fre-quency range, guided by perceptive considerations. Proposedin [51], this perceptive scale of pitches is defined so as aconstant variation in the MEL scale is perceived as constant inthe Hz scale. One particular link between these scales is that1000 Hz corresponds to 1000 mels. The relation between theHz scale and MEL scale is presented in Figure 3 and obeys

m = 2595× log10

(1 +

f

700

). (20)

Based on this scale, a filterbank is designed and consists on24 triangular-shaped filters whose center frequency is linearlydistributed in the MEL scale and whose bandwith increaseswith central frequency.

The Bark scale [52] divides the frequency range intocritical bands. This division is defined so as two sinusoidslocated in a critical band and whose amplitude is the sameare perceived in the same way while their perceived intensity

is different if they are located in different bands. The relationbetween Hz and Bark scale obeys

Bark = 13× arctan

(f

1315.8

)+ 3.5× arctan

(f

7518

). (21)

Based on this scale, the critical bands are implemented byusing 24 rectangular filters whose center frequency is linearlydistributed in the Bark scale and whose bandwith increaseswith central frequency. The Bark scale is used to compute theloudness and derived measures [46] with the sharpness andthe spread (see Section 5.5.12).

5.5.11. Spectral Tristimuli. The tristimuli [46] are proposedin [53] as a timbre equivalent to the color attributes of vision.These are defined as energy ratio between the fundamentalfrequency and its harmonics. As it is decided to use inthis study descriptors not based on fundamental frequencyestimation, the implementation of tristimuli is modified byusing frequency bands from the MEL and Bark scales insteadof the harmonics. The 3 tristimuli are defined as in (22),in which kBand[1] stands for the FFT bins correspondingto the frequency range defined by the first MEL or Barkfrequency band and kBand[1,...,24] for the bins corresponding tothe frequency range defined from the first to the 24th MELor Bark frequency bands:

T1 =∑

kBand[1]|X(k)|

∑kBand[1,...,24]

|X(k)| ,

T2 =∑

kBand[2,3,4]|X(k)|

∑kBand[1,...,24]

|X(k)| ,

T3 =∑

kBand[5,...,24]|X(k)|

∑kBand[1,...,24]

|X(k)| .

(22)

5.5.12. Spectral Loudness. The specific loudness [46] is theloudness associated to each Bark band and is defined as in(23), where z is the index of the Bark band (z standing forvalues from 1 to 24) and kBand[z] the FFT bins correspondingto the frequencies included in the zth critical band:

Loudness(z) =⎛⎜⎝∑

kBand[z]

|X(k)|⎞⎟⎠

0.23

. (23)

The total loudness is defined as the sum of the specificloudness:

LoudnessTotal =24∑

z=1

Loudness(z). (24)

For each band, a relative loudness is defined

LoudnessRelative(z) = Loudness(z)LoudnessTotal

. (25)

Page 37: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal Processing

Based on the Bark scale and the loudness, a perceptiveequivalent to the spectral center of gravity is computed as

A = 0.11×∑24

z=1zg(z)Loudness(z)LoudnessTotal

, (26)

where g(z) is defined

g(z) =⎧⎨⎩

1, if z < 15,

0.066× e0.171×z, if z ≥ 15.(27)

Finally the Spread measures the distance from the largetspecific loudness to the total loudness:

Spread =(

LoudnessTotal −max[z][Loudness(z)]LoudnessTotal

)2

.

(28)

5.5.13. Spectral Balances. As defined in [19], 5 frequencybands are considered:

(1) L0: [60–400 Hz],

(2) L1: [400–2000 Hz],

(3) L2: [2000–5000 Hz],

(4) L3: [5000–8000 Hz],

(5) LT : [60–8000 Hz].

The energy in each of these bands is computed as in (29),where Li stands for the ith frequency band and kLi for theFFT bins corresponding to the frequencies included in the Lifrequency band (i stands for the values 0, 1, 2, 3,T):

ELi = 10× log10

kLi

PSD(k). (29)

The energy ratio between two of these frequency bands isdefined as in (30) (i and j stand for the values 0, 1, 2, 3,T):

ELi, j = 10× log10

∑kLi

PSD(k)∑

kLiPSD(k)

. (30)

The Soft Phonation Index is defined in the same way thanin (30), but for the [0–1000 Hz] and [0–8000 Hz] frequencybands.

5.5.14. Spectral Flux. The spectral flux [42, 46, 47] isa descriptor aiming at quantifying the variation of thespectrum along time. It is particularly useful when particularevent (such as voice onsets [54]) must be detected. Thistemporal variation is computed from the normalized cross-correlation between two successive amplitude spectra:

SF(t) = 1−∑NFFT/2

k=1 |X(t − 1, k)| × |X(t, k)|√∑NFFT/2k=1 |X(t − 1, k)|2 ×

√∑NFFT/2k=1 |X(t, k)|2

.

(31)

5.6. Voicing Decision. As it has been chosen to compute thecorrelation between features only for voiced parts of speech,a voicing detection algorithm dedicated to this purpose hasbeen developed. The different steps of this algorithm are asfollows.

(1) Prior estimation of fundamental period: a lot ofmethods are proposed in literature, but the YINalgorithm [55] has emerged since recent years in thespeech processing and MIR communities. This algo-rithm provides a prior estimation of fundamentalperiod, necessary for the following step.

(2) Computation of the local cross-correlation [49]:the cross-correlation function (see (32) for twosequences y(n) and z(n) whose length is N) is themajor element to determine if the speech segment isvoiced or not:

Ryz(m) = 1N − |m|

N−m−1∑

n=0

y(n +m)z(n). (32)

Every 30 milliseconds, the corresponding estimationof fundamental period is considered and two framesare extracted from speech signal: one fundamentalperiod on the left of the current instant of analysisand one fundamental period on the right of thisinstant. The cross-correlation between these twoframes is then computed according to (32).

(3) Thresholding of the cross-correlation: by observ-ing the evolution of the maximum of the cross-correlation function (let us call it MaxXC) andaccording to [43], it has been observed that thisdescriptor, correctly thresholded, provides a prelim-inary discrimination between voiced and unvoicedframes. The most satisfying value for the thresholdis 0.02. A voiced mask is defined for the whole speechsignal:

Voiced Mask ={

1, if MaxXC ≥ 0.02,

0, if MaxXC < 0.02.(33)

(4) Correction of the voicing mask: although the resultsof the previous step are already satisfying, somemistakes remain, as in other problems in whicha threshold has to be applied. A typical mistakeis an isolated voiced frame among unvoiced ones.To overcome these detection errors, a second-ordermoving average filter has been applied on the voicedmask. Thus, for a given frame, if the output of thefilter is lower than 1, it is tagged as unvoiced andvoiced otherwise.

Once the voiced mask is available, each evolution of thefeatures presented above is multiplied by the mask in orderto keep only the value of features in voiced parts of speech.

6. Correlation Computation

As presented in Section 5, a total of 87 features are consideredin this study. They were originally intended to be inputs of

Page 38: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

Matrix of correlation between features

Correlation computation

Selection of elements in upper part of the correlation matrix

Figure 4: Details of the second part of the analysis system.

a classification system. In order to eliminate the redundantinformation in the features, the correlation between featuresis first computed. This operation is included in the secondpart of the analysis system (see Figure 4).

6.1. Definition of the Correlation. The Pearson correlationcoefficient [49] is computed as (for two numeric sequencesx and y whose length is N)

Rxy =∑N

i=1(xi − x)× (yi − y)

√∑Ni=1(xi − x)2 ×

√∑Ni=1

(yi − y

)2, (34)

where

x = 1N

N∑

i=1

xi,

y = 1N

N∑

i=1

yi.

(35)

The values of the correlation coefficient are constricted intothe [−1, 1] interval, |Rxy| = 1 corresponding to perfectlycorrelated sequences and Rxy = 0 to perfectly uncorrelatedsequences.

In the case of multiple sequences of features, thecorrelation is computed between each pair of sequences andthe overall correlation matrix is computed as

M(p, q

) =∑N

i=1

(xi,p − x

)×(yi,q − y

)√∑N

i=1

(xi,p − x

)2 ×√∑N

i=1

(yi,q − y

)2, (36)

where p and q (constricted into the [1, 87] interval) identifytwo sequences of features and where x and y are computedfor each sequence. The correlation matrix for a normalsubject and a pathological sample (from the databasedescribed in Section 4) are presented in Figures 5 and 6.

When looking at those matrices, one can see that theirstructures are quite different, this fact being confirmed forother samples in the database. That is why it was decided toexploit the information from the correlation matrix ratherthan the features themselves in order to see if significantdifferences could be found between normal and pathologicalsamples.

Feat

ure

inde

x

80

70

60

50

40

30

20

10

Feature index

10 20 30 40 50 60 70 80

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 5: Correlation matrix for a normal sample.

Feat

ure

inde

x

80

70

60

50

40

30

20

10

Feature index

10 20 30 40 50 60 70 80

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 6: Correlation matrix for a pathological sample.

6.2. Exploitation of the Correlation Matrix. As shown inFigures 5 and 6, the correlation matrix for the normal samplecontains more elements close to 1 (in absolute values) thanthe one for the pathological sample. An information couldbe extracted from the correlation matrix by considering theelements of the upper part of this matrix and by consideringeach of these elements as a feature itself. As correlationmatrix is symetric and its diagonal consisting on elementsequals to 1 by definition, the number of elements in theupper part of the correlation matrix is

NDescriptors ×(NDescriptors − 1

)

2, (37)

where NDescriptors stands for the number of acoustic descrip-tors. In the present case, as 87 descriptors are considered asinputs of the system, there are 3741 elements to consider inthe correlation matrix.

In order to find a statistical discriminant factor betweennormal and pathological samples, the correlation matrix is

Page 39: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

12 EURASIP Journal on Advances in Signal ProcessingPe

rcen

tage

0

0.05

0.1

0.15

0.2

0.25

Sum of the elements of the correlation matrix

0 50 100 150 200 250 300 350

NormalPathological

Figure 7: Distribution of the sum of the elements of the correlationmatrix for normal samples (green) and pathological samples (red)included in the training set.

Selection of elements from correlation matrix

Combination of selected elements from correlation matrix

Normal/pathological factor

Features selection

Figure 8: Details of the third part of the analysis system.

computed for all samples of the training set and, to assessthe classification performance of this factor, the correlationmatrix has also been computed for the samples in the testset.

A first attempt to the extraction of information from thecorrelation matrix is the sum of the elements in its upperpart. The distribution of this sum for the samples of thetraining set is shown in Figure 7. The normal distributionis in green and the pathological one in red. The y-axis isgraduated in percentages of samples that are associated to aparticular value of the sum (x-axis). When looking at thesedistributions, one can see that each one is large and that theystrong overlap, leading to the conclusion that the sum value isnot able to separate the normal and pathologic samples. Onecan make the hypothesis that some features bring confusionin the sum operation. That is why it was decided to select onlyfew of them in order to see if the separation between normal

and pathological samples is better in this case. The operationof selecting features is presented in the next section.

7. Feature Selection

As each element of the correlation matrix can be consideredas a feature, the analysis system computes now 3741 features.As seen in the previous section, the sum of these featuresdoes not allow to separate well the distribution of normaland pathological samples in the training set. It is thusnecessary to select among the 3741 features the few onesthat discriminate best between the two populations. Somemethods are proposed in literature, each of them is brieflydescribed here. The reasons of choosing one of them arethen presented and the selected method is finally applied tothe present problem. The selection and combination of theelements of the correlation matrix are included in the thirdpart of the analysis system (see Figure 8).

7.1. Methods for Feature Selection

7.1.1. Principal Component Analysis. Principal ComponentAnalysis (PCA) is a well-known method for the preprocess-ing of features in a classification system [38, 56]. It is usedto linearly transform the features in order to find the bestway to represent them (in terms of least square error). If Xrepresents the normalized features matrix, the new featuresmatrix Z is obtained by

Z = UTX , (38)

where U is the linear transformation matrix. One can showthat the matrix U that leads to the best final representationconsists of the eigen vectors of the autocorrelation matrixXXT . The dispersion of features around each new axis ofrepresentation is given by the eigen value associated tothis axis. A reduction of features dimensionality is possibleby selecting the axis of representation associated to thehighest eigen values. It must be however emphasized thatthe transformation defined in (38) is not based on a classlabelling but only on the features. Besides, PCA consists ofcomputing a linear combination of original features, leadingto a difficult physical interpretation of the new ones.

7.1.2. Generalized Fisher Criterion. The generalized Fishercriterion [56] is a class separation criterion based on thefeatures and the class labelling. It is based on the ratiobetween two matrices.

(i) Within-class covariance matrix: quantifies theamount of inner features dispersion for all theclasses.

(ii) Between-class covariance matrix: quantifies the fea-tures dispersion around the general mean for all theclasses.

Page 40: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 13D

iscr

imin

ant

pow

er

0

1

2

3

4

5

6

7

8×10−4

Index of the sorted elements of the correlation matrix

0 500 1000 1500 2000 2500 3000 3500 4000

Figure 9: Discriminant power of the 3741 correlations for thesamples of the training set.

Perc

enta

ge

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Correlation between the spectral decreaseand the 1st tristimulus in Bark bands

−1 −0.5 0 0.5 1

NormalPathological

Figure 10: Distribution of the first discriminant correlation for thesamples of the training set (green; normal class; red: pathologicalclass).

For a given feature k, only the diagonal elements of the twomatrices defined above are considered and its discriminantpower between C classes is defined as

Dk =∑C

c=1p(ωc)(μck − μk

)2

∑Cc=1p(ωc)σ2

ck

, (39)

where p(ωc) stands for the percentage of representation ofclass c in the database, μck for the mean of the feature k inthe class c, μk the mean of feature k for all the classes, and σckfor the standard deviation of feature k in the class c. A featureselection is possible by selecting the features associated to the

highest values of discriminant power. Comparing to PCA,it has the advantage to be based on class labelling and toconserve the meaning of the features.

7.1.3. Fisher Discriminant Analysis. The Fisher discriminantanalysis [56, 57] is a procedure allowing to change therepresentation system of features and to select among themin one operation. For a C classes problem, this methodconsists of finding C − 1 linear discriminant functions,these functions maximizing the ratio between the within-class covariances and the between-class covariances. One canprove that these functions are the eigen vectors of a particularmatrix. This method allows to reduce the dimensionality of aproblem although this dimensionality is fixed by the numberof involved classes. Besides, as PCA, the new features are theresult of linear combination of the original ones, leading to adifficult physical interpretation after transformation.

7.2. Application of Feature Selection. It has been chosen inthis study to apply the generalized Fisher criterion in orderto keep the choice of the final dimensionality (contrary tothe linear discriminant analysis) and the physical meaningof the features (contrary to PCA and linear discriminantanalysis).

Figure 9 shows the discriminant power of the 3741correlations, sorted in ascending order, for the samples ofthe training set. One can see that, for some correlations, thediscriminant power is higher than for others. It has beenchosen to study two cases here: the case in which only thecorrelation associated to the highest discriminant power iskept and the case in which only the two ones associated tothe highest discriminant powers are kept.

7.2.1. One Correlation Case. The selected correlation isthe one between the first spectral tristimulus in the Barkscale (see Section 5.5.11) and the spectral decrease (seeSection 5.5.7). The distribution of this correlation for thetwo classes of the training set is shown in Figure 10.The normal samples are characterized by the fact thatthere is a high concentration for values around −0.75.This means that the evolution of the spectral decreaseand the first spectral tristimulus are fairly strong linkedfor a large majority of samples, although this link is notabsolute (because the correlation is not −1 but −0.75).The pathological samples are characterized by a largerdispersion of the correlation value, meaning that for somesamples the two characteristics are fairly slightly linkedand for others no link exists at all. Compared to Figure 7,the two classes are much more separated. It may thusbe possible to split the normal and pathological samplesof the training set by thresholding the most discriminantcorrelation.

In order to have an overview of the classificationperformances of this thresholding for the training set, aReceiver Operating Curve (ROC) is built by computing theFalse Positive Rate (FPR) and True Positive Rate (TPR)for thresholds uniformly distributed between the lower and

Page 41: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

14 EURASIP Journal on Advances in Signal ProcessingTr

ue

pos

itiv

era

te(%

)

0

0.2

0.4

0.6

0.8

1

False positive rate (%)

0 0.2 0.4 0.6 0.8 1

Receiver operator curveCrossing pointNo discrimination line

Figure 11: ROC for the “One Correlation Case.”

upper limits of the correlation. These numbers are computedfor each threshold value as follows.

(1) For each sample of the training set, computing anautomatic labelling by assigning the class Normalif the correlation is lower than the threshold andassigning the class Pathological otherwise.

(2) Computing the confusion matrix based on theconfrontation between the manual class labelling (inother terms if a sample has been manually taggedNormal or Pathological) and on the automatic classlabelling. The elements of the confusion matrix aredefined as in Table 1 where TP stands for TruePositive, FP for False Positive, FN for False Nega-tive and TN for True Negative. These values maybe normalized according to the cardinality of theNormal class (called #Normal) and the cardinality ofthe Pathological class (called #Pathological). TPR andFPR are therefore obtained by dividing, respectively,TP and FP by #Pathological and #Normal. One mayalso define the accuracy (Acc), measuring how wella binary classifier correctly identifies or excludes acondition and defined as

Acc = TP + TN

#Normal + #Pathological. (40)

The ROC for the “One Correlation Case” is shown inFigure 11. In this curve, the point (0, 0) corresponds tothe case in which all the normal samples are correctlyclassified but all the pathologic ones are misclassified and thepoint (1, 1) corresponds to the opposite situation. One mayalso cite the ideal point (0, 1) corresponding to the perfectclassification of both normal and pathological samples. Themore the ROC is close to this point the best the classifier is.Between the points (0, 0) and (1, 1),the choice of a particular

Table 1: Confusion matrix.

Manual pathological Manual Normal

Auto pathological TP FP

Auto normal FN TN

Table 2: Confusion matrix for the one correlation case (Trainingset).

Manual pathological Manual normal

Auto pathological 0.947 0.088

Auto normal 0.053 0.912

Table 3: Confusion matrix for the one correlation case (Test set).

Manual pathological Manual normal

Auto pathological 0.947 0.105

Auto normal 0.053 0.895

Table 4: Mean confusion matrix for the 10 training sets (Onecorrelation case).

Manual pathological Manual normal

Auto pathological 0.943 0.109

Auto normal 0.057 0.891

threshold depends on the objective. If one wants to avoiderrors on Normal class identification, the correspondingthreshold will lead to low FPR (but also to low TPR). On thecontrary if it is important to avoid mistakes on Pathologicalclass identification, the corresponding threshold will lead tohigh TPR (but also high FPR).

A particular point is highlighted in the ROC (blacksquare), corresponding to the threshold located at the cross-ing point of the two distributions in Figure 10 (threshold =−0.3). For this threshold, the confusion matrix is shownin Table 2 (Acc = 0.9446). These first results are alreadysatisfying.

Now that the most discriminant correlation has beenchosen and its classification performance assessed on thetraining set, this performance has to be evaluated forsamples that are not part of the training set, here samplesforming the test set. When applying the threshold of thecorrelation on samples of the test set, one obtains theconfusion matrix shown in Table 3 (Acc = 0.9426). Onecan see that the performance is in the same order as forthe training set although the chosen correlation and itsthreshold lead to lower classification performance for thenormal samples. It must be emphasized here that the normalsamples are much less represented than the pathological onesin the MEEI Database, and thus in training and test sets.Consequently a misclassification of a normal sample leadsto a higher variation of classification performance than amisclassification of a pathological sample because #Normalis much lower than #Pathological. That is why the results ofclassification should be interpreted while keeping in mindthis difference.

Page 42: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 15

Table 5: Mean confusion matrix for the 10 test sets (Onecorrelation case).

Manual pathological Manual normal

Auto pathological 0.955 0.074

Auto normal 0.045 0.926

Table 6: Accuracy for the 10 pairs of training and test sets (Onecorrelation case).

Number Training set Test set

1 0.946 0.947

2 0.942 0.947

3 0.942 0.947

4 0.940 0.951

5 0.940 0.951

6 0.938 0.955

7 0.929 0.971

8 0.938 0.955

9 0.940 0.963

10 0.942 0.945

In order to validate the fact that the chosen correlationand its thresholding are the most appropriate for thedistinction between normal and pathological samples, 10training sets and 10 test sets (different from the training andtest sets defined in Section 4) have been randomly formedfrom the samples of the MEEI Database. This has been donein the proportion described in Section 4. For each trainingset, the Fisher analysis has been performed. It turned out thatthe most discriminant correlation is always the correlationbetween the first spectral tristimulus in the Bark scale andthe spectral decrease. Moreover, it appeared that the samethreshold than the one corresponding to the crossing pointof the distributions in Figure 10 could be appropriate for theclassification task. Therefore this threshold has been appliedon the chosen correlation in the 10 training sets and 10test sets and the associated confusion matrices have beencomputed. Tables 4, 5, and 6 show, respectively, the meanconfusion matrix for the 10 training sets, the mean confusionmatrix for the 10 test sets, and the accuracy for the 10 pairsof sets.

All these results confirm that the chosen correlationand its associated threshold perform well in the task ofdiscriminating between normal and pathological samples.

7.2.2. Two Correlations Case. In this case the two correlationsassociated to the highest discriminant power are selected.The first one is the correlation between the first spectraltristimulus in the Bark scale and the spectral decrease and thesecond one is the correlation between the relative loudnessin the first Bark band (see Section 5.5.12) and the spectraldecrease. For these two correlations, the location of thenormal and pathological samples of the training set is shownin Figure 12. Based on this distribution, it has been chosen to

Seco

nd

mos

tdi

scri

min

ant

corr

elat

ion

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

First most discriminant correlation

−1 −0.5 0 0.5 1

Normal samplesPathological samples

Figure 12: Location of the normal and pathological samples ofthe training set in the “two most discriminative correlations”representation space (Normal: green; Pathological: red).

Perc

enta

ge

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Sum of the two most discriminant correlations

−1.5 −1 −0.5 0 0.5 1 1.5

NormalPathological

Figure 13: Distribution of the sum of the two most discriminativecorrelations for the two corpora of the training set (Normal: green;Pathological: red).

evaluate to what extent the sum of the two correlations couldseparate the normal and pathological samples.

The distribution of this sum for the samples of thetraining set is represented in Figure 13. As for the “OneCorrelation Case,” it may be possible to split the twopopulations by thresholding the sum of the two correlations.The ROC for thresholds uniformy distributed between lowerand upper limits of this sum is computed in the same wayas for the previous case and is shown in Figure 14. The same

Page 43: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

16 EURASIP Journal on Advances in Signal ProcessingTr

ue

pos

itiv

era

te(%

)

0

0.2

0.4

0.6

0.8

1

False positive rate (%)

0 0.2 0.4 0.6 0.8 1

Receiver operator curveCrossing pointNo discrimination line

Figure 14: ROC for the “Two Correlations Case.”

remarks that in the first case can be made about the meaningof the curve. Moreover, when comparing the two curves,one may observe that, for a same value of TPR, the secondconfiguration is better that the first one for TPR values below0.92 and the contrary for higher values.

A particular point is highlighted in the ROC (blacksquare), corresponding to the threshold located at the cross-ing point of the two distributions in Figure 13 (threshold =−0.5). For this threshold, the confusion matrix is shown inTable 7 (Acc = 0.9335). One may remark that, comparingto Table 2, the correct classification rate is lower for thepathological class (correct classification decreased by 1.2%)but remains unchanged for the normal class. The samethreshold has been applied on the sum of the two correlationsfor the samples in the test set. The results are shown inTable 8 (Acc = 0.9394). One can see that the performanceis in the same order as for the training set although thechosen correlation and its threshold again lead to slightlylower classification performance for the pathological samplesand unchanged classification performance for the normalones.

The 10 training and test sets of validation defined inthe “One Correlation Case” have been used to assess thevalidity of the “Two Correlations Case” approach. For eachtraining set, the Fisher analysis has been performed and itturned out that the two most discriminant correlations arealways the correlation between the first spectral tristimulusin the Bark scale and the spectral decrease and the correlationbetween the relative loudness in the first Bark band andthe spectral decrease. Moreover, it appeared that the samethreshold than the one corresponding to the crossing pointof the distributions in Figure 13 could be appropriate for theclassification task. Therefore this threshold has been appliedon the sum of the chosen correlations for the 10 pairs oftraining and test sets and the associated confusion matrices

Table 7: Confusion matrix for the two correlations case (Trainingset).

Manual pathological Manual normal

Auto pathological 0.935 0.088

Auto normal 0.065 0.912

Table 8: Confusion matrix for the two correlations case (Test set).

Manual pathological Manual normal

Auto pathological 0.938 0.105

Auto normal 0.062 0.895

Table 9: Mean confusion matrix for the 10 training sets (Twocorrelations case).

Manual pathological Manual normal

Auto pathological 0.930 0.106

Auto normal 0.070 0.894

Table 10: Mean confusion matrix for the 10 test sets (Twocorrelations case).

Manual pathological Manual normal

Auto pathological 0.945 0.074

Auto normal 0.055 0.926

Table 11: Accuracy for the 10 pairs of training and test sets (Twocorrelations case).

Number Training set Test set

1 0.933 0.934

2 0.933 0.934

3 0.931 0.939

4 0.925 0.951

5 0.931 0.938

6 0.927 0.947

7 0.920 0.960

8 0.927 0.947

9 0.929 0.943

10 0.929 0.943

have been computed. Tables 9, 10, and 11 show, respectively,the mean confusion matrix for the 10 training sets, the meanconfusion matrix for the 10 test sets, and the accuracy for the10 pairs of sets.

Concerning the mean confusion matrices, the classifi-cation performance for the pathological samples is in bothcases lower than in the “One Correlation Case” (see Table 5).When looking at the accuracies, one can see that they arelower than the ones in the “One Correlation Case” for all thepairs of sets (see Table 6).

7.3. Discussion. The application of a feature selection onthe correlations between acoustic descriptors has proved itsability to separate the normal and pathological samples inthe MEEI database. When comparing the “One Correlation

Page 44: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 17

Case” and the “Two Correlations Case,” one may say thatthe first one is better than the second one. This decision issupported by different considerations. Firstly the ROC of thefirst configuration is better than the ROC of the second onefor TPR higher than 0.92. Although it corresponds to higherFPR, this result is better because the FPR is more sensitive tomisclassification than TPR (because #Normal is lower than#Pathological). Secondly, when comparing the confusionmatrices, it has been found that the second configurationleads to lower classification performance than the first onefor the pathological samples and unchanged performancesfor the normal ones. It is of significant importance sinceit is more important to detect all the pathological samplesthan all the normal ones. Thirdly, when comparing theaccuracies, they are always lower for the second configurationthan for the first one. The accuracy depending on #Normaland #Pathological, a lower number of correctly classifiedpathological samples induces a smaller accuracy than a lowernumber of correctly classified normal samples. The secondconfiguration is thus characterized, by means of accuracy, bya higher number of misclassified pathological samples thanthe first configuration. Fourthly, the first configuration hasthe advantage to keep more facilities of interpretation thanthe second configuration (only one correlation instead of thesum of two correlations). Finally, the first configuration onlyrequires the computation of two spectral features and onecorrelation while the second one requires the computationof three spectral features and two correlations.

Some interpretations can be given about the featuresselected in the first configuration. As shown in Figure 10, thenormal samples are characterized by correlation values con-centrated around−0.75. That means that the evolution of thespectral decrease and the first spectral tristimulus are fairlystrong linked for a large majority of the samples, althoughthis link is not perfect (because the correlation is not −1but −0.75). The pathological samples are characterized by alarger dispersion of the correlation value, meaning that forsome samples the two features are fairly slightly linked andfor others no link exists at all. Although the trend is clearerfor normal samples than for the pathological ones, one mustkeep in mind that the number of normal samples is muchlower than the number of pathological ones in the database.

Concerning the speech utterances used in this work, itis interesting to discuss about the sense of jointly assesssustained and continuous speech samples since these twokinds of samples are included in the MEEI database. Onthe one hand, the sustained vowel offers the advantage tobe acquired in relatively stable conditions, meaning that thecharacteristics of the source and the vocal tract are quitestable. This enables computing features and especially theirperturbation in an easier way than in the case of continuousspeech. The correlation between features is also easier tounderstand and to interpret. Besides, analyzing the sustainedvowels also allows the computation and the interpretationof features to be relatively less influenced than continuousspeech by intonation, stress, or phonetic context. On theother hand, continuous speech reflects more the dynamicsof speech production since the characteristics of source andvocal tract are no longer stable. This production includes

onset, terminations, variation of pitch and amplitude, andvoice breaks. According to clinicians, this kind of informa-tion is also informative about the presence of pathology andmore representative of the every-day life of a patient thanthe sustained vowel. Assessing jointly sustained vowels andcontinuous speech seems to make sense because these twokinds of productions describe different (but complementary)conditions: the sustained vowel is more relative to stableconditions while continuous speech is more relative todynamic conditions.

Apart from the discussion above, it must be emphasizedthat the output of the analysis system presented in thispaper is a normal/pathological factor (see the overview ofthe system in Section 1). When a new subject is presentedat the analysis system, this output could be the value ofthe most discriminant correlation and the position of thesubject according to the distribution of this correlation inthe test database. The aim would be not to provide anunilateral decision about the presence of pathology or notbut to provide an indication to the clinician, who remainsthe person who has the final appreciation.

8. Conclusion

A classification scheme between normal and pathologicalvoices has been presented in this paper. When appliedon speech samples extracted from the MEEI database,this system provides a correct classification rate of 94.7%for pathological samples and 89.5% for normal samples.Regarding to litterature, these results are slightly below thoseoffered by methods basing on this database but our methodis unique in several aspects: the considered features are notbased on the estimation on the fundamental period, theycome from both the normal/pathologic voice assessmentand Music Information Retrieval domains, the correlationbetween selected features is used to discriminate normal andpathological samples instead of using a complex classifier.Besides, a potential use of our system is the computation of anormal/pathological factor, aiming at giving an indication tothe clinician about the location of a subject according to thedatabase.

Among the future works, the test of this classificationsystem on larger databases is planned in order to see ifusing correlation remains powerful for the discriminationbetween the two populations. Using the mutual informationfor estimating the link between features will also beeninvestigated since it has not been considered in this study.Finally some features provided by the source-tract separationof speech could be integrated in the system in order to see ifthey are relevant for classification purposes.

Acknowledgments

The authors would like to thank the Walloon Region, Bel-gium, for its support (Grant WALEO II ECLIPSE 516009),the Universite Catholique de Louvain (ORL-ORLO Labora-tory) for the availability of the database and the COST Action2103 “Advanced Voice Function Assessment.” This paper

Page 45: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

18 EURASIP Journal on Advances in Signal Processing

presents research results of the Belgian Network DYSCO(Dynamical Systems, Control, and Optimization), fundedby the Interuniversity Attraction Poles Programme, initiatedby the Belgian State, Science Policy Office. The scientificresponsibility rests with its author(s).

References

[1] M. Hirano, Psycho-Acoustic Evaluation of Voice: GRBAS Scalefor Evaluating the Hoarse Voice, Springer, Berlin, Germany,1981.

[2] K. Marasek, “An attempt to classify lx signals,” in Proceedingsof the 4th European Conference on Speech Communication andTechnology (EuroSpeech ’95), Madrid, Spain, September 1995.

[3] D. Deliyski, “High-speed videoendoscopy: recent progressand clinical prospects,” in Proceedings of the 7th InternationalConference on Advances in Quantitative Laryngology Voice andSpeech Research (AQL ’06), vol. 7, pp. 1–12, University ofGroningen, 2006.

[4] J. Demeyer and B. Gosselin, “Glottis segmentation with ahighspeedglottography: a new approach,” in Proceedings ofLiege Image Days, Liege, Belgium, March 2008.

[5] M. Vasilakis and Y. Stilyanou, “A mathematical model foraccurate measurement of jitter,” in Proceedings of 5th Inter-national Workshop on Models and Analysis of Vocal Emissionsfor Biomedical Applications (MAVEBA ’07), Firenze, Italy,December 2007.

[6] J. Benesty, M. M. Sondhi, and Y. Huang, Springer Handbook ofSpeech Processing, Springer, Berlin, Germany, 2008.

[7] C. Fredouille, G. Pouchoulin, J.-F. Bonastre, M. Azzarello, A.Giovanni, and A. Ghio, “Application of automatic speakerrecognition techniques to pathological voice assessment (dys-phonia),” in Proceedings of the 9th European Conference onSpeech Communication and Technology (EuroSpeech ’05), pp.149–152, Lisbon, Portugal, September 2005.

[8] G. Pouchoulin, C. Fredouille, J. Bonastre, A. Ghio, M.Azzarello, and A. Giovanni, “Modelisation statistique etinformations pertinentes pour la caracterisation des voixpathologiques (dysphonies),” in Proceedings of JEP (Journeed’Etudes sur la Parole), 2006.

[9] A. A. Dibazar, S. Narayanan, and T. W. Berger, “Featureanalysis for automatic detection of pathological speech,” inProceedings of the 2nd Joint Engineering in Medicine andBiology, 24th Annual Conference and the Annual Fall Meetingof the Biomedical Engineering Society (BMES/EMBS ’02), vol.1, pp. 182–183, Houston, Tex, USA, October 2002.

[10] A. Dibazar and S. Narayanan, “A system for automaticdetection of pathological speech,” in Proceedings of the 36thAsilomar Conference on Signals, Systems and Computers, PacificGrove, Calif, USA, November 2002.

[11] A. Dibazar, T. Berger, and S. Narayanan, “Pathological voiceassessment,” in Proceedings of the 28th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBS ’06), New York, NY, USA, August 2006.

[12] J. Godino-Llorente, P. Gomez-Vilda, N. Saenz-Lechon, M.Blanco-Velasco, F. Cruz-Roldan, and M. A. Ferrer, “Discrim-inative methods for the detection of voice disorders,” inProceedings of International Conference on Non-Linear SpeechProcessing (NOLISP ’05), Barcelona, Spain, April 2005.

[13] K. E. Corp, “Multi-dimensional voice program (mdvp) [com-puter program],” Tech. Rep., Kay Elemetrics Corp., 2008.

[14] K. E. Corp, “Disordered voice database model (version 1.03),”Tech. Rep., Massachussets Voice Eye and Ear Infirmary Voiceand Speech Lab, 1994.

[15] J. I. Godino-Llorente, S. Aguilera-Navarro, C. Hernandez-Espinosa, M. Fernandez-Redondo, and P. Gomez-Vilda, “Onthe selection of meaningful speech parameters used by apathologic/non pathologic voice register classifier,” in Proceed-ings of the 6th European Conference on Speech Communicationand Technology (EUROSPEECH ’99), Budapest, Hungary,September 1999.

[16] R. B. Reilly, R. Moran, and P. Lacy, “Voice pathologyassessment based on a dialogue system and speech analysis,”in Proceedings of the AAAI Symposium on Dialogue Systems forHealth Communication, pp. 104–109, 2004.

[17] R. J. Moran, R. B. Reilly, P. de Chazal, and P. D. Lacy,“Telephony-based voice pathology assessment using auto-mated speech analysis,” IEEE Transactions on BiomedicalEngineering, vol. 53, no. 3, pp. 468–477, 2006.

[18] D. Michaelis, M. Frohlich, and H. W. Strube, “Selection andcombination of acoustic features for description of pathologicvoices,” Journal of the Acoustical Society of America, vol. 103,no. 3, pp. 1628–1639, 1998.

[19] J. B. Alonso, J. de Leon, I. Alonso, and M. A. Ferrer,“Automatic detection of pathologies in the voice by HOSbased parameters,” EURASIP Journal on Advances in SignalProcessing, vol. 2001, no. 4, pp. 275–284, 2001.

[20] J. Alonso, F. D. de Maria, C. Travieso, and M. Ferrer, “Usingnonlinear features for voice disorder detection,” in Proceedingsof International Conference on Non-Linear Speech Processing(NOLISP ’05), Barcelona, Spain, April 2005.

[21] H. Kasuya, Y. Endo, and S. Saliu, “Novel acoustic measure-ments of jitter and shimmer characteristics from pathologicalvoice,” in Proceedings of 8th European Conference on SpeechCommunication and Technology (EUROSPEECH ’03), Geneva,Switzerland, September 2003.

[22] A. Alpan, F. Grenez, and J. Schoentgen, “Estimation of vocalnoise and cycle duration jitter in connected speech,” inProceedings of MAVEBA, 2007.

[23] C. Ferrer, M. E. Hernandez-Diaz, and E. Gonzalez, “Usingwaveform matching techniques in the measurement of shim-mer in voiced signals,” in Proceedings of the 8th AnnualConference of the International Speech Communication Associ-ation (INTERSPEECH ’07), vol. 4, pp. 2436–2439, Antwerp,Belgium, August 2007.

[24] Y. Medan, E. Yair, and D. Chazan, “Super resolution pitchdetermination of speech signals,” IEEE Transactions on SignalProcessing, vol. 39, no. 1, pp. 40–48, 1991.

[25] G. de Krom, “Spectral correlates of breathiness and roughnessfor different types of vowel fragments,” in Proceedings of the3rd International Conference on Spoken Language Processing(ICSLP ’94), Yokohama, Japan, September 1994.

[26] E. O’Leidhin and P. Murphy, “Analysis of spectral measures forvoiced speech with varying noise and perturbation levels,” inProceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’05), Philadelphia, Pa,USA, March 2005.

[27] M. Frohlich, D. Michaelis, and H. Strube, “Acoustic breath-iness measures in the description ofpathologic voices,” inProceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’98), Seattle, Wash,USA, May 1998.

[28] W. Wszolek, R. Tadeusiewicz, A. Izworski, and T. Wszolek,“Automated understanding of selected voice tract pathologiesbased on the speech signal analysis,” in Proceedings of the 23rd

Page 46: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 19

Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBS ’01), vol. 2, pp. 1719–1722, Istanbul, Turkey, October 2001.

[29] F. Severin, B. Bozkurt, and T. Dutoit, “Hnr extraction in voicedspeech oriented towards voice quality analysis,” in Proceedingsof 13th European Signal Processing Conference (EUSIPCO ’05),Antalya, Turkey, September 2005.

[30] C. d’Alessandro, B. Yegnanarayana, and V. Darsinos, “Decom-position of speech signals into deterministic and stochasticcomponents,” in Proceedings of the IEEE International Confer-ence on Acoustics, Speech, and Signal Processing (ICASSP ’95),vol. 1, pp. 760–763, Detroit, Mich, USA, May 1995.

[31] P. Boersma, “Accurate short-term analysis of the fundamentalfrequency and the harmonics-to-noise ratio of a sampledsound,” in Proceedings of the Institute of Phonetic Sciences (IFA’93), Amsterdam, The Netherlands, 1993.

[32] K. Shama, A. Krishna, and N. U. Cholayya, “Study ofharmonics-to-noise ratio and critical-band energy spectrumof speech as acoustic indicators of laryngeal and voicepathology,” EURASIP Journal on Advances in Signal Processing,vol. 2007, Article ID 85286, 9 pages, 2007.

[33] M. Wester, “Automatic classification of voice quality: com-paring regression models and hidden markov models,” inProceedings of Symposium on Databases in Voice QualityResearch and Education (VOICEDATA ’98), 1998.

[34] H. Kasuya, S. Ogawa, and Y. Kikuchi, “Adaptive combfiltering method as applied to acoustic analyses of pathologicalvoice,” in Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP ’86), pp. 669–672, Tokyo, Japan, April 1986.

[35] D. Michaelis, T. Gramss, and H. W. Strube, “Glottal-to-noiseexcitation ratio—a new measure for describing pathologicalvoices,” Acustica, vol. 83, no. 4, pp. 700–706, 1997.

[36] G. Pouchoulin, C. Fredouille, J.-E. Bonastre, A. Ghio, andA. Giovanni, “Frequency study for the characterization ofthe dysphonic voices,” in Proceedings of the 8th AnnualConference of the International Speech Communication Associ-ation (INTERSPEECH ’07), vol. 3, pp. 1789–1792, Antwerp,Belgium, August 2007.

[37] J. Wang and C. Jo, “Performance of gaussian mixture modelsas a classifier for pathological voice,” in Proceedings of the11th Australian International Conference on Speech Science andTechnology, 2006.

[38] C. Bishop, Pattern Recognition and Machine Learning,Springer, New York, NY, USA, 2006.

[39] P. Kukharchik, I. Kheidorov, E. Bovbel, and D. Ladeev, “Imageand signal processing,” in Speech Signal Processing Based onWavelets and SVM for Vocal Tract Pathology Detection, LectureNotes in Computer Science, pp. 192–199, Springer, Berlin,Germany, 2008.

[40] G. Peeters, “A large set of audio features for sound description(similarity and classification) in the cuidado project,” Tech.Rep., IRCAM, 2004.

[41] MPEG-7, “Information technology - multimedia contentdescription interface—part 4: audio,” Tech. Rep. ISO/IEC JTC1/SC 29, ISO/IEC FDIS 15938-4, 2002.

[42] E. Scheirer and M. Slaney, “Construction and evaluationof a robust multifeature speech/music discriminator,” inProceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’97), vol. 2, pp. 1331–1334, Munich, Germany, April 1997.

[43] X. Sun, “Pitch determination and voice quality analysisusing subharmonic-to-harmonic ratio,” in Proceedings of theIEEE International Conference on Acoustics, Speech, and Signal

Processing (ICASSP ’02), vol. 1, pp. 333–336, Orlando, Fla,USA, May 2002.

[44] K. Martin and Y. Kim, “2pmu9. musical instrument identifi-cation: a pattern-recognition approach,” in Proceedings of the136th Meeting of The Acoustical Society of America, 1998.

[45] E. Wold, T. Blum, D. Keislar, and J. Wheaten, “Content-basedclassification, search, and retrieval of audio,” IEEE Transactionson Multimedia, vol. 3, no. 3, pp. 27–36, 1996.

[46] G. Peeters and X. Rodet, “Hierarchical gaussian tree withinertial ratio maximization for the classification of largemusical instrument databases,” in Proceedings of the 6thInternational Conference on Digital Audio Effects (DAFx ’03),London, UK, September 2003.

[47] N. Misdariis, B. Smith, D. Pressnitzer, P. Susini, and S.McAdams, “Validation of a multidimensionnal distance modelfor perceptual dissimilarities among musical timbres,” Journalof the Acoustical Society of America, vol. 103, 1998.

[48] C. E. Pearson, “Probability and Statistics,” in Handbook ofApplied Mathematics, pp. 1185–1196, Van Nostrand Reinhold,New York, NY, USA, 1974.

[49] G. Korn and T. Korn, Mathematical Handbook for Scientistsand Engineers, chapter 18, McGraw-Hill, New York, NY, USA,1967.

[50] X. Serra and J. Bonada, “Sound transformation based on thesms high level attributes,” in Proceedings of the InternationalConference on Digital Audio Effects (DAFx ’98), 1998.

[51] S. Stevens and J. V. E. Newman, “A scale for the measurementof the psychological magnitude of pitch,” Journal of theAcoustical Society of America, vol. 8, pp. 185–190, 1937.

[52] E. Zwicker, “Subdivision of the audible frequency range intocritical bands,” Journal of the Acoustical Society of America,1961.

[53] H. Pollard and E. Jansson, “A tristimulus method for thespecification of musical timbres,” Acustica, vol. 51, pp. 162–171, 1982.

[54] K. Jensen, “Multiple scale music segmentation using rythm,timbre and harmony,” EURASIP Journal on Advances in SignalProcessing, vol. 2007, Article ID 73205, 11 pages, 2007.

[55] A. de Cheveigne and H. Kawahara, “YIN, a fundamentalfrequency estimator for speech and music,” Journal of theAcoustical Society of America, vol. 111, no. 4, pp. 1917–1930,2002.

[56] K. Fukunaga, Introduction to Statistical Pattern Recognition ,Academic Press, New York, NY, USA, 2nd edition, 1990.

[57] R. Duda and P. Hart, Pattern Classification and Scene Analysis,John Wiley & Sons, New York, NY, USA, 1973.

Page 47: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 928974, 11 pagesdoi:10.1155/2009/928974

Research Article

A Joint Time-Frequency and Matrix Decomposition FeatureExtraction Methodology for Pathological Voice Classification

Behnaz Ghoraani and Sridhar Krishnan

Signal Analysis Research Lab, Department of Electrical and Computer Engineering, Ryerson University,Toronto, ON, Canada M5B 2K3

Correspondence should be addressed to Sridhar Krishnan, [email protected]

Received 1 November 2008; Revised 28 April 2009; Accepted 21 July 2009

Recommended by Juan I. Godino-Llorente

The number of people affected by speech problems is increasing as the modern world places increasing demands on the humanvoice via mobile telephones, voice recognition software, and interpersonal verbal communications. In this paper, we propose anovel methodology for automatic pattern classification of pathological voices. The main contribution of this paper is extraction ofmeaningful and unique features using Adaptive time-frequency distribution (TFD) and nonnegative matrix factorization (NMF).We construct Adaptive TFD as an effective signal analysis domain to dynamically track the nonstationarity in the speech and utilizeNMF as a matrix decomposition (MD) technique to quantify the constructed TFD. The proposed method extracts meaningfuland unique features from the joint TFD of the speech, and automatically identifies and measures the abnormality of the signal.Depending on the abnormality measure of each signal, we classify the signal into normal or pathological. The proposed methodis applied on the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database which consists of 161 pathological and 51normal speakers, and an overall classification accuracy of 98.6% was achieved.

Copyright © 2009 B. Ghoraani and S. Krishnan. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Dysphonia or pathological voice refers to speech problemsresulting from damage to or malformation of the speechorgans. Dysphonia is more common in people who usetheir voice professionally, for example, teachers, lawyers,salespeople, actors, and singers [1, 2], and it dramaticallyeffects these professional groups’s lives both financially andpsychosocially [2]. In the past 20 years, a significant attentionhas been paid to the science of voice pathology diagnosticand monitoring. The purpose of this work is to help patientswith pathological problems for monitoring their progressover the course of voice therapy. Currently, patients arerequired to routinely visit a specialist to follow up theirprogress. Moreover, the traditional ways to diagnose voicepathology are subjective, and depending on the experienceof the specialist, different evaluations can be resulted.Developing an automated technique saves time for both thepatients and the specialist and can improve the accuracy ofthe assessments.

Our purpose of developing an automatic pathologicalvoice classification is training a classification system whichenables us to automatically categorize any input voice aseither normal or pathological. The same as any other signalclassification methods, before applying any classifier, we arerequired to reduce the dimension of the data by extractingsome discriminative and representative features from thesignal. Once the signal features are extracted, if the extractedfeatures are well defined, even simple classification methodswill be good enough for classification of the data. There havebeen some attempts in literature to extract the most properfeatures. Temporal features, such as, amplitude perturbationand pitch perturbation [3, 4] have been used for pathologicalspeech classification; however, the temporal features aloneare not enough for pathological voice analysis. Spectraland cepstral domains have also been used for pathologicalvoice feature extraction; for example, mean fundamentalfrequency and standard deviation of the frequency [4],energy spectrum of a speech signal [5], mel-frequency cep-stral coefficients (MFCCs) [6], and linear prediction cepstral

Page 48: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

coefficients (LPCCs) [7] have been used as pathologicalvoice features. Gelzinis et al. [8] and Saenz-Lechon et al. [9]provide a comprehensive review of the current pathologicalfeature extraction methods and their outcomes. We mentiononly few of the techniques which reported a high accuracy;for example, Parsa and Jamieson in [10] achieves 96.5%classification using four fundamental frequency dependentfeatures and two independent features based on the linearprediction (LP) modeling of vowel samples. In [7], Godino-Llorente et al. feed MFCC coefficients of the vowel /ah/from both normal and pathological speakers into a neural-network classifier, and achieve 96% classification rate. In[11], Umapathy et al. present a new feature extractionmethodology. In this paper, the authors propose a segmentfree approach to extract features such as octave max andmean, energy ratio and length, and frequency ratio fromthe speech signals. This method was applied on continuousspeech samples, and it resulted in 93.4% classificationaccuracy.

In this paper, we study feature extraction for pathologicalvoice classification and propose a novel set of meaningfulfeatures which are interpretable in terms of spectral andtemporal characteristics of the normal and pathologicalsignals. In Section 2, we explain the proposed methodology.Section 3 provides an overview of the desired characteristicsof the selected signal analysis domain and chooses a signalrepresentation which satisfies the criteria. Section 4 describesnonnegative matrix factorization (NMF) as a part-basedmatrix decomposition (MD). In Section 5, we propose anovel temporal and spectral feature set and apply a simpleclassifier to train the pattern classifier. Results are given inSection 6, and conclusion is described in Section 7.

2. Methodology

In this paper, we propose a novel approach for automaticpathological voice feature extraction and classification. Themajority of the current methods apply a short time spectrumanalysis to the signal frames, and extract the spectral andtemporal features from each frame. In other words, thesemethods assume the stationarity of the pathological speechover 10–30 milliseconds intervals and represent each framewith one feature vector; however, to our knowledge, thestationarity of the pathological speech over 10–30 millisec-onds has not been confirmed yet, and as a matter of fact,our observation from the TFD of abnormal speech evidentthat there are more transients in the abnormal signals, andthe formants in pathological speech are more spread andare less structured. Another shortcoming of the currentapproaches is that they require to segment the signal intoshort intervals. Using an appropriate signal segmentationhas always been a controversial topic in windowed TFapproaches. Since the real world signals have nonstationarydynamics, segmentation at nonstationarity parts of the signalcould loose the useful information of the signal. To overcomethese limitations, we propose a novel approach to extract theTF features from the speech in a way that it captures thedynamic changes of the pathological speech.

Figure 1 is a schematic of the proposed pathologicalspeech classification approach. As shown in this figure, ajoint TF representation of the pathological and normalsignals is estimated. It has been shown that TF analysisis effective for revealing non-stationary aspects of signalssuch as trends, discontinuities, and repeated patterns whereother signal processing approaches fail or are not as effective.However, most of the TF analyses have been utilized for visu-alization purpose, and quantification and parametrization ofTFD for feature extraction and automatic classification havenot been explicitly studied so far. In this paper, we explore TFfeature extraction for pathological signal classification. As wemention in Section 3, not every TF signal analysis is suitablefor our purpose. In Section 3, we explain the criteria for asuitable TFD and propose Adaptive TFD as a method whichsuccessfully captures the temporal and spectral localizationof the signals components.

Once the signal is transformed to the TF plane, weinterpret the TFD as a matrix VM×N and apply a matrixdecomposition (MD) technique to the TF matrix as givenbelow

VM×N =WM×rHr×N =r∑

i=1

wihi (1)

where N is the length of the signal, M is the frequencyresolution of the constructed TFD, and r is the order ofMD. Applying an MD on the TF matrix V , we derive the TFmatrices W and H , which are defined as follows:

WM×r = [w1w2 · · ·wr],

Hr×N =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

h1

h2

...

hr

⎤⎥⎥⎥⎥⎥⎥⎥⎦.

(2)

In (1), MD reduces the TF matrix (V) to the base andcoefficient vectors ({wi}i=1,...,r and {hi}i=1,...,r , resp.) in a waythat the former represents the bases components in the TFsignal structure, and the latter indicates the location of thecorresponding base vectors in time. The estimated base andcoefficient vectors are used in Section 5 to extract noveljoint time and frequency features. Despite the window-basedfeature extraction approaches, the proposed method doesnot take any assumption about the stationarity of the signal,and MD automatically decides at which interval the signalis stationarity. In this paper, we choose nonnegative matrixfactorization (NMF) as the MD technique. NMF and theoptimization method are explained in Section 4.

Finally, the extracted features are used to train a classifier.The classification and the evaluation are explained inSection 5.3.

3. Signal Representation Domain

The TFD, V(t, f ), that could extract meaningful featuresshould preserve joint temporal and spectral localization of

Page 49: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

Normal speech

Pathological speech

Test speech

TFD

TFD

TFD

VM×N

VM×N

VM×N

MD

MD

MD

WM×r

Hr×N

WM×r

Hr×N

WM×r

Hr×N

Featureextraction

Featureextraction

Featureextraction

{ fNi}

{ fPi}

K-meansclustering

Nearestcluster

{Ck}Train

Abnormalityclusters

{Cabnk }

Classification

Test

Figure 1: The schematic of the proposed pathological feature extraction and classification methodology.

the signal. As shown in [12], the TFD that preserves thetime and frequency localized components has the followingproperties:

(1) There are nonnegative values.

V(t, f

) ≥ 0. (3)

In order to produce meaningful features, the value of theTFD should be positive at each point; otherwise the extractedfeatures may not be interpretable, for example, Wigner-Villedistribution (WVD) always gives the derivative of the phasefor the instantaneous frequency which is always positive, butit also gives that the expectation value of the square of thefrequency, for a fixed time, can become negative which doesnot make sense [13]. Moreover, it is very difficult to explainnegative probabilities.

(2) There are correct time and frequency marginals.∫ +∞

−∞V(t, f

)df = |x(t)|2, (4)

∫ +∞

−∞V(t, f

)dt = ∣∣X( f )

∣∣2, (5)

where V(t, f ) is the TFD of signal x(t) with Fouriertransform of X( f ). The TFD which satisfies the abovecriteria is called positive TFD [13]. A positive TFD withcorrect marginals estimates a cross-term free distributionof the true joint TF distribution of the signal. Such aTFD provides a high TF localization of the signal energy,and it is therefore a suitable TF representations for featureextraction from non-stationary signals. In this study, we usea TFD that satisfies the criteria in (5) and (3). This TFDis called Adaptive TFD as it is constructed according tothe properties of the signal being analyzed. Adaptive TFDhas been used for instantaneous feature extraction fromVibroarthrographic (VAG) signals in knee joint problems toclassify the pathological conditions of the articular cartilage[14].

3.1. Adaptive TFD. Adaptive TFD method [14] uses thematching pursuit TFD (MP-TFD) as an initial TFD estimate

to construct a positive, high resolution, and cross-term freeTFD. As explained in Appendix A, MP-TFD decomposes thesignal into Gabor atoms with a wide variety of modulatedfrequency and phase, time shift and duration, and addsup the Wigner distribution of each component. MP-TFDeliminates the cross-term problem with bilinear TFDs andprovides a better representation for multicomponent signals.However, the shortcoming of MP-TFD is that it does notnecessarily satisfy the marginal properties.

As described by Krishnan et al. [14], we apply a cross-entropy minimization to the matching pursuit TFD (MP-TFD) denoted by V(t, f ), as a prior estimate of the trueTFD, and construct an optimal estimate of TFD, denoted byV(t, f ) in a way that the estimated TFD satisfies the time andfrequency marginals, m0(t) and m0( f ), respectively.

The Adaptive TFD is iteratively estimated from the MP-TFD as given below.

(1) The time marginal is satisfied by multiplying andthen dividing the TFD by the desired and the currenttime marginal:

V (0)(t, f ) = V(t, f

)m0(t)p(t)

, (6)

where p(t) is the time marginal of V(t, f ). At thisstage, V (0)(t, f ) has the correct time marginal.

(2) The frequency marginal is satisfied by multiplyingand then dividing the TFD by the desired and thecurrent frequency marginal:

V (1)(t, f ) = V (0)(t, f ) m0(f)

p(0)(f) , (7)

where p(0)( f ) is the frequency marginal of V (0)(t, f ).At this stage V (1)(t, f ) satisfies the frequencymarginal condition, but the time marginal could bedisrupted.

(3) It is shown that repeating the above steps makes theestimated TFD closer to the true TF representation ofthe signal.

Page 50: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

4. Matrix Decomposition

We consider the TFD, V(t, f ), as a matrix, VM×N , where N isthe number of samples, and M is the frequency resolution ofthe constructed TFD, for example, given an 81.92 ms framewith sampling frequency of 25 kHz,N is 2048 and the highestpossible frequency resolution, M, is 1024, which is half of theframe length. Next, we apply an MD technique to decomposethe TF matrix to the components, WM×r and Hr×N , in away that V ≈ WH . W and H matrices are called basis andencoding, matrices respectively, and r < N is the number ofthe decomposition.

Depending on the utilized matrix decomposition tech-nique, the estimated components satisfy different criteria andoffer variant properties. The MD techniques that is suitablefor TF quantification has to estimate the encoding and basecomponents with a high TF localization. Three well-knownMD techniques are Principal Component Analysis (PCA),Independent Component Analysis (ICA), and NonnegativeMatrix Factorization (NMF). PCA finds a set of orthogonalcomponents that minimizes the mean squared error of thereconstructed data. The PCA algorithm decomposes thedata into a set of eigenvectors W corresponding to thefirst r largest eigenvalues of the covariance matrix of thedata, and H , the projection of the data on this space.ICA is a statistical technique for decomposing a complexdataset into components that are as independent as possible.If r independent components w1 · · ·wr compose r linearmixtures v1 · · · vn as V =WH , the goal of ICA is estimatingH , while our observation is only the random matrix V . Oncethe matrix H is estimated, the independent components canbe obtained as W = VH−1. NMF technique is appliedto a nonnegative matrix and constraints the matrix factorsW and H to be nonnegative. In a previous study [15],we demonstrated that NMF decomposed factors promisea higher TF representation and localization compared toICA and PCA factors. In addition, as it was mentioned inSection 3, the negative TF distributions do not result ininterpretable features, and they are not suitable for featureextraction. Therefore, in this paper, we use NMF for TFmatrix decomposition.

NMF algorithm starts with an initial estimate for W andH and performs an iterative optimization to minimize agiven cost function. In [16], Lee and Seung introduce twoupdating algorithms using the least square error and theKullback-Leibler (KL) divergence as the cost functions:

Least square error

W ←−W · VHT

WHHT, H ←− H · WTV

WTWH,

KL divergence

W ←−W · (V/WH)HT

1 ·H , H ←− H · WT(V/WH)W · 1

.

(8)

In these equations, A · B and A/B are term by termmultiplication and division of the matrices A and B.

Various alternative minimization strategies have beenproposed [17]. In this work, we use a projected gradientbound-constrained optimization method which is proposedby Lin [18]. The optimization method is performed onfunction f = V −WH and is consisted of three steps.

(1) Updating the Matrix. W In this stage, the optimizationof fH(W) is solved with respect to W , where fH(W) is thefunction f = V −WH , in which matrix H is assumed to beconstant. In every iteration, matrix W is updated as

Wt+1 = max{(Wt − αt∇ fH

(Wt

)), 0}

, (9)

where t is the iteration order, ∇ fH(W) is the projectedgradient of the function f , while H is constant, and αt isthe step size to update the matrix. The step size is found asαt = βKt . Where β1,β2,β3, . . . are the possible step sizes, andKt is the first nonnegative integer for which

f(Wt+1)− f

(Wt

) ≤ σ⟨∇ fH

(Wt

),Wt+1 −Wt

⟩, (10)

where the operator 〈·, ·〉 is the inner product between twomatrices as defined

〈A,B〉 =∑

i

j

ai jbi j . (11)

In [18], values of σ and β are suggested to be 0.01 and 0.1,respectively. Once the step size, αt, is found, the stationaritycondition of function fH(W) at the updated matrix ischecked as

∥∥∥∇P fH(Wt+1)∥∥∥ ≤ ε∥∥∇ fH

(W1)∥∥, (12)

where ‖∇ fH(W1)‖ is the the projected gradient of thefunction fH(W) at first iteration (t = 1), ε is a very smalltolerance, and∇P fH(W) is the projected gradient defined as

∇P fH(W) =⎧⎨⎩∇ fH(W), wmr > 0,

min(0,∇ fH(W)

), wmr = 0.

(13)

If the stationary condition is met, the procedure stops, if not,the optimization is repeated until the point Wt+1 becomes astationary point of fH .

(2) Updating the Matrix. H : This stage solves the optimiza-tion problem respect to H assuming W is constant. A similarprocedure to what we did in stage 1 is repeated in here. Theonly difference is that in the previous stage, H is constant,but here W is constant.

(3) The Convergence Test. Once the above sub-optimumproblems are solved, we check for the stationarity of the Wand H solutions together:

∥∥∇ fH(Wt

)∥∥ +∥∥∇ fW

(Ht)∥∥

≤ ε(∥∥∇ fH(W1)∥∥ +

∥∥∇ fW(H1)∥∥).

(14)

Page 51: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

Base vectors

wi

LF energy

≥HF energy

Yes

No

wLFi

wHFi

Featureextraction

Featureextraction

f LF :

[Shi ,Dhi ,MO(1)wi ,MO(2)

wi ,MO(3)wi ]

f HF :

[Shi ,Dhi , Swi , SHwi ]

Figure 2: Block diagram of the proposed feature extraction technique.

The optimization is complete if the global convergence rule(14) is satisfied; otherwise, the steps 1 and 2 are iterativelyrepeated until the optimization is complete.

The gradient-based NMF is computationally competitiveand offers better convergence properties than the standardapproach, and it is, therefore, used in the present study.

5. Feature Extraction and Classification

In this section, we extract a novel feature set from thedecomposed TF base and coefficient vectors (W and H).Our observations evident that the abnormal speech behavesdifferently for voiced (vowel) and unvoiced (constant)components. Therefore, prior to feature extraction, we dividethe base vectors into two groups: (a) Low Frequency (LF): thebases with dominant energy in the frequencies lower than4 kHz, and (b) High Frequency (HF): the bases with majorenergy concentration in the higher frequencies.

Next, as depicted in Figure 2, we extract four featuresfrom each LF base and five features from each HF base whileonly two of these two feature sets are the same. In order toderive the discriminative features of normal and abnormalsignals, we investigate the TFD difference of the two groups.To do so, we choose one normal and one pathological speechand construct the Adaptive TFD of each 80 ms frame of thesignals. The sum of the TF matrices for each speech is shownin Figure 3. We observed two major differences between thepathological and the normal speech: (1) the pathologicalsignal has more transient components compared to thenormal signal, and (2) the pathological voice presents weakerformants compared to the normal signal.

Base on the above observations, we extract the followingfeatures from the coefficient and base vectors.

5.1. Coefficient Vectors. It is observed that the pathologicalvoice can be characterized by its noisy structure. The moretransients and discontinuities are present in the signal, themore abnormality is observed in the speech. Two features areproposed to represent this characteristic of the pathologicalspeech.

5.1.1. Sparsity. Sparsity of the coefficient vector distinguishesthe nonfrequent transient components of the abnormalsignals from the natural frequent components. Several

sparseness measures have been proposed in the literature. Inthis paper, we use the function defined as

Shi =√N −

(∑Nn=1 hi(n)

)/√∑N

n=1 h2i√

N − 1. (15)

The above function is unity if and only if hi contains asingle nonzero component and is zero if and only if all thecomponents are equal. The sparsity measure in (15) has beenused for applications such as NMF matrix decompositionwith more part-based properties [19]; however, it has neverbeen used for feature extraction application.

The next proposed feature differentiates the discontinu-ity characteristics of the pathological speech from the normalsignal.

5.1.2. Sum of Derivative. We have

Dhi =N−1∑

n=1

h′i (n)2, (16)

where

h′i (n) = hi(n + 1)− hi(n), n = 1, . . . ,N − 1. (17)

Dhi captures the discontinuities and abrupt changes, whichare typical in pathological voice samples.

5.2. Base Vectors. The base vectors represent the frequencycomponents present in the signal. The dynamics of the voiceabnormality varies between HF and LF-bases groups. Hence,we extracted different frequency features for each group.

5.2.1. Moments. Our observation showed that in the patho-logical speech, the HF bases tend to have bases with energyconcentration at higher frequencies compared to the normalsignals. To discriminate this abnormality property, we extractthe first three moments of the base vectors as the features:

MO(o)wi =M∑

m=1

f owi(m), o = 1, 2, 3 (18)

where MO1, MO2, and MO3 are the three moments, andM is the frequency resolution. The moment features areextracted from HF bases; the higher are the frequencyenergies, the larger will the feature values be. Although thesefeatures are useful for distinguishing the abnormalities of

Page 52: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5N

orm

aliz

edfr

equ

ency

10 20 30 40 50 60 70 80

Time (ms)

(a) TF distribution of a normal voice with a male speaker

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Nor

mal

ized

freq

uen

cy

10 20 30 40 50 60 70 80

Time (ms)

(b) TF distribution of a pathological voice with a male speaker

Figure 3: TFD of a normal (a) and an abnormal signal (b) is constructed using adaptive TFD with Gabor atoms, 100 MP iterations and 5MCE iterations. As evident in theses figures, the pathological signal has more transient components specially at high frequencies. In addition,the TF of the pathological signal presents weak formants, while the normal signal has more periodicity in low frequencies, and introducesstronger formants.

the HF components, there are not useful for representingthe abnormalities of the LF bases. The reason is that themajor frequency changes in the LF components is dominatedby the difference in pitch frequency of speech from onespeaker to another speaker, and it does not provide anydiscrimination between normality or abnormality of thespeech. Two features are proposed for LF bases.

5.2.2. Sparsity. As is known in literature, it is expected toobserve periodic structures in the low frequency componentsof the normal speech. Therefore, when a large amount ofscattered energy is observed in the low frequency compo-nents, we conclude that a level of abnormality is present inthe signal. To measure this property, we propose the sparsityof the base vectors {wi}i=1,...,M as given below:

Swi =√M −

(∑Mm=1 wi(m)

)/√∑M

m=1 w2i√

M − 1. (19)

For normal signals we expect to have higher sparsity fea-tures, while pathological speech signals have lower sparsityvalues.

5.2.3. Sharpness. Swi measures the spread of the componentsin low frequencies. In addition, we need another featureto provide an information on the energy distribution infrequency. Comparing the LF bases of the normal and thepathological signals, we notice that normal signals havestrong formants; however, the pathological signals have weakand less structured formants.

For each base vector, first we calculate the Fouriertransform as given

Wi(ν) =∣∣∣∣∣∣

M∑

f=1

e− j(2πmν/M)wi(m)

∣∣∣∣∣∣. (20)

whereM is length of the base vector, andWi(ν) is the Fouriertransform of the base vector wi. Next, we perform a secondFourier transform on the base vector, and obtain Wi(κ) asfollows:

Wi(κ) =∣∣∣∣∣∣

M/2∑

ν=1

e− j(2πνκ/(M/2))Wi(ν)

∣∣∣∣∣∣. (21)

Finally, we sum up all the values of |W(κ)| for κ more thanm0, where m0 is a small number:

SHwi =M/4∑κ=m0

|Wi(κ)|. (22)

In Appendix B, we demonstrate that SHwi is a large valuefor bases representing strong formants, such as in normalspeech, but is a small value for distorted formants, such asin pathological speech.

5.3. Classification. As it is shown in Figure 1, once thefeatures are extracted, we feed them into a pattern classifier,which consists of a training and a testing stage.

5.3.1. Training Stage. Various classifiers were used for patho-logical voice classification [8], such as, the linear discrimi-nant analysis, hidden Markov models, and neural networks.In the proposed technique, we use K-means clustering as asimple classifier.

Page 53: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

f HFtest = { f HF

t }t=1,...,THF CHFtest = {CHF

i }t=1,...,THF

if CHFt εCHF

abn

if CLFt εCLF

abn

min

√√√√5∑

i=1

( f HFi (i)− CHF

k (i))2

min

√√√√4∑

i=1

( f LFi (i)− CLF

k (i))2

k = 1, . . . ,K

k = 1, . . . ,K

f LFtest = { f LF

t }t=1,...,TLF CLFtest = {CLF

t }t=1,...,TLF

abnHFtest = abnHF

test + 1

abnLFtest = abnLF

test + 1

abnHFtest

THF+

abnLFtest

TLF

abn><

norm

Thpatho

Figure 4: The block diagram of the test stage.

K-means clustering is one of the simplest unsupervisedlearning algorithms. The method starts with an initialrandom centroids, and it iteratively classifies a given dataset into a certain number of clusters (K) by minimizing thesquared Euclidean distance of the samples in each cluster tothe centroid of that cluster. For each cluster, the centroid isthe mean of the points in that cluster Ci.

Since separate features are extracted for LF and HFcomponents, we have to train a separate classifier for eachgroup: CLF and CHF for LF and HF components, respectively.Once the clusters are estimated, we count the number ofabnormality feature vectors in each cluster, and the clusterwith a majority of abnormal points is labeled as abnormalclusters; otherwise, the cluster is labeled as normal

Ck ∈⎧⎪⎨⎪⎩

Abnormality, if∑

f Ckabn > α∑

f Ckn ,

Normality, if∑

f Ckabn < α∑

f Ckn ,(23)

where∑f Ckabn and

∑f Ckn are the total number of abnormality

and normality features in the cluster Ck, respectively. Wefound the value of α equal to 1.2 to be a proper choice forthis threshold.

In (23), we choose the classes that represent the abnor-mality in the speech. The equation distinguishes a cluster asabnormal if the number of the features estimated from thepathological voice is more than features derived from thenormal speech. The abnormality clusters are denoted as CLF

abnand CHF

abn for LF and HF groups, respectively.

5.3.2. Testing Stage. In this stage, we test the trained classifier.For a voice sample, we find the nearest cluster to each ofits feature vectors using Euclidean distance criterion. If thenumber of the feature vectors that belong to the abnormalityclusters is dominant, the voice sample is classified as apathological voice; otherwise, it is classified as a normalspeech.

Figure 4 demonstrates the testing stage. f LFTest and f HF

Testfeature vectors are derived from the base and coefficientvectors in LF and HF groups, respectively. For each featurevector, we find the closest cluster, Ck0 , as given in

f LFt ∈ CLF

k0if k0= min

k=1,...,K

√√√√√4∑

i=1

(f LFt (i)−CLF

k (i))2

,

t=1, . . . ,TLF,

f HFt ∈ CHF

k0if k0= min

k=1,...,K

√√√√√5∑

i=1

(f HFt (i)−CHF

k (i))2

,

t=1, . . . ,THF,

(24)

where f LFt and f HF

t are the input feature vectors, and THF andTLF are the total numbers of test feature vectors for HF andLF components, respectively.

Next, the number of all the features that belong toabnormal and normal clusters is calculated

if CLFk0∈ CLF

abn =⇒ abnLFtest = abnLF

test + 1,

if CHFk0∈ CHF

abn =⇒ abnHFtest = abnHF

test + 1,(25)

where abnLFtest and abnHF

test are the numbers of all the featurevectors of LF and HF groups that belong to an abnormalcluster. The signal is classified as normal if

Labnormality < Thpatho, (26)

where Thpatho is the abnormality threshold, and Labnormality isthe number of the abnormality features in the voice sample:

Labnormality =(

abnLFtest

TLF+

abnHFtest

THF

). (27)

If the criterion in (26) is not satisfied, the signal is classifiedas a pathological speech.

Page 54: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

5

10

15

NP

E

50 100 150 200

Iteration

(a)

5

10

15

NP

E

50 100 150 200

Iteration

(b)

Figure 5: The normalized projected energy (NPE) at each iterationis plotted for one normal (a) and one pathological signal (b). As itcan be observed in this figure, most of the coherent structure of thesignal is projected before 100 iterations, and the remaining energyis negligible.

6. Results

The proposed methodology was applied to the MassachusettsEye and Ear Infirmary (MEEI) voice disorders database, dis-tributed by Kay Elemetrics Corporation [20]. The databaseconsists of 51 normal and 161 pathological speakers whosedisorders spanned a variety of organic, neurological, trau-matic, and psychogenic factors. The speech signal is sampledat 25 kHz and quantized at a resolution of 16 bits/sample. Inthis paper, 25 abnormal and 25 normal signals were used totrain the classifier.

MP-TFD with Gabor atoms is estimated for each 80 msof the signal. Gabor atoms provide optimal TF resolutionin the TF plane and have been commonly used in MP-TFD. To acquire the required iterations (I) in the MPdecomposition, we calculate the energy of the projectedsignal at each iteration, 〈Rix, gγi〉 in (A.2). Figure 5 illustratesthe mean of the projected energy per iteration for onenormal and one pathological signal. As evident in this figure,most of the coherent structure of the signal is projectedbefore 100 iterations. Therefore, in this paper, MP-TFD isconstructed using the first 100 iterations and the remainingenergy is ignored. As explained in Section 3.1, the AdaptiveTFD is constructed by performing MCE iterations to theestimated MP-TFD. It can be shown that after 5 iterations,the constructed TFD satisfies the marginal criteria in (5).

Next, we apply NMF-MD with base number of r = 15to each TF matrix and estimate the base and coefficientmatrices, W and H , respectively. Each base vector is catego-rized into either LF or HF group a base vector is groupedas LF component if its energy is concentrated more in thefrequency range of 4 kHz or less; otherwise, it is groupedas HF component. We extract 4 features (Sh,Dh, Sw, SHw)from each LF base vector w and its coefficient vector h, and5 features (Sh,Dh,MO(1)

w ,MO(2)w ,MO(3)

w ) from each HF base

Sh Dh Sw SHw Sh Dh MO(1)w MO(2)

w MO(3)w

LF features HF features

Feat

ure

imp

orta

nce

Figure 6: The relative height of each feature represents the relativeimportance of the feature compared to the other features.

vector and its coefficient vector. In order to obtain the roleof each feature in the classification accuracy, we calculate theP-value of each feature using the Student’s t-test. The featurewith the smallest P-value plays the most important role inthe classification accuracy. Figure 6 demonstrates the relativeimportance of each 9 features. As shown in this figure, Dh

and SHw from LF features, and Sh, MO(2)w and MO(3)

w fromHF features play the most significant role in the classificationaccuracy.

Finally, we apply the K-means clustering to the logarithmof the derived feature vectors, and define the abnormalityclusters. Figures 7 illustrates the application of the proposedmethodology for a pathological voice sample which is shownin Figure 7(a). As explained in Section 5.3, the test proceduredetermines the feature vectors that belong to the abnor-mality clusters. We use the base and coefficient matrices,Wabn and Habn, corresponding to the abnormality featurevectors to reconstruct the abnormality TF matrix, Vabn, asVabn = WabnHabn. Figure 7(b) depicts the reconstructed TFmatrix. As it is expected, the proposed method successfullydistinguishes transients, high frequency components, andweek formants as abnormality.

In the test stage, the trained classifier is used to calculatethe measure of abnormality (Labnormality in (27)) for eachvoice sample. Figure 8 shows the abnormality measure for51 normal and 161 pathological speech signals in MEEIdatabase. As evident in this figure, the pathological sampleshave higher abnormality measure compared to the normalsamples. Each signal is classified as normal if its abnormalitymeasure is smaller than a threshold (Thpatho in (26));otherwise it is classified as pathological. In order to findthe abnormality threshold, receiver operating curves (ROCs)of Labnormality are computed with the area under the curveindicating relative abnormality detection (Figure 9). Basedon the ROC, the cut point of 0.59 is chosen as theabnormality threshold (Thpatho = 0.59). Table 1 shows theaccuracy of the classifier. From the table, it can be observedthat out of 51 normal signals, 50 were classified as normal,and only 1 was misclassified as pathological. Also, the tableshows that out of 161 pathological signals, 159 were classified

Page 55: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Nor

mal

ized

freq

uen

cy

10 20 30 40 50 60 70 80

Time (ms)

(a) TFD of a pathological speech

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Nor

mal

ized

freq

uen

cy

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Time (ms)

(b) TFD of the estimated abnormality

Figure 7: The classifier of Figure 4 is applied to the TF matrix of apathological speech shown in (a), and the estimated abnormality TFmatrix is shown in (b) . As evident in this figure, the abnormalitycomponents are mainly transients, high frequency components, andweek formants.

as pathological and only 2 were misclassified as normal.The total classification accuracy is 98.6%. As it can beconcluded from the result, the extracted features successfullydiscriminate the abnormality region in the speech.

In Figure 9 and Table 1, we utilized MD with decomposi-tion order (r) of 15. We repeated the proposed method usingdifferent decomposition orders. Our experiment showedthat the decomposition order of 5 and higher is suitablefor our application. Table 2 shows the P-values of threedecomposition orders obtained with the Student’s t-test.

As explained in Section 2, our proposed feature extrac-tion methodology performs a longer term modeling com-pared to the current methods. The pathological speechclassification is conventionally performed on 10–30 ms ofsignal. At sampling frequency of 8 kHz, the number ofsample is 80–240 samples per segment. In this paper, we

0

2

4

6

8

Abn

orm

alit

ym

easu

re

50 100 150

Voice sample

PathologicalNormal

Figure 8: For each voice sample, the number of the featurevectors that belong to an abnormality cluster is calculated, and theabnormality measure is calculated as the ratio of the total number ofthe abnormal feature vectors to the total number of feature vectorsin the voice sample.

0

0.2

0.4

0.6

0.8

1

Sen

stiv

ity

0 0.2 0.4 0.6 0.8 1

1-Specificity

ROC curve

Figure 9: Receiver operating curve for the pathological voice classi-fication is plotted. In this analysis, pathological speech is considerednegative, and normal is considered positive. The area under theROC is 0.999, and the maximum sensitivity for pathological speechdetection while preserving 100% specificity is 98.1%.

use 80 ms of speech at sampling frequency of 25 kHz. Asa result, we are working with 2048 samples/frame whichis 10 times the conventional length. The results shown inthis section demonstrate that the proposed methodologysuccessfully discriminates the pathological characteristicsof the speech. In addition to the high accuracy rate, theadvantage of our proposed methodology can be concludedin 3 points. (1) By performing MP on the speech signal,we project the most coherent structure of the signal. The

Page 56: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal Processing

Table 1: Classification result.

Classes Normal Abnormal Total

Normal 50 1 51

Pathological 2 159 161

Normal 98.0% 2.0% 100%

Pathological 1.2% 98.8% 100%

Table 2: P-value of the classifiers obtained with three differentdecomposition orders.

Decomposition order (r) 5 10 15

P-value 3× 10−10 1× 10−11 1× 10−13

remaining part represents the random noise presented in thesignal. Hence, we perform an automatically denoising onthe signal which allows the technique to be practical in thelow SNR speech signals. (2) In this method, we reconstructthe TF matrix of the abnormality part of the signal, and weestimate the amount of abnormality in the speech signal.The reconstructed TF matrix and the abnormality measurehave potential to be used as a patients’ progress measureover the course of voice therapy. (3) In this work, we use avery simple classifier rather than a complex classifier, such ashidden Markov models or neural networks.

7. Conclusion

TF analysis are effective for revealing non-stationary aspectsof signals such as trends, discontinuities, and repeatedpatterns where other signal processing approaches fail orare not as effective; however, most of the TF analysisare restricted to visualization of TFDs and do not focuson quantification or parametrization that are essential forfeature analysis and pattern classification.

In this paper, we presented a joint TF and MD featureextraction approach for pathological voice classification. Theproposed methodology extracts meaningful speech featuresthat are difficult to be captured by other means. TF featuresare extracted from a positive TFD that satisfies the marginalconditions and can be considered as a true joint distributionof time and frequency. The utilized TFD is a segment free TFapproach, and it provides a high-resolution and cross-termfree TFD.

The TF matrix was decomposed into its base (spectral)and coefficient (temporal) vectors using nonnegative matrixfactorization (NMF) method. Four features were extractedfrom the components with low frequency structure, and fivefeatures were derived from the bases with high frequencycomposition. The features were extracted from the decom-posed vectors based on the spectral and temporal character-istics of the normal and pathological signals. In this study,we performed K-means clustering to the proposed featurevectors, and we achieved an accuracy rate of 98.6% for theMEEI voice disorders database, including 161 pathologicaland 51 normal speakers.

Appendices

A. Matching Pursuit TFD

Matching pursuit (MP) was proposed by Mallat and Zhang[21] in 1993 to decompose a signal into Gabor atoms, gγi ,with a wide variety of modulated frequency ( fi) and phase(φi), time shift (pi) and duration (si) as shown in

gγi(t) =1√sig(t − pisi

)exp

[j(2(π fit + φi

))], (A.1)

where γi represents the set of parameters (si, pi, fi,φi). TheMP dictionary is consisted of Gabor atoms with durations(si) varying from 2 samples to N (length of the signal x(t)),and it therefore is a very flexible technique for non-stationarysignal representation. At each iteration, the MP algorithmchooses the Gabor atom that best fits to the input signal.Therefore, after I iterations, MP procedure chooses theGabor atoms that best fit to the signal structure without anypreassumption about the signal’s stationarity. Componentswith long stationarity properties will be represented by longGabor atoms, and transients will be characterized by shortGabor atoms.

At each iteration, MP projects the signal into a set of TFatoms as follows:

x(t) =I−1∑

i=0

⟨Rix, gγi

⟩gγi(t) + RIx, (A.2)

where 〈Rix, gγi〉 is the expansion coefficient on atom gγi(t),and RIx is the decomposition residue after I decomposition.At this stage, the selected components represent coherentstructures and the residue represents incoherent structuresin the signal. The residue may be assumed to be due torandom noise, since it does not show any TF localization.Therefore, the decomposition residue in (A.2) is ignored, andthe Wigner-Ville distribution (WVD) of each I componentsis added in the following:

V(t, f

) =I−1∑

i=0

∣∣∣⟨Rix, gγi

⟩∣∣∣2Wgγi

(t, f

), (A.3)

where Wgγi(t, f ) is the WVD of the Gabor atom gγi(t),

and V(t, f ) is called the MP-TFD. Wigner distribution is apowerful TF representation; however when more than onecomponent is present in the signal, the TF resolution will beconfounded by cross-terms. Nevertheless, when we apply theWigner distribution to single components and add them up,the summation will be a cross-term free TFD.

Page 57: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

B. Analysis of sharpness feature

In order to demonstrate the behavior of feature SHw, weassume that the base vector, wi, has two components atfrequencies samples m1 and m2 with energies of α and β,respectively

wi(m) = αδ(m−m1) + βδ(m−m2), (B.1)

|W(ν)| (21) is calculated as

|W(ν)| =√α2 + β2 + 2αβ cos (2π(m1 −m2)ν). (B.2)

|W(ν)| is independent to the parameter ν only when m1 ≈m2, or when the energy ratio of the components in (B.1) istoo small (either β/α ≈ 0 or α/β ≈ 0). In this case, when wecalculate the Fourier transform of |W(ν)| as shown in (21),|W(κ)| is non-zero only at small values of κ (say κ < m0,where m0 is a small number). Hence, SHwi as it is calculatedin (22) results in a small feature. From the other side, |W(ν)|is dependent on the parameter ν when both the componentsin (B.1) are strong (β/α ≈ R,R /= 0). In this case, the Fouriertransform of |W(ν)| is not negligible at κ > m0, and SHwi

results in lager values.From the above explanation, we conclude that the small

values of SHwi represent pathological formants, in which thecomponents’ energies are very small compared to the energyof the main frequency (β/α ≈ 0 or α/β ≈ 0), and the largevalues of SHwi show the strong formants in speech (β/α ≈R,R /= 0).

References

[1] R. T. Sataloff, Professional Voice: The Science and Art of ClinicalCare, Raven Press, New York, NY, USA, 1991.

[2] P. Carding and A. Wade, “Managing dysphonia caused bymisuse and abuse,” British Medical Journal, vol. 321, pp. 1544–1545, 2000.

[3] E. J. Wallen and J. H. L. Hansen, “A screening test for speechpathology assessment using objective quality measures,” inProceedings of International Conference on Spoken LanguageProcessing (ICSLP ’96), vol. 2, pp. 776–779, Philadelphia, Pa,USA, October 1996.

[4] R. J. Moran, R. B. Reilly, P. de Chazal, and P. D. Lacy,“Telephony-based voice pathology assessment using auto-mated speech analysis,” IEEE Transactions on BiomedicalEngineering, vol. 53, no. 3, pp. 468–477, 2006.

[5] T. Ananthakrishna, K. Shama, and U. C. Niranjan, “k-meansnearest neighbor classifier for voice pathology,” in Proceedingsof the IEEE India Conference INDICON, pp. 352–354, IndianInstitute of Technology, Kharagpur, India, 2004.

[6] A. A. Dibazar, S. Narayanan, and T. W. Berger, “Featureanalysis for automatic detection of pathological speech,” inProceedings of Annual International Conference of the IEEEEngineering in Medicine and Biology (EMBS ’02), vol. 1, pp.182–183, Houston, Tex, USA, 2002.

[7] J. I. Godino-Llorente and P. Gomez-Vilda, “Automatic detec-tion of voice impairments by means of short-term cepstralparameters and neural network based detectors,” IEEE Trans-actions on Biomedical Engineering, vol. 51, no. 2, pp. 380–384,2004.

[8] A. Gelzinis, A. Verikas, and M. Bacauskiene, “Automatedspeech analysis applied to laryngeal disease categorization,”Computer Methods and Programs in Biomedicine, vol. 91, no.1, pp. 36–47, 2008.

[9] N. Saenz-Lechon, J. I. Godino-Llorente, V. Osma-Ruiz, and P.Gomez-Vilda, “Methodological issues in the development ofautomatic systems for voice pathology detection,” BiomedicalSignal Processing and Control, vol. 1, no. 2, pp. 120–128, 2006.

[10] V. Parsa and D. G. Jamieson, “Identification of pathologicalvoices using glottal noise measures,” Journal of Speech, Lan-guage, and Hearing Research, vol. 43, no. 2, pp. 469–485, 2000.

[11] K. Umapathy, S. Krishnan, V. Parsa, and D. G. Jamieson,“Discrimination of pathological voices using a time-frequencyapproach,” IEEE Transactions on Biomedical Engineering, vol.52, no. 3, pp. 421–430, 2005.

[12] F. Auger and P. Flandrin, “Improving the readability of time-frequency and time-scale representations by the reassignmentmethod,” IEEE Transactions on Signal Processing, vol. 43, no. 5,pp. 1068–1089, 1995.

[13] L. Cohen and T. E. Posch, “Positive time-frequency distribu-tion functions,” IEEE Transactions on Acoustics, Speech, andSignal Processing, vol. 33, no. 1, pp. 31–38, 1985.

[14] S. Krishnan, R. M. Rangayyan, G. D. Bell, and C. B. Frank,“Adaptive time-frequency analysis of knee joint vibroarthro-graphic signals for noninvasive screening of articular cartilagepathology,” IEEE Transactions on Biomedical Engineering, vol.47, no. 6, pp. 773–783, 2000.

[15] B. Ghoraani and S. Krishnan, “Quantification and localiza-tion of features in time-frequency plane,” in Proceedings ofCanadian Conference on Electrical and Computer Engineering(CCECE ’08), pp. 1207–1210, May 2008.

[16] D. D. Lee and H. S. Seung, “Algorithms for non-negativematrix factorization,” in Proceedings of the Conference onAdvances in Neural Information Processing Systems (NIPS ’01),pp. 556–562, 2001.

[17] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R.J. Plemmons, “Algorithms and applications for approximatenonnegative matrix factorization,” Computational Statisticsand Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.

[18] C.-J. Lin, “Projected gradient methods for nonnegative matrixfactorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.

[19] P. O. Hoyer, “Non-negative matrix factorization with sparse-ness constraints,” Journal of Machine Learning Research, vol. 5,pp. 1457–1469, 2004.

[20] M. Eye and E. Infirmary, Voice Disorders Database, Version1.03, Kay Elemetrics Corporation, Lincoln Park, NJ, USA,1994.

[21] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Process-ing, vol. 41, no. 12, pp. 3397–3415, 1993.

Page 58: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 821304, 6 pagesdoi:10.1155/2009/821304

Research Article

A First Comparative Study of Oesophageal and Voice ProsthesisSpeech Production

Massimiliana Carello1 and Mauro Magnano2

1 Dipartimento di Meccanica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy2 Ospedali Riuniti di Pinerolo, A.S.L. TO3, Via Brigata Cagliari 39, 10064 Pinerolo, Torino, Italy

Correspondence should be addressed to Massimiliana Carello, [email protected]

Received 31 October 2008; Revised 2 March 2009; Accepted 30 April 2009

Recommended by Juan I. Godino-Llorente

The purpose of this work is to evaluate and to compare the acoustic properties of oesophageal voice and voice prosthesisspeech production. A group of 14 Italian laryngectomized patients were considered: 7 with oesophageal voice and 7 withtracheoesophageal voice (with phonatory valve). For each patient the spectrogram obtained with the phonation of vowel /a/(frequency intensity, jitter, shimmer, noise to harmonic ratio) and the maximum phonation time were recorded and analyzed.For the patients with the valve, the tracheostoma pressure, at the time of phonation, was measured in order to obtain importantinformation about the “in vivo” pressure necessary to open the phonatory valve to enable speech.

Copyright © 2009 M. Carello and M. Magnano. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Laryngeal cancer is the second most common upper aero-digestive cancer, in particular, it causes pain, dysphagia, andimpedes speech, breathing, and social interactions.

The management of advanced cancers often includesradical surgery, such as a total laryngectomy which involvesthe removal of the vocal cords and, as a consequence, theloss of voice. Total laryngectomy represents an operationthat drastically affects respiratory dynamics and phonationmechanisms, suppressing the normal verbal communication,it is disabling and has a detrimental effect on the individual’squality of life. In fact, for some laryngectomy patients, theloss of speech is more important than survival itself.

With the laryngectomy, the patient is deprived of thevibrating sound source (the vocal folds and laryngeal box)and the energy source for voice production, as the air streamfrom the lungs is no longer connected to the vocal tract.

Consequently, since 1980, different methods for regain-ing phonation have been developed, the most important are(1) the use of an electro-larynx, (2) conventional speechtherapy, (3) surgical prosthetic methods [1–3].

The use of an electro-larynx allows the restoration of thevoice by an external sound generator; it is exclusively reserved

for patients who have not benefited from conventionalspeech therapy or on whom a tracheoesophageal prosthesiscannot be applied.

The conventional speech therapy allows the acquisitionof autonomously oesophageal voice (EV) and, therefore, it isthe most commonly used treatment in voice rehabilitationof laryngectomized patients which requires a sequence oftraining sessions to develop the ability to insufflate theoesophagus by inhaling or injecting air through coordinatemuscle activity of the tongue, cheeks, palate, and pharynx.The last technique of capturing air is by swallowing air intothe stomach. Voluntary air release or “regurgitation” of smallvolumes vibrates the cervical esophageal inlet, hypophar-ingeal mucosa, and other portions of the upper aerodigestivetract to produce a “burp-like” sound. Articulation of the lips,teeth, palate, and tongue produces intelligible speech.

The surgical prosthetic methods (TEP), introduced in1980 by Weinberg et al. [4], spread rapidly due to theexcellent outcomes that they achieved. In this case a phona-tory valve is positioned in a specifically made shunt in thetracheoesophageal wall, and closing the tracheostoma, theair reaches the mouth (through the cervical esophageal inlet,hypopharingeal mucosa, and the upper aerodigestive tract)and the vibration is modulated with a new voice production.

Page 59: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

Table 1: Patient data, vocal, and pressure parameters.

Personal data Vocal parameters Tracheostoma pressure

Age Sex Tracheostomaarea

Fundamentalfrecuancy

Jitter Jitterperc.

ShimmerShimmerperc.

NHRMaximumphonationtime

Tracheostomapressure

Acousticpressure/Tracheostomapressure

[cm2] [Hz] [ms] [%] [Pa] [%] [−] [s] [Pa] [−]∗ 10(−7)

EV1 49 M 1.56 75.188 17.67 13.44 0.00073 0.36 0.832 0.90 — —

EV2 77 M 0.87 153.846 42.67 33.41 0.00019 0.56 3.265 0.77 — —

EV3 62 M 1.37 96.154 33.67 18.01 0.00026 0.43 1.063 0.65 — —

EV4 60 M 1.69 56.497 13.33 24.46 0.00026 0.21 1.575 0.68 — —

EV5 74 M 1.94 69.444 28.33 21.76 0.00005 0.19 1.297 1.63 — —

EV6 71 M 0.69 98.039 22.67 22.39 0.00048 0.83 1.032 0.68 — —

EV7 61 M 0.62 56.818 30.33 25.38 0.00006 0.15 1.146 0.57 — —

TEP1 68 M 1.75 112.360 3.33 3.79 0.00012 0.20 0.834 48.45 4906 1.7077

TEP2 61 F 2.37 102.041 6.00 6.13 0.00005 0.23 0.487 12.18 2960 1.0955

TEP3 76 M 0.68 86.957 18.67 17.06 0.00029 0.51 1.906 7.86 3752 2.0051

TEP4 78 M 1.62 109.890 3.33 3.86 0.00012 0.30 2.892 6.47 5077 1.6604

TEP5 61 M 1.44 60.606 4.67 2.86 0.00001 0.17 0.146 22.39 1790 0.3187

TEP6 76 M 2.21 58.590 13.67 10.99 0.00033 0.36 0.216 4.67 2481 3.9962

TEP7 60 M 1.00 107.527 9.00 10.41 0.00021 0.38 2.776 19.11 5127 3.2538

The resulting speech depends on the expiratory capacitybut the voice quality is very good and resembles the “origi-nal” voice. This kind of voice is called “tracheoesophageal”voice. Intelligibility of EV can vary according to severalperceptive factors on the precise definition for which thereis no general agreement. Furthermore, aerodynamic data inthe study of EV physiology and, in particular, correlationsbetween those data and the perceptive findings have not beendefined as yet.

The sound generator of both oesophageal and tra-cheoesophageal speech is the mucosa of the pharyngo-esophageal (PE) segment, that differs from patient to patient,depending on the shape and stiffness of the scar betweenthe hypopharynx and oesophagus, the localization of thecarcinoma, different surgical needs and procedures, andthe extent of the remaining esophageal mucosa. Severalinvestigations of the substitute voice attempted to detecta correlation between voice quality and morphological ordynamic properties of the PE segment [5] but sometimes themethod is not very comfortable for the patient.

In this paper, a simple and physiological method ofmeasurement of voice characteristics is presented, useful,above all, for oesophageal and tracheoesophageal voices thatare characterised by a strong aperiodicity.

Voice quality is a perceptual phenomenon, and con-sequently, perceptual evaluations are considered the “goldstandard” of voice quality evaluation. In clinical practice,perceptual evaluation plays a prominent role in therapyevaluation, while the acoustic analyses are not usuallyroutinely performed.

Several studies have described acoustic analysis ofoesophageal and tracheoesophageal voice quality and have

concluded that there is a considerable difference betweenthe laryngeal voice and the acoustic measures, because thesevoices have a high aperiodicity [6–8].

For this reason a commercially available Multi Dimen-sional Voice Program (MDVP), suitable for a subject notlaryngectomized with laryngeal voice, is not useful to analyzeall the tracheoesophageal voices, where the power vocalsignal in terms of frequency and the amplitude outline isnot regular, with distinguishable peak values and clean sound[6].

2. Patients

The subjects included 14 Italian laryngectomized patients(13 men and 1 woman) with ages ranging from 49 to 78years, with a mean of 66.7 years. Seven of them speak withoesophageal voice (EV) while seven patients have a Provoxvoice prostheses (TEP).

For each patient a picture of the stoma has been takento obtain its size (or area). The stoma size ranged from0.62 cm2to 2.21 cm2, with a mean of 1.41 cm2.

In Table 1 are shown the personal data of the patients:age, sex, and size of the stoma.

3. Methods

3.1. Voice and Tracheostoma Pressure Measurement. Thephonetic specialists have a standard method to evaluate thevoice characteristics, the first is a perceptive evaluation butthe most important is the objective evaluation to measurethe acoustic characteristics of the voice using a computerizedanalysis [9–11].

Page 60: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

The oesophageal and the tracheoesophageal voice arecharacterized by aperiodic characteristics and importantnoise components, so it is very difficult to individuate thepeak values. For this reason the use of a multiparameterprogramme MDVP for these kinds of voices does not providereliable results, while the programme is very reliable forlaryngeal voices; this is pointed out by different researchgroups [6, 8, 11, 12]. In this paper a new different system hasbeen proposed and used, taking into account the knowledgeof the engineering signal analysis.

For the research shown in this paper a specific experi-mental setup has been made by a microphone (Bruel andKjier, 4133 type, with stabilized supplier 2804 type andpreamplifier type 2669) and a digital oscilloscope with aspecific setup (Tektronik type) that allows recording of a datasequence.

The measurement and recording of speech signals havebeen taken with the patient standing up and a microphonepositioned 20 cm from the mouth at an angle of 45◦. In thiscondition, the patient pronounced the vowel /a/ with a toneand sound level considered by himself to correspond to ausual conversation.

The speech signal was recorded for 1 second to haveit constant. In this way, it is possible to consider a steadysignal, with average value and variance constants, and withthe power spectral analysis it is possible to use the Fouriertransform and the Wiener Kintchine theorems. The use of asampling frequency of 10 kHz allows to evaluate the signal upto a frequency of 5 kHz, according to Nyquist theorem.

The maximum phonation time was measured in the sameconditions but with the patient that pronounces the vowel /a/as long as possible.

Every test on each individual patient was carried outthree times to verify the repeatability of the measurements,Table 1 reports the mean values.

For the patient with tracheoesophageal voice the speechsignal and the pressure at the tracheostoma were recordedsimultaneously.

The pressure was measured with a specifically madedevice. A Provox adhesive plaster (usually used for thestoma filter) positioned on the tracheostoma allows to fixa small teflon cylinder of suitable diameter. A soft rubberpart is connected to the other extremity of the cylinder;the patient, using two fingers, closes the rubber part on thetracheostoma.

A pressure transducer (RS Component 235-5790), posi-tioned in a pressure measurement point in radial positionon the cylinder, allows a dynamic measurement of thetracheostoma pressure to be taken by means of a digitaloscilloscope.

The pressure measurement device is shown in Figures1(a) and 1(b). In particular, in the case of Figure 1(a) thepatient can breath freely; in the case of Figure 1(b) the devicecan be closed by the patient to allow voice production,in these conditions the pressure and the voice signal arerecorded simultaneously using a digital oscilloscope.

The pressure and voice signals have been treated witha program (developed in MATLAB) specifically written to

(a) (b)

Figure 1: Device for tracheostoma pressure measurement.

700600500400300200100

Time (ms)

−3

−2

−1

0

1

2

3

×10−3

Am

plit

ude

(W)

Figure 2: Vocal signal amplitude versus time (EV1).

carry out spectral power analysis and based on a decision-making tool, to obtain the following:

(i) vocal signal analysis: power spectral density (byWelch period analysis), time-frequency spectrogram(or sonogram); fundamental frequency (cepstrummethod); jitter and jitter percentage; shimmer andshimmer percentage, Noise to Harmonic Ratio(NHR);

(ii) tracheostoma pressure signal analysis: power spectralanalysis, pressure average value;

(iii) cross-spectral analysis of vocal and pressure signal topoint out the same harmonic components;

(iv) acoustic pressure to tracheostoma pressure ratio(ratio of the maximum values).

The tracheostoma pressure allows important informationabout the “in vivo” pressure necessary to open the phonatoryvalve to speech, while the ratio of the acoustic pressure tothe tracheostoma pressure gives the pulmonary effort levelnecessary for the patient to produce the voice. In fact itis possible to note that at equal acoustic pressure, a lowpulmonary effort is necessary for a subject that has a lowtracheostoma pressure.

Page 61: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

45040035030025020015010050

Time (ms)

−8

−6

−4

−2

0

2

4

6

8×10−4

Am

plit

ude

(W)

Figure 3: Vocal signal amplitude versus time (TEP3).

5000450040003500300025002000150010005000

Frequency (Hz)

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

×10−5

Am

plit

ude

(W)

Figure 4: Vocal signal amplitude versus frequency (EV1).

Sometimes EV and TEP voice samples could not beanalysed at all, or only very short parts were analyzable.Visual inspection of these voice samples showed that thepatients had very low-pitched voices (for this reason the useof MDVP system is not suitable) or even that there is nofundamental frequency present at all.

The obtained vocal and tracheostoma pressure parame-ters are shown in Table 1.

4. Results and Discussion

Taking into account the data shown in Table 1 averagevalue and standard deviation (±σ) was calculated for thetwo groups of voices (EV and TEP). The results areshown in Table 2; it is possible to note that the tracheo-esophageal voices TEP have a lower standard deviation forthe vocal parameters (frequency, jitter, shimmer), in fact theTEP voices are more repeatable and have better acoustic

5000450040003500300025002000150010005000

Frequency (Hz)

1

2

3

4

5

6

×10−7

Am

plit

ude

(W)

Figure 5: Vocal signal amplitude versus frequency (TEP3).

0.60.50.40.30.20.10

Time (ms)

5000

4500

4000

3500

3000

2500

2000

1500

1000

500

0

Freq

uen

cy(H

z)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 6: Vocal signal frequency versus time (EV1).

characteristics. The oesophageal voice EV has lower standarddeviation regarding the maximum phonation time but it isnecessary to note that generally the patients with a TEP voicehave longer phonation time and this allows a better way tocommunicate and quality of the life.

Each patient’s voice signal (oesophageal EV and tra-cheoesophageal TEP) has been recorded and treated with thedeveloped MATLAB program. As an example, the results ofconcerning two patients, namely, EV1 and TEP3, are shownfrom Figure 2 to Figure 7.

The recorded signal in term of amplitude versus time isshown in Figures 2 (EV1) and 3 (TEP3).

The spectral power analysis allows to obtain the ampli-tude as a function of the time or the frequency as a functionof the time.

Figures 4 (EV1) and 5 (TEP3) show the amplitudeversus frequency spectra. It is possible to note that theesophageal voice EV has one fundamental frequency anda noise component at high frequency level, while thetracheoesophageal voice TEP has a frequency peak value andtwo noise components.

Page 62: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

Table 2: Average and standard deviation for patient data, vocal, and pressure parameters.

Personal data Vocal parameters Tracheostoma pressure

Age Sex Tracheostomaarea

Fundamentalfrecuancy

Jitter Jitterperc.

ShimmerShimmerperc.

NHRMaximumphonationtime

Tracheostomapressure

Acousticpressure/Tracheostomapressure

[cm2] [Hz] [ms] [%] [Pa] [%] [−] [s] [Pa] [−]∗ 10(−7)

EVaverage

64.86 — 1.25 86.569 26.95 22.69 0.00029 0.39 1.459 0.84 — —

EVstandarddeviation

9.72 — 0.52 34.063 9.96 6.24 0.00024 0.24 0.830 0.36 — —

TEPaverage

68.57 — 1.58 91.139 8.38 7.87 0.00016 0.31 1.322 17.30 3728 2.0053

TEPstandarddeviation

8.04 — 0.61 23.089 5.84 5.19 0.00012 0.12 1.188 15.23 1358 1.2518

0.40.350.30.250.20.150.10.050

Time (ms)

5000

4500

4000

3500

3000

2500

2000

1500

1000

500

0

Freq

uen

cy(H

z)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 7: Vocal signal frequency versus time (TEP3).

The frequency spectrum in term of frequency versus timebehaviour is shown in Figures 6 (EV1) and 7 (TEP3).

Similar behaviour was observed for the other patients.Finally, an overall analysis of the data obtained from the 14patients was made, pointing out a noise component between600 Hz and 800 Hz in all cases, with a harmonic componentbetween 1200 Hz and 1600 Hz. This phenomenon could becorrelated to pseudo-glottis (or larynx-oesophageal tract)physiological characteristics.

For all the TEP patients the tracheostoma pressure versustime was recorded and the power spectral analysis has beencarried out. The results for TEP3 are shown in Figure 8 interm of pressure versus time and in Figure 9 in term ofamplitude versus frequency.

To investigate the correlation between the pressure andthe voice signals (with TEP subject) the cross-spectrumbased on the Fourier transform was evaluated. The mostimportant and interesting result pointed out by this analysisis that the two signals have equal fundamental frequencyand the same harmonic components for each TEP subjectconsidered. Figure 10 shows the results obtained with theTEP3.

10009008007006005004003002001000

Time (ms)

1400

1500

1600

1700

1800

1900

2000

2100

2200

2300

Pre

ssu

re(P

a)

Figure 8: Pressure signal versus time (TEP3).

5000450040003500300025002000150010005000

Frequency (Hz)

1

2

3

4

5

6

×105

Am

plit

ude

(W)

Figure 9: Pressure signal amplitude versus frequency (TEP3).

Page 63: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

5000450040003500300025002000150010005000

Frequency (Hz)

2

4

6

8

10

12

×10−4

Am

plit

ude

(W)

Figure 10: Pressure and voice signal amplitudes (cross spectrum)versus frequency (TEP3).

Future steps of this research could be (i) increasing thenumber of patients to improve statistically the reliability ofthe analysis; (ii) comparing the tracheostoma pressure beforeand after the TEP procedure to improve the correlationbetween voice frequency and tracheostoma pressure after theTEP procedure.

References

[1] H. F. Mahieu, Voice and speech rehabilitation following laryn-gectomy, Doctoral dissertation, Rijksuniversiteit Groningen,Groningen, The Netherlands, 1988.

[2] E. D. Blom, M. I. Singer, and R. C. Hamaker, TracheoesophagealVoice Restoration Following Total Laryngectomy, Singular Pub-lishing, San Diego, Calif, USA, 1998.

[3] G. Belforte, M. Carello, G. Bongioannini, and M. Magnano,“Laryngeal prosthetic devices,” in Encyclopedia of MedicalDevices and Instrumentation, J. G. Webster, Ed., vol. 4, pp. 229–234, John Wiley & Sons, New York, NY, USA, 2nd edition,2006.

[4] B. Weinberg, Y. Horii, E. Blom, and M. Singer, “Airwayresistance during esophageal phonation,” Journal of Speech andHearing Disorders, vol. 47, no. 2, pp. 194–199, 1982.

[5] M. Schuster, F. Rosanowski, R. Schwarz, U. Eysholdt, and J.Lohscheller, “Quantitative detection of substitute voice gener-ator during phonation in patients undergoing laryngectomy,”Archives of Otolaryngology, vol. 131, no. 11, pp. 945–952, 2005.

[6] C. J. van As-Brooks, F. J. Koopmans-van Beinum, L. C. W. Pols,and F. J. M. Hilgers, “Acoustic signal typing for evaluation ofvoice quality in tracheoesophageal speech,” Journal of Voice,vol. 20, no. 3, pp. 355–368, 2006.

[7] C. J. van As-Brooks, F. J. M. Hilgers, F. J. Koopmans-vanBeinum, and L. C. W. Pols, “Anatomical and functionalcorrelates of voice quality in tracheoesophageal speech,”Journal of Voice, vol. 19, no. 3, pp. 360–372, 2005.

[8] C. J. van As-Brooks, F. J. M. Hilgers, I. M. Verdonck-de Leeuw,and F. J. Koopmans-van Beinum, “Acoustical analysis andperceptual evaluation of tracheoesophageal prosthetic voice,”Journal of Voice, vol. 12, no. 2, pp. 239–248, 1998.

[9] W. De Colle, Voce & Computer, Omega Edizioni, Italy, 2001.[10] A. Schindler, A. Canale, A. L. Cavalot, et al., “Intensity and

fundamental frequency control in tracheoesophageal voice,”Acta Otorhinolaryngologica Italica, vol. 25, no. 4, pp. 240–244,2005.

[11] C. F. Gervasio, A. L. Cavalot, G. Nazionale, et al., “Evaluationof various phonatory parameters in laryngectomized patients:comparison of esophageal and tracheo-esophageal prosthesisphonation,” Acta Otorhinolaryngologica Italica, vol. 18, no. 2,pp. 101–106, 1998.

[12] S. Motta, I. Galli, and L. Di Rienzo, “Aerodynamic findings inesophageal voice,” Archives of Otolaryngology, vol. 127, no. 6,pp. 700–704, 2001.

Page 64: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 203790, 13 pagesdoi:10.1155/2009/203790

Research Article

Linear Classifier with Reject Option for the Detection ofVocal Fold Paralysis and Vocal Fold Edema

Constantine Kotropoulos (EURASIP Member)1, 2 and Gonzalo R. Arce2

1 Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Box 451, Greece2 Department of Electrical and Computer Engineering, University of Delaware, 140 Evans Hall, Newark, DE 19716, USA

Correspondence should be addressed to Constantine Kotropoulos, [email protected]

Received 1 November 2008; Revised 19 May 2009; Accepted 30 July 2009

Recommended by Juan I. Godino-Llorente

Two distinct two-class pattern recognition problems are studied, namely, the detection of male subjects who are diagnosed withvocal fold paralysis against male subjects who are diagnosed as normal and the detection of female subjects who are suffering fromvocal fold edema against female subjects who do not suffer from any voice pathology. To do so, utterances of the sustained vowel“ah” are employed from the Massachusetts Eye and Ear Infirmary database of disordered speech. Linear prediction coefficientsextracted from the aforementioned utterances are used as features. The receiver operating characteristic curve of the linearclassifier, that stems from the Bayes classifier when Gaussian class conditional probability density functions with equal covariancematrices are assumed, is derived. The optimal operating point of the linear classifier is specified with and without reject option.First results using utterances of the “rainbow passage” are also reported for completeness. The reject option is shown to yieldstatistically significant improvements in the accuracy of detecting the voice pathologies under study.

Copyright © 2009 C. Kotropoulos and G. R. Arce. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Vocal pathologies arise due to accident, disease, misuse ofthe voice, or surgery affecting the vocal folds and have aprofound impact on patients’ life. The modeling of normaland pathological voice source and the analysis of healthy andpathological voices has gained increasing interest recently[1]. Among the most interesting works are those concernedwith Parkinson’s Disease (PD) and multiple sclerosis, whichbelong to a class of neurodegenerative diseases that affectpatients speech, motor, and cognitive capabilities [2, 3].People with neurological conditions causing disability oftenhave associated dysarthria, which is the most commonacquired speech disorder affecting 170 per 100 000 popula-tion [4]. Several studies explore the main voice characteristics(i.e., the fundamental frequency and vocal tract resonancefrequencies) together with their deviation from the nominalconditions for persons who exhibit voice disorders. Althoughthe majority of techniques analyze the speech signal, thevideo modality offers complementary information [5, 6].For example, three-dimensional (3D) magnetic resonance

imaging could be used to build a 3D numerical model ofthe vocal tract and videokymography could overcome thetransmission speed and volume limitations of 2D imaging(i.e., stroboscopy) for severely dysphonic patients with anaperiodic signal, allowing to register the movements ofthe vocal folds with a high time resolution on a lineperpendicular to the glottis [1]. Furthermore, the irregularvocal fold oscillations can be observed by means of a digitalhigh-speed camera using image processing techniques inorder to extract the vocal fold edges, estimate the minimumglottal area defined by the vocal fold positions, and computethe distance between the glottal midline and the vocal foldedges extracted at medial position in real-time [7]. The timeseries of such displacements can drive an inversion procedurein order to adjust the parameters of a biomechanical modelof vocal folds for both pathological and healthy vocalfold oscillations. All the aforementioned techniques aim atevaluating the performance of special treatments, such as theLee Silverman Voice Treatment [3], assisting the e-inclusionof people with physical disabilities and disordered speech byoffering better access to telecommunication services [8] or

Page 65: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

more efficient environmental control systems [9]. Thus, it is amatter of great significance to develop systems able to classifythe incoming voice samples as normal or pathological onesbefore other procedures are further applied.

Voice pathologies may be assessed by either percep-tual judgments or an objective assessment. The perceptualjudgment resorts to qualifying and quantifying the vocalpathology by listening to patients’ speech. Although this isthe most commonly used method by clinicians, it suffersfrom several drawbacks. First of all, the perceptual judgmenthas to be performed by an expert jury in order to increaseits reliability. Second, due to the lack of universal assessmentscales and the dependence on experts’ professional back-ground and experience or the knowledge of patients history,the perceptual judgment may involve large intra and inter-variability. Third, the perceptual analysis is very costly intime and human resources and cannot be planned regularly.Nowadays an increasing use of objective measurement-basedanalysis as a non-invasive technique for supporting diagnosisin laryngeal pathology has been observed [8–11]. Objectivemeasurement-based analysis qualifies and quantifies thevoice pathology by analyzing acoustical, aerodynamic, andphysiological measurements. These measurements may bedirectly extracted from patient’s speech utterance using asimple computer-based system or may require special instru-ments. Typical techniques, such as fundamental frequencyand jitter estimation should be carefully adapted in order totake into account the significant variations of fundamentalfrequency from cycle to cycle as well as the presence ofsubharmonic and aperiodic components in the pathologicalvoice [12–14]. Very useful insight to the production ofdisordered speech could be obtained through simulationstudies [15–17]. Although the objective analysis alleviatesthe subjectivity of perceptual judgments, it has certainlimitations as well. First, the objective analysis often relies onpattern recognition techniques, such as linear discriminantanalysis, correlation estimation, which do depend on themeasurements being analyzed. Second, the objective analysisis frequently confined to the study of sustained vowels only,which are not representative of continuous speech [18]. Inthe medical literature, agreement between the perceptualjudgments and the findings of objective analysis is generallysought for [19, 20].

Several techniques for the detection and classification ofvoice pathologies by means of acoustic analysis, parametricand non-parametric feature extraction, and pattern recog-nition are reviewed in [21]. In all these techniques, first,descriptive features are extracted from the speech signal.A number of so-called classical parameters quantify pitchperturbations (jitter), amplitude perturbations (shimmer)and estimate the Harmonic to Noise Ratio at differentfrequency bands and the critical-band energy spectrum byemploying either short-term Discrete Fourier Transformand cepstral analysis [22–24] or the singularities in thepower spectral density of the vocal cord cover wave (alsoreferred to as the mucosal wave correlate) [25]. Alternatively,features stemming from the 1-D bicoherence index derivedby the bispectrum [22] or nonlinear dynamical systemtheory, such as statistics of the correlation dimension and the

largest Lyapunov exponent [26], or the return period densityentropy [27] were extracted. Features could also be obtainedby applying the continuous wavelet transform to each speechframe and averaging neighbor wavelet coefficients on time-frequency scale [28]. Frequently, feature vectors undergodimensionality reduction by applying Principal ComponentAnalysis (PCA) [29–31] before classification or a subset offeatures are selected by applying either a wrapper or a filter.Next, the features are either clustered in a number of pre-defined classes, say by a K-means algorithm [30] or are fedto a classifier, which is designed to solve a two-class patternrecognition problem. That is, to verify a specific pathologyin a test utterance or to decide whether a test utteranceis pathological or not. Commonly used classifiers resort tolinear discriminant analysis (LDA) [23, 27, 29, 32], nearestneighbors [24, 26, 29], vector quantization [33] or supportvector machines (SVMs) [28, 31, 34]. It is worth noting thatthe detection of voice pathology is closely related to speakerverification. In particular, pathological class models can bederived from generic Gaussian mixture models by employingthe maximum a posteriori adaptation technique [35] andadapting only the means [34]. While a sustained phonationcan be classified as normal or pathological with an accuracygreater than 90% when speech is recorded in laboratoryconditions [21], telephone quality speech can be classified asnormal or pathological with a much smaller accuracy, that is,74.15% [23].

In this paper, we are concerned with vocal fold paralysisand vocal fold edema, which are both associated with com-munication deficits that affect the perceptual characteristicsof pitch, loudness, quality, intonation, and have similarsymptoms with PD and other neuro-degenerative diseases[36]. We are interested in detecting male subjects who arediagnosed with vocal fold paralysis against male subjectswho are diagnosed as normal. Similarly, we would liketo distinguish between female subjects who are diagnosedwith vocal fold edema against female subjects who arediagnosed as normal. Utterances from the Massachusetts Eye& Ear Infirmary (MEEI) Voice Disorders Database, which isdistributed by Kay Elemetrics [37], are employed, becausethe MEEI database is a benchmark annotated speech corpus.A review of several voice pathology detection approacheswith the MEEI database can be found in [21]. However, themajority of these approaches aim at identifying whether anutterance is pathological or not without addressing whichspeech pathology is observed. Although a direct comparisonbetween these methods is not possible, because different datasubsets have been used and different performance criteriahave been employed, one can roughly claim that the stateof the art accuracy in detecting whether an utterance ispathological or not exceeds 98% [38, 39]. In the following,let us confine ourselves to vocal fold paralysis and edemadetection. The identification of vocal fold paralysis usingthe normalized energy across various scaling factors ofthe wavelet transform and a multilayer neural networktrained by back-propagation was proposed [40]. For 50data samples of the MEEI database, an average classificationaccuracy of 90% was reported. The performance of Fisher’slinear classifier, the K-nearest neighbor classifier, and the

Page 66: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

nearest mean one for detecting vocal fold paralysis in maleutterances and vocal fold edema in female utterances wasassessed in [29]. The subjects were called to articulate thesustained vowel “ah” (/a/). From each recording, two centralframes were selected among the ones that belong to themost stationary portion of the sustained speech signal as isproposed in [41, 42]. 14-order linear prediction coefficients(LPCs) were extracted from each frame. The dimensionalityof the raw feature vector was then reduced to 2 by PCA.Receiver operating characteristic (ROC) curves for the Fisherlinear classifier were demonstrated. It was shown that aprobability of detection close to 85% could be achievedfor a probability of false alarm 10% in the case of vocalfold paralysis in male utterances, while the probability ofdetection for vocal fold edema in female utterances wasfound to be approximately 73% at the same probabilityof false alarm. The nearest mean classifier was found tooutperform K-nearest neighbor classifiers for K = 1, 2, 3in both experiments. Two linear classifiers were examinedin [32]. The first one is based on a sample-based optimallinear classifier design [43], while the second one is basedon the dual-space linear discriminant analysis [44]. Again 14LPCs were extracted by processing utterances correspondingto the sustained vowel “ah.” Both the rectangular and theHamming window are used to extract the speech frames [45].The assessment of the classifiers studied in [32] was done byestimating the probability of false alarm and the probabilityof detection using the leave-one-out method. The parametricclassifier was found to be more accurate than the dual spacelinear discriminant classifier. In particular, a slightly higherprobability of detection for vocal fold paralysis in men wasmeasured, that is approximately equal to 90% for probabilityof false alarm 10%. The gain in the probability of detectionfor vocal fold edema in women was 20% higher than thatachieved by the Fisher linear discriminant in [29]. LPCs,LPC-derived cepstral coefficients, and mel frequency cepstalcoefficients were extracted for vocal fold edema detection in[33]. A vector quantizer was trained based on the distancebetween the feature vectors. Experiments were conductedby using 53 normal speakers and another 67, who werediagnosed with voice pathologies including vocal fold edema.Only a single operating point was reported, which yieldsprobability of detection approximately 73% for probabilityof false alarm 4% [33]. For the same probability of falsealarm, a probability of detection, which falls between 80.95%for rectangular window and 90.47% for Hamming window,was reported in [32].

Two distinct two-class pattern recognition problems arestudied, namely, the detection of male subjects who arediagnosed with vocal fold paralysis against male subjects whoare diagnosed as normal and the detection of female subjectswho are suffering from vocal fold edema against femalesubjects who do not suffer from any voice pathology. Therationale for gender-dependent voice pathology detectionis in the inherent differences of the speech productionsystem for male and female speakers and the higher accuracyfor speech emotion recognition, speaker indexing, speakerrecognition, and so forth, offered by the gender-dependentmodels than the gender-independent ones. The ROC curve

of the linear classifier, that stems from the Bayes classi-fier when Gaussian class conditional probability densityfunctions with equal covariance matrices are assumed, isderived. The optimal operating point of the linear classifieris specified with and without reject option. The contributionof this paper is in the assessment of the impact of rejectoption in the ROC curve of the linear classifier for the two-class pattern recognition problems under study. Althoughsustained vowels are not representative of continuous speech,utterances of the sustained vowel “ah” from the MEEIdatabase are employed here due to their wide use inmedical practice and, primarily, in order to maintain directcompatibility with previously reported results [29, 32] andminimal problem complexity, so that we focus on the role ofthe reject option. However, first experimental results usingcontinuous speech utterances are reported for completeness.A reject region in classifier design was also proposed in [27],but without demonstrating its impact in the ROC curve.The motivation behind the introduction of reject optionin classifier design is two-fold: First, when the conditionalerror given a feature vector due to the decision rule (alsoknown as classification risk) is high, the classifier shouldpostpone making any decision and request rather for expert’sadvice. Second, new classes may appear during the testphase, which were not present during training or someclasses may be sampled poorly during training leading toinaccurate class models [46]. The introduction of rejectoption in the design of two-class classifiers (also knownas dichotomizers) and its impact on the ROC has recentlyattracted the attention of the pattern recognition community[46–49]. Linear prediction coefficients extracted from theutterances are used as features. The reject option is shownto yield statistically significant improvements in the accuracyof detecting the voice pathologies under study.

The outline of the paper is as follows. Section 2 describesbriefly the Bayes classifier for both minimum error and min-imum cost classification in a two-class pattern recognitionproblem without a reject option and discusses the motivationbehind the adoption of a linear classifier. Section 2.1 definesthe ROC curve and its use to derive the optimal operatingpoint for a two-class classifier. The introduction of rejectoption in a dichotomizer is addressed in Section 3. Thedata-set used is presented in Section 4 along with featureextraction. Experimental results are reported in Section 5and conclusions are drawn in Section 6.

2. The Bayes and the Linear Classifierswithout Reject Option

Let X denote a sample (i.e., a feature vector). Let the classΩ1 comprise of samples from healthy subjects and the classΩ2 comprise of samples from subjects diagnosed with certainpathologies. The Bayes rule for minimum error assigns X tothe class Ωi having the maximum a posteriori probabilitygiven X [43]. That is,

�(X) = p1(X)p2(X)

Ω1

≷Ω2

P2

P1, (1)

Page 67: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

where pi(X) are the class conditional probability densityfunctions (pdfs) and Pi are the a priori probabilities of theclasses Ωi, i = 1, 2. The term �(X) at the left-hand side of (1)is known as likelihood and the fraction in the right-hand sideof (1) is called the threshold value of the likelihood ratio fordecision [43]. Frequently, the decision is expressed in termsof the minus log-likelihood ratio h(X) = − ln �(X), whichis known as the discriminant function. Let us assume thatthe class conditional pdfs are normal densities with meanvectors Mi and covariance matrices Σi, i = 1, 2. Then, thediscriminant function becomes a quadratic function of X ,that is,

h(X) = 12

(X −M1)TΣ−11 (X −M1)

− 12

(X −M2)TΣ−12 (X −M2)

+12

ln|Σ1||Σ2|

Ω1

≶Ω2

lnP1

P2.

(2)

The minimization of the probability of classificationerror treats equally the misclassifications of Ω1- and Ω2-samples. However, a higher decision cost should be assignedwhenever a patient is misclassified as normal than whenevera normal subject is misclassified as patient. By introducingthe cost ci j of deciding X ∈ Ωi although X actually belongsto Ω j according to ground truth, the Bayes test for minimumcost is obtained:

p1(X)p2(X)

Ω1

≷Ω2

(c12 − c22)P2

(c21 − c11)P1. (3)

The comparison of (3) with (1) reveals that only thethreshold has been changed in the right-hand side of thelikelihood ratio test. Clearly, for symmetrical cost function,that is, c12 − c22 = c21 − c11, the aforementioned likelihoodratio tests coincide. Hereafter, we will employ a linearclassifier that stems from the quadratic one (2) if equalcovariance matrices Σ1 = Σ2 = Σ are assumed, that is,

h(X) =(M2 − M1

)TΣ−1 X

+12

(M T

1 Σ−1M1 − M T

2 Σ−1M2

) Ω1

≶Ω2

t,(4)

where Mi is the sample mean for Ωi, i = 1, 2, t denotes thethreshold admitting a value in the range of the discriminantfunction, and Σ is the gross sample covariance matrixestimated from the design set without making any distinctionbetween normal and pathological samples. That is, Σ =(1/N)

∑Nl=1(Xl − M)(Xl − M)

T, where Xl, l = 1, 2, . . . ,N

are the feature vectors in the design set of cardinality N

and M is the gross sample mean feature vector. In the

Bayes sense, the linear classifier is optimum only for thenormal distribution with equal covariance matrices [43].Although, the assumption of equal covariance matricesmight not be plausible in reality, the simplicity of theclassifier compensates for any potential loss in accuracy otherclassifiers (e.g., SVMs) might deliver. Indeed, (4) requiresonly Σ and Mi, i = 1, 2 to be estimated from the designset. However, it should be stressed that no linear classifierperforms well, when the distributions are not separated bythe mean-difference, but are separated by the covariance-difference. In the latter case, one has to adopt a more complexclassifier, for example, a quadratic one.

2.1. ROC Curve without Reject Option. The decisions takenby the linear classifier (4) for all test samples yield thefollowing measures, which are functions of the threshold t:

(i) true positive rate (TP), also called sensitivity or prob-ability of detection PD, which is defined as the ratiobetween pathological samples correctly classified andthe total number of pathological samples;

(ii) false negative rate (FN), also called probability of miss,which is defined as the ratio between pathologicalsamples wrongly classified and the total number ofpathological samples;

(iii) true negative rate (TN), also called specificity, which isdefined as the ratio between normal samples correctlyclassified and the total number of normal samples;

(iv) false positive rate (FP) also known as probabilityof false alarm PFA, which is defined as the ratiobetween normal samples wrongly classified and thetotal number of normal samples.

By varying the threshold, we obtain several operating pointsof the classifier, which can be represented through the receiveroperating characteristic (ROC) curve, which is the plot of PD(TP) versus PFA (FP) having t as an implicit parameter. TheROC is always a concave upwards curve [50]. If a single figureof merit out of a ROC curve is sought, the most commonlyused figure of merit is the area under the ROC curve. Anideal classifier would have a unit area under the ROC curve.Besides the visualization of classifier performance, the ROCcurve can be used to select the most appropriate decisionthreshold for a particular application [47]. In this case, onehas to resort to the costs ci j , i, j = 1, 2, shown in the uppertwo rows in Table 1. Clearly, c12 and c21 are related to afalse negative and a false positive classification, while c11

and c22 refer to the costs of true negative and true positiveclassifications. A particular operating point (PFA(t),PD(t)) atthreshold t is associated to the expected cost [47]:

EC(t) = P1(c21 − c11)PFA(t) + P2(c22 − c12)PD(t)

+ P1c11 + P2c12

(5)

which defines a set of straight lines with slope

α = −P1

P2

c21 − c11

c22 − c12(6)

Page 68: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

Table 1: Costs for voice pathology detection with reject option.

Detector’s decisionActual diagnosis

Normal (1) Pathological (2)

Normal (1) c11 c12

Pathological (2) c21 c22

Reject cR1 (CRN) cR2 (CRP)

on the (PFA(t),PD(t)) plane. Among these lines the onetouches the ROC curve determines the best operating point,that is, the threshold that minimizes the expected cost. Ifthe ROC curve has been obtained by means of a parametricmodel, it is a smooth curve and the best operating pointis where the line is tangent to the ROC curve [50]. Whenthe ROC curve is defined with respect to a finite numberof experimental measurements connected with straight lines,the optimal operating point can be determined by the pointwhere a line with slope α touches the ROC curve movingdownwards from the top left corner of the (PFA,PD) plane[51]. Such point lies on the ROC convex hull. That is, thesmallest convex set containing the points of the ROC curve[47].

3. Dichotomizers with Reject Option

Given X , the conditional error (or risk) for the Bayesclassifier for minimum error (1) is

r(X) = min{P1p1(X),P2p2(X)

}. (7)

When r(X) is close to 0.5, decision-making can be postponedby introducing a reject test. By setting a threshold θ for r(X),the reject region is defined as [43]

r(X) ≥ θ ⇐⇒− ln1− θθ

+ lnP1

P2

≤ h(X) ≤ ln1− θθ

+ lnP1

P2.

(8)

Thus whenever (8) is satisfied, the sample X is rejected.That is, no decision is taken by the classifier and furtheradvice is requested by a medical doctor in the context of theapplication discussed in the paper. Samples in Ω1 satisfyingh(X) > ln((1 − θ)/θ) + ln(P1/P2) are misclassified (FP).Similarly, samples in Ω2 satisfying h(X) < − ln((1 − θ)/θ) +ln(P1/P2) are misclassified (FN). Equation (8) suggests tomodify the linear classifier decision rule (4) by introducingtwo thresholds t1 and t2 with t1 ≤ t2 as follows:

X ∈ Ω1(N) if h(X) < t1,

X ∈ Ω2(P) if h(X) > t2,

X is rejected if t1 ≤ h(X) ≤ t2.

(9)

Obviously, (9) suggests that although the probability ofrejection is a fraction of the test samples, the probability offalse alarm and the probability of detection is now a fraction

of the test samples, which are not being rejected. That is,the denominators in the estimates of the just mentionedprobabilities are now different than those without rejection.

In a sample-based approach, we may set t1 = t − ϑ andt2 = t + ϑ, where t admits values uniformly spaced in the

interval [hmin,hmax] with hmin = minX∈(Ω1∪Ω2){h(X)} and

hmax = maxX∈(Ω1∪Ω2){h(X)}, while ϑ = γΔt, where Δt is thestep increment of t and γ is a small integer. However, sucha choice does not harm the validity of the analysis followingfor generic (asymmetric) thresholds t1 and t2 [47]. Let T theset of discrete thresholds determined by the just describedprocedure for t. One may set t1 ∈ T and t2 ∈ T so thatt2 > t1.

3.1. ROC Curve with Reject Option. When a reject optionis introduced in the classifier design, the costs for rejectionshould be inserted in the last row of Table 1. The optimalvalues of t and ϑ (or γ) should be determined so that thefollowing two conflicting requirements are fulfilled, namelyclassification error reduction and limited reject region inorder to preserve as many correct classifications as possible.Following similar lines to [47], it can be shown that theexpected cost associated with the classification (9) is now afunction of two variables and is given by

EC(t, ϑ) = ε2(t + ϑ)− ε1(t − ϑ) + P2c12 + P1c11, (10)

where

ε1(t − ϑ) = P2(c12 − cR2)PD(t − ϑ) + P1(c11 − cR1)PFA(t − ϑ),

ε2(t + ϑ) = P2(c22 − cR2)PD(t − ϑ) + P1(c21 − cR1)PFA(t − ϑ).(11)

The optimal t and ϑ satisfy ∇t,ϑEC(t, ϑ) = 0. This isequivalent to

P2(c22 − cR2)∂PD(t2)∂t2

+ P1(c21 − cR1)∂PFA(t2)∂t2

− P2(c12 − cR2)∂PD(t1)∂t1

− P1(c11 − cR1)∂PFA(t1)∂t1

= 0,

P2(c22 − cR2)∂PD(t2)∂t2

+ P1(c21 − cR1)∂PFA(t2)∂t2

+ P2(c12 − cR2)∂PD(t1)∂t1

+ P1(c11 − cR1)∂PFA(t1)∂t1

= 0,

(12)

where the following change of variables has been made t1 =t − ϑ and t2 = t + ϑ. By adding and subtracting by parts thetwo equations in the set (12), we arrive at

P2(c22 − cR2)∂PD(t2)∂t2

+ P1(c21 − cR1)∂PFA(t2)∂t2

= 0,

P2(c12 − cR2)∂PD(t1)∂t1

+ P1(c11 − cR1)∂PFA(t1)∂t1

= 0.

(13)

Page 69: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

PD

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PFA

Without reject optionWith reject option

(a)

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

PD

0 0.05 0.1 0.15 0.2 0.25

PFA

Without reject optionWith reject option

(b)

Figure 1: (a) Experimental ROC curves of the linear classifier tested for vocal fold paralysis detection in men without reject option (dashedline) and with reject option (solid line). (b) Zoom in the ROC curves.

The set of equations (13) defines two straight lines withslopes

α1 = −P1

P2

c21 − cR1

c22 − cR2, (14)

α2 = −P1

P2

c11 − cR1

c12 − cR2(15)

on the plane of PFA and PD. Equations (14) and (15) arevalid for generic t1 and t2. The set of equations (13) suggeststhat the straight lines of slope α1 and α2 should touch theconvex hull of the ROC curve without reject option at twodistinct points having implicit parameters t1 and t2 suchthat t1 < t2. Each of these distinct points can be found bymeans of a simple search of the edges of the ROC convexhull derived without the reject option [47]. Having foundt1 and t2, the set of equations t1 = t − ϑ and t2 = t + ϑ isthen solved for t and ϑ. Clearly, the just derived estimates oft and ϑ are initial ones, because they depend on the convexhull resolution of the ROC curve without rejection estimatedfrom the threshold values t ∈ T . The initial estimates of tand ϑ can be corrected, when the operating point they definelies inside the convex hull of the ROC curve with rejection.Since the probability of false alarm and the probability ofdetection in the latter ROC curve are fractions of the testsamples, which are not being rejected, the lines of slope αgiven by (6) should touch the convex hull of the ROC curvewith rejection at the optimal operating point. The values oft and ϑ of the aforementioned optimal operating point arebetter estimates than the initial ones. If the initial estimatesof t and ϑ define an operating point outside the convex hullof the ROC curve with rejection, then no further correctionis needed, because such an operating point defines a newvertex of the convex hull linked by two new edges with the

nearest vertices already included in the available convex hull.Obviously, the new vertex will be the point where the lines ofslope α touch the updated convex hull.

4. Datasets and Feature Extraction

The MEEI database was released in 1994 [37]. It containsover 1400 voice signals of approximately 700 subjects. Twodifferent kinds of recordings were collected: the patientswere called to articulate the sustained vowel “ah” (/a/)and to read the “rainbow passage” in each session. Thedatabase contains recordings of vowel “ah” (53 normal and657 pathological utterances) and continuous speech (53normal and 661 pathological utterances). The discussion isfocused on the sustained vowel recordings and first resultson “rainbow passage” recordings will be reported. Therecordings were performed in matching acoustic conditions,using Kays Computerized Speech Lab. Each subject wasasked to produce a sustained phonation of vowel “ah” ata comfortable pitch and loudness for at least 3 seconds.The process was repeated three times for each subject,and a speech pathologist chose the best sample for thedatabase. The recordings of the sustained vowel were madeat a sampling rate of 25 KHz for patients and 50 KHz forthe healthy subjects. In the latter case, the sampling ratewas reduced to 25 KHz by down-sampling. The normalvoice recordings are about 5 seconds long, whereas thepathological ones are about 3 seconds long. The majorasset of the MEEI database is the clinical assessment of thesubjects as well as the availability of subjects’ personal details.However, there are several drawbacks that are carefullyidentified in [21].

Due to the inherent differences in the speech productionsystem of male and female subjects, it makes sense to deal

Page 70: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PD

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PFA

(a)

0.7

0.75

0.8

0.85

0.9

0.95

1

PD

0 0.05 0.1 0.15 0.2 0.25

PFA

(b)

Figure 2: (a) Convex hull of the experimental ROC curve of the linear classifier without reject option (solid line) with the level lines of slopeα (dashed lines) overlaid. (b) Zoom in (a): the arrow points to the optimal operating point (PFA,PD) = (0.0252, 0.9296).

with disordered speech detection separately for each gender.Two experiments are conducted. The first experiment con-cerns vocal fold paralysis detection and the dataset comprisesrecordings from 21 males aged 26 to 60 years, who weremedically diagnosed as normal, and another 21 males aged20 to 75 years, who were medically diagnosed with vocalfold paralysis. The second experiment concerns vocal foldedema detection, where 21 females aged 22 to 52 years,who were medically diagnosed as normal, and another 21females aged 18 to 57 years, who were medically diagnosedwith vocal fold edema served as subjects. The subjectsmight suffer from other diseases too, such as hyperfunction,ventricular compression, atrophy, teflon granuloma, andso forth. Although a multi-label classification frameworkwould be more appropriate, we will assume a sort oftying in this paper by ignoring the other connotations, sothat enough design and test samples are available for ourstudy. Multi-label classification is left for future research.However, the linear classifier studied in the paper requiresonly the estimation of the class-conditional mean vectorsand the gross dispersion matrix. Accordingly, the number ofadjustable parameters is not high.

As in [29, 32], 14 LPCs are extracted for each speechframe. The speech frames have a duration of 20 ms andneighboring frames do not overlap. The rectangular windowis used to extract the speech frames. By varying the numberof LPCs from 14 to 30, we have found that the probabilityof correct classification for both voice pathologies does notimprove so much to justify linear prediction analysis ofhigher order than the 14th. On the contrary, more LPCsthan 14 are found to frequently deteriorate the probabilityof correct classification.

In the first experiment, the sample set consists of 423614-dimensional feature vectors (i.e., samples) of which 3171samples were extracted from normal speech utterances of thesustained vowel “ah” and the remaining 1065 samples wereextracted from pathological speech uttered by male speakers.In the second experiment, the sample set consists of 4199

Table 2: Arithmetic values of the costs employed for voicepathology detection with reject option.

Detector’s decision Actual diagnosis

Normal (1) Pathological (2)

Normal (1) −1 10

Pathological (2) 5 −1

Reject 1 2

14-dimensional feature vectors of which 3096 samples wereextracted from normal speech utterances of the sustainedvowel “ah” and the remaining 1103 samples were extractedfrom pathological speech uttered by female speakers. Foreach experiment, first experimental results using utterancesof “rainbow passage” are also reported.

5. Experimental Results

The assessment of the linear classifier for detecting vocalfold paralysis in men and vocal fold edema in women eitherwith or without reject option is based on the ROC curve.80% of the samples have been used in classifier design,and the remaining 20% of the samples has been used fortesting the classifier. The classifier design aims at estimatingthe parameters appearing in (4). The costs depicted inTable 2 have been used in the study of ROC curves. Thenegative sign for true positives and true negatives should beinterpreted as a gain. The assignment of a higher cost for falsenegatives (misses) than false positives (false alarms) is easilyunderstood. The costs cR2 (CRP) and cR1 (CRN) are chosenso that the inequality

c11 − cR1

c12 − cR2>c21 − cR1

c22 − cR2(16)

holds [47]. A design strategy is as follows.

Page 71: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

0.05

0.1

0.15

Pro

babi

lity

ofre

ject

ion

0.30.2

0.10ϑ −4 −2 0 2

t

(a)

0.2

0.4

0.6

0.8

Pro

babi

lity

ofre

ject

ion

5

0

−5

t2

−4−2

02

4

t1

(b)

Figure 3: Probability of rejection in vocal fold paralysis detection as a function of (a) t and ϑ, (b) t1, t2 ∈ T with t2 ≥ t1.

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

PD

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055

PFA

(a)

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

PD

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055

PFA

(b)

Figure 4: (a) Zoom in the convex hull of the ROC without reject option (solid line); the level lines of slope α1 (dashed lines) are overlaid.The arrow points to the optimal operating point (PFA(t2),PD(t2)) = (0.0252, 0.9296). (b) Zoom in the convex hull of the ROC without rejectoption (solid line); the level lines of slope α2 (dashed lines) are overlaid. The arrow points to the optimal operating point (PFA(t1),PD(t1)) =(0.0472, 0.9531).

(1) Choose c22 < cR2 < c12, for example, cR2 = 2.

(2) Let η = (c12−cR2)/(cR2−c22) > 0, for example, η = 1.

(3) Then, cR1 < (c21η + c11)/η + 1, for example, cR1 < 4.5.

In addition, cR1 should be chosen so that the straight linesof slope α1 and α2 touch the convex hull of the ROCcurve without reject option at two distinct points in orderthe reject option to be meaningful. The choice cR1 = 1satisfies both requirements. However, any other assignmentstemming from the just described strategy could also be used.

5.1. Vocal Fold Paralysis in Men. The experimental ROCcurves of the linear classifier without reject option (4) andwith reject option (9), that were derived by counting classifierdecisions, are shown in Figure 1.

In order to obtain a better insight into the detection,first the convex hull of the ROC curve without the rejectoption is plotted in Figure 2(a). In the same figure, severalparallel level lines PD(t) = α PFA(t) + β(t) are overlaid.Clearly, one of these lines passes through the ideal operatingpoint (PFA(t),PD(t)) = (0, 1). The intercept of this line

Page 72: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

0.7

0.75

0.8

0.85

0.9

0.95

1

PD

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

PFA

Figure 5: Zoom in the ROC convex hulls with reject option (solidline) and without reject option (dashed line).

is β(t)|{t:PFA(t)=0,PD(t)=1} = 1. Accordingly, to produce theset of parallel lines one has to uniformly vary β ∈ [0, 1].The inspection of Figure 2(b) reveals the optimal operatingpoint (PFA(t),PD(t)) = (0.0252, 0.9296), where the level linestouch the ROC convex hull. Indeed, the line above thattouching the ROC curve does not determine any feasiblepoint for the classifier, although it exhibits a lower expectedcost, while the line below intersects the ROC curve in at leasttwo points, but at a greater expected cost. The easiest methodto identify the optimal point is the visual inspection of thegraph. However, since the vertices of the convex hull havealready been determined, one has to insert the associated(PFA(t),PD(t)) into (5), sort the vertices in increasing orderof the expected cost, and read the operating point thatyields the minimum expected cost. Alternatively, one maysearch the edges of the ROC convex hull as is suggested in[47]. All these methods have been successfully tested in allexperiments conducted.

The introduction of the reject option in (9) induces theprobability of rejection, which is plotted in Figure 3 as afunction of t1 and t2 when the costs shown in Table 2 areused. Figure 3(a) depicts the probability of rejection as afunction of t and ϑ. In particular, t ∈ T and 10 equallyspaced values of ϑ ∈ [0, 3Δt] were defined. As expected,the largest probability of rejection (i.e., 0.1804) occurs fort = −0.7330 and ϑ = 0.2434 yielding thresholds t1 and t2 inthe middle of their domain T . The probability of rejectionfor t1, t2 ∈ T with t2 ≥ t1 is plotted in Figure 3(b). It is seenthat the generic rejection region may yield large probabilitiesof rejection leaving very few test samples to be processedby the classifier. On the contrary, much fewer test samplesshould be submitted to a clinician for further screening, ift1, t2 are set equal to t ± ϑ.

In Figure 4(a), the convex hull of the ROC withoutrejection is plotted along with the level lines having slopeα1 given by (14). The points that define the ROC convexhull are indicated by markers. The level lines touch theROC convex hull at the operating point (PFA(t2),PD(t2)) =

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

PD

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

PFA

Without reject optionWith reject option

Figure 6: Zoom in the experimental ROC curves of the linearclassifier applied to vocal fold edema detection in women withoutreject option (dashed line) and with reject option (solid line).

(0.0252, 0.9296). The level lines having slope α2 given by (15)touch the convex hull of the ROC without rejection at theoperating point (PFA(t1),PD(t1)) = (0.0472, 0.953), as canbe seen in Figure 4(b). The implicit thresholds associatedwith the two operating points are t1 = −0.2822 and t2 =−0.1920. Indeed, the reject option is useful in the middleof the domain of thresholds T . By applying the proceduredescribed in Section 3.1, the associated probabilities of falsealarm and detection with reject option at the optimaloperating point are found to be 0.01904 and 0.99484. Itis seen that the introduction of rejection has improvedthe probability of detection by 6.59% for probability offalse alarm fixed to approximately 2%. The classificationaccuracy with reject option at the operating point underdiscussion is measured 98.47%, that is 2.13% higher thanthat measured without rejection. The confidence interval forthe classification accuracy can be estimated as in [21], that is,

CI = ±z1−δ/2

√q(1− q)N

, (17)

where z1−δ/2 is the standard Gaussian percentile for con-fidence level 100 (1 − δ)% (e.g., for δ = 0.05, z1−δ/2 =z0.975=1.967), q is the experimentally measured classificationaccuracy, and N is the number of samples. In our case,for N = 847 and q = 0.96863, (17) yields 0.83%,which indicates that the just mentioned improvement isstatistically significant at 95% level of significance. If cR1 isset equal to−1 (i.e., a gain is introduced for rejecting normalsubjects), which is a permissible policy according to the costassignment methodology described previously, and all othercosts are left intact, the probability of correct classification atthe best operating point increases to 98.59%, which yieldsa statistically significant improvement at the same level ofsignificance (CI = 0.7954%). At the latter operating point,

Page 73: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal Processing

0.2

0.4

0.6

0.8

1

PD

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

PFA

(a)

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

PD

0 0.05 0.1 0.15 0.2 0.25

PFA

(b)

Figure 7: (a) Convex hull of the experimental ROC curve of the linear classifier without reject option (solid line) with the level lines of slopeα (dashed lines) overlaid. (b) Zoom in (a), the arrow points to the optimal operating point (PFA,PD) = (0.0629, 0.7955).

we have PFA = 0.0172 and PD = 0.994709, when the rejectoption is enabled.

The superiority of the linear classifier with reject optionis demonstrated in Figure 5, where the convex hull of theROC curves with reject option (solid line) and without rejectoption (dashed line) are plotted only. It is self-evident thatthe area of the convex hull for the ROC with reject optionis greater than that without reject option. The area of theconvex hull is correlated with the area under the ROC that isfrequently used as an objective figure of merit. In particular,the area under the ROC was measured to 0.9868 withoutrejection and 0.9951 with rejection option, when t1 = t − ϑand t2 = t + ϑ.

The same procedure has been applied to a set of 5049test feature vectors extracted from utterances of “rainbowpassage.” At the optimal operating point with respect to thecosts of Table 2 the classifier without reject option yieldsPFA = 0.477227 and PD = 0.9358 and its accuracy is 72.93%.The introduction of the reject option yields at the optimaloperating point PFA = 0.0686 and PD = 0.91875, while theprobability of correct classification increases to 92.45%. It isseen that the reject option reduces drastically the probabilityof false alarm by approximately 40% at the same probabilityof detection. Needless to say that the improvement inclassification accuracy is statistically significant.

5.2. Vocal Fold Edema in Women. The experimental ROCcurves of the linear classifier without reject option (4) andwith reject option (9) with the cost assignment shown inTable 2 were derived by counting classifier decisions areplotted in Figure 6.

The convex hull of the ROC curve without reject optionis plotted in Figure 7. In the same figure, a set of parallel levellines having slope given by (6) is overlaid and the points thatdefine the ROC convex hull are indicated by markers. If thecosts shown in Table 2 are employed, the minimum expectedcost is found for the threshold that yields the operatingpoint (PFA(t),PD(t)) = (0.0629, 0.7955), where the level linestouch the ROC convex hull.

The introduction of the reject option in (9) induces theprobability of rejection, which is plotted in Figure 8 as a

0.05

0.1

0.15P

roba

bilit

yof

reje

ctio

n

0.30.2

0.10ϑ −4 −2 0 2 4

t

Figure 8: Probability of rejection as a function of (t1, t2) for vocalfold edema detection.

function of t and ϑ. 100 equally spaced values in the range[hmin,hmax] were taken for t and 10 equally spaced valuesof ϑ ∈ [0, 3Δt] were defined as previously in vocal foldparalysis. As expected, the larger probability of rejectionoccurs in the middle of the domain of t ± ϑ.

In Figure 9(a), the convex hull of the ROC withoutrejection is plotted along with the level lines having slopeα1 given by (14). The points that define the ROC convexhull are indicated by markers. The level lines touch theROC convex hull at the operating point (PFA(t2),PD(t2)) =(0.0177, 0.7227). The level lines of slope α2 given by (15)touch the convex hull of the ROC without rejection atthe operating point (PFA(t1),PD(t1)) = (0.1322, 0.8590),as is demonstrated in Figure 9(b). These operating pointscorrespond to t1 = −0.2643 and t2 = 0.2937. By applying theprocedure described in Section 3.1, the associated probabili-ties of false alarm and detection with reject option are found

Page 74: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

0.6

0.65

0.7

PD

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

PFA

(a)

0.7

0.75

0.8

0.85

0.9

PD

0 0.05 0.1 0.15 0.2 0.25 0.3

PFA

(b)

Figure 9: (a) Zoom in the convex hull of the ROC without reject option (solid line); The level lines of slope α1 (dashed lines) are overlaid.The arrow points to the optimal operating point (PFA(t2),PD(t2)) = (0.0177, 0.7227). (b) Zoom in the convex hull of the ROC without rejectoption (solid line); the level lines of slope α2 (dashed lines) are overlaid. The arrow points to the optimal operating point (PFA(t1),PD(t1)) =(0.1322, 0.8590).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

PD

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

PFA

Figure 10: Zoom in the ROC convex hulls with reject option (solidline) and without reject option (dashed line).

to be 0.02003 and 0.836842, respectively. The classificationaccuracy with reject option at the best operating point, whenthe costs of Table 2 are used, is measured 94.316%. That is,4.316% higher than that measured without rejection. Theconfidence interval for the classification accuracy predictedby (17) for N = 840 and q = 0.94316 is 1.57%, whichindicates that the just mentioned improvement of 4.316%is statistically significant at 95% level of significance. Byfixing the probability of detection to 83.64%, the rejectoption is found to reduce the probability of false alarm by9.12%.

The superiority of the linear classifier with reject optionis demonstrated in Figure 10, where the convex hull of theROC curves with reject option (solid line) and without reject

option (dashed line) are plotted only. It is self-evident thatthe area of the convex hull for the ROC with reject optionis greater than that without reject option. In particular, thearea under the ROC increases from 0.9458 to 0.96 with theintroduction of the reject option.

The same procedure has been applied to a set of 3365test feature vectors extracted from utterances of “rainbowpassage.” At the optimal operating point with respect to thecosts of Table 2 the classifier without reject option yieldsPFA = 0.5965 and PD = 0.8959 and its probability of correctclassification is 64.96%. The introduction of the reject optionyields at the optimal operating point PFA = 0.5228 andPD = 0.8853, while the accuracy increases to 68.8%. Itis seen that the reject option reduces the probability offalse alarm by approximately 7.3% at the same probabilityof detection. The improvement of 3.9% in classificationaccuracy is statistically significant at 95% level of significance(CI = 1.57%).

6. Conclusions

The reject option has been shown to improve the accuracy ofa linear classifier in detecting vocal fold paralysis for malepatients as well as detecting vocal fold edema for femaleones than that obtained without reject option. Moreover, thereported improvements are shown to be statistically signifi-cant at 95% confidence level. In addition, the linear classifierwith reject option outperforms the previously employedclassifiers in [29, 32] to detect the aforementioned voicepathologies under exactly the same experimental protocol.Future research will address the introduction of reject optionin the design of the Bayes classifier, when Gaussian mixturemodels approximate the class conditional probability densityfunctions of the linear prediction coefficients extracted fromcontinuous speech.

Page 75: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

12 EURASIP Journal on Advances in Signal Processing

References

[1] C. Manfredi, “Voice models and analysis for biomedicalapplications,” Biomedical Signal Processing and Control, vol. 1,no. 2, pp. 99–101, 2006.

[2] F. Quek, M. Harper, Y. Haciahmetoglou, L. Chen, and L.O. Ramig, “Speech pauses and gestural holds in parkinson’sdisease,” in Proceedings of the 7th International Conferenceon Spoken Language Processing (ICSLP ’02), pp. 2485–2488,Denver, Colo, USA, September 2002.

[3] L. Will, L. O. Ramig, and J. L. Spielman, “Application of leesilverman voice treatment (LSVT) to individuals with multiplesclerosis, ataxic dysarthria, and stroke,” in Proceedings of the7th International Conference on Spoken Language Processing(ICSLP ’02), pp. 2497–2500, Denver, Colo, USA, September2002.

[4] P. Enderby and L. Emerson, Does Speech and Language TherapyWork? Singular Publications, 1995.

[5] R. P. Schumeyer and K. E. Barner, “Effect of visual informationon word initial consonant perception of dysarthric speech,” inProceedings of the 4th International Conference on Spoken Lan-guage Processing (ICSLP ’96), vol. 1, pp. 46–49, Philadelphia,Pa, USA, October 1996.

[6] K. Mady, R. Sader, A. Zimmermann, et al., “Assessment ofconsonant articulation in glossectomee speech by dynamicMRI,” in Proceedings of the 7th International Conference onSpoken Language Processing (ICSLP ’02), pp. 961–964, Denver,Colo, USA, September 2002.

[7] R. Schwarz, U. Hoppe, M. Schuster, T. Wurzbacher, U.Eysholdt, and J. Lohscheller, “Classification of unilateral vocalfold paralysis by endoscopic digital high-speed recordings andinversion of a biomechanical model,” IEEE Transactions onBiomedical Engineering, vol. 53, no. 6, pp. 1099–1108, 2006.

[8] V. Parsa and D. G. Jamieson, “Interactions between speechcoders and disordered speech,” Speech Communication, vol.40, no. 7, pp. 365–385, 2003.

[9] M. S. Hawley, P. Green, P. Enderby, S. Cunningham, and R.K. Moore, “Speech technology for e-inclusion of people withphysical disabilities and disordered speech,” in Proceedingsof the 9th European Conference on Speech Communicationand Technology (INTERSPEECH ’05), pp. 445–448, Lisbon,Portugal, September 2005.

[10] F. Plante, H. Kessler, B. Cheetham, and J. Earis, “Speechmonitoring of infective laryngitis,” in Proceedings of the4th International Conference on Spoken Language Processing(ICSLP ’96), vol. 2, pp. 749–752, Philadelphia, Pa, USA,October 1996.

[11] E. J. Wallen and J. H. L. Hansen, “Screening test for speechpathology assessment using objective quality measures,” inProceedings of the 4th International Conference on Spoken Lan-guage Processing (ICSLP ’96), vol. 2, pp. 776–779, Philadelphia,Pa, USA, October 1996.

[12] M. N. Vieira, F. R. McInnes, and M. A. Jack, “Robust F0 andjitter estimation in pathological voices,” in Proceedings of the4th International Conference on Spoken Language Processing(ICSLP ’96), vol. 2, pp. 745–748, Philadelphia, Pa, USA,October 1996.

[13] P. Mitev and S. Hadjitodorov, “Fundamental frequencyestimation of voice of patients with laryngeal disorders,”Information Sciences, vol. 156, no. 1-2, pp. 3–19, 2003.

[14] H. Weiping, W. Xiuxin, and P. Gomez, “Robust pitch extrac-tion in pathological voice based on wavelet and cepstrum,” inProceedings of the 12th European Signal Processing Conference

(EUSIPCO ’04), pp. 297–300, Vienna, Austria, September2004.

[15] L. Deng, X. Shen, D. Jamieson, and J. Till, “Simulationof disordered speech using a frequency-domain vocal tractmodel,” in Proceedings of the 4th International Conference onSpoken Language Processing (ICSLP ’96), vol. 2, pp. 768–771,Philadelphia, Pa, USA, October 1996.

[16] B. Gabelman and A. Alwan, “Analysis by synthesis of FMmodulation and aspiration noise components in pathologicalvoices,” in Proceedings of the IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP ’02), vol. 1, pp.449–452, Orlando, Fla, USA, May 2002.

[17] J. Hanquinet, F. Grenez, and J. Schoentgen, “Synthesis of disor-dered speech,” in Proceedings of the 9th European Conference onSpeech Communication and Technology (INTERSPEECH ’05),pp. 1077–1080, Lisbon, Portugal, September 2005.

[18] V. Parsa and D. G. Jamieson, “Acoustic discriminationof pathological voice: sustained vowels versus continuousspeech,” Journal of Speech, Language, and Hearing Research,vol. 44, no. 2, pp. 327–339, 2001.

[19] A. McAllister, “Acoustic, perceptual and physiological studiesof ten-year-old children’s voices,” Speech, Music and HearingQuarterly Progress and Status Report, vol. 38, no. 1, 1997.

[20] V. Uloza, V. Saferis, and I. Uloziene, “Perceptual and acousticassessment of voice pathology and the efficacy of endolaryn-geal phonomicrosurgery,” Journal of Voice, vol. 19, no. 1, pp.138–145, 2005.

[21] N. Saenz-Lechon, J. I. Godino-Llorente, V. Osma-Ruiz, and P.Gomez-Vilda, “Methodological issues in the development ofautomatic systems for voice pathology detection,” BiomedicalSignal Processing and Control, vol. 1, no. 2, pp. 120–128, 2006.

[22] J. B. Alonso, J. de Leon, I. Alonso, and M. A. Ferrer,“Automatic detection of pathologies in the voice by HOS basedparameters,” EURASIP Journal on Applied Signal Processing,vol. 2001, no. 4, pp. 275–284, 2001.

[23] R. B. Reilly, R. Moran, and P. Lacy, “Voice pathologyassessment based on a dialogue system and speech analysis,”in Proceedings of the of the AAAI Fall Symposium on DialogueSystems for Health Communication, pp. 104–109, Washington,DC, USA, October 2004.

[24] K. Shama, A. Krishna, and N. U. Cholayya, “Study ofharmonics-to-noise ratio and critical-band energy spectrumof speech as acoustic indicators of laryngeal and voicepathology,” EURASIP Journal on Advances in Signal Processing,vol. 2007, Article ID 85286, 9 pages, 2007.

[25] P. Gomez, J. I. Godino, F. Rodrıguez, et al., “Evidence of vocalcord pathology from the mucosal wave cepstral contents,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’04), vol. 5, pp. 437–440,Montreal, Canada, May 2004.

[26] J. B. Alonso, F. D. de Maria, C. M. Trevieso, and M. A. Ferrer,“Using nonlinear features for voice disorder detection,” inProceedings of the 3rd International Conference on Non-LinearSpeech Processing (NOLISP ’05), pp. 94–106, Barcelona, Spain,2005.

[27] M. Little, P. McSharry, I. Moroz, and S. Roberts, “Nonlin-ear, biophysically-informed speech pathology detection,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’06), vol. 2, pp. 1080–1083, Toulouse, France, May 2006.

[28] P. Kukharchik, I. Kheidorov, E. Bovbel, and D. Ladeev, “Speechsignal processing based on wavelets and SVM for vocal tractpathology detection,” in Proceedings of the 3rd International

Page 76: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 13

Conference on Image and Signal Processing (ICISP ’08), vol.5099 of Lecture Notes in Computer Science, pp. 192–199,Springer, Cherbourg-Octeville, France, July 2008.

[29] M. Marinaki, C. Kotropoulos, I. Pitas, and N. Maglaveras,“Automatic detection of vocal fold paralysis and edema,”in Proceedings of the International Conference on SpokenLanguage Processing (ICSLP ’04), pp. 537–540, Jeju, SouthKorea, October 2004.

[30] P. Gomez, F. Dıaz, A. Alvarez, et al., “Principal componentanalysis of spectral perturbation parameters for voice pathol-ogy detection,” in Proceedings of the18th IEEE Symposiumon Computer-Based Medical Systems (CBMS ’05), pp. 41–46,Dublin, Ireland, June 2005.

[31] C. Peng, W. Chen, and B. Wan, “A preliminary study ofpathological voice classification,” in Proceedings of the 7thIEEE International Conference on Computer and InformationTechnology (CIT ’07), pp. 1106–1110, October 2007.

[32] E. Ziogas and C. Kotropoulos, “Detection of vocal foldparalysis and edema using linear discriminant classifiers,”in Proceedings of the 4th Helenic Conference on Advances inArtificial Intelligence (SETN ’06), vol. 3955 of Lecture Notes inComputer Science, pp. 454–464, Springer, Heraklion, Greece,May 2006.

[33] B. G. A. Aguiar Neto, J. M. Fechine, S. C. Costa, and M.Muppa, “Feature estimation for vocal fold edema detectionusing short-term cepstral analysis,” in Proceedings of the 7thIEEE International Conference on Bioinformatics and Bioengi-neering (BIBE ’07), pp. 1158–1162, October 2007.

[34] C. Fredouille, G. Pouchoulin, J.-F. Bonastre, M. Azzarello, A.Giovanni, and A. Ghio, “Application of automatic speakerrecognition techniques to pathological voice assessment (dys-phonia),” in Proceedings of the 9th European Conference onSpeech Communication and Technology (EUROSPEECH ’05),pp. 149–152, Lisbon, Portugal, September 2005.

[35] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speakerverification using adapted Gaussian mixture models,” DigitalSignal Processing, vol. 10, no. 1–3, pp. 19–41, 2000.

[36] http://emedicine.medscape.com/article/863779-overview.[37] Massachusetts Eye and Ear Infirmary, Voice Disorders Database,

Version 1.03, Kay Elemetrics Corp., Lincoln Park, NJ, USA,1994, CD-ROM.

[38] A. A. Dibazar, S. Narayanan, and T. W. Berger, “Featureanalysis for automatic detection of pathological speech,” inProceedings of the 25th IEEE Annual International Conferenceof the Engineering in Medicine and Biology, vol. 1, pp. 182–183,2002.

[39] V. Parsa, D. G. Jamieson, K. Stenning, and H. A. Leeper, “Onthe estimation of signal-to-noise ratio in continuous speechfor abnormal voices,” in Proceedings of the 7th InternationalConference on Spoken Language Processing (ICSLP ’02), pp.2505–2508, Denver, Colo, USA, September 2002.

[40] J. Nayak and P. S. Bhat, “Identification of voice disorders usingspeech samples,” in Proceedings of the 10th IEEE InternationalConference on Convergent Technologies for Asia-Pasific Region(TENCON ’03), vol. 3, pp. 951–953, 2003.

[41] R. A. Prosek, A. A. Montgomery, B. E. Walden, and D. B.Hawkins, “An evaluation of residue features as correlates ofvoice disorders,” Journal of Communication Disorders, vol. 20,pp. 105–107, 1987.

[42] M. De Oliveira Rosa, J. C. Pereira, and M. Grellet, “Adaptiveestimation of residue signal for voice pathology diagnosis,”IEEE Transactions on Biomedical Engineering, vol. 47, no. 1, pp.96–104, 2000.

[43] K. Fukunaga, Introduction to Statistical Pattern Recognition,Academic Press, San Diego, Calif, USA, 2nd edition, 1990.

[44] X. Tang and W. Wang, “Dual-space linear discriminantanalysis for face recognition,” in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR ’04), vol. 2, pp. 1064–1068, 2004.

[45] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete TimeProcessing of Speech Signals, MacMillan Publishing Company,New York, NY, USA, 1993.

[46] T. C. W. Landgrebe, D. M. J. Tax, P. Paclık, and R. P. W.Duin, “The interaction between classification and reject per-formance for distance-based reject-option classifiers,” PatternRecognition Letters, vol. 27, no. 8, pp. 908–917, 2006.

[47] F. Tortorella, “A ROC-based reject rule for dichotomizers,”Pattern Recognition Letters, vol. 26, no. 2, pp. 167–180, 2005.

[48] C. M. Santos-Pereira and A. M. Pires, “On optimal reject rulesand ROC curves,” Pattern Recognition Letters, vol. 26, no. 7, pp.943–952, 2005.

[49] C. Marrocco, M. Molinara, and F. Tortorella, “An empiricalcomparison of ideal and empirical ROC-based reject rules,”in Proceedings of the 5th International Conference on MachineLearning and Data Mining (MLDM ’07), vol. 4571 of LectureNotes in Computer Science, pp. 47–60, 2007.

[50] H. L. V. Trees, Detection, Estimation and Modulation Theory,Part I, John Wiley & Sons, New York, NY, USA, 1968.

[51] M. H. Zweig and G. Campbell, “Receiver-operating character-istic (ROC) plots: a fundamental evaluation tool in clinicalmedicine,” Clinical Chemistry, vol. 39, no. 4, pp. 561–577,1993.

Page 77: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 982102, 13 pagesdoi:10.1155/2009/982102

Research Article

Back-and-Forth Methodology for Objective Voice QualityAssessment: From/to Expert Knowledge to/from AutomaticClassification of Dysphonia

Corinne Fredouille,1 Gilles Pouchoulin,1 Alain Ghio,2 Joana Revis,2

Jean-Francois Bonastre,1 and Antoine Giovanni2

1 Laboratoire Informatique d’Avignon (LIA), University of Avignon, 84911 Avignon, France2 LPL-CNRS, Aix-Marseille University, 13604 Aix-en-Provence, France

Correspondence should be addressed to Corinne Fredouille, [email protected]

Received 31 October 2008; Revised 1 April 2009; Accepted 10 June 2009

Recommended by Juan I. Godino-Llorente

This paper addresses voice disorder assessment. It proposes an original back-and-forth methodology involving an automaticclassification system as well as knowledge of the human experts (machine learning experts, phoneticians, and pathologists).The goal of this methodology is to bring a better understanding of acoustic phenomena related to dysphonia. The automaticsystem was validated on a dysphonic corpus (80 female voices), rated according to the GRBAS perceptual scale by an expertjury. Firstly, focused on the frequency domain, the classification system showed the interest of 0–3000 Hz frequency band for theclassification task based on the GRBAS scale. Later, an automatic phonemic analysis underlined the significance of consonants andmore surprisingly of unvoiced consonants for the same classification task. Submitted to the human experts, these observations ledto a manual analysis of unvoiced plosives, which highlighted a lengthening of VOT according to the dysphonia severity validatedby a preliminary statistical analysis.

Copyright © 2009 Corinne Fredouille et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Assessment of voice quality is a key point for establishingtelecommunication standards as well as for medical arealinked to speech and voice disorders. In the telecommunica-tion field, voice quality assessment is mainly addressed at theperceptual level using the Mean Opinion Score (MOS) scale[1] standardized by the International TelecommunicationUnion (ITU). The evaluation of voice quality is doneby a jury composed of nonspecialized listeners. Severalalgorithms were proposed in order to move from thishuman perception-based-measure to an automatic measureto reduce costs as well as to move from a subjectiveto an objective method. The most known algorithm isthe Perceptual Evaluation of Speech Quality (PESQ) [2]also normalized by the ITU. The effectiveness of PESQ ismeasured by the correlation of the MOS measures obtainedby a human jury and using the PESQ. If the PESQ (and its

extensions) is well suited for the telecommunication field,it requires parallel audio records without and with noisedisturbance to evaluate voice quality. This constraint is ofcourse impossible to satisfy in the medical/pathological area.However, independently of this difference it is interesting tonotice that the MOS/PESQ is estimated at the perceptuallevel and that there is no analytical description of informa-tion at the acoustic or phonetic field characterizing a givenlevel of quality. In fact, the human subjective perception isused as a baseline (MOS) and an automatic approach (PESQ)is used to match some signal differences with the MOS scale.

In the field of voice disorder assessment, a generalapproach very similar to the one used in telecommunicationsis followed. Human experts are committed to evaluatingquality of speech samples at the perceptual level, generallyimplying approaches based on the expertise of researchersand practitioners. Three main drawbacks of this schemecould be highlighted: assessment remains subjective, costly

Page 78: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

(if an expert jury is involved), and not analytical, that is, thejudgment may be global or not based on a standardized set ofcriteria. As opposed to the telecommunication area for whicha standard scale (MOS) is proposed, only very few assessmentscales [3–5] are to be found in the pathological field, whichare generally accepted but not really considered as a standarddue to the large diversity of pathological voice disorders andto the intrinsic difficulty to characterize some pathologies.

This paper describes the three points highlighted pre-viously, by proposing a general approach based on boththe human expertise and automatic voice classificationapproach. The proposed scheme allows to automate the voicequality estimate like PESQ, and to move from a subjective toan objective approach. Moreover, the most interesting part isto use the automatic approach in order to support the humanexpertise by highlighting some specific acoustic aspects ofthe addressed pathology or class of pathologies. Figure 1presents the first part of the proposed scheme. The automaticvoice classification system is fed by the pathological voiceexamples associated with the perceptual labels given by thehuman experts. A feedback loop is proposed to assess theability of the automatic system in the classification task.Of course, several iterations involving inputs of machinelearning experts are needed to obtain a satisfactory system.

Figure 2 illustrates the second part of the proposedscheme. Here, the automatic voice classification system isapplied on a set of voice examples to produce analytical infor-mation. This information is given to the experts through asecond feedback loop and associated with statistical measuresand voice excerpts. It allows experts to listen to and/or toanalyze manually small parts of a large speech database onlyin order to assess the interest of one information. Dependingon the previous results, experts—with machine learningspecialists—could change the feature selection and allow thesystem to output targeted information.

This scheme can be applied to any kind of pathologyunder a couple of main constraints: (1) enough expertknowledge is available (to seed the automatic classificationsystem), (2) a good/large enough corpus is available.

In this paper, we focus on dysphonia—an impairmentof the voice—for two main reasons. Firstly, dysphoniarespects the constraints reported previously. Secondly, evenif dysphonia is often considered as a “minor” trouble linkedto an esthetic point of view, this pathology has a drasticimpact on the patient’s quality of life. An explanation of thissubconsideration of dysphonia is that voice quality is gen-erally described as a paralinguistic phenomenon with littleimpact on communication. However, the social relevanceand economic impact of voice disorders are now obvious,especially for school teachers or other professionals whouse their voice as a primary tool of trade. For instance, arecent study [6] has revealed that 10.5% of the teachersare clearly suffering from voice disorders, when severalenquiries [7] show that voice is the primary tool for about25% to 33% of the working population. In addition tomedical and professional consequences, some voice disordershave also severe consequences regarding social activities andinteraction with others. It is the reason why voice therapy isan important issue in a social, economical, clinical contexts,

and among voice therapy activities, voice assessment is animportant part of this clinical and scientific challenge.

A large set of methods can be used to assess voicedisorders like discussion with the patient, endoscopic exam-ination of the larynx, postural behavior of the patient [8],psychological and behavioral profile [9], auto-evaluation asVoice Handicap Index questionnaire [10], perceptual judg-ment [11], or instrumental assessment [12]. It is preferableto increase the fields of observation in order to take themultidimensional aspect of the spoken communication intoaccount. Indeed, an assessment method taken individuallyappears as a reduced view of the voice disorder and providesonly a part of the truth.

The perceptual dimension of voice is an essential aspectfor the vocal evaluation as speech and voice are producedto be perceived. Evaluating voice without studying theimpact on listeners amounts to lose its “raison d’etre”.Moreover, the majority of dysphonic speakers decides toconsult a practitioner when their entourage hears changesin their vocal production on perceptive feelings only. In thesame way, practitioners appreciate therapeutic results mainlylistening to the patients’ voice: auditory perception is the firstand the most accessible method to evaluate vocal quality.Lastly, the human being and his/her perceptive system arepowerful to decode speech [12]. However, the perceptualjudgment remains a controversial method because of variousdrawbacks, notably its subjectivity [13, 14].

The multiparametric instrumental analysis representsan alternative solution to quantify vocal dysfunctions [15].Methods can be based on acoustic measurements but alsoon aerodynamic parameters or electrophysiological signals.These measurements are carried out for vocal productionwith sensors designed to record and compute multipleparameters issued from the speech production. The major-ity of studies in this domain outlines the necessity ofcombining various complementary measures in order totake the multidimensional properties of vocal productioninto account [15–20]. In a recent study [21], we haveapplied a perceptual assessment (GRBAS scale [3]) on 449voice samples including 391 patients with a voice disorderrecorded in the ENT Department of the Timone UniversityHospital Center in Marseille (France). Concurrently, onthe same cohort of patients, an instrumental voice analysiswas carried out using the EVA workstation (SQLab-LPL,Aix-en-Provence, France). The subject was instructed topronounce three consecutive sustained vowels and severalconsecutive/pa/. For more than 80% of this population, thegrade proposed by perceptual and instrumental assessmentwas concordant, which was considered as an acceptableresult by our practitioners for a clinical use. It is importantto notice that the state of the art of such instrumentalapproaches allows only nonnatural, noncontinuous speechmaterials when studying the different phenomena on naturalcontinuous speech is of a large interest.

This limitation encourages the authors to get interestto automatic speaker and speech recognition techniquesfor dysphonic voice characterization. Indeed, in additionto analytical instrumental assessment approaches, anotherkind of methods, mainly drawing upon both automatic

Page 79: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

Initial set ofvoice

examples Humanexperts

Initial knowledge

Perceptualassessment(label files)

Automaticclassification

Objective assessmentapproach : feedback forautomatic assessmentrefinement

Figure 1: Objective assessment of voice quality based on an automatic classification approach and the human expertise.

Automaticclassification

Analytical analysis:Unexpected behavior or features,statistically significant features, …

(both linked to the voice examples)

Human experts

New knowledge

Expert feedback(feature selectionrefinement)

Initial or largerset of voiceexamples

Figure 2: The “automatic system to human experts/knowledge” feedback loop.

speech processing and pattern recognition domains, hasbeen proposed in literature. Mostly dedicated to voicedisorder detection, these approaches rely on automaticacoustic analyses such as spectral [22, 23], cepstral [24–27], or multidimensional acoustic analysis inspired fromthe analytic instrumental assessment (F0, jitter, shimmer,or Harmonic Noise Ratio) [25, 28–30], combined withautomatic classifiers based on Linear Discriminant Analysis(LDA) [28, 29], Hidden Markov Models (HMM) [24, 28],Gaussian Mixture Models (GMM) [24, 26], Support VectorMachines (SVM) [31], or Artificial Neural Networks (ANN)[23, 24, 30].

Compared with the analytic instrumental assessmentmethods described previously, originality and interest ofthese automatic classification-based-assessment approachesare: (1) the ability to analyze continuous speech near tonatural elocution, (2) the ability to process a large set ofdata, authorizing studies on a large-scale and significantstatistical results, (3) a simple and automatic acousticanalysis providing an easy-to-use and noninvasive tool forclinical use.

In this paper, we present a complete approach based onthe “back-and-forth methodology” we have just presented.Section 2 is dedicated to the dysphonic voice classificationsystem. Section 3 describes the experimental protocols aswell as an experimental validation of the first part of themethod: the objective assessment of dysphonic voices. Thenext section presents the core of our method, which aimsto gather new knowledge about dysphonia. Finally, Section 5concludes this paper and presents some future works.

2. Dysphonic Voice Classification System

The system presented below is part of the automatic system-based assessment approaches previously defined and isinvolved in the “back-and-forth” methodology. The prin-ciple retained here is to adapt a state of the art speakerrecognition system to the dysphonic voice classification task.A speaker recognition system can be seen as a supervised clas-sification process able to differentiate speech signals betweenclasses. A class of signal generally belongs to a given speakerand is modeled using a set of examples from the latter. In

Page 80: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

some cases, a composite class could be necessary (associatedwith several speakers) and modeled by grouping severalclasses modeled independently or by modeling a uniquecomposite class on all the signals of speakers belonging to thisclass. Two adaptation levels are necessary to suit a speakerrecognition system to the dysphonic voice classification task.Firstly, a class does not longer correspond to a given speakerbut to a specific pathology or a severity grade of thispathology. The class is then modeled using data from a set ofspeakers affected by the corresponding pathology or severitygrades. Obviously, voices used for training a pathologicalclass cannot be included in the set of tested voices in order todifferentiate pathology detection from speaker recognition.The second adjustment to apply to the speaker recognitionsystem is the representation of speech data, which can beoptimized for the voice disorder classification task.

The speaker recognition technique used in this study isbased on the statistical Gaussian Mixture based modeling,which remains one of the state of the art alternative solutionsfor speaker recognition [32]. This approach consists in threephases:

(i) a parameterization phase;(ii) a modeling phase;

(iii) a decision/classification phase.

2.1. Parameterization Phase. The parameterization phase isnecessary to extract relevant information from the speechsignal. Here, it is based on a short-term spectral analysisresulting on 24 frequency spectrum coefficients and per-formed as follows. The speech signal, sampled at 16 kHz, isfirst emphasized by applying a filter, which aims to enhancethe high frequencies of the spectrum generally reduced forthe speech production. This filter is defined as: x(t) = x(t)−k · x(t − 1) with k fixed at 0.95 empirically. The speechsignal is then windowed by using a 20 millisecond Hammingwindow, shifting at a 10 millisecond rate. The goal of theHamming window is to reduce the side effects. The latterfacilitates the application of a Fast Fourier Transform (FFT)locally on each window (512 points) and the computation ofthe FFT modulus leading to a power spectrum. This powerspectrum is multiplied by a filterbank (series of passbandfrequency filters) in order to extract the envelope of thespectrum. Here, 24 triangular filters are used. According toexperiments, they are either spaced linearly on the 8 kHzfrequency band (referred to as LFSC standing for LinearFrequency Spectrum Coefficients in this paper), or spacedaccording to a MEL scale (referred to as MFSC standingfor Mel Frequency Spectrum Coefficients), wellknown to becloser to the frequency scale of the human ear.

The feature vectors issued from this analysis, at a10 millisecond rate, are finally normalized to fit a 0-mean and1-variance distribution, coefficient by coefficient (means andvariances are estimated on the non-silence signal portions).Classically, this normalization is employed to reduce theeffect of recording channels and facilitates the followingstatistical process.

The LFSC/MFSC computation is done by using the(GPL) SPRO toolkit [33]. Finally, the feature vectors can

be augmented by adding dynamic information representingthe way these vectors vary in time. Here, first and secondderivatives of static coefficients are considered (also namedΔ and ΔΔ coefficients) resulting in 72 coefficients.

2.2. Modeling Phase. Classically, this phase aims to estimatemodels of targeted classes like individual speakers in speakerrecognition. In this paper, models represent either a set ofpathological/normal voices or a set of voices related to aspecific severity grade.

Modeling relies on Gaussian Mixture Models (GMM)and estimate techniques drawing upon the speaker recogni-tion domain. In this context, a set of D-dimensional featurevectors, denoted by X = x1, . . . , xT , is represented by aweighted sum of M multidimensional Gaussian distribu-tions. Each distribution is defined by a D × 1 mean vectorμi, a D × D covariance matrix Σi, and a weight wi of thedistribution inside the mixture. The set of distributions andrelated parameters, also called the Gaussian Mixture Model,is denoted λ = (wi,μi,Σi), i = (1, . . . ,M). The modelingphase consists in estimating all these parameters accordingto training data.

In speaker recognition, two-step modeling is typicallyapplied for the model parameter estimate to improve theirrobustness, especially when a small amount of training datais available for some specific classes, as follows.

(i) Parameters of a GMM model are first estimated on alarge amount of speech signal, issued from a genericpopulation of speakers. This generic speech model,also called Universal Background Model (UBM),tends to represent the speaker-independent space ofacoustic features. It is generally trained using theiterative Expectation-Maximization (EM) algorithm[34] associated with the Maximum Likelihood crite-rion (ML).

(ii) A speaker model is then derived from this UBMmodel by involving adaptation techniques like MAP[35] and the small amount of training data avail-able for the given speaker. In practice, only themean parameters are updated while covariancematrices and distribution weights remain generallyunchanged, directly issued from the UBM model.The mean adaptation relies on a combining functioninvolving mean values issued from both the UBMmodels and the speaker training data.

In this paper, the same scheme is applied due to the smallamount of training data available for pathological and con-trol speech (see Section 3.1 for more details on the corpus).The UBM model parameters are estimated on a French read-speech corpus composed of 76 female speech utterances of2 minutes each. This female population is extracted fromthe BREF corpus [36], which is entirely separate from thedysphonic corpus and the targeted task. All the GMM modelsare composed of 128 Gaussian components with diagonalcovariance matrices.

Regarding the dysphonic voice classification task, a GMMmodel will be estimated per class of information targeted (forinstance, a GMM model per grade of dysphonia severity).

Page 81: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

2.3. Decision/Test Phase. During this phase, a set of newfeature vectors Y = y1, . . . , yT associated with an incomingspeech signal is presented to the system and compared withone or several GMM models λ. This comparison consistsin computing the averaged frame-based likelihood, denotedL(Y | λ), as follows:

L(Y | λ) = 1T

T∑

t=1

L(yt | λ

)(1)

with

L(yt | λ

) =M∑

i=1

wi · Li(yt),

Li(yt) = 1

(2π)D/2|Σi|1/2

× exp{−1

2

(yt − μi

)T(Σi)−1(yt − μi

)},

(2)

where Li(yt) represents the likelihood of frame yt accord-ing to the ith Gaussian distribution of the model λ =(wi,μi,Σi), i = (1, . . . ,M).

In the context of the dysphonic voice classification, theclassification decision is made by selecting the GMM modelλ, and consequently, the class of information associated withfor which the largest likelihood measure is computed giventhe incoming speech signal Y .

3. Experimental Protocol

Results provided in the rest of the paper are expressedin terms of correct classification rates (named CCR).(Intables, the number of well-classified voices is also providedin brackets.) For indication, 95% confidence intervals areprovided for the overall CCR only, given the small numberof tests available from the corpus used and despite cautionstaken by the authors (see Sections 3.1 and 3.2). Finally, it hasto be pointed out that all these results are issued from theGMM classifier and have to be interpreted from a statisticalviewpoint.

3.1. Corpus. The corpus, called DV , used in this study iscomposed of read speech pronounced by a set of dysphonicsubjects, mostly affected by nodules, polyps, oedemas, andcysts as well as a control group. The subjects’ voices areclassified according to the G criterion of the Hirano’s GRBASscale [3], where a normal voice is rated as grade 0, a slightdysphonia as 1, a moderate dysphonia as 2 and finally, asevere dysphonia as 3. The choice of the G criterion wasdriven by two main reasons: (1) it refers to a global qualityjudgment as opposed to the other criteria (RBAS), which ismore suitable regarding the type of parameterization used inthis work, (2) like the R and B criteria, it is more robust tointra- and interlistener variability.

The corpus was supplied by the ENT Department of theTimone University Hospital Center in Marseille (France). Itis composed of 80 voices of females aged 17 to 50 (mean:

32.2). The speech material is obtained by reading the sameshort text (French), of which signal duration varies from13.5 to 77.7 seconds (mean: 18.7 seconds). The 80 voicesare equally balanced in the four GRBAS perceptual grades(20 voices per each), which were determined by a jurycomposed of three expert listeners. This perceptual judgmentwas carried out by consensus between the different jurymembers as it is the usual way to assess voice quality byour therapist partners. The judgment was done during onesession only.

This corpus is used for all the experiments presented inthis paper. Due to its small size, cautions have been madeto provide statistical significance of the results over all theexperiments by applying specific methods like the leave-one-out technique.

3.2. Leave-One-Out Technique. As shown in Section 2.2,training data used to learn models of pathological classeshave to be separated from testing. In other words, speakersincluded in the training set should not be present in thetesting set. As the DV corpus is relatively small (80 voices),it is not well suited to split it into two separate subsets.Consequently, some special protocols have been designedfor different classification tasks (Task1 and Task2) in orderto respect this constraint while providing more statisticallysignificant results. These protocols rely on the leave-one-outtechnique, which consists in discarding a speaker, noted x,from the experimental set, in learning some models on theremaining data and in testing data of speaker x using thesemodels. This scheme is repeated until reaching a sufficientnumber of tests.

3.3. Task1-Protocol P1. Task1 consists in determiningwhether a given voice is normal or dysphonic. Consequently,two different GMM models have to be estimated: the λd(normal) model trained on a subset of G0 voices andλd (dysphonic) model trained on a voice subset equally-balanced in G1, G2 and G3 grades.

In respect with the leave-one-out approach and this gradebalancing, various voice subsets excluding the testing voice,composed of 18 voices each, are available to estimate both thenormal and dysphonic GMM models. In the dysphonic case,these subsets are built randomly, including 6 voices per gradeunder the constraint that all the voices are used at least once.

For testing, each individual voice available in the DVcorpus is first compared to all the normal voice models (fromwhich it has been discarded if it is normal), resulting in anaveraged normal voice likelihood L(Y | λd), and secondlycompared to all the dysphonic voice models (from which ithas been discarded if it is dysphonic), resulting in an averageddysphonic voice likelihood L(Y | λd). The decision perindividual voice relies on the maximum between the coupleof likelihoods.

3.4. Task2-Protocol P2. Task2 consists in assessing a givenvoice according to the G criterion of the GRBAS scale. Fourclasses and corresponding models (one per grade, λG0 , λG1 ,λG2 and λG3 ) are in competition in the system. In this context,

Page 82: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

Grade 0

20181614121086420

Speakers

[7–8]

[6–7]

[5–6]

[4–5]

[3–4]

[2–3]

[1–2]

[0–1]

Freq

uen

cyra

nge

s(k

Hz)

(a)

Grade 1

20181614121086420

Speakers

[7–8]

[6–7]

[5–6]

[4–5]

[3–4]

[2–3]

[1–2]

[0–1]

Freq

uen

cyra

nge

s(k

Hz)

(b)

Grade 2

20181614121086420

Speakers

[7–8]

[6–7]

[5–6]

[4–5]

[3–4]

[2–3]

[1–2]

[0–1]

Freq

uen

cyra

nge

s(k

Hz)

(c)

Grade 3

20181614121086420

Speakers

[7–8]

[6–7]

[5–6]

[4–5]

[3–4]

[2–3]

[1–2]

[0–1]

Freq

uen

cyra

nge

s(k

Hz)

(d)

Figure 3: Number of voices correctly classified from the 4-G classification following 1000 Hz-width frequency subbands (24 LFSC).

the 20 normal voices (G0 rated voices) and the 60 dysphonicones (G1, G2 and G3 rated voices) available in the DV corpusare used as follows.

(i) All the subsets of 19 voices among the G0 set areused to estimate a model per each-λ−YG0

-with Y thediscarded voice. The same process is applied on theset G1, G2 and G3. This results in 20 different modelsavailable per grade.

(ii) When voice Y , labeled perceptually as grade i, istested, Y is first compared to model λ−YGi

leadingto the likelihood computation: L(Y | λ−YGi

). Then,an averaged likelihood is computed for all theother grades (different from i), by using the grade-dependent model sets (average on 20 likelihoods pergrade).

(iii) The decision relies on the maximum of the fourlikelihoods.

3.5. Validation. To evaluate quality of the classificationsystem, which next experimental results will depend on,Tables 1 and 2 provide its intrinsic performance throughboth protocols P1 and P2 respectively. In addition to the24 LFSC-based parameterization, which will be used in thenext sections, performance of a second system configuration,based on 72 MFSC is provided (see Section 2.1 for detailsof these different parameterizations). This second configura-tion aims to illustrate the potentiality of the automatic systemwhen more complex and relevant information, like the jointuse of static and dynamic features for instance, is extractedfrom the speech signal.

As expected, the 72 MFSC-based configuration showsthe best classification performance independently of theprotocols used, taking benefit of the more complex informa-tion. Focusing on the protocol P2, a large confusion can beobserved on both grades 1 and 2 whatever the configurationsused and the use of more relevant information involved inthe 72 MFSC. This confirms the requirement of a better

Page 83: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

Table 1: Performance of the normal and dysphonic voice classifica-tion (Task1) expressed in terms of correct classification rate (CCRin %) as well as the number of succeeded tests (in parentheses) onthe DV corpus according to two parameterization configurations(24 LFSC and 72 MFSC). Confidence intervals (CIs) are providedfor the overall scores.

Correct classification rate (CCR in %)

(Succeeded Test Nb)

System Normal Dysphonic Overall ±CI

24 LFSC 95.0 (19) 66.7 (40) 73.8 (59) 9.7

72 MFSC 95.0 (19) 91.7 (55) 92.5 (74) 5.8

understanding of the acoustic phenomena related to dyspho-nia and their different levels of severity.

4. Knowledge Gathering

The goal of this section is to describe how the automaticclassification results allow to gather relevant knowledge forin-depth and refined the human expert analysis. In this way,the automatic system will be first joined to a frequencysubband analysis. The aim of this subband-based analysisis to study how the acoustic characteristics of phenomenalinked to dysphonia are spread out along different frequencybands depending on the severity level; in other words:“is a frequency subband more discriminant than anotherfor dysphonic voice classification?” In a second step, thissubband analysis will be coupled with a phonetic analysis tohelp for refining observations.

All the experiments have been conducted following theprotocol P2. Despite its lower performance, the 24LFSCbased-parameterization was preferred to the 72 MFSC for allthe following experiments for two main reasons. Firstly, theuse of linear filters is more straightforward in this contextand facilitates the comparison between individual subbands.Secondly, the goal of the following experiments is to examineacoustic phenomena related to the dysphonia in the speechsignal through the classification task instead of improvingintrinsic performance of the automatic system.

4.1. Frequency Subband Analysis. The subband-based analy-sis consists in cutting the frequency domain in subbands pro-cessed independently. The main motivation of this approachresides in the assumption that the relevance of frequencyinformation can be dependent on the band of frequenciesconsidered. For example, [37] shows that some subbandsseem to be more relevant to characterize speakers than someothers for the automatic speaker recognition task. In thesame way, subband architecture-based approaches have beenused for the automatic speech recognition task in adverseconditions, since subbands may be affected differently bynoise [38].

In this context, the full frequency band 0–8000 Hz isfirst split into individual fixed-width subbands (1000 Hzwidth), which the automatic classification system (described

in Section 2) is applied on afterwards. According to perfor-mance observed on individual subbands, larger subbands areinvestigated.

4.1.1. 1000 Hz Subband Performance. In this first experi-ment, eight 1000 Hz-width subbands are processed indi-vidually through the classification system. Classificationperformance is presented per subband on Figure 3. Threemain trends can be pointed out.

(i) Frequency bands between 0 and 3000 Hz get the bestperformance with an overall CCR varying from 55%to 70%.

(ii) Frequencies between 3000 and 5000 Hz exhibit theworse overall performance. Only the normal voices(grade 0) get a satisfactory score of 65% CCR, despitea loss of performance compared with the full band(85% CCR). On the other side, a strong confusioncan be observed for the dysphonic voices leading tovery low scores (20% CCR).

(iii) Frequencies upper than 5000 Hz provide better over-all performance compared with 3000 to 5000 Hzsubbands even though most of the classificationerrors are scattered over the grades, still demonstrat-ing a large confusion. On the contrary, it can beobserved that severe dysphonic voices (grade 3) arewell classified in both subbands between 5000 and7000 Hz (70% CCR) and 7000–8000 Hz (80% CCR,best score).

Therefore, considering 1000 Hz-width frequency bands indi-vidually highlights (1) some difficulties to classify the grade2 voices whatever the individual subband considered, (2)the ability of low frequencies to discriminate most of thevoices, except for the grade 2 voices, (3) the “surprising”performance of the grade 3 voices on high frequencies,especially regarding the 7000–8000 Hz subband for which thespeech amount is very low.

4.1.2. Joint Frequency Band Performance. This sectionfocuses on the three frequency zones highlighted in theprevious section. The automatic classification is now per-formed on the following frequency subbands: 0–3000 Hz,3000–5400 Hz and 5400–8000 Hz, which aims to take benefitof the complementarity of the 1000 Hz-width subbands.Performance, reported in Table 3, shows that the behaviorobserved on the individual 1000 Hz-width subbands isemphasized here. Indeed, the 0–3000 Hz band (joiningthe first three 1000 Hz-width subbands) remains the mostinteresting frequency band, exhibiting an overall 71.25%CCR and achieving the best score for the grade 2 voices(65% CCR versus 50% for both full band and the bestindividual subband 1000–2000 Hz). Conversely, the 3000–5400 Hz band exhibits the lowest overall CCR (48.75%)compared with the other subbands. Confusion observed inthe individual 1000 Hz-width subbands is still present, exceptfor the grade 3 voices which tend to take benefit of the

Page 84: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

Table 2: Performance of the 4-G classification (Task2) expressed in terms of correct classification rates (CCR in %) as well as the number ofsucceeded tests (in parentheses) on the DV corpus according to two parameterization configurations (24 LFSC and 72 MFSC). Confidenceintervals (CI) are provided for the overall scores.

Correct classification rate (CCR in %)

(Succeeded Test Nb)

System Grade 0 Grade 1 Grade 2 Grade 3 Overall ±CI

24 LFSC 85.0 (17) 55.0 (11) 50.0 (10) 70.0 (14) 65.0 (52) 10.5

72 MFSC 95.0 (19) 65.0 (13) 70.0 (14) 85.0 (17) 78.8 (63) 9

Grade 0 Grade 1 Grade 2 Grade 3 Total

[0−8000] − [0−3000] Hz

All phonemes

ConsonantsVowels

0

10

20

30

40

50

60

70

80

90

100

CC

R(%

)

First three columns ≥ [0–8000] Hz

Second three columns ≥ [0–3000] Hz

Figure 4: Performance per grade in terms of correct classification rate (CCR %) considering “All phonemes”, consonant and vowel classes,for the 0–8000 Hz (each first set of three columns) and 0–3000 Hz (each second set of three columns) frequency bands.

complementarity of the individual subbands (65% CCR ver-sus 25% and 20% for the corresponding individual 1000 Hz-width subbands). Finally, the 5400–8000 Hz band, relatedto the residual zone of fricative and plosive consonants,provides reasonable performance for the normal (65% CCR)and severe dysphonic voices (70% CCR). Regarding speechinformation carried by this band, CCR of the grade 3 voicesmay be explained by the resulting noise of the veiled (orblown) features of severe dysphonic voices. In contrary, it ismore difficult to explain the behavior of the normal voicesin this band, except by a discriminant lack of informationcompared with other grades.

4.2. Frequency Band-Based Phonetic Analysis. To help inunderstanding and interpreting the behavior of the auto-matic classification system in the 0–3000 Hz frequency band,

the authors propose to pursue the classification system obser-vation through a frequency band-based phonetic analysis.In this way, performance of the classification system will beanalyzed per phoneme class and per frequency range (0–8000 Hz and 0–3000 Hz) in order to evaluate which impactmay have the dysphonia effects on phonemes or phonemeclasses in particular frequency bands according to the grades.This phonetic analysis is close to the “phonetic labeling”proposed in [39], in which a descriptive and perceptualstudy of pathological characteristics of different phonemesis proposed. To perform this frequency band-based phoneticanalysis, a phonetic segmentation is necessary for eachspeech signal available in the DV corpus. This segmentationwas extracted automatically by realizing an automatic text-constrained phonetic alignment. The latter was yielded bythe LIA alignment system, based on a Viterbi decoding

Page 85: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

Grade 0 Grade 1 Grade 2 Grade 3 Total

[0−8000] − [0−3000] Hz

Consonants

Voiced consonantsUnvoiced consonants

0

10

20

30

40

50

60

70

80

90

100C

CR

(%)

First three columns ≥ [0–8000] Hz

Second three columns ≥ [0–3000] Hz

Figure 5: Performance per grade in terms of correct classification rate (CCR %) considering voiced and unvoiced consonant classes, for the0–8000 Hz (each first set of three columns) and 0–3000 Hz (each second set of three columns) frequency bands.

p E r d e t u tVOT

420041004000390038003700360035003400

−0.1

0

0.1

0.2

(Am

plit

ude

)

420041004000390038003700360035003400

(ms)

0

2000

4000

6000

8000

10000

(Hz)

(a)

p E r d e t u t

Pseudo VOT

54005300520051005000490048004700

−0.8−0.6−0.4−0.2

00.20.40.60.8

(Am

plit

ude

)

54005300520051005000490048004700

(ms)

0

2000

4000

6000

8000

10000

(Hz)

(b)

Figure 6: Oscillogram and spectrogram for the extract “[· · · ] perdait toutes [· · · ]”/pErdetut/; grade 0 on the left (normal voice), grade3 on the right (severe dysphonia). On the right, we can note (1) the abnormal extension of the voice onset time (VOT) on the unvoicedplosive/t/. The consonant is quasi transformed in fricative (spirantisation), (2) the quasi unvoiced consonant/d/ transformed in/t/.

Page 86: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal Processing

and graph-search algorithms [40], a text-restricted lexiconof words associated with their phonological variants, anda set of 38 French phonemes. It is worth noting thatthe phonetic segmentation is coupled with the automaticdysphonic classification system for the decision step only.Indeed, for the classification tests and decision making, theaveraged frame based likelihood (see Section 2.3) betweenthe incoming voice and the grade models is computed onthe restricted set of segments associated with a targetedphoneme class. Conversely, grade models are learned onall the phonemic material available per grade in the DVcorpus independently of the phoneme class targeted. Table 4provides duration information of the phoneme classestargeted.

Figure 4 compares performance of the overall phonemeset (denoted “All phonemes” in the figure) with consonantand vowel classes according to the 0–8000 Hz and 0–3000 Hz frequency bands. Figure 5 focuses on consonantperformance comparing voiced and unvoiced classes.

Comparing vowel and consonant classes (Figure 4) onthe 0–8000 Hz band uniquely, it can be observed thatconsonants outperform slightly vowels for the grade 0 (80against 70% CCR), for the grade 2 (50 against 40% CCR),and for the grade 3 (85 against 60% CCR). Best performanceis reached by the vowel class on the grade 1 uniquelyeven if the difference is slight with the consonant class (55against 50% CCR). Therefore, consonants tend to be moreefficient in this context for discriminating dysphonia severitygrades than vowels, although the former contains voicedand unvoiced components. In this way, Figure 5 shows thatthe unvoiced consonants outperform the voiced ones forboth the grade 0 (80 against 70% CCR) and the grade1 (60 against 45% CCR) as opposed to the grade 2 forwhich voiced consonants obtain 60% CCR against 55% forunvoiced consonants. Both of them reach 75% CCR for thegrade 3.

Considering now the 0–3000 Hz frequency band,Figure 4 shows that consonants outperform vowels on boththe grades 1 (55 against 30% CCR) and 3 (75 against 55%CCR) while reaching similar performance on grades 0 (90against 85% CCR) and 2 (70% CCR for both). Thus, similarbehavior can be observed on the grades 0 and 3 by comparingthe frequency bands, performance of the consonant andvowel classes becomes equal for the grade 2 on the 0–3000 Hzband. Only the behavior of the consonant and vowel classesfor the grade 1 is quite different since performance of thevowels decreases largely on the 0–3000 Hz. This tends toindicate that confusion with other grades is largely higherfor the grade 1 considering the first formants of vowels only(present in the 0–3000 Hz). Regarding now Figure 5, thebehavior of the unvoiced consonant is rather similar forgrades 0 and 3 by comparing both the frequency bands.Inversely, the behavior of the grades 1 and 2 is quite differentsince the unvoiced consonants reach 35% CCR for the grade1 on the 0–3000 Hz against 60% CCR on the 0–8000 Hzfrequency band and 75% against 50% CCR for the grade 2.Therefore, the 0–3000 Hz frequency band seems to increasethe confusion of grade 1 with the other grades consideringthe unvoiced consonants only.

4.3. Discussion. The progress in the experiments based onthe automatic classification system reported previously, fromthe subbands to the phonetic analysis, tends to underline therelevancy of the unvoiced consonant in the discriminationof the GRBAS grades of dysphonia. This observation israther unexpected regarding the definition of dysphonia. Forthat matter, studies reported in literature generally focuson voiced components because they are directly affectedby pathologies related to the glottic source. For instance,sustained vowels are extensively associated with perceptualor objective approaches in literature since they make theassessment or measurement of parameters directly linkedto the vocal source easier. The relatively high performanceof the unvoiced consonants exhibited in this paper tendsto highlight that these components can be of interest forassessing severity grade of dysphonia similarly to the voicedcomponents. An interesting assumption for this observationwould be that the consequences of dysphonia on the vowelproduction may impact the production of the unvoicedconsonants as well, considering Vowel-Consonant (VC) orConsonant-Vowel (CV) contexts.

4.4. From Automatic Classification to Expertise: PreliminaryResults on Prolonged Voice Onset Time. Previous sectionshave raised different interesting observations, requiringfurther analysis by human experts. As dysphonia is alaryngeal disorder, the quite good performance reachedon the unvoiced consonants by the automatic classifierwas rather unexpected. It is the reason why data weremanually analyzed, focusing first on the unvoiced plosives(Figure 6). By verifying the automatic boundaries of theplosive, a lengthening of the voice onset time according tothe dysphonia severity has been highlighted.

Voice onset time (VOT) is the duration between therelease of a plosive and the beginning of the vocal foldvibration. This duration can be indicative of the speakercapacity to coordinate his/her articulatory and phonatoryorgans. For instance, during the production of the sequence/pa/, the speaker must control the relaxation of the lips,which creates a burst followed by the vibration of thevocal cords to produce the vowel. However, a deregulationcan involve a lengthening or a shortening of this durationbecause of some peripheral biomechanical constraints orif motor control for the laryngeal vibration is delayed oranticipated compared to the gesture of opening of theconsonant. This deregulation could also appear if the speakerdoes not have a well-tuned pneumophonatory control, forinstance in dysphonia without laryngeal pathology. Abnor-mal VOT has been studied for second-language learning[41], aphasia or apraxia of speech [42], dysarthria [43],stuttering speech [44], dysphagia [45], spasmodic dysphonia[46]. To confirm VOT lengthening observation, VOT wasmeasured from 865 unvoiced plosives (161 /p/, 244 /k/, 460/t/), present on the French text uttered by the 80 femalespeakers of the DV corpus. For the statistical analysis,the “R” software v.2.6.2—a language and environment forstatistical computing (http://www.R-project.org)—was used,associated with a linear mixed model [47]. The latter isa powerful model class used for the analysis of grouped

Page 87: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

Table 3: Performance of the 4-G classification following joint frequency subbands in terms of correct classification rates (% CCR)—24LFSC. Confidence intervals (CI) are provided for the overall scores.

Correct classification rate (CCR in %)

(Succeeded test Nb)

Grade 0 Grade 1 Grade 2 Grade 3 Overall ±CI

Full Band 85.0 (17) 55.0 (11) 50.0 (10) 70.0 (14) 65.00 (52) 10.5

0–3000 Hz 90.0 (18) 65.0 (13) 65.0 (13) 65.0 (13) 71.25 (57) 10

3000–5400 Hz 65.0 (13) 40.0 (8) 25.0 (5) 65.0 (13) 48.75 (39) 11

5400–8000 Hz 65.0 (13) 35.0 (7) 45.0 (9) 70.0 (14) 53.75 (43) 11

Table 4: Total duration in seconds per phonetic class and per grade as well as the number of phonemes (nb), their averaged duration (μ)and associated standard deviation (σ).

Phonetic classesGrades Info. per class

G0 G1 G2 G3 nb μ σ

Consonant 135.13 139.21 149.83 167.28 6395 0.092 0.045

Liquid 34.56 34.01 36.04 43.03 2181 0.068 0.033

Nasal 29.72 30.17 31.85 33.42 1279 0.098 0.039

Fricative 31.77 32.32 35.07 40.70 1144 0.122 0.057

Occlusive 39.08 42.71 46.87 50.13 1791 0.100 0.039

Vowel 103.58 98.77 103.46 109.79 5586 0.074 0.046

Oral 84.37 80.45 85.22 93.66 4862 0.071 0.044

All phonemes 241.51 240.96 256.66 280.52 12140 0.084 0.046

data such as the repeated observations of a speaker availablein this study. A key feature of mixed models is that theyallow to address multiple sources of variation by introducingsome random effects in addition to fixed effects; in otherwords, they permit to take both within- and between-subject variation into account in this context. The modelVOT = f(GRADE) was studied here where GRADE is a 4-level ordered factor as regressor and SPEAKER is a randomeffect (intercept only). The statistical analysis, depicted inFigure 7, shows that the linear component of the GRADEfactor is significant (P = .0001) with no significant quadraticor cubic effect. The main result obtained is that VOT isincreasing with the dysphonia level in a significant way.This result can be explained by the difficulty to initiate thevocal cord vibration correctly, encountered by dysphonicspeakers. It confirmed the phenomenon observed manuallyon the attack of sustained vowels by [48], which showedthe importance of the vowel onset to identify the dysphoniaseverity perceptually. Of course, this study needs to bepursued by observing other kinds of data. To conclude thisset of experiments, we can point out the interest of theautomatic system to highlight features which cannot be apriori observed or expected by the human experts, especiallyon continuous speech.

5. Conclusion

The work presented in this paper aims to show thatmachine learning approaches could help the human expertsto analyze more deeply acoustic features linked to voicedisorders. Initially, the described approach relies on thehuman expertise, necessary for feeding an automatic voice

3210

0.05

0.1

0.15

VO

T

Figure 7: VOT (in seconds) according to the dysphonia severitygrade (0: normal, 3:severe dysphonia).

classification system. More precisely, the latter requires a setof voice samples associated with some meaningful labels pro-vided by the experts like pathologies and related perceptualgrades. After several tuning and validation steps involvinghuman experts (machine learning experts, phoneticians andpathologists) the voice classification system is able to modelinitial knowledge related to the labels. In a second phase,the automatic classification system is used to determinerelevant information available in different parts of the input

Page 88: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

12 EURASIP Journal on Advances in Signal Processing

speech recordings or exploited through different kinds ofacoustical features. This second phase aims to determinesome new hypotheses on voice disorders (more precisely,new hypotheses on the features characterizing voice disor-ders). Therefore, a new feedback loop between the automaticclassification system and the human experts is mandatory.To assess the general methodology, this approach wasexperimented through an automatic classification system ofdysphonia severity grades (following the G criterion of theGRBAS scale). This paper presents in details the automaticclassification system, the class of original information itallows to highlight, as well as the first preliminary studyrelated to the VOT carried out by the human experts.

Experiments based on this automatic classification sys-tem and conducted on a dysphonic corpus led to differentinteresting observations. First, the 0–3000 Hz frequencyband achieved the best performance compared to [3000–5400] and 5400–8000 Hz, with an overall correct classifi-cation rate of about 71% (to be compared to 48% and53% respectively for the other subbands) for the task ofseverity grade classification. When the system was used torank the part of useful speech in terms of phonemic content,it was observed that consonants outperform vowels and,more surprisingly, that unvoiced consonants appeared to bevery relevant for the classification task. Submitted to thehuman experts, these results led to a manual observationand analysis of unvoiced consonants. Focused on unvoicedplosives, this analysis highlighted a lengthening of the voiceonset time (VOT) according to the dysphonia severity. Thisobservation was confirmed by a statistical analysis performedon 865 unvoiced plosives issued from the dysphonic corpus.This phenomenon can be intuitively explained by thedifficulty in initiating the vocal cord vibration correctlyencountered by dysphonic speakers. However, from theauthor’s knowledge, it has never been discussed from ascientific point of view. Even if this preliminary study onthe VOT has to be pursued by observing, for instance, otherclasses of unvoiced consonants, the approach proposed inthis paper has shown the potentiality of the back-and-forthloop between the automatic dysphonic voice classificationsystem and the human experts. It should to drive the lattertowards a better understanding of the acoustic phenomenarelated to voice disorders in the speech signal. In additionto the validation of the VOT lengthening according to thedysphonia severity levels, future work will be dedicated tobring human expertise on the potentiality of the unvoicedcomponents for discriminating dysphonia severity grades.First studies will examine more complex phonemic contextslike Consonant-Vowel (CV) or Vowel-Consonant (VC) inorder to determine if vowel alterations due to dysphoniamay have impacts on the adjacent unvoiced consonants.Once validated, this new knowledge will be analyzed by themachine learning experts for its potential integration in theautomatic classification system.

Acknowledgment

This research was partially supported by COST Action 2103“Advanced Voice Function Assessment”.

References

[1] IEEE, “IEEE recommended practice for speech quality mea-surements,” IEEE Transactions on Audio and Electroacoustics,vol. 17, no. 3, pp. 225–246, 1969.

[2] ITU-T Rec. P.862, “Perceptual evaluation of speech quality(PESQ), an objective method for end-to-end speech qualityassessment of narrowband telephone networks and speechcodecs,” International Telecommunication Union, Geneva,Switzerland, February 2001.

[3] M. Hirano, “Psycho-acoustic evaluation of voice: GRBAS scalefor evaluating the hoarse voice,” Clinical Examination of voice,Springer, 1981.

[4] B. Hammarberg, Perceptual and Acoustic Analysis of Dyspho-nia, Department of Logopedics and Phoniatrics, KarolinskaInstitutet, 1986.

[5] M. S. De Bodt, P. H. Van de Heyning, F. L. Wuyts, and L.Lambrechts, “The perceptual evaluation of voice disorders,”Acta Oto-Rhino-Laryngologica Belgica, vol. 50, no. 4, pp. 283–291, 1996.

[6] D. Morsomme, S. Russel, J. Jamart, M. Remacle, and I.Verduyckt, “Evaluation subjective de la voix (VHI) chez 723enseignants en region bruxelloise,” in Actes du Congres de laSociete Francaise de Phoniatrie, Paris, France, 2008.

[7] INSERM, La Voix. Ses Troubles Chez Les Enseignants, INSERM,2006.

[8] J. D. Hoit, “Influence of body position on breathing and itsimplications for the evaluation and treatment of speech andvoice disorders,” Journal of Voice, vol. 9, no. 4, pp. 341–347,1995.

[9] N. Roy, D. M. Bless, and D. Heisey, “Personality and voicedisorders: a multitrait-multidisorder analysis,” Journal ofVoice, vol. 14, no. 4, pp. 521–548, 2000.

[10] B. H. Jacobson, A. Johnson, C. Grywalski, et al., “TheVoice Handicap Index (VHI): development and Validation,”American Journal of Speech-Language Pathology, vol. 6, no. 3,pp. 66–69, 1997.

[11] P. H. Dejonckere, C. Obbens, G. M. de Moor, and G. H.Wieneke, “Perceptual evaluation of dysphonia: reliability andrelevance,” Clinical Linguistics and Phonetics, vol. 45, no. 2, pp.76–83, 1993.

[12] P. H. Dejonckere, P. Bradley, P. Clemente, et al., “A basic proto-col for functional assessment of voice pathology, especially forinvestigating the efficacy of (phonosurgical) treatments andevaluating new assessment techniques: guideline elaborated bythe Committee on Phoniatrics of the European LaryngologicalSociety (ELS),” European Archives of Oto-Rhino-Laryngology,vol. 258, no. 2, pp. 77–82, 2001.

[13] J. Kreiman, B. R. Gerratt, G. B. Kempster, A. Erman, and G. S.Berke, “Perceptual evaluation of voice quality: review, tutorial,and a framework for future research,” Journal of Speech andHearing Research, vol. 36, no. 1, pp. 21–40, 1993.

[14] L. Anders, H. Hollien, P. Hurme, A. Sonninnen, and J.Wendler, “Perceptual evaluation of hoarseness by severalclasses of listeners,” Clinical Linguistics and Phonetics, vol. 40,pp. 91–100, 1988.

[15] F. L. Wuyts, M. S. De Bodt, G. Molenberghs, et al., “Thedysphonia severity index: an objective measure of vocal qualitybased on a multiparameter approach,” Journal of Speech,Language, and Hearing Research, vol. 43, no. 3, pp. 796–809,2000.

[16] J. F. Piccirillo, C. Painter, D. Fuller, and J. M. Fredrickson,“Multivariate analysis of objective vocal function,” The Annals

Page 89: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 13

of Otology, Rhinology and Laryngology, vol. 107, no. 2, pp. 107–112, 1998.

[17] A. Giovanni, V. Molines, B. Teston, and N. Nguyen,“L’evaluation objective de la dysphonie: une methode multi-parametrique,” in Proceedings of the International Congress ofPhonetic Sciences (ICPhS ’91), pp. 274–277, Aix-en-Provence,France, 1991.

[18] B. Teston and B. Galindo, “A diagnosis of rehabilitation aidworkstation for speech and voice pathologies,” in Proceedingsof European Conference on Speech Communication and Tech-nology (Eurospeech ’95), pp. 1883–1886, Madrid, Spain, 1995.

[19] A. Giovanni, D. Robert, N. Estublier, B. Teston, M. Zanaret,and M. Cannoni, “Objective evaluation of dysphonia: pre-liminary results of a device allowing simultaneous acousticand aerodynamic measurements,” Clinical Linguistics andPhonetics, vol. 48, no. 4, pp. 175–185, 1996.

[20] A. Ghio and B. Teston, “Evaluation of the acoustic and aero-dynamic constraints of a pneumotachograph for speech andvoice studies,” in Proceedings of the International Conference onVoice Physiology and Biomechanics, pp. 55–58, 2004.

[21] P. Yu, R. Garrel, R. Nicollas, M. Ouaknine, and A. Giovanni,“Objective voice analysis in dysphonic patients. New dataincluding non linear measurements,” Clinical Linguistics andPhonetics, vol. 59, pp. 20–30, 2007.

[22] L. Gavidia-Ceballos and J. H. L. Hansen, “Direct speechfeature estimation using an iterative EM algorithm for vocalfold pathology detection,” IEEE Transactions on BiomedicalEngineering, vol. 43, no. 4, pp. 373–383, 1996.

[23] R. T. Ritchings, G. V. Conroy, M. McGillion, et al., “Aneural network based approach to objective voice qualityassessment,” in Proceedings of the 18th International Conferenceon Expert System (ES ’98), pp. 198–209, Cambridge, UK, 1998.

[24] A. A. Dibazar, S. Narayanan, and T. W. Berger, “Featureanalysis for automatic detection of pathological speech,” inProceedings of the Engineering Medicine and Biology Sympo-sium, vol. 1, pp. 182–183, 2002.

[25] C. Maguire, P. de Chazal, R. B. Reilly, and P. Lacy, “Identifi-cation of voice pathology using automated speech analysis,”in Proceedings of the 3rd International Workshop on Modelsand Analysis of Vocal Emission for Biomedical Applications,Florence, Italy, December 2003.

[26] C. Fredouille, G. Pouchoulin, J.-F. Bonastre, M. Azzarello, A.Giovanni, and A. Ghio, “Application of automatic speakerrecognition techniques to pathological voice assessment (dys-phonia),” in Proceedings of the 9th European Conference onSpeech Communication and Technology (Interspeech ’05), pp.149–152, Lisboa, Portugal, 2005.

[27] J. I. Godino-Llorente, P. Gomez-Vilda, and M. Blanco-Velasco, “Dimensionality reduction of a pathological voicequality assessment system based on Gaussian mixture modelsand short-term cepstral parameters,” IEEE Transactions onBiomedical Engineering, vol. 53, no. 10, pp. 1943–1953, 2006.

[28] M. Wester, “Automatic classification of voice quality: com-paring regression models and hidden Markov models,” inProceedings of the Symposium on Databases in Voice Qual-ity Research and Education (VOICEDATA ’98), pp. 92–97,Utrecht, December 1998.

[29] V. Parsa and D. G. Jamieson, “Acoustic discriminationof pathological voice: sustained vowels versus continuousspeech,” Journal of Speech, Language, and Hearing Research,vol. 44, no. 2, pp. 327–339, 2001.

[30] J. B. Alonso, F. Diaz, C. M. Travieso, and M. A. Ferrer, “Usingnonlinear features for voice disorder detection,” in Proceedings

of the International Conference on Non-Linear Speech Processing(NOLISP ’05), pp. 94–106, Barcelona, Spain, April 2005.

[31] W. Chen, C. Peng, X. Zhu, B. Wan, and D. Wei, “SVM-basedidentification of pathological voices,” in Proceedings of the 29thAnnual International Conference of the IEEE Engineering inMedicine and Biology Society, vol. 2007, pp. 3786–3789, 2007.

[32] F. Bimbot, J.-F. Bonastre, C. Fredouille, et al., “A tutorial ontext-independent speaker verification,” EURASIP Journal onApplied Signal Processing, vol. 2004, no. 4, pp. 430–451, 2004.

[33] G. Gravier, “SPRO: a free speech signal processing toolkit(version 4.0.1),” 2003, http://gforge.inria.fr/projects/spro.

[34] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-likelihood from incomplete data via the EM algorithm,”Journal of the Acoustical Society of America, vol. 39, pp. 1–38,1977.

[35] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speakerverification using adapted Gaussian mixture models,” DigitalSignal Processing: A Review Journal, vol. 10, no. 1–3, pp. 19–41,2000.

[36] L. F. Lamel, J. L. Gauvain, and M. Eskenazi, “BREF, alarge vocabulary spoken corpus for french,” in Proceedingsof the European Conference on Speech Communication andTechnology (Eurospeech ’91), pp. 505–508, Genoa, Italy, 1991.

[37] L. Besacier, J.-F. Bonastre, and C. Fredouille, “Localizationand selection of speaker specific information with statisticalmodelling,” Speech Communication, vol. 31, pp. 89–106, 2000.

[38] I. A. McCowan and S. Sridharan, “Multi-channel sub-bandspeech recognition,” EURASIP Journal on Applied SignalProcessing, vol. 2001, no. 1, pp. 45–52, 2001.

[39] J. Revis, A. Ghio, and A. Giovanni, “Phonetic labeling ofdysphonia: a new perspective in perceptual voice analysis,”in Proceedings of the 7th International Conference Advances inQuantitative Laryngology, Voice and Speech Research, October2006.

[40] F. Brugnara, D. Falavigna, and M. Omologo, “Automaticsegmentation and labeling of speech based on hidden Markovmodels,” Speech Communication, vol. 12, no. 4, pp. 357–370,1993.

[41] J. Alba-Salas, “Voice Onset Time and foreign accent detection:are L2 learners better than monolinguals?” Revista Alicantinade Estudios Ingleses, vol. 17, November 2004.

[42] P. Auzou, C. Ozsancak, R. J. Morris, M. Jan, F. Eustache, andD. Hannequin, “Voice onset time in aphasia, apraxia of speechand dysarthria: a review,” Clinical Linguistics and Phonetics,vol. 14, no. 2, pp. 131–150, 2000.

[43] R. J. Morris, “VOT and dysarthria: a descriptive study,” Journalof Communication Disorders, vol. 22, no. 1, pp. 23–33, 1989.

[44] L. Jancke, “Variability and duration of voice onset time andphonation in stuttering and nonstuttering adults,” Journal ofFluency Disorders, vol. 19, no. 1, pp. 21–37, 1994.

[45] J. Ryalls, K. Gustafson, and C. Santini, “Preliminary inves-tigation of voice onset time production in persons withdysphagia,” Dysphagia, vol. 14, no. 3, pp. 169–175, 1999.

[46] J. D. Edgar, C. M. Sapienza, K. Bidus, and C. L. Ludlow,“Acoustic measures of symptoms in abductor spasmodicdysphonia,” Journal of Voice, vol. 15, no. 3, pp. 362–372, 2001.

[47] J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S andS-PLUS (Statistics and Computing), Springer, New York, NY,USA, 2000.

[48] J. Revis, A. Giovanni, and J.-M. Triglia, “Influence de l’attaquesur l’analyse perceptive des dysphonies,” Clinical Linguisticsand Phonetics, vol. 54, no. 1, pp. 19–25, 2002.

Page 90: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 159234, 11 pagesdoi:10.1155/2009/159234

Research Article

Analysis of Acoustic Features in Speakers with CognitiveDisorders and Speech Impairments

Oscar Saz,1 Javier Simon,2 W.-Ricardo Rodrıguez,1 Eduardo Lleida,1 and Carlos Vaquero1

1 Communications Technology Group (GTC), Aragon Institute for Engineering Research (I3A), University of Zaragoza,50018 Zaragoza, Spain

2 Department of General and Hispanic Linguistics, University of Zaragoza, 50009 Zaragoza, Spain

Correspondence should be addressed to Oscar Saz, [email protected]

Received 31 October 2008; Revised 11 February 2009; Accepted 8 April 2009

Recommended by Juan I. Godino-Llorente

This work presents the results in the analysis of the acoustic features (formants and the three suprasegmental features: tone,intensity and duration) of the vowel production in a group of 14 young speakers suffering different kinds of speech impairmentsdue to physical and cognitive disorders. A corpus with unimpaired children’s speech is used to determine the reference valuesfor these features in speakers without any kind of speech impairment within the same domain of the impaired speakers; thisis 57 isolated words. The signal processing to extract the formant and pitch values is based on a Linear Prediction Coefficients(LPCs) analysis of the segments considered as vowels in a Hidden Markov Model (HMM) based Viterbi forced alignment. Intensityand duration are also based in the outcome of the automated segmentation. As main conclusion of the work, it is shown thatintelligibility of the vowel production is lowered in impaired speakers even when the vowel is perceived as correct by humanlabelers. The decrease in intelligibility is due to a 30% of increase in confusability in the formants map, a reduction of 50% inthe discriminative power in energy between stressed and unstressed vowels and to a 50% increase of the standard deviation in thelength of the vowels. On the other hand, impaired speakers keep good control of tone in the production of stressed and unstressedvowels.

Copyright © 2009 Oscar Saz et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

The presence of certain speech and language disordersproduces a decrease in the intelligibility of the speech in thepatients affected with them [1]. In languages like Spanish,vowels are the nuclei of every syllable and play an importantrole in the intelligibility of speech, so the decrease in theirquality and discriminative power has a major effect in theoverall intelligibility of the speech. The goal of this work is toanalyze and characterize this loss of intelligibility in a groupof young speakers with cognitive disorders.

Several analytic studies have been carried out in studyingthe vocalic production of patients with different speechimpairments. Cases of aphasia, disorder in the languagedue to brain damage, have been studied to understandtheir influence and the decrease of quality in the vocalicproduction [2, 3] by these patients. Dysarthria has been alsostudied where claims of patients with severe affections still

controlling some of their suprasegmental vocalic featureshave been made [4], although with a lack of fine control overthem. The affection to vocalic production in speech disordersdue to Down’s syndrome has been also studied [5] in pre-and postsurgical situations. Finally, the authors did an initialapproach to this kind of analysis [6, 7] with the Spanishdatabase of project HACRO containing different kinds ofimpaired speech [8].

In this work, it will be studied how vowel productionquality varies in a group of young speakers with cogni-tive disorders and, sometimes severe, speech impairmentsassociated to them like dysarthria, with respect to a setof reference unimpaired speakers. Four features will bestudied: formant frequencies, fundamental frequency (tone),intensity (energy) and duration (length). Formants are theacoustic parameters required to distinguish different vowels,while tone and intensity may play the main role in theutterance of stressed versus unstressed vowels [9, 10]. Finally,

Page 91: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

duration of vowels affects the correct perception of syllableprominence and position within the whole word or utterance[11], although its impact is not clear in Spanish language.

The organization of this paper is as follows: In Section 2,the acoustic features to be studied in this work will bepresented from the point of view of acoustic and perceptualphonetics. Section 3 will introduce the young speech corporaused in this paper: the reference subcorpus and the impairedsubcorpus. In Section 4 the methods for the extraction of allthe studied features will be presented, as well as the referencevalues extracted from the unimpaired speech corpus. Theresults over the impaired speech corpus and the comparisonwith the reference values will be given in Section 5 anddiscussed in Section 6. Finally, the conclusions to this workwill be extracted in Section 7.

2. Features of Spanish Vowels

This section will give a brief review on the main acousticfeatures of the vocalic production, focusing on their influ-ence on the articulation of the Spanish vowels. The Spanishlanguage contains five vowels (/a/, /e/, /i/, /o/ and /u/) clearlydefined by their position in the formants map as it will beshown in the study of the reference corpus in Section 4.1.There are two allophones of the /i/ and /u/ vowels actinglike glides ([ j] and [w], resp.) that, despite being close to thevowels, cannot be considered as vocalic sounds when theyare unstressed vowels and make the transition to a purelyvocalic sound which is the nucleus in the syllable [12]. Hence,these glides are never considered for analysis in this work.Next, we will provide a basic theory of the Spanish vowels,according to their acoustic production and their influence inthe perception of speech.

2.1. Formants. Formant frequencies are the only acousticfeature needed to describe Spanish vowels, where thesefrequencies rely heavily on the articulatory properties ofeach vowel [13]. The two main articulatory properties arethe horizontal position of the tongue (defining palatal orfront versus velar or back vowels) and the vertical positionof the tongue (defining high versus low vowels). With thisclassification, a low position of the tongue will produce ahigher first formant; while a more palatal position of thetongue will produce a higher second formant. Higher orderformants like the third or fourth formants do not have asignificant impact in Spanish vowels and are not consideredin this work; moreover, tone doesn’t have an impact either inthe distinction of vowels.

According to this organization, Spanish has two highvowels (low first formant, 300–400 Hz): the velar /u/ (lowsecond formant, 900 Hz) and the palatal /i/ (high secondformant, 2300–2700 Hz), while only one low vowel (high firstformant, 700–900 Hz) /a/ with a central position betweenpalatal and velar (middle second formant, 1500–1700 Hz).Finally, two more vowels share a central-high position(high first formant, 500–600 Hz): the velar /o/ (low secondformant, 1000–1200 Hz) and the palatal /e/ (high secondformant, 2000–2400 Hz) [14].

2.2. Suprasegmental Features. There are three main acous-tical features that affect the suprasegmental production inSpanish: tone, intensity and duration. In isolated words likeit is the case of the work in this paper, these features mostlyaffect the distinct perception of stressed and unstressedvowels, although they do it in very different ways. Stress isconsidered in many phonetic theories as a binary feature thatcan be characterized as +stress or –stress, as perceived by thelistener. Several trends differ in which suprasegmental featurecarries most of the stress information, although nowadaysit is widely accepted that tone is the main carrier of stress[15], followed by intensity. Anyways, no categorical assertioncan be made in this subject, as the main prosody of thesentence and other microprosodic features can affect thisperception in different utterances, as well as in the differentcharacterization of tone in each language.

Finally, duration also has an influence in the perceptionof stress, but it is very affected by the fact that everysyllable has a canonic length, so the duration of a stressedvowel is only comparable to the duration of the sameunstressed vowel when they are the nucleus of the samesyllabic structure. Otherwise, no categorical conclusion canbe made from the comparison of the duration of stressed andunstressed vowels.

3. Corpora for Analysis

This section will present the most interesting features ofthe corpora used in this work for the analysis carriedout in Sections 4 and 5. Further information concerningother features of the corpus can be found in [16]. Thevocabulary used in the recording sessions is the 57 wordsfrom the Induced Phonological Register (RFI) [17], a verywell-known speech therapy handbook in Spanish. These 57words contain 129 syllables and 292 phonemes, with severalrepetitions of the vowels in different syllabic structures (90different syllables). More precisely, the total number ofvowels in the set of words is 129 (58 /a/, 18 /e/, 9 /i/, 38 /o/and 6 /u/), each one of them being the nucleus of each one ofthe 129 syllables (in Section 2, it was argued how glides areconsidered nonvocalic sounds).

The process of the speech acquisition was made using“Vocaliza” [18]; this computer-aided speech therapy toolallows the acquisition of speech elicited from childrenprompting them with text, audio and images. Recordingswere made in an empty classroom environment with a close-talk microphone (AKG C444L) connected to a laptop witha conventional sound card acquiring the signals in 16 kHzsampling frequency and storing them with a depth of 16 bits.The main corpus is divided into two subcorpora: unimpairedand impaired speech.

3.1. Unimpaired Speech Corpus. The unimpaired speechsubcorpus contains speech from 168 young speakers (73males and 95 females) in the range of 10 to 18 yearsold attending classes at primary and secondary school inZaragoza, Spain. Every speaker has uttered one session of theisolated words in the RFI. The total number of utterances

Page 92: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

in this subset of the corpus is 9576 isolated words (6 hoursof signal). Recording process was fully supervised by at leasta member of the research team to assure the good qualityof the pronunciation and intelligibility of the utterances.Furthermore, only children with a good literacy assessmentby their teachers were chosen to take part in the recordings.This subcorpus was recorded with the idea of providing areference in the standard features in the speech of youngspeakers as it is well known that children’ speech has specialfeatures [19].

3.2. Impaired Speech Corpus. The impaired speech subset ofthe corpus contains speech from 14 young speakers whoserelation in terms of age and gender is shown on Table 1. Everyspeaker has uttered 4 sessions of the RFI isolated words; thisis, 228 isolated words per speaker and a total of 3192 isolatedwords in the corpus (3 hours of signal). All 14 speakerssuffer from cognitive disabilities and sometimes are alsophysically handicapped [16]. These disabilities affect theirspeech, producing a decrease in the quality and intelligibilityin their utterances and also severe mispronunciations ofsome phonemes, which are either substituted by anotherphoneme or completely deleted.

Every utterance in the impaired speech subcorpus wasmanually labeled by three different experts to determinethe perception of pronunciation mistakes made by thespeakers. With a pairwise interlabeler agreement of 89.65%the mispronunciation rate (substitutions and deletions) is17.61% for the overall set of phones (vowels, glides andconsonants). The results in vowel mispronunciation perspeaker are shown on Table 2, where it can be seen how thereis a great variability in the affection to every speaker’s speech,with some speakers making nearly no mistakes, while someothers reaching 20% of mistakes. Although some speakers arenot making any mistakes in the vowels, this does not indicatethat their voice is completely healthy, because they presentsome degree of dysarthria that affects their voice quality.

Average mispronunciation rate of every vowel is shown inTable 3; the mean result for the 5 vowels altogether is 7.43%of mispronunciations, where /a/ and /o/ are around 4%-5%and /e/, /i/ and /u/ are more frequent mispronounced with 9-10% of mistakes. Once again, it is to remark that this manuallabeling only refers to the substituted and deleted phonemes,resembling a perceptual labeling of how human expertsperceive the phonemes (as the canonical one or as any other,but not indicating which was the actual phoneme uttered bythe speakers in substitution of the canonical expansion).

4. Acoustic Analysis and Reference Results

The acoustic analysis carried out aims to achieve a robustestimation of the four features concerned for study explainedin Section 2. This Section gives a brief review of thealgorithms used for the acoustic analysis and focuses onthe reference results over the unimpaired subcorpus. State-of-the-art speech processing algorithms are implemented toestimate these values following the diagram on Figure 1 asalso implemented in the speech therapy tool “PreLingua” for

Table 1: Impaired speakers in the corpus (Down’s stands forDown’s Syndrome).

SpeakerAge Gender Degree Speaker Age Gender Degree

Spk01 13 Female Down’s Spk02 11 Male Severe

Spk03 21 Male Moderate Spk04 20 Female Moderate

Spk05 18 Male Down’s Spk06 16 Male Moderate

Spk07 18 Male Severe Spk08 19 Male Severe

Spk09 11 Female Moderate Spk10 14 Female Moderate

Spk11 19 Female Moderate Spk12 18 Male Severe

Spk13 13 Female Down’s Spk14 11 Female Moderate

Table 2: Rate of vowel mispronunciations per speaker.

Speaker Spk01 Spk02 Spk03 Spk04 Spk05 Spk06 Spk07

%Errors 0.39% 3.10% 0.39% 0.39% 17.44% 0.19% 0.78%

Speaker Spk08 Spk09 Spk10 Spk11 Spk12 Spk13 Spk14

%Errors 8.53% 0.78% 7.56% 3.10% 8.33% 28.68% 0.00%

Table 3: Rate of mispronunciations per vowel.

Vowel /a/ /e/ /i/ /o/ /u/

% Errors 4.16% 9.92% 9.52% 4.61% 8.93%

the improvement of phonatory controls in young children[20]. The speech processing is applied framewise (with aframe length of 25 milliseconds. and a frame shift of 10milliseconds.) after obtaining the automated segmentationof the input speech via a Viterbi-based forced alignment.Hidden Markov Models (HMMs) used for the Viterbialignment were trained with 3 different databases containingadult unimpaired speech: Albayzin [21], SpeechDat-Car [22]and Domolab [7]. 39-dimension Mel Frequency CepstralCoefficients (MFCCs) vectors are used as features for theHMM alignment, composed of 12 static features and energyplus delta features plus delta-delta features. An exampleof the outcome of the automated segmentation over oneof the utterances in the unimpaired children’s subcorpuscan be seen in Figure 2(a). The automated segmentation isinitially based on the canonic transcription of every one ofthe utterances (isolated words) but, to avoid the perniciouseffect of phoneme deletions in the impaired speakers’pronunciations, the deleted phonemes (as perceived in thehuman labeling) are not fed as input into the automatedsegmentation, as shown in the example in Figure 2(b).

After segmentation, impaired speech will be studiedin two different groups: correctly pronounced vowels andmispronounced vowels. This way, the intelligibility will bestudied separately in the situations in which the labelers stillunderstand the vowel as correctly pronounced and in thesituation of perception of mispronunciations.

4.1. Feature Estimation. The feature estimation is carriedout following the next steps: after signal preprocessing (DCoffset, pre-emphasis and Hamming windowing), a LinearPrediction Coefficient (LPC) analysis [23] is applied to every

Page 93: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

Table 4: Mean (μ), standard desviation (σ), skewness (γ1) and excess Kurtosis (γ2) values for the first and second formants in the referencecorpus.

First formant Second formant

Vowel μ σ γ1 γ2 μ σ γ1 γ2

/a/ 762.75 108.77 −0.29 −2.90 1567.30 288.48 0.27 −0.78

/e/ 512.21 61.73 −0.21 −3.00 2356.78 422.64 0.45 0.16

/i/ 379.58 68.17 0.52 −2.98 2787.75 267.27 0.14 −0.16

/o/ 552.72 69.46 −0.23 −2.95 1173.13 212.38 1.31 2.56

/u/ 423.40 61.48 −0.26 −2.98 1083.16 213.67 0.69 0.32

Table 5: Mean (μ), standard desviation (σ), skewness (γ1) and excess Kurtosis (γ2) values of pitch in the reference corpus (Females 13-14years-old).

Stressed vowels Unstressed vowels

Vowel μ σ γ1 γ2 μ σ γ1 γ2

/a/ 229.67 30.43 −0.54 1.04 207.78 27.54 0.65 4.02

/e/ 229.04 30.12 −0.39 0.89 219.37 29.08 0.02 0.42

/i/ 241.08 33.38 −0.54 1.30 219.51 29.64 0.74 3.71

/o/ 228.55 32.45 −0.53 0.93 203.24 26.42 0.41 2.65

/u/ 237.66 39.23 −0.44 1.30 236.88 30.68 −0.28 2.02

Pre-processing LPC analysis

Root extractionPrediction errorautocorrelation

Formant frequencies

Pitch

Figure 1: Acoustic analysis diagram

frame to extract the roots of coefficients (ak) of the 16-orderspeech prediction model in (1).

H(z) = G

1−∑16k=1(akz−k)

, (1)

where the input signal s(n) is estimated as s(n) using thetime-domain impulsional response h(n) associated to H(z)as in (2):

s(n) = h(n)∗ s(n). (2)

The estimation of the formants takes the 16 LPCcoefficients (ak) in the prediction model H(z) and extractsthe polynomial roots, each one of them associated to aformant frequency. The roots with the two higher absolutevalues will correspond to the first and second formants.

Tone estimation calculates the autocorrelation of theprediction error e(n) given in (3) and its autocorrelation r(k)

in (4) with f r l the value of frame length (25 millisecondsper frame):

e(n) = s(n)− s(n), (3)

r(k) =f r l∑

n=0

e(n)e(n− k), (4)

The index k in which the autocorrelation has its maxi-mum value outside from the area around the origin r(0) willbe the pitch period (kpitch) associated to the pitch frequency(Fpitch = Fsample/kpitch) where Fsample is 16 kHz as mentionedbefore. An estimation of the sonority value, as the ratiobetween the maximum value of autocorrelation and theautocorrelation in the origin (r(kpitch)/r(0)), will indicate ifthe frame is sonorant enough to be considered as a voweland, hence, use the calculated pitch and formant values ascorrect. A high sonority ratio avoids the possibility of pitchand formant prediction mistakes, although some correctframes might be rejected.

For the intensity estimation, some arguments have tobe considered. First, actual values of intensity (this is,sample values or directly computed frame energy) cannotbe considered into the study as it is not possible to reliablyargue that input intensity during the recording process stayedsteady through all different sessions, as the recordings of allthe speakers took more than a year. However, it is reasonableto argue that Signal-to-Noise Ratio (SNR) will maintainconstant for similar speech intensity independently of theinput volume since a close-talk microphone was used for therecordings.

This assumption is evaluated by the estimation of thebackground noise power level calculated for the corpus usedin the work, whose mean value is 27.15 dB (7.22 dB of

Page 94: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

SIL /m/ /o/ SIL/t/ /o/

1.51.41.31.21.110.90.80.70.60.5

Time (s)

−1

−0.5

0

0.5

1

1.5×104

Sam

ples

(a) Utterance of the word “moto” (SIL-/m/-/o/-/t/-/o/-SIL) by an unim-paired male of 11 years old.

SIL /a/ /B/ /o/ SIL

1.31.21.110.90.80.70.60.50.40.3

Time (s)

−1.5

−1

−0.5

0

0.5

1

1.5×104

Sam

ples

(b) Utterance of the word “arbol” (SIL-/a/-/r/-/B/-/o/-/l/) by Spk05 where/r/ and /l/ are labeled as deletions.

Figure 2: Examples of the outcome of the automated segmentation

/a/

/e//o/

/i//u/

500100015002000250030003500

Formant F2 (Hz)

1000

900

800

700

600

500

400

300

200

Form

antF

1(H

z)

(a) Formant representation (mean and standard deviation) in thereference corpus.

350300250200150100

Pitch (Hz)

StressedUnstressed

0

50

100

150

200

250

300

His

togr

amva

lue

(b) Pitch histogram for stressed and unstressed vowels /o/ in the referencecorpus (females 13-14 years old).

6050403020100

Energy (dB)

StressedUnstressed

0

100

200

300

400

500

600

700

800

900

1000

His

togr

amva

lue

(c) Energy histogram of stressed and unstressed vowels /o/ in the referencecorpus.

8007006005004003002001000

Length (ms)

0

100

200

300

400

500

600

His

togr

amva

lue

(d) Length histogram of vowels /o/ in the reference corpus.

Figure 3: Representation of the 4 features: (a) Formants, (b) Pitch, (c) Energy, (d) Length

Page 95: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

/e/

/a/

/i/

/o/

/u/

500100015002000250030003500

Formant F2 (Hz)

1000

900

800

700

600

500

400

300

200

Form

antF

1(H

z)

(a) Formant representation (mean and standard deviation) in theimpaired corpus (Correct vowels)

/a//e/

/i/

/o//u/

500100015002000250030003500

Formant F2 (Hz)

1000

900

800

700

600

500

400

300

200

Form

antF

1(H

z)

(b) Formant representation (mean and standard deviation) in theimpaired corpus (Mispronounced vowels)

Figure 4: Formants map for the impaired speakers

Table 6: Mean (μ), standard desviation (σ), skewness (γ1) and excess Kurtosis (γ2) values of the frame wise energy (SNR) in the referencecorpus.

Stressed vowels Unstressed vowels

Vowel μ σ γ1 γ2 μ σ γ1 γ2

/a/ 37.78 6.93 −0.39 0.39 30.34 8.32 −0.27 −0.16

/e/ 37.21 7.36 −0.58 0.97 34.18 7.91 −0.27 0.15

/i/ 36.77 6.44 −0.33 −50 33.18 7.29 −0.39 0.44

/o/ 38.42 6.96 −0.57 0.78 29.46 8.11 −0.18 −0.17

/u/ 37.27 7.12 −0.46 0.34 34.61 6.34 −0.35 0.30

Table 7: Mean (μ), standard desviation (σ), skewness (γ1) andexcess Kurtosis (γ2) values of the vowel length in the referencecorpus.

Vowel μ σ γ1 γ2

/a/ 120.75 53.11 1.25 4.49

/e/ 107.73 50.81 1.06 2.45

/i/ 114.88 39.42 0.56 0.62

/o/ 123.03 58.42 2.24 13.25

/u/ 113.16 45.89 0.62 0.52

standard deviation) for the reference subcorpus and 27.07 dB(6.61 dB of standard deviation) for the impaired subcorpus,which validates the hypothesis that noise level is directlyrelated to intensity level and maintains similar and goodproperties through all the recordings. Hence, prior to energyestimation, average background noise power is calculatedthrough all the frames considered as nonspeech in theforced alignment. Afterwards, for each frame of the vowels,framewise energy is calculated and SNR is obtained bysubtracting the noise power in the utterance. For conventionpurposes, from now on, intensity or energy will be thisvalue of SNR where the background noise level has beensubstracted.

Duration calculation is done by estimating the length ofthe vowel in milliseconds, computing the number of frames

assigned to each vowel in the forced alignment and thenmultiplying by the frame shift value of 10 milliseconds perframe. A threshold over the energy is applied to restrict thevowel boundaries and hence avoid the effect of coarticulationin the transitions to or from consonantal sounds. Thisthreshold was preset to restrict boundary frames with lowenergy whose calculation of pitch and formants could beinaccurate.

4.2. Reference Results. The reference subcorpus of 168 unim-paired young speakers was initially analyzed to determine thestandard values of the formants and suprasegmental featuresunder study in this work. Some general assumptions willbe made in this paper concerning the statistical propertiesof the features studied in this work: First, the values ofthe formants have a 2-dimension Gaussian distribution foreach vowel. Values of pitch and energy have a Gaussiandistribution separately for stressed and unstressed vowels(pitch can only be considered for one speaker alone orfor a population of the same gender and age). Finally, thevalues of vowel length have a Gaussian distribution for eachvowel.

All the values in this Section are given in terms ofmean (μ), standard deviation (σ), skewness (γ1) and excessKurtosis (γ2); where the values (close to zero) of γ1 and γ2

validate the Gaussian assumptions. Once assured the Gaus-sian properties, in the studies on the impaired subcorpus inSection 5, μ and σ will be the only statistics. All reference

Page 96: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

Table 8: Mean (μ) and standard desviation (σ) for the formants in the impaired corpus.

Correct vowels Mispronounced vowels

First formant Second formant First formant Second formant

Vowel μ σ μ σ μ σ μ σ

/a/ 789.90 191.70 1568.03 302.28 643.25 172.22 1692.28 365.08

/e/ 542.55 78.29 2230.34 410.96 585.28 100.29 1965.08 597.43

/i/ 368.81 70.89 2671.75 426.03 503.10 102.72 2177.16 590.60

/o/ 586.92 86.15 1185.30 216.18 572.12 79.85 1245.47 215.34

/u/ 390.02 69.24 1075.79 163.38 577.96 95.51 1144.39 130.56

Table 9: Mean (μ) values of pitch in the impaired corpus for correct vowels.

Group A Group B Group C Group D

Vowel Stressed Unstressed Stressed Unstressed Stressed Unstressed Stressed Unstressed

/a/ 150.93 139.10 231.46 200.71 259.56 233.79 308.50 278.65

/e/ 152.43 145.35 238.40 208.26 259.42 248.60 302.11 283.32

/i/ 170.45 144.40 254.82 219.44 277.78 242.95 308.34 283.63

/o/ 150.94 139.27 236.52 200.28 259.28 232.22 302.90 270.11

/u/ 161.93 146.68 251.93 230.36 267.58 245.53 316.96 292.26

values are shown on Tables 4 (formants), 5 (pitch), 6 (energy-SNR) and 7 (length). Table 5 shows only the results forthe group of unimpaired females of 13-14 years old as anexample of pitch trend in the unimpaired data (rest of groupsbehave similarly and are not shown here to restrict space ofthis Section, it is to remember that pitch has to be studiedseparately for gender and age to maintain the condition ofGaussian distribution).

A graphical representation of these features is givenin Figure 3. First, Figure 3(a) shows the results of theformant analysis done over the reference corpus plottingand ellipsoid whose center are the mean values for thefirst and second formant and the axes are the standarddeviations of both formants. Figures 3(b) and 3(c) show thehistograms in the pitch and energy respectively for the vowel/o/ in the reference corpus, separating stressed vowels fromunstressed vowels; while Figure 3(d) shows the histogram ofthe duration of the vowel /o/ across the reference corpus.Vowel /o/ has been chosen to provide this graphical viewof the histograms for being one of the vowels with moreappearances in the corpus.

Referring to the formant results in Table 4 andFigure 3(a), the values are similar to the canonic formantvalues accepted traditionally in Spanish phonetics, and agood discrimination can be made among all five vowels.Pitch and energy in Table 5 and 6 and Figures 3(b) and3(c) show their discriminative effect in the perception ofstress, as the pitch in stressed vowels is 10–20 Hz overthe pitch of unstressed vowels and the energy in stressedvowels is 4-7 dB over the energy of unstressed vowels. Finally,regarding length in Table 7 and Figure 3(d), it is seen as vowelproduction is steady in its length, with a standard deviationnot exceeding the 40%–50% of the mean results in length(around 120 milliseconds).

5. Impaired Speech Results

In this section, the results achieved in the acoustic analysisover the impaired speech subset of the corpus will be given.This analysis will comprehend the four acoustic featuresconsidered in Section 2, while making an initial comparisonwith the results in Section 4.2 over the reference subcorpus.The full comparative analysis will be made in Section 6with the help of statistical tools like the Kullback-LeiblerDivergence and the Fisher Ratio.

5.1. Formant Results. The formant map for the 14 impairedspeakers is shown on Figure 4. Figure 4(a) provides theformant map for the vowels perceived as correctly pro-nounced by the human labelers, with their statistics givenin the first columns of Table 8. Two major effects can beappreciated: First, the increase in the area of every vowel inthe formant map in Figure 4(a), which is appreciated as anincrease in the standard deviation of the formants in Table 8when compared to the formants of the reference speakers inTable 4. And second, the approximation of vowels /a/, /e/ and/o/ towards the center of the formants map in Figure 4(a),also appreciated in the mean results in Table 8.

Concerning the results for the vowels perceived asmispronounced by the human labelers, given in Figure 4(b)and the second half of Table 8, there can be appreciated thetotal confusion in the formants, as expected in this casewhere a mistake in the pronounced vowel has been madeby the speakers. In this case, all the formants are centered inthe middle of the formant map and the standard deviationis much higher. In this case, what the speakers are reallyuttering is different from the canonical vowel to be expectedand the production of speech is blurred in the formant map,as the labelers were not told to indicate what the speaker wasreally saying.

Page 97: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

Table 10: Mean (μ) values of pitch in the impaired corpus for mispronounced vowels.

Group A Group B Group C Group D

Vowel Stressed Unstressed Stressed Unstressed Stressed Unstressed Stressed Unstressed

/a/ 150.27 137.56 263.48 229.33 267.54 256.01 264.44 236.27

/e/ 151.98 146.10 246.23 228.11 255.77 268.71 306.03 312.17

/i/ 139.74 149.60 272.98 210.56 270.37 225.41 — —

/o/ 150.75 138.28 241.19 210.61 260.71 245.16 — 250.61

/u/ — 138.04 223.33 196.08 273.40 250.00 — 267.78

Table 11: Mean (μ) and standard desviation (σ) of energy in the impaired corpus.

Correct vowels Mispronounced vowels

Stressed Unstressed Stressed Unstressed

Vowel μ σ μ σ μ σ μ σ

/a/ 37.09 7.94 31.87 9.06 35.85 7.62 35.03 9.70

/e/ 37.93 7.29 34.02 8.34 35.25 8.20 32.88 7.37

/i/ 37.59 7.85 33.36 7.60 37.37 9.55 30.98 8.82

/o/ 37.91 8.47 34.70 9.69 37.01 9.58 33.22 8.69

/u/ 38.63 7.81 34.38 7.03 36.86 9.64 27.84 8.50

Table 12: Mean (μ) and standard desviation (σ) of length in theimpaired corpus.

Correct vowels Mispronounced vowels

Vowel μ σ μ σ

/a/ 138.42 75.62 99.47 109.53

/e/ 142.01 84.72 100.17 87.69

/i/ 128.88 66.76 143.75 117.11

/o/ 151.66 93.81 115.33 120.13

/u/ 127.42 64.62 138.40 77.98

5.2. Tone (Pitch) Results. The study of the pitch values for theimpaired subcorpus should best given separately for everyspeaker; however, the lack of sufficient data for a correctstatistical analysis (especially when studying mispronouncedvowels) leads to the need of gathering speakers in groups withsimilar pitch values. Hence, 4 groups are created,

(i) Group A gathers speakers Spk03, Spk06, Spk07 andSpk12 (4 of the older males with very low pitchvalues).

(ii) Group B gathers speakers Spk05, Spk08, Spk10 andSpk11 (2 females and 2 males with a medium pitchvalues).

(iii) Group C gathers speakers Spk04, Spk09, Spk13 andSpk14 (4 females with a medium-high pitch values)

(iv) Group D gathers speakers Spk01 and Spk02 (maleand female with a high pitch).

The results for the 4 groups of speakers are given in Tables9 (correctly pronounced vowels) and 10 (mispronouncedvowels, where some values are missing due to the notexistence of data for those cases).

It can be seen as impaired speakers keep a good controlof these prosodic features: Values of pitch are steady amongall five vowels and speakers show the ability to discriminatestressed vowels from unstressed vowels in all 5 vowelsin similar ways to reference speakers (with 10–20 Hz ofseparation between stressed and unstressed vowels). Wehave to consider with caution the results in the case ofmispronounced vowels, as the nonexistence of some casesleads to strange results.

5.3. Intensity (Energy) Results. Regarding the values offramewise energy (SNR as explained on Section 4), theaverage results for all the impaired speakers are given inTable 11. It is seen how energy keeps good properties for theimpaired speakers, and they are able to produce an increasein their intensity production when uttering stressed vowels,although compared to the reference results in Section 4.2there is a slight increase in the energy of unstressed vowels.On the other hand, a reduction in the energy in stressedvowels is noticed in the vowels labeled as mispronunciations.

5.4. Duration (Length) Results. The statistics for the resultsof the vowel length in the group of 14 impaired speakersare shown on Table 12. It can be seen that there is anincrease in the average length of around 15 millisecondsfor all vowels when compared to the reference speakers inTable 7, but what it is more noticeable is the increase instandard deviation (more than 50%), which indicates thepresence of vowels with a very variable length, meaning theexistence of extremely long and extremely short vowels, asthere is no significant change in the skewness and Kurtosis ofthe statistics. The increase in standard deviation is especiallynoticeable in the mispronounced vowels, which indicatesthat what the speakers are really uttering instead of the vowelsis a non steady realization of speech. This clearly might be

Page 98: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

Table 13: sKLD and FR for the most confusable pairs of vowels in the Spanish formants (Avg is the weighted average over the number ofappearances of every vowel).

Unimpaired speakersImpaired speakers Impaired speakers

(Correct vowels) (Mispronounced vowels)

Vowels sKLD FR sKLD FR sKLD FR

/a/-/e/ 17.40 6.39 11.80 3.11 1.78 0.24

/a/-/o/ 9.72 3.86 7.51 1.99 5.43 1.25

/e/-/i/ 6.49 2.82 6.60 3.26 0.78 0.39

/o/-/u/ 4.15 2.03 7.27 3.34 1.02 0.16

/e/-/o/ 20.97 6.45 16.17 5.21 9.35 1.29

Avg. 12.67 4.62 10.10 3.19 4.17 0.76

Table 14: sKLD and FR in pitch between stressed and unstressed vowels (Avg is the weighted average over the number of appearances ofevery vowel).

Unimpaired speakersImpaired speakers Impaired speakers

(Correct vowels) (Mispronounced vowels)

Vowels sKLD FR sKLD FR sKLD FR

/a/ 1.15 0.51 1.02 0.46 4.14 1.66

/e/ 0.52 0.13 0.52 0.23 1.29 0.21

/i/ 1.19 0.54 1.86 0.55 48.97 1.16

/o/ 1.43 0.67 0.94 0.42 2.37 0.53

/u/ 0.19 0.06 0.47 0.28 3.06 0.24

Avg 1.10 0.49 0.96 0.41 6.30 1.02

indicating that speakers are unsure of their production ofspeech, so they are trying to skip that vowel (making itshorter) or making it longer while they try to pronounce theright sound.

6. Discussion

The results obtained in Section 5 can give way to a dis-cussion on several aspects of the vocalic production ofimpaired speakers. The discussion in this section will comeaccompanied with the computation of the Kullback-LeiblerDivergence (KLD) and the Fisher Ratio (FR) [24]. Thesetwo measures are known to provide a good metric of thediscriminative power of two different random variables.In this work, they will help to know the discriminativeseparation between vowels in the formant map and betweenstressed and unstressed vowels in terms of tone and intensity.

For this work, it will be considered the KLD definitionfor n-dimensional Gaussians distributions (2-dimensionalin the case of formants and 1-dimensional in the otherfeatures). This definition, considered for two distributionsA ∼ ℵ(μA,ΣA) and B ∼ ℵ(μB,ΣB) where μA and μB are meanvectors, ΣA and ΣB diagonal covariance matrices and n thedimension of the distributions, is given by (5):

KL(A,B) =∑n

i=0

(log

(ΣAi

ΣBi

)+

(μAi − μBi

)2

ΣBi

+ΣAi

ΣBi

− 1

).

(5)

However, given this definition, the KLD is nonsymmet-ric, this means that KL(A,B) /=KL(B,A), so a symmetrizedKLD (sKLD) is defined in (6):

sKLD(A,B) = KL(A,B) + KL(B,A)2

. (6)

Finally, the FR equation for the two n-dimensionalGaussian distributions A ∼ ℵ(μA,ΣA) and B ∼ ℵ(μB,ΣB) isgiven in (7):

FR(A,B) =∑n

i=0

((μAi − μBi

)2

ΣAi + ΣBi

). (7)

Concerning the formants (the only acoustic feature of thevowels) there is an important decrease in sKLD and FR in theformant map between the vowels /a/, /e/ and /o/ in Table 13,while vowels /i/ and /u/ separate from the other 3 vowels,increasing their sKLD and FR in the formant map.

However, this is not a precise vision of the situation,because it is not to be forgotten than these two vowels arethe less likely seen in Spanish language; not only in thevocabulary of this work in Section 3, but also in some othermajor text corpora in Spanish like the Europarl corpus [25],where the percentage of appearances of vowels is 11.83%for /e/, 9.51% for /a/, 8.07% for /o/ and only 4.28% for /i/and 1.74% for /u/. This way, when computing a weightedaverage result in sKLD and FR (last row of Table 13), wherethe weights are the percentage of appearances of every vowelin the vocabulary, it is seen that there is an average reductionof 20.28% in the sKLD and 30.95% in the FR between

Page 99: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal Processing

Table 15: sKLD and FR in energy between stressed and unstressed vowels (Avg is the weighted average over the number of appearances ofevery vowel).

Unimpaired speakersImpaired speakers Impaired speakers

(Correct vowels) (Mispronounced vowels)

Vowel sKLD FR sKLD FR sKLD FR

/a/ 1.04 0.47 0.42 0.19 0.13 0.00

/e/ 0.17 0.08 0.29 0.12 0.12 0.05

/i/ 0.31 0.14 0.30 0.15 0.24 0.11

/o/ 1.49 0.70 0.51 0.23 0.19 0.09

/u/ 0.19 0.08 0.35 0.16 1.04 0.49

Avg 0.96 0.44 0.42 0.19 0.20 0.06

unimpaired and impaired speakers uttering correctly thevowels. This reduction in discriminative power rises upto 83.55% between unimpaired and impaired speakers inthe situation of mispronunciations. This result is clearlyexpectable since we are considering a situation where thecanonical form of the vowel has been not uttered, but itserves as a way of validating consistently the human labelingmade by the experts.

In terms of suprasegmental features, the separationbetween stressed and unstressed vowels are given in Tables14 (pitch) and 15 (energy). Table 14 shows that there isnot a significant decrease in the weighted sKLD and FR inpitch between unimpaired speakers and impaired speakers(when uttering correctly the vowels). This corroboratesprevious works [4] in the fact that impaired speakers canstill control some prosodic features in their speech even theylose intelligibility in their vowel production. The results inthe mispronounced vowels by impaired speakers cannot beconsidered due to the pernicious effect of unseen cases in thetest data.

It is in terms of energy (or intensity) where impairedspeakers seem to have bigger problems in the control ofprosody and stress. There is a reduction of 56.26% in sKLDand 56.82% in FR in the discriminative power between thesetwo distributions, and this reduction increases to 80% in thecase of mispronounced vowels. As mentioned in Section 5,this reduction in discriminative power is mostly due to anincrease in the energy of unstressed vowels. The reason forthat might be in the fact that impaired speakers are tryingto assure themselves in their pronunciation by raising theirintensity in their situations of hesitation. This extra intensitywould not affect stressed vowels because stressed vowels havehigher energy due to this prosodic feature of stress affectingthem.

Finally, the study of the length of the production ofvowels by the impaired speakers in Table 12 shows an effect ofdispersion in the length of the vowels. This means that vowelsas uttered by these speakers are more often abnormally longor short. Actually, two separate effects can be appreciated;in the case of correctly pronounced vowels by the impairedspeakers there is an effect of lengthening of the vowels(around 20%–30% increase in mean values between Tables7 and 12), while mispronounced vowels are excessivelydispersed (with standard deviations of 80% of the mean

values), mainly due to the doubts and hesitations of incor-rect pronunciations. The increase of duration of correctlypronounced vowels might indicate certain hesitations in thespeakers when uttering their speech, due to their insecurityin speech production because of their speech disorders.

7. Conclusion

As conclusions to this work, a whole corpus with unimpairedand impaired children’ speech corpus has undergone anacoustic study based on LPC analysis to calculate acousticfeatures like formants and suprasegmental features (pitch,energy and length). Results show that the good propertiesof unimpaired speakers (well-behaved formants, separationof stressed and unstressed vowels in terms of pitch andenergy, and statistically correct length features) are distortedin different ways in the impaired speakers.

Impaired speakers reduce in 20%–30% the discrimina-tive ability of the formant map, even when the pronunciationis perceived as correct by a set of human experts. Resultsin the case of mispronunciations show a total blur in theformant map as expected and as detected by the humanexperts. Impaired speakers have a good control of tone asfeature for the microprosody of the words; but intensitydiscrimination between stressed and unstressed vowels isreduced by a 50% due to an increase in the energy ofunstressed vowels. Finally, it has been shown how thesespeakers have problems to maintain a steady productionof vowels in terms of their length, with the abnormalproduction of extremely long or short vowels that is reflectedin an increase of 50% in the standard deviation of the vowellength.

Hence, it can be concluded that the main problem inthe vowel production due to the speech disorders analyzedin this work reflects in terms of formants, intensity controland vowel length, while they are able to maintain a correctproduction of pitch. Further work in this area may includea more precise analysis of the formant values, consideringtheir relationship to the pitch value of every speaker. Also,the results in this work could be validated with the resultsachieved with a manual segmentation of vowels; although theautomated segmentation is robust enough and, altogether

Page 100: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

with the strict sonority threshold applied, assures that all theframes analyzed belong to vowels.

Further studies in the vowel duration may also bedone considering a new vocabulary with the same syllablesin different positions and situations of stress. Finally, abigger study considering connected speech might be doneto study the loss of prosody features in a situation ofcomplete sentences. This study might be useful to determineif impaired speakers have problems with prosody control in amore complex context than simple control of stress features.Another study of interest would be to link these results to theoutcome of a whole phonetic transcription of the speakers’speech (with a confusion matrix of the mispronunciations)and also analyze separately each speaker speech in terms ofacoustic parameters, although that would require a morecareful statistical analysis due to the reduction in the amountof data studied.

Acknowledgments

The authors want to acknowledge Jose Manuel Marcos,Cesar Canalıs, Pedro Pegero, and Beatriz Martınez from theSchool for Special Education “Alborada”, located in Zaragoza(Spain), for their collaboration in this work.

References

[1] M.-J. Ball, Phonetics for Speech Pathology, Whurr Publishers,London, UK, 1983.

[2] K. Croot, “An acoustic analysis of vowel production acrosstasks in a case of non-fluent progressive aphasia,” in Proceed-ings of the 5th International Conference on Spoken LanguageProcessing (ICSLP-Interspeech ’98), pp. 907–910, Sydney, Aus-tralia, December 1998.

[3] T. Prizl-Jakovac, “Vowel production in aphasia,” in Proceedingsof the 6th European Conference on Speech Communication andTechnology (EuroSpeech ’99), pp. 583–586, Budapest, Hungary,September 1999.

[4] R. Patel, “Phonatory control in adults with cerebral palsy andsevere dysarthria,” Augmentative and Alternative Communica-tion, vol. 18, no. 1, pp. 2–10, 2002.

[5] C. P. Moura, D. Andrade, L. M. Cunha, et al., “Voice qualityin Down syndrome children treated with rapid maxillaryexpansion,” in Proceedings of the 9th European Conference onSpeech Communication and Technology (EuroSpeech ’05), pp.1073–1076, Lisboa, Portugal, September 2005.

[6] O. Saz, A. Miguel, E. Lleida, A. Ortega, and L. Buera, “Studyof time and frequency variability in pathological speech anderror reduction methods for automatic speech recognition,”in Proceedings of the 9th International Conference on SpokenLanguage Processing (ICSLP ’06), pp. 993–996, Pittsburgh, Pa,USA, September 2006.

[7] R. Justo, O. Saz, V. Guijarrubia, A. Miguel, M.-I. Torres, andE. Lleida, “Improving dialogue systems in a home automationenvironment,” in Proceedings of the 1st International Confer-ence on Ambient Media and Systems (Ambi-Sys ’08), Quebec,Canada, February 2008.

[8] J.-L. Navarro-Mesa, P. Quintana-Morales, I. Perez-Castellano,and J. Espinosa-Yanez, “Oral corpus of the project HACRO(Help tool for the confidence of oral utterances),” Tech. Rep.,

Department of Signal and Communications, University of LasPalmas de Gran Canaria, Gran Canaria, Spain, May 2005.

[9] O. Fujimura and D. Erickson, “Acoustic phonetics,” in TheHandbook of Phonetic Sciences, W.-J. Hardcastle and J. Laver,Eds., pp. 65–115, Blackwell, Oxford, UK, 1997.

[10] K.-N. Stevens, Acoustic Phonetics, MIT Press, Cambridge,Mass, USA, 1998.

[11] O. Scharenborg, “Modelling fine-phonetic detail in a com-putational model of word recognition,” in Proceedings ofthe International Conference on Spoken Language Processing(ICSLP ’08), pp. 1473–1476, Brisbane, Australia, September2008.

[12] J.-I. Hualde, The Sounds of Spanish, Cambridge UniversityPress, Cambridge, UK, 2005.

[13] A. Quilis, Fonetica Acustica de la Lengua Espanola, Gredos,Madrid, Spain, 1981.

[14] E. Martınez-Celdran and A.-M. Fernandez-Planas, Manual deFonetica Espanola. Articulaciones y Sonidos del Espanol, Ariel,Barcelona, Spain, 2007.

[15] D. Fry, “Experiments in the perception of stress,” Language andSpeech, vol. 1, pp. 126–152, 1958.

[16] O. Saz, W.-R. Rodrıguez, E. Lleida, and C. Vaquero, “A novelchildren’s corpus of disordered speech,” in Proceesing of the 1stWorkshop on Child, Computer and Interaction (WOCCI ’08),Chania, Greece, October 2008.

[17] M. Monfort and A. Juarez-Sanchez, Registro FonologicoInducido (Tarjetas Grficas), Cepe, Madrid, Spain, 1989.

[18] C. Vaquero, O. Saz, E. Lleida, and W.-R. Rodrıguez, “E-inclusion technologies for the speech handicapped,” in Pro-ceedings of IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP ’08), pp. 4509–4512, Las Vegas,Nev, USA, March-April 2008.

[19] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of chil-dren’s speech: developmental changes of temporal and spectralparameters,” Journal of the Acoustical Society of America, vol.105, no. 3, pp. 1455–1468, 1999.

[20] W.-R. Rodrıguez, C. Vaquero, O. Saz, and E. Lleida, “Speechtechnology applied to children with speech disorders,” inProceedings of the 4th International Conference on Biomedi-cal Engineering (BioMed ’08), pp. 247–250, Kuala Lumpur,Malaysia, June 2008.

[21] A. Moreno, D. Poch, A. Bonafonte, et al., “Albayzin speechdatabase: design of the phonetic corpus,” in Proceedings ofthe 3th European Conference on Speech Communication andTechnology (EuroSpeech ’93), pp. 175–178, Berlin, Germany,September 1993.

[22] A. Moreno, B. Lindberg, C. Draxler, et al., “SpeechDat-Car:a large speech database for automotive environments,” inProceedings of the 2nd Language Resources European Conference(LREC ’00), Athens, Greece, June 2000.

[23] L.-R. Rabiner and R.-W. Schafer, Digital Processing of SpeechSignals, Signal Processing Series, Prentice-Hall, EnglewoodCliffs, NJ, USA, 1978.

[24] T.-M. Cover and J.-A. Thomas, Elements on InformationTheory, Wiley Interscience, New York, NY, USA, 1991.

[25] P. Koehn, “Europarl: a parallel corpus for statistical machinetranslation,” in Proceedings of the 10th Machine TranslationSummit, pp. 79–86, Phuket,Thailand, September 2005.

Page 101: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 629030, 9 pagesdoi:10.1155/2009/629030

Research Article

Automated Intelligibility Assessment of PathologicalSpeech Using Phonological Features

Catherine Middag,1 Jean-Pierre Martens,1 Gwen Van Nuffelen,2 and Marc De Bodt2

1 Department of Electronics and Information Systems, Ghent University, 9000 Ghent, Belgium2 Antwerp University Hospital, University of Antwerp, 2650 Edegem, Belgium

Correspondence should be addressed to Catherine Middag, [email protected]

Received 31 October 2008; Accepted 24 March 2009

Recommended by Juan I. Godino-Llorente

It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of thecommunication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibilityrating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words) andthat they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objectiveautomatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the designof automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designedand well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008) ispresented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normalspeech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trainedon pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of thediscrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.

Copyright © 2009 Catherine Middag et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

In clinical practice there is a great demand for fast andreliable methods for assessing the communication efficiencyof a person with a (pathological) speech disorder. It is arguedin several studies (e.g., [1]) that intelligibility is an importantcriterion in this assessment. Therefore several perceptualtests aiming at the measurement of speech intelligibility havebeen conceived [2–4]. One of the primary prerequisites forgetting reliable scores is that the test should be designedin such a way that the listener cannot guess the correctanswer based solely on contextual information. That is whythese tests use random word lists, varying lists at differenttrials, real words as well as pseudowords, and so forth.Another important issue is that the listener should not be toofamiliar with the tested speaker since this creates a positivebias. Finally, if one wants to use the test for monitoringthe efficiency of a therapy, one cannot work with the samelistener all the time because this would introduce a bias

shift. The latter actually excludes the speaker’s therapistas a listener, which is very unfortunate from a practicalviewpoint.

For the last couple of years there has been a growinginterest in trying to apply automatic speech recognition(ASR) for the automation of the traditional perceptual tests[5–8]. By definition an ASR is an unbiased listener, but is italready reliable enough to give rise to computed intelligibilityscores that correlate well with the scores obtained from awell-designed and well-performed perceptual test? In thispaper, we present and evaluate an automated test whichseems to provide such scores.

The simplest approach to automated testing is to letan ASR listen to the speech, to let it perform a lexicaldecoding of that speech, and to compute the intelligibilityas the percentage of correctly decoded words or phonemes.Recent work [9, 10] has demonstrated that in case thesketched approach is applied to read text passages of speakerswith a particular disorder (e.g., dysarthric speakers or

Page 102: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

laryngectomies) it can yield intelligibilities that correlate wellwith an impression of intelligibility, expressed on a 7-pointLikert scale [11].

In order to explore the potential of the approach inmore demanding situations, we have let a state-of-the-artASR system [12] recognize isolated monosyllabic words andpseudowords spoken by a variety of pathological speakers(different types of pathology and different degrees of severityof that pathology). The perceptual intelligibilities againstwhich we compared the computed ones represented intelli-gibility at the phone level. The outcome of our experimentswas that the correlations between the perceptual and thecomputed scores were only moderate [13]. This is inlinewith our expectations since the ASR employs acoustic modelsthat were trained on the speech of nonpathological speakers.Consequently, when confronted with severely disorderedspeech, the ASR is asked to score the sounds that are inmany respects very different from the sounds it was trainedon. This means that acoustic models are asked to makeextrapolations in areas of the acoustic space that were notexamined at all during training. One cannot expect thatunder these circumstances a lower acoustic likelihood alwayspoints to a larger deviation (distortion) of the observedpronunciation from the norm.

Based on this last argument we have conceived analternative approach. It first of all employs phonologicalfeatures as an intermediate description of the speech sounds.Furthermore, it computes a series of features used forcharacterizing the voice of a speaker, and it employs aseparate intelligibility prediction model (IPM) to convertthese features into a computed intelligibility. Our firsthypothesis was that even in the case of severe speechdisorders, some of the articulatory dimensions of a soundmay still be more or less preserved. A description of thesounds in an articulatory feature space may possibly offer afoundation for at least assessing the severity of the relativelylimited distortions in these articulatory dimensions. Notethat the term “articulatory” is usually reserved to designatefeatures stemming from direct measurements of articulatorymovements (e.g., by means of an articulograph). We adoptthe term “phonological” for features that are also intendedto describe articulatory phenomena, although here they arederived from the waveform. Our second hypothesis wasthat it would take only a simple IPM with a small numberof free parameters to convert the speaker features into anintelligibility score, and therefore that this IPM can betrained on a small collection of both pathological and normalspeakers.

We formerly developed an initial version of our system[13], and we were able to demonstrate that its computedintelligibilities correlated well with perceived phone-levelintelligibilities [14] for our speech material. However, thesegood correlations could only be attained with a systemincorporating two distinct ASR components: one workingdirectly in the acoustic feature space and one working inthe phonological feature space. In this paper we presentsignificant improvements of the phonological componentof our system, and we show that as a result of theseimprovements we can now obtain high accuracy using

phonological features alone. This means that we now obtaingood results with a much simpler system comprising onlyone ASR comprising no more than 55 context-independentacoustic models.

The rest of this paper is organized as follows. In Section 2,we briefly describe the perceptual test that was automatedand the pathological speech corpus that was available forthe training and evaluation of our system. In Section 3we present the system architecture, and we briefly discussthe basic operations performed by the initial stages of thesystem. The novel speaker feature extractor and the trainingof the IPM are discussed in Sections 4 and 5, respectively.In Section 6 we assess the reliability of the new system andcompare it to that of the original system. The paper ends witha conclusion and some directions for future work.

2. Perceptual Test and Evaluation Database

The subjective test we have automated is the Dutch Intelligi-bility Assessment (DIA) test [4], one which was specificallydesigned with the aim to measure the intelligibility ofDutch speech at the phoneme level. Each speaker reads50 consonant-vowel-consonant (CVC) words but with onerelaxation, namely, those words with one of the two con-sonants missing are also allowed. The words are selectedfrom three lists: list A is intended for testing the consonantsin a word initial position (19 words including one with amissing initial consonant), list B is intended for testing themin a word final position (15 words including one with amissing final consonant), and list C is intended for testingthe vowels and diphthongs in a word central position (16words with an initial and final consonant). To avoid guessingby the listener, there are 25 variants of each list, and eachvariant contains existing words as well as pronounceablepseudowords. For each test word, the listener must completea word frame by filling in the missing phoneme or byindicating the absence of that phoneme. In case the initialconsonant is tested, the word frame could be something like“.it” or “.ol”. The perceptual intelligibility score is calculatedas the percentage of correctly identified phonemes. Previousresearch [4, 15] has demonstrated that the intelligibilityscores derived from the DIA are highly reliable (an interratercorrelation of 0.91 and an intrarater correlation of 0.93[15]).

In order to train and test our automatic intelligibilitymeasurement system, we could dispose of a corpus ofrecordings from 211 speakers. All speakers uttered 50 CVCwords (the DIA test) and a short text passage.

The speakers belong to 7 distinct categories: 51 speakerswithout any known speech impairment (the control group),60 dysarthric speakers, 12 children with cleft lip or palate,42 persons with pathological speech secondary to hearingimpairment, 37 laryngectomized speakers, 7 persons diag-nosed with dysphonia, and 2 persons with a glossectomy.

The DIA recordings of all speakers were scored byone trained speech therapist. This therapist was howevernot familiar with the recorded patients. The perceptual(subjective) phoneme intelligibilities of the pathological

Page 103: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

training speakers range from 28 to 100 percent with a meanof 78.7 percent. The perceptual scores of the control speakersrange from 84 to 100 percent, with a mean of 93.3 percent.More details on the recording conditions and the severity ofthe speech disorders can be found in [14].

We intend to make the data freely available for researchthrough the Dutch Speech and Language Resources agency(TST-centrale), but this requires good documentation inEnglish first. In the meantime, the data can already beobtained by simple request (just contact the first author ofthis paper).

3. An Automatic IntelligibilityMeasurement System

As already mentioned in the introduction, we have conceiveda new speech intelligibility measurement system that is morethan just a standard word recognizer. The architecture ofthe system is depicted in Figure 1. The acoustic front-end extracts a stream of mel-frequency cepstral coefficients(MFCC) [16] feature vectors from the waveform. At everytime t = 1, . . . ,T which is a multiple of 10 milliseconds,it computes a vector Xt of 12 MFCCs plus a log-energy(all derived from a segment of 30 milliseconds centeredaround t). This MFCC feature stream is then converted into aphonological feature stream. At each time t, the phonologicalfeature detector computes a vector Yt of 24 components eachrepresenting the posterior probability P(Ai | Xt−5, . . . ,Xt+5)that one of 24 binary phonological classes Ai (i = 1, . . . , 24)is “supported by the acoustics” in a 110 milliseconds windowaround time t. The full list of phonological classes canbe found in [17]. Some typical examples are the classesvoiced (= vocal source class), burst (= manner class), labial(= place-consonant class), and mid-low (= vowel class).The phonological feature detector is a conglomerate of fourartificial neural networks that were trained on continuousspeech uttered by normal speakers [17].

The forced alignment system lines up the phonologicalfeature stream with a typical (canonical) acoustic-phonetictranscription of the target word. This transcription isa sequence of basic acoustic-phonetic units, commonlyreferred to as phones [18]. The acoustic-phonetic tran-scription is modeled by a sequential finite state machinecomposed of one state per phone. The states are context-independent, meaning that all occurrences of a particularphone are modeled by the same state. This is consideredacceptable because coarticulations can be handled in animplicit way by the phonological feature detector. In fact, thelatter analyzes a long time interval for any given timeframe,and this window can expose most of the contextual effects.Each state is characterized by a set of canonical values Aci forthe phonological classes Ai. These values can either be 1 (=on, present), 0 (= off, absent), or irrelevant (= both valuesare equally acceptable). Self-loops and skip transitions makeit possible to handle variable phone durations and phoneomissions in an easy way.

The alignment system is instructed to return the statesequence S = {s1, . . . , sT} with the largest posterior proba-

bility P(S | X1, . . . ,XT). This probability is approximated asfollows (see [17] for more details):

P(S | X1, . . . ,XT) =T∏

t=1

P(st | Xt−5, . . . ,Xt+5)P(st | st−1)P(st)

,

P(st | Xt−5, . . . ,Xt+5) =⎡⎣ ∏

Aci(st)=1

Yti

⎤⎦

1/Np(st)

,

(1)

withNp(st) representing the number of classes with a positivecanonical value for state st. The transition probabilities P(st |st−1) and the prior state probabilities P(st) were trainedon normal speech. The probability P(st | Xt−5, . . . ,Xt+5) ishereafter shortnoted as P(st | Xt).

Once the 3 tuples (st,Yt,P(st | Xt)) are available for allframes of all utterances of one speaker, the speaker featureextractor can derive from these 3 tuples (and from thecanonical values of the phonological classes in the differentstates) a set of phonological features that characterize thespeaker. The Intelligibility Prediction Model (IPM) thenconverts these speaker features into a computed phonemeintelligibility score.

In the subsequent sections, we will provide a moredetailed description of the last two processing stages sincethese are the stages that mostly distinguish the new from theoriginal system.

4. Speaker Feature Extraction

In [13], only context-independent speaker features werederived from the alignments. In this work we will benefitfrom the binary nature of the phonological classes to identifyan additional set of context-dependent speaker features thatcan be extracted from these alignments.

The extraction of speaker features is always based onaveraging either P(st | Xt) or Yt over frames that were as-signed to a particular state or set of states. The averagingis not restricted to frames that, according to the alignment,contribute to the realization of a phoneme that is beingtested in the DIA (e.g., the initial consonant of the word).We let the full utterances and the corresponding statesequences contribute to the feature computation becausewe assume that this should lead to a more reliable (stable)characterization of the speaker. However, at certain places,we have compensated for the fact that not every speaker haspronounced the same words (due to subtest variants), andtherefore, that the distribution of phonemes can differ fromspeaker to speaker as well.

4.1. Phonemic Features (PMFs). A phonemic feature PMF( f )for phone f is derived as the mean of P(st|Xt) over all framesXt that were assigned to a state st which is equal to f (thereis 1 state per phone). Repeating this for every phone in theinventory then gives rise to 55 PMFs of the form

PMF(f) = 〈P(st | Xt)〉t;st= f f = 1, . . . , 55, (2)

Page 104: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

Phonologicalfeature detector

Acousticfront-end

Alignment system

Intelligibilityprediction model

Acoustic-phonetictranscription

Intelligibility

Speaker featureextractor

s(n) Xt Yt

st ,P(st|Xt)

Figure 1: Architecture of the automatic intelligibility measurement system.

with 〈x〉selection representing the mean of x over the framesspecified by the selection.

4.2. Phonological Features (PLFs). Instead of averaging theposterior probabilities P(st | Xt), one can also average thephonological features Yti (i = 1, . . . , 24). In particular, onecan take the mean of Yti (for some i) over all frames thatwere assigned to one of the phones that are characterized bya canonical value Aci = A for feature class Ai (A can be either1 or 0 here). This mean score is thus generally determinedby the realizations of multiple phones. Consequently, sincedifferent speakers have uttered different word lists, thedifferent phones could have a speaker-dependent weight inthe computed means. In order to avoid this, the simpleaveraging scheme is replaced by the following two-stageprocedure:

(1) take the mean of Yti over all frames that were assignedto a phone f whose Aci( f ) = A, denote this meanas PLF( f , i,A), and repeat the procedure for all validcombinations ( f , i,A);

(2) compute PLF(i,A) as the mean over f of thePLF( f , i,A) that were obtained in the previous stage.

This procedure gives equal weights to every phonecontributing to PLF(i,A). Written in mathematical notation,one gets

PLF(f , i,A

) = 〈Yti〉t;st= f ;Aci( f )=A ∀ valid(f , i,A

),

PLF(i,A)=〈PLF(f , i,A

)〉 f ;Aci( f )=A i = 1, . . . , 24; A = 0, 1.

(3)

Since for every of the 24 phonological feature classes thereare phones with canonical values 0 and 1 for that class, onealways obtains 48 phonological features. The 24 phonologicalfeatures PLF(i, 1) are called positive features because theymeasure to what extent a phonological class that was sup-posed to be present during the realization of certain phonesis actually supported by the acoustics observed during theserealizations. The 24 phonological features PLF(i, 0) are called

negative features. We add this negative PLF set becauseit is important for a patient’s intelligibility not only thatphonological features occur at the right time but also thatthey are absent when they should be.

4.3. Context-Dependent Phonological Features (CD-PLFs). Itcan be expected that pathological speakers encounter moreproblems with the realization of a particular phonologicalclass in some contexts than in others. Consequently it makessense to compute the mean value of a phonological featureYti under different circumstances that take not only thecanonical value of feature class Ai in the tested phoneinto account but also the properties of the surroundingphones. Since the phonological classes are supposed to referto different dimensions of articulation, it makes sense toconsider them more or less independently, and therefore, toconsider only the canonical values of the tested phonologicalclass in these phones as context information. Due to theternary nature of the phonological class values (on, off,irelevant), the number of potential contexts per (i,A) islimited to 3×3 = 9. If we further include “silence” as a specialcontext to indicate that there is no preceding or succeedingphone, the final number of contexts is 16. Taking intoaccount that PLFs are only generated for canonical valuesA of 0 and 1 (and not for irrelevant), the total number ofsequences of canonical values (SCVs) for which to computea CD-PLF is 24 × 2 × 16 = 768. This number is however anupper bound since many of these SCVs will not occur in the50 word utterances of the speaker.

In order to determine in advance all the SCVs that areworthwhile to consider in our system, we examined thecanonical acoustic-phonetic transcriptions of the words inthe different variants of the A, B, or C-lists, respectively.We derived from these lists how many times they containa particular SCV. We then retained only those SCVs thatappeared at least twice in any combination of variants onecould make. It is easy to determine the minimal number ofoccurences of each SCV. One just needs to determine thenumber of times each variant of the A-list contains the SCVand to record the minimum over these times to get an A-count. Similarly one determines a B and a C-counts, and one

Page 105: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

takes the sum of these counts. For our test, we found that 123of the 768 SCVs met the condition we set out.

If AL and AR represent the canonical values of featureclass Ai in the left and right context phone, the compu-tation of a context-dependent feature for the combination(A,AL,AR) is obtained by means of a two-stage scheme:

(1) take the mean of Yti over all frames which wereassigned to a phone f having a canonical valueAci( f ) = A (A can be either 1 or 0 here) andappearing between phones whose canonical valuesof class Ai are AL and AR, denote this mean asPLF( f , i,A,AL,AR), and repeat the procedure for allcombinations ( f , i,A,AL,AR) occurring in the data,

(2) compute PLF(i,A,AL,AR) as the mean over f of thePLF( f , i,A,AL,AR) that were computed in the firststage.

Again, this procedure gives equal weights to all thephones that contribute to a certain CD-PLF. In mathematicalnotation one obtains

PLF(f , i,A,AL,AR

)

= 〈Yti〉t; st= f ;Aci=A;ALci=AL;ARci=AR

∀ occurring(f , i,A,AL,AR

),

PLF(i,A,AL,AR

)

=⟨PLF ( f , i,A,AL,AR)

⟩f ; occurring ( f ,i,A,AL,AR)

∀ occurring(i,A,AL,AR

),

(4)

with Aci, ALci, and ARci being short notations for, respectively,the canonical values of Ai in the state visited at time t, in thestate from where this state was reached at some time beforet, and in the state which is visited after having left the presentstate at some time after t.

Note that the context is derived from the phone sequencethat was actually realized according to the alignment system.Consequently, if a phone is omitted, a context that was notexpected from the canonical transcriptions can occur, andvice versa. Furthermore, there may be fewer observationsthan expected for the SCV that has the omitted phonein central position. In the case that no observation of aparticular SCV would be available, the corresponding featureis replaced by its expected value (as derived from a set ofrecorded tests).

5. Intelligibility Prediction Model (IPM)

When all speaker features are computed, they need tobe converted into an objective intelligibility score for thespeaker. In doing so we use a regression model that is trainedon both pathological and normal speakers.

5.1. Model Choice. A variety of statistical learners is availablefor optimizing regression problems. However, in order to

avoid overfitting, only a few of these can be applied to ourdata set. This is because the number of training speakers(211) is limited compared to the number of features (e.g., 123CD-PLFs) per speaker. A linear regression model in termsof selected features, with the possible combination of somead hoc transformation of these features, is about the mostcomplex model we can construct.

5.2. Model Training. We build linear regression modelsfor different feature sets, namely, PMF, PLF, and CD-PLF,and combinations thereof. A fivefold cross-validation (CV)method is used to identify the feature subset yielding thebest performance. In contrast to our previous work, we nolonger take the Pearson Correlation Coefficient (PCC) as theprimary performance criterion. Instead, we opt for the rootmean squared error (RMSE) of the discrepancies betweenthe computed and the measured intelligibilities. Our mainarguments for this change of strategy are the following.

First of all, the RMSE is directly interpretable. In case thediscrepancies (errors) are normally distributed, 67% of thecomputed scores lie closer than the RMSE to the measured(correct) scores. Using the Lilliefors test [19] we verified that,in practically all the experiments we performed, the errorswere indeed normally distributed.

A second argument is that we want the computed scoresto approximate the correct scores directly. Per test set, thePCC actually quantifies the degree of correlation betweenthe correct scores and the best linear transformation of thecomputed scores. As this transformation is optimized for theconsidered test set, the PCC may yield an overly optimisticevaluation result.

Finally, we noticed that if a model is designed to cover alarge intelligibility range, and if it is evaluated on a subgroup(e.g., the control group) covering only a small subrange, thePCC can be quite low for this subgroup even though theerrors remain acceptable. This happens when the rankingsof the speakers of this group along the perceptual and theobjective scores, respectively, are significantly different. TheRMSE results were found to be much more stable acrosssubgroups.

Due to the large number of features, an exhaustive searchfor the best subset would take a lot of computation time.Therefore we investigated two much faster but definitelysuboptimal sequential procedures. The so-called forwardprocedure starts with the best combination of 3 featuresand adds one feature (the best) at the time. The so-calledbackward procedure starts with all the features and removesone feature at the time.

Figure 2 illustrates a typical variation of RMSE versus thenumber of features being selected. By measuring not only theglobal RMSE but also the individual RMSEs in the 5 folds ofthe CV-test, one can get an estimate of the standard deviationon the global RMSE for a particular selected feature set. Inorder to avoid that too many features are being selected wehave adopted the following 2-step procedure: (1) determinethe selected feature set yielding the minimal RMSE; (2)select the smallest feature set yielding an RMSE that is notlarger than the minimal (best) RMSE augmented with theestimated standard deviation on that RMSE.

Page 106: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

7

8

9

10

11

12

Roo

tm

ean

squ

ared

erro

r

0 20 40 60 80 100

Number of features

Figure 2: Typical evolution of the root mean squared error (RMSE)as a function of the number of selected features for the forwardselection procedure. Also indicated is the evolution of RMSE ± σwith σ representing the standard deviation on the RMSEs found forthe 5 folds. The square and the circle show the sizes of the best andthe actually selected feature subsets.

6. Results and Discussion

We present results for the new system as well as for apreviously published system that was much more complexsince it comprised two subsystems each containing a differentASR and each generating a set of speaker features. Thefirst subsystem generated 55 phonemic features (PMF-tri)originating from acoustic scores computed by state-of-the-art triphone acoustic models in the MFCC feature space. Thesecond subsystem generated 48 phonological features (PLFs)in the way described in Section 4.2. The speaker featuresof the two subsystems could be combined before they weresupplied to the intelligibility prediction model.

6.1. General Results. We have used the RMSE criterion toobtain three general IPMs (trained on all speakers) thatwere based on the speaker features generated by our originalsystem. The first model only used the phonemic features(PMF-tri) emerging from the first subsystem, the second oneapplied the phonological features (PLF) emerging from theother subsystem, and the third one utilized the union of thesetwo feature sets (PMF-tri + PLF). The number of selectedfeatures and the RMSEs for these models are listed in the firstthree rows of Table 1.

Next, we examined all the combinations of 1, 2, or 3speaker feature sets as they emerged from the new system.The figures in Table 1 show that all IPMs using the CD-PLFs perform the same as our previous best system: PMF-tri + PLF. In the future as we look further into underlyingarticulatory problems of pathological speakers, it will bemost pertinent to opt for an IPM based solely on articulatoryinformation such as PLF + CD-PLF.

Taking this IPM as our reference system, the Wilcoxonsinged-rank test [19] has revealed the following: (1) there

Table 1: Number of selected features and RMSE for a numberof general models (trained on all speakers) created for differentspeaker feature sets. The features with suffix “tri” emerge from ourpreviously published system. Results differing significantly from theones of our reference system PLF + CD-PLF are marked in bold.

Speaker features Selected features RMSE

PMF-tri 5 8.9

PLF 16 9.2

PMF-tri + PLF 19 7.7

PMF 11 10.1

PLF 16 9.2

CD-PLF 21 8.2

PLF + CD-PLF 27 7.9

PMF + CD-PLF 31 7.8

PMF + PLF 20 9.0

PMF + PLF + CD-PLF 42 7.8

Table 2: Root mean squared error (RMSE) for pathology specificIPMs (labels are explained in the text) based on several speakerfeature sets. N denotes the number of selected features. The resultswhich differ significantly from the reference system PLF + CD-PLFare marked in bold.

DYS LARYNX HEAR

CD-PLFRMSE 6.4 5.2 5.8

N 28 19 43

PMF + PLFRMSE 7.9 7.3 8.1

N 12 8 22

PMF + CD-PLFRMSE 6.1 4.3 3.9

N 22 31 55

PLF + CD-PLFRMSE 6.1 5.3 4.8

N 28 17 52

PMF + PLF + CD-PLFRMSE 5.9 4.1 4.2

N 38 28 49

PMF-tri + PLFRMSE 6.4 7.6 5.5

N 26 10 22

is no significant difference between the accuracy of the newreference system and that of the formerly published system,(2) the context-dependent feature set yields a significantlybetter accuracy than any of the context-independent featuresets, (3) the addition of context-independent features toCD-PLF only yields a nonsignificant improvement, and(4) a combination of context-independent phonemic andphonological features emerging from one ASR (PMF +PLF) cannot compete with a combination of similar features(PMF-tri + PLF) originating from two different ASRs.Although maybe a bit disappointing at first glance, the firstconclusion is an important one because it shows that thenew system with only one ASR comprising 55 context-independent acoustic states achieves the same performanceas our formerly published system with two ASRs, one ofwhich is a rather complex one comprising about thousandtriphone acoustic states.

Page 107: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

40

60

80

100

Com

pute

dsc

ore

40 60 80 100

Perceptual score

(a)

40

60

80

100

Com

pute

dsc

ore

40 60 80 100

Perceptual score

DHL

N

O

(b)

Figure 3: Computed versus perceptual intelligibility scores emerg-ing from the systems PMF-tri + PLF (a) and PLF + CD-PLF(b). Different symbols were used for dysarthric speakers (D),persons with hearing impairment (H), laryngectomized speakers(L), speakers with normal speech (N), and others (O).

Scatter plots of the subjective versus the objective intel-ligibility scores for the systems PMF-tri + PLF and PLF +CD-PLF are shown in Figure 3. They confirm that most ofthe dots are in vertical direction less than the RMSE (about8 points) away from the diagonal which represents the idealmodel. They also confirm that the RMSE emerging from ourformer system is slightly lower than that emerging from ournew system.

The largest deviations from the diagonal appear for thespeakers with a low intelligibility rate. This is a logicalconsequence of the fact that we only have a few suchspeakers in the database. This means that the trained IPMwill be more specialized in rating medium to high-quality

40

60

80

100

Com

pute

dsc

ore

40 60 80 100

Perceptual score

(a)

40

60

80

100

Com

pute

dsc

ore

40 60 80 100

Perceptual score

(b)

Figure 4: Computed versus perceptual intelligibility scores emerg-ing from the PMF-tri + PLF (a) and PLF + CD-PLF (b) fordysarthric speakers.

speakers. Consequently, it will tend to produce overratedintelligibilities for bad speakers. We were not able to recordmany more bad speakers because they often have otherdisabilities as well and are therefore incapable of performingthe test. By giving more weight to the speakers with lowperceptual scores during the training of the IPM, it is possibleto reduce the errors for the low perceptual scores at theexpense of only a small increase of the RMSE caused by theslightly larger errors for the high perceptual scores.

6.2. Pathology-Specific Intelligibility Prediction Models. If aclinician is mainly working with one pathology, he is proba-bly more interested in an intelligibility prediction model thatis specialized in that pathology. Our hypothesis is that sincepeople with different pathologies are bound to have differentarticulation problems, pathology specific models shouldselect pathology-specific features. We therefore search for thefeature set offering the lowest RMSE on the speakers of the

Page 108: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

validation group with the targeted pathology. However, fortraining the regression coefficients of the IPM we use allthe speakers in the training fold. This way we can alleviatethe problem of having an insufficient number of pathology-specific speakers to compute reliable regression coefficients.The characteristics of the specialized models for dysarthria(DYS), laryngectomy (LARYNX), and hearing impairment(HEAR) can be found in Table 2. The results which differsignificantly from the reference results are marked in bold;the reference results are themselves marked in italic. Thedata basically support the conclusions that were drawn fromTable 1, with two exceptions: (1) for the HEAR model,adding PMF to CD-PLF turns out to yield a significantimprovement now, and (2) for the LARYNX model, thecombination PMF + PLF is not significantly worse thanPMF-tri + PLF.

Scatter plots of the computed versus the perceptualintelligibility scores emerging from the former (PMF-tri +PLF) and the new (PLF + CD-PLF) dysarthria model areshown in Figure 4.

In [13] we already compared results obtained with ourformer system to results reported by Riedhammer et al.[9] for a system also comprising two state-of-the-art ASRsystems. Although a direct comparison is difficult to make,it appears that our results emerging from an evaluation on adiverse speaker set are very comparable to those reported in[9], even though the latter emerged from an evaluation on anarrower set of speakers (either tracheo-oesaphagal speakersor speakers with cancer of the oral cavity).

7. Conclusions and Future Work

In our previous work [13], we showed that an alignment-based method combining two ASR systems can yieldgood correlations between subjective (human) and objective(computed) intelligibility scores. For a general model, weobtained Pearson correlations of about 0.86. For a dysarthriaspecific model these correlations were as large as 0.94.In the present paper we have shown that by introducingcontext-dependent phonological features it is possible toachieve equal to higher accuracies by means of a systemcomprising only one ASR which works on phonologicalfeatures that were extracted from the waveform by a set ofneural networks.

Now that we have an intelligibility score which isdescribed in terms of features that refer to articulatorydimensions, we can start to think of extracting more detailedinformation that can reveal the underlying articulatoryproblems of a tested speaker.

In terms of technology, we still need to conceive morerobust speaker feature selection procedures. We must alsoexamine whether an alignment model remains a viablemodel for the analysis of severely disordered speech. Finally,we believe that there exist more efficient ways of using thenew context-dependent phonological features than the oneadopted in this paper (e.g., clustering of contexts, betterdealing with effects of phone omissions). Finding such waysshould result in further improvements of the intelligibilitypredictions.

Acknowledgment

This work was supported by the Flemish Institute for thePromotion of Innovation by Science and Technology inFlanders (IWT) (contract SBO/40102).

References

[1] R. D. Kent, Ed., Intelligibility in Speech Disorders: Theory, Mea-surement, and Management, John Benjamins, Philadelphia, Pa,USA, 1992.

[2] R. D. Kent, G. Weismer, J. F. Kent, and J. C. Rosenbek, “Towardphonetic intelligibility testing in dysarthria,” Journal of Speechand Hearing Disorders, vol. 54, no. 4, pp. 482–499, 1989.

[3] R. D. Kent, “The perceptual sensorimotor examination formotor speech disorders,” in Clinical Management of Sensori-motor Speech Disorders, pp. 27–47, Thieme Medical, New York,NY, USA, 1997.

[4] M. De Bodt, C. Guns, and G. V. Nuffelen, NSVO: Nederland-stalig SpraakVerstaanbaarheidsOnderzoek, Vlaamse Verenigingvoor Logopedisten, Herentals, Belgium, 2006.

[5] J. Carmichael and P. Green, “Revisiting dysarthria assessmentintelligibility metrics,” in Proceedings of the 8th InternationalConference on Spoken Language Processing (ICSLP ’04), pp.742–745, Jeju, South Korea, October 2004.

[6] J.-P. Hosom, L. Shriberg, and J. R. Green, “Diagnostic assess-ment of childhood apraxia of speech using automatic speechrecognition (ASR) methods,” Journal of Medical Speech-Language Pathology, vol. 12, no. 4, pp. 167–171, 2004.

[7] H.-Y. Su, C.-H. Wu, and P.-J. Tsai, “Automatic assessment ofarticulation disorders using confident unit-based model adap-tation,” in Proceedings of the IEEE International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP ’08), pp.4513–4516, Las Vegas, Nev, USA, March 2008.

[8] P. Vijayalakshmi, M. R. Reddy, and D. O’Shaughnessy, “Assess-ment of articulatory sub-systems of dysarthric speech using anisolated-style phoneme recognition system,” in Proceedings ofthe 9th International Conference on Spoken Language Processing(Interspeech ’06), vol. 2, pp. 981–984, Pittsburgh, Pa, USA,September 2006.

[9] K. Riedhammer, G. Stemmer, T. Haderlein, et al., “Towardsrobust automatic evaluation of pathologic telephone speech,”in Proceedings of the IEEE Automatic Speech Recognition andUnderstanding Workshop (ASRU ’07), pp. 717–722, Kyoto,Japan, December 2007.

[10] A. Maier, M. Schuster, A. Batliner, E. Noth, and E. Nkenke,“Automatic scoring of the intelligibility in patients with cancerof the oral cavity,” in Proceedings of the 8th Annual Conferenceof the International Speech Communication Association (Inter-speech ’07), vol. 1, pp. 1206–1209, Antwerpen, Belgien, August2007.

[11] R. Likert, “A technique for the measurement of attitudes,”Archives of Psychology, vol. 22, no. 140, pp. 1–55, 1932.

[12] K. Demuynck, J. Roelens, D. V. Compernolle, and P.Wambacq, “Spraak: an open source speech recognition andautomatic annotation kit,” in Proceedings of the 9th Inter-national Conference on Spoken Language Processing (Inter-speech ’08), pp. 495–496, Brisbane, Australia, September 2008.

[13] C. Middag, G. Van Nuffelen, J. P. Martens, and M. DeBodt, “Objective intelligibility assessment of pathologicalspeakers,” in Proceedings of the International Conference onSpoken Language Processing (Interspeech ’08), pp. 1745–1748,Brisbane, Australia, September 2008.

Page 109: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

[14] G. Van Nuffelen, C. Middag, M. De Bodt, and J. P. Martens,“Speech technology-based assessment of phoneme intelligi-bility in dysarthria,” International Journal of Language andCommunication Disorders. In press.

[15] G. Van Nuffelen, M. De Bodt, C. Guns, F. Wuyts, and P. Vande Heyning, “Reliability and clinical relevance of segmentalanalysis based on intelligibility assessment,” Folia Phoniatricaet Logopaedica, vol. 60, no. 5, pp. 264–268, 2008.

[16] S. B. Davis and P. Mermelstein, “Comparison of parametricrepresentations for monosyllabic word recognition in con-tinuously spoken sentences,” IEEE Transactions on Acoustics,Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.

[17] F. Stouten and J.-P. Martens, “On the use of phonologicalfeatures for pronunciation scoring,” in Proceedings of theIEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’06), vol. 1, pp. 329–332, Toulouse, France,May 2006.

[18] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N.Dahlgren, “Darpa timit acoustic-phonetic continuous speechcorpus CD-ROM,” Tech. Rep. NISTIR 4930, National Instituteof Standards and Technology, Gaithersburgh, Md, USA, 1993.

[19] D. J. Sheskin, Handbook of Parametric and NonparametricStatistical Procedures, CRC Press, Boca Raton, Fla, USA, 2004.

Page 110: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 308340, 14 pagesdoi:10.1155/2009/308340

Research Article

Modelling Errors in Automatic Speech Recognition forDysarthric Speakers

Santiago Omar Caballero Morales and Stephen J. Cox

Speech, Language, and Music Group, School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, UK

Correspondence should be addressed to Santiago Omar Caballero Morales, [email protected]

Received 3 November 2008; Revised 27 January 2009; Accepted 24 March 2009

Recommended by Juan I. Godino-Llorente

Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible forspeech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as lowintelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptationalgorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errorsmade by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of “metamodels”that incorporate a model of the speaker’s phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-statetransducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phoneticlevel and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that bothtechniques outperform standard adaptation techniques.

Copyright © 2009 S. O. Caballero Morales and S. J. Cox. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

“Dysarthria is a motor speech disorder that is often associ-ated with irregular phonation and amplitude, incoordinationof articulators, and restricted movement of articulators” [1].This condition can be caused by a stroke, cerebral palsy,traumatic brain injury (TBI), or a degenerative neurologicaldisease such as Parkinson’s Disease, or Alzheimer’s Disease.The affected muscles by this condition may include thelungs, larynx, oropharynx and nasopharynx, soft palate, andarticulators (lips, tongue, teeth, and jaw), and the degree towhich these muscle groups are compromised determines theparticular pattern of speech impairment [1].

Based on the presentation of symptoms, dysarthria isclassified as flaccid, spastic, mixed spastic-flaccid, ataxic,hyperkinetic, and hypokinetic [2–4]. In all types of dysarthria,phonatory dysfunction is a frequent impairment and isdifficult to assess because it often occurs along with otherimpairments affecting articulation, resonance, and respira-tion [2–6]. Particularly, six impairment features are related tophonatory dysfunction, reducing the speaker’s intelligibilityand altering naturalness of his/her speech [4, 7, 8].

(i) Monopitch: in all types of dysarthria.

(ii) Pitch level: in spastic and mixed spastic-flaccid.

(iii) Harsh voice: in all types of dysarthria.

(iv) Breathy voice: in flaccid and hypokinetic.

(v) Strained-strangled: in spastic and hyperkinetic.

(vi) Audible inspiration: in flaccid.

These features make the task of developing assistive Auto-matic Speech Recognition (ASR) systems for people withdysarthria very challenging. As a consequence of phonatorydysfunction, dysarthric speech is typically characterized bystrained phonation, imprecise placement of the articula-tors and incomplete consonants closure. Intelligibility isaffected when there is reduction or deletion of word-initialconsonants [9]. Because of these articulatory deficits, thepronunciation of dysarthric speakers often deviates from thatof nondysarthric speakers in several aspects: rate of speech islower; segments are pronounced differently; pronunciation isless consistent; for longer stretches of speech, pronunciationcan be even more varying due to fatigue [10]. Speaking rate,which is important for ASR performance, is affected by slow

Page 111: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

pronunciation that produces prolonged phonemes. This canmake a 1-syllable word to be interpreted as a 2-syllable word(day→ dial), and words with long voiceless stops can beinterpreted as two words because of the long silent occlusionphase in the middle of the target word (before→ be for) [11].

The design of ASR systems for dysarthric speakersis difficult because they require different types of ASRdepending on their particular type and level of disabil-ity [1]. Additionally, phonatory dysfunction and relatedimpairments cause dysarthric speech to be characterizedby phonetic distortions, substitutions, and omissions [12,13] that decrease the speaker’s intelligibility [1] and thusASR performance. However it is important to develop ASRsystems for dysarthric speakers because of the advantagesthey offer when compared with interfaces such as switchesor keyboards. These may be more physically demanding andtiring [14–17] and as dysarthria is usually accompanied byother physical handicaps, impossible for them to use. Evenwith the speech production difficulties exhibited by manyof these speakers, speech communication requires less effortand is faster than conventional typing methods [18], despitethe difficulty of achieving robust recognition performance.

Experiments with commercial ASR systems have shownlevels of recognition accuracy up to 90% for some dysarthricspeakers with high intelligibility after a certain number oftests, although speakers with lower intelligibility did notachieve comparable levels of recognition accuracy [11, 19–22]. Most of the speakers involved in these studies presentedindividual error patterns, and variability in recognition rateswas observed between test sessions and when trying differentASR systems. Usually these commercial systems require somespeech samples from the speaker to adapt to his/her voice andthus increase recognition performance. However the system,which is trained on a normal speech corpus, is not expectedto work well on severely dysarthric speech as adaptationtechniques are insufficient to deal with gross abnormalities[16]. Moreover, it has been reported that recognition perfor-mance on such systems rapidly deteriorates for vocabularysizes greater than 30 words, even for speakers with mild tomoderate dysarthria [23].

Thus, research has concentrated on techniques to achievemore robust ASR performance. In [22], a system basedon Artificial Neural Networks (ANNs) produced betterresults when compared with a commercial system, andoutperformed the recognition of human listeners. In [10],the performance of HMM-based speaker-dependent (SD)and speaker-independent (SI) systems on dysarthric speechwas evaluated. SI systems are trained on nondysarthricspeech (as commercial systems above) and SD systems aretrained on a limited amount of speech of the dysarthricspeaker. The performance of the SD system was better thanthe SI’s and the word error rates (WERs) obtained showedthat ASR of dysarthric speech is certainly possible for low-perplexity tasks (with a highly constrained bigram languagemodel).

The Center of Spoken Language Understanding [1]improved vowel intelligibility by the manipulation of a smallset of highly relevant speech features. Although they limitedthemselves to studying consonant-vowel-consonant (CVC)

contexts from a special purpose database, they significantlyimproved the intelligibility of dysarthric vowels from 48%to 54%, as evaluated by a vowel identification task using64 CVC stimuli judged by 24 listeners. The ENABL Project(“ENabler for Access to computer-Based vocational taskswith Language and speech” [24, 25] was developed to provideaccess by voice via speech recognition to an engineeringdesign system, ICAD. The baseline recognition engine wastrained on nondysarthric speech (speaker-independent), andit was adapted to dysarthric speech using MLLR (MaximumLikelihood Linear Regression, see Section 2) [26]. Thisreduced the action error rate of the ICAD from 24.1% to8.3%. However these results varied from speaker to speaker,and for some speakers the improvement was substantiallygreater than for others.

The STARDUST Project (Speech Training And Recog-nition for Dysarthric Users of Speech Technology) [16,27–29] has developed speech technology for people withsevere dysarthria. Among the applications developed, anECS (Environmental Control System) was designed forhome control with a small vocabulary speaker-dependentrecognizer (10 words commands). The methodology forbuilding the recognizer was adapted to deal with scarcity oftraining data and the increased variability of the materialwhich was available. This problem was addressed by closingthe loop between recognizer-training and user-training.They started by recording a small amount of speech datafrom the speaker, then they trained a recognizer using thatdata, and later used it to drive a user-training application,which allowed the speaker to practice to improve consistencyof articulation. The speech-controlled ECS was faster touse than switch-scanning systems. Other applications fromSTARDUST are the following.

(i) STRAPTk (Speech Training Application Toolkit)[29], a system that integrates tools for speech analysis,exercise tasks, design, and evaluation of recognizers.

(ii) VIVOCA (Voice Input Voice Output Communica-tion Aid) [30], which is aimed to develop a portablespeech-in/speech-out communication aid for peoplewith disordered or unintelligible speech. Anothertool, the “Speech Enhancer” from Voicewave Tech-nology Inc. [31], improves speech communicationin real time for people with unclear speech andinaudible voice [32]. While VIVOCA recognizesdisordered speech and resynthesises it in a normalvoice, the Speech Enhancer does not recognize orcorrect speech distortions due to dysarthria.

A project at the University of Illinois is aimed to provide(1) a freely distributable multimicrophone, multicameraaudiovisual database of dysarthric speech [33], and (2)programs and training scripts that could form the founda-tion for an open-source speech recognition tool designedto be useful for dysarthric speakers. In the University ofDelaware, research has been done by the Speech ResearchLab [34] to develop natural sounding software for speechsynthesis (ModelTalker) [35], tools for articulation trainingfor children (STAR), and a database of dysarthric speech[36].

Page 112: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

As already mentioned, commercial “dictation” ASR sys-tems have shown good performance for people with mild tomoderate dysarthria [20, 21, 37], although these systems failfor speakers with more severe conditions [11, 22]. Variabilityin recognition accuracy, the speaker’s inability to access thesystem by him/herself, restricted vocabulary, and continuousassistance and editing of words were evident in these studies.Although isolated-words recognizers have performed betterthan continuous speech recognizers, these are limited bytheir small vocabulary (10–78 possible words or commands),making them only suitable for “control” applications. Forcommunication purposes, a continuous speech recognizercan be more suitable, and studies have shown that undersome conditions a continuous system can perform betterthan a discrete system [37].

The motivation of our research is to develop techniquesthat could lead to the development of large-vocabularyASR systems for speakers with different types of dysarthria,particularly when speech data for adaptation or trainingis small. In this paper, we describe two techniques thatincorporate a model of the speaker’s pattern of errors intothe ASR process in such a way as to increase word recognitionaccuracy. Although these techniques have general applicationto ASR, we believe that they are particularly suitable for use inASR of dysarthric speakers who have low intelligibility due,in some degree, to a limited phonemic repertoire [13], andthe results presented here confirm this.

We continue in Section 1.1 by showing the pattern oferrors caused on ASR due to the effects of a limited phonemicrepertoire and thus expand on the effect of phonatorydysfunction on dysarthric speech. The description of ourresearch starts in Section 2 with the details of the baselinesystem used for our experiments, the adaptation techniqueused for comparison, the database of dysarthric speech, andsome initial word recognition experiments. In Section 3 theapproach of incorporating information from the speaker’spattern of errors into the recognition process is explained.In Section 4 we present the first technique (“metamodels”),and in Section 5, results on word recognition accuracy whenit is applied on dysarthric speech. Section 6 comments onthe technique and motivates the introduction of a secondtechnique in Section 7, which is based on a network of Finite-State Transducers (WFSTs). The results of this technique arepresented in Section 8. Finally, conclusions and future workare presented in Section 9.

1.1. Limited Phonemic Repertoire. Among the identifiedfactors that give rise to ASR errors in dysarthric speech [13],the most important are decreased intelligibility (because ofsubstitutions, deletions, and insertions of phonemes) andlimited phonemic repertoire, the latter leading to phonemesubstitutions. To illustrate the effect of reduced phonemicrepertoire, Figure 1 shows an example phoneme confu-sion matrix for a dysarthric speaker from the NEMOURSDatabase of Dysarthric Speech (described in Section 2). Thisconfusion matrix is estimated by a speaker-independent ASRsystem, and so it may show confusions that would notactually be made by humans, and also spurious confusions

Stim

ulu

s

Response

aa ae ahaoawaxayea eh er ey ia ih iyohowoyuauhuw b ch d dh f g hhjh k l m n ng p r s sh t th v w y z zhsi l

aaaeahaoawaxayeaehereyiaihiy

ohowoyuauhuw

bch

ddh

fg

hhjhkl

mn

ngprs

sht

thvwyz

zhsi l

Figure 1: Phoneme confusion matrix from a dysarthric speaker.

Stim

ulu

s

Response

aa ae ahaoawaxayea eh er ey ia ih iyohowoyuauhuw b ch d dh f g hhjh k l m n ng p r s sh t th v w y z zhsi l

aaaeahaoawaxayeaehereyiaihiy

ohowoyuauhuw

bch

ddh

fg

hhjhkl

mn

ngprs

sht

thvwyz

zhsi l

Figure 2: Phoneme confusion matrix from a normal speaker.

that are actually caused by poor transcription/output align-ment (see Section 4.1). However, since we concerned withmachine rather than human recognition here, we can makethe following observations.

(i) A small set of phonemes (in this case the phonemes/ua/, /uw/, /m/, /n/, /ng/, /r/, and /sil/) dominates thespeaker’s output speech.

(ii) Some vowel sounds and the consonants /g/, /zh/, and/y/, are never recognized correctly. This suggests thatthere are some phonemes that the speaker apparentlycannot enunciate at all, and for which he or shesubstitutes a different phoneme, often one of thedominant phonemes mentioned above.

These observations differ from the pattern of confusions seenin a normal speaker from the Wall Street Journal (WSJ)database [38], as shown in Figure 2. This confusion matrixshows a clearer pattern of correct recognitions and fewconfusions of vowels with consonants.

Page 113: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

Most speaker adaptation algorithms are based on theprinciple that it is possible to apply a set of transformationsto the parameters of a set of acoustic models of an “average”voice to move them closer to the voice of an individual.Whilst this has been shown to be successful for normalspeakers, it may be less successful in cases where the phonemeuttered is not the one that was intended but is substitutedby a different phoneme or phonemes, as often happens indysarthric speech. In this situation, we argue that a moreeffective approach is to combine a model of the substitutionslikely to have been made by the speaker with a languagemodel to infer what was said. So rather than attemptingto adapt the system, we model the insertion, deletion, andsubstitution errors made by a speaker and attempt to correctthem.

2. Speech Data, Baseline Recognizer, andAdaptation Technique

Our speaker-independent (SI) speech recognizer was builtwith the HTK Toolkit [39] using the data from 92 speakersin set si tr of the Wall Street Journal (WSJ) database [38].A Hamming window of 25 milliseconds moving at a framerate of 10 milliseconds was applied to the waveform datato convert it to 12 MFCCs (using 26 filterbanks), andenergy, delta, and acceleration coefficients were added. Theresulting data was used to construct 45 monophone acousticmodels. The monophone models had a standard three stateleft-to-right topology with eight mixture components perstate. They were trained using standard maximum-likelihoodtechniques, using the routines provided in HTK.

The dysarthric speech data was provided by theNEMOURS Database [36]. This database is a collection of814 short sentences spoken by 11 speakers (74 sentencesper speaker) with varying degrees of dysarthria (data fromonly 10 speakers was used as some data is missing for onespeaker). The sentences are nonsense phrases that have asimple syntax of the form “the X is Y the Z”, where X andZ are usually nouns and Y is a verb in present participle form(for instance, the phrases “The shin is going the who”, “Theinn is heaping the shin”, etc.). Note that although each ofthe 740 sentences is different, the vocabulary of 112 wordsis shared.

Speech recognition experiments were implemented byusing the baseline recognizer on the dysarthric speech.For these experiments, a word-bigram language model wasestimated from the (pooled) 74 sentences provided by eachspeaker.

The technique used for the speaker adaptation experi-ments was MLLR (Maximum Likelihood Linear Regression)[26]. A two-pass MLLR adaptation was implemented asdescribed in [39], where a global adaptation is done firstby using only one class. This produces a global-inputtransformation that can be used to define more specifictransforms to better adapt the baseline system to thespeaker’s voice. Dynamic adaptation is then implemented byusing a regression class tree with 32 terminal nodes or baseclasses.

0102030405060708090

100

Cor

rect

wor

ds(%

)

BB BK BV FB JF LL MH RK RL SC

Speakers

BaseMLLR 16FDA

Figure 3: Comparison of recognition performance: human assess-ment (FDA), unadapted (BASE) and adapted (MLLR 16) SImodels.

From the complete set of 74 sentences per speaker, 34sentences were used for adaptation and the remaining 40 fortesting. The set of 34 was divided into sets to measure theperformance of the adapted baseline system when using adifferent amount of adaptation data. Thus adaptation wasimplemented using 4, 10, 16, 22, 28, and 34 sentences.For future reference, the baseline system adapted with Xsentences will be termed as MLLR X and the baselinewithout any adaptation as BASE.

Table 1 shows the number of MLLR transform classes(XFORMS) for the 10 dysarthric speakers used in theseexperiments using different amounts of adaptation data. Forcomparison purposes, Table 2 shows the same for ten speak-ers selected randomly from the si dt set of the WSJ databaseusing similar sets of adaptation data. In both cases, thenumber of transforms increases as more data is available. Themean number of transforms (Mean XFORMS) is similar forboth sets of speakers, but the standard deviation (STDEV)is higher for dysarthric speakers. This shows that withindysarthric speakers there are more differences and variabilitythan within normal speakers, which may be caused byindividual patterns of phonatory dysfunction.

An experiment was done to compare the performanceof the baseline and MLLR-adapted recognizer (using 16utterances for adaptation) with a human assessment ofthe dysarthric speakers used in this study. Recognition wasperformed with a grammar scale factor and word insertionpenalty as described in [39].

Figure 3 shows the intelligibility of each of the dysarthricspeakers as measured using the Frenchay Dysarthria Assess-ment (FDA) test in [36], and the recognition performance(% word correct) when tested on the unadapted baselinesystem (BASE) and the adapted models (MLLR 16). The cor-relation between the FDA performance and the recognizerperformance is 0.67 (unadapted models) and 0.82 (adapted).Both are significant at the 1% level, which gives someconfidence that the recognizer displays a similar performancetrend when exposed to different degrees of dysarthric speechas humans.

Page 114: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

Table 1: MLLR transforms for dysarthric speakers.

Adaptation dataDysarthric speakers

Mean XFORMS STDEVBB BK BV FB JF LL MH RK RL SC

4 0 4 1 2 2 1 1 1 5 3 2 1.6

10 3 10 5 4 5 4 4 3 8 7 5 2.3

16 5 11 6 7 7 5 5 5 11 9 7 2.4

22 7 11 7 9 10 9 8 6 11 11 9 1.9

28 9 11 9 9 10 10 10 8 11 12 10 1.2

34 10 11 10 9 11 11 10 9 11 12 10 1.0

Table 2: MLLR transforms for normal speakers.

Adaptation dataNormal (WSJ) speakers

Mean XFORMS STDEVC31 C34 C35 C38 C3C C40 C41 C42 C45 C49

5 5 4 6 5 3 3 5 5 4 3 4 1.1

10 8 7 8 6 7 8 6 6 7 6 7 0.9

15 11 10 9 9 9 11 9 8 10 9 10 1.0

20 12 12 12 11 10 12 11 9 12 11 11 1.0

30 13 13 13 12 11 13 13 11 13 12 12 0.8

3. Incorporating a Model of the ConfusionMatrix into the Recognizer

We suppose that a dysarthric speaker wishes to utter a word-sequence W that can be transcribed as a phone sequenceP. In practice, he or she utters a different phone sequenceP. Hence the probability of the acoustic observations Oproduced by the speaker given W can be written as

Pr(O |W) = Pr(O | P) =∑

P

Pr(O | P, P

)Pr(P | P

). (1)

However, once P is known, there is no dependence of O onP, so we can write

Pr(O |W) =∑

P

Pr(O | P

)Pr(P | P

). (2)

Hence the probability of a particular word sequenceW∗ withassociated phone sequence P∗ is

Pr(P∗ | O) =Pr(O | P

)Pr(P∗)

Pr(O)(3)

=Pr(P∗)

∑PPr(O | P

)Pr(P | P

)

Pr(O). (4)

In the usual way, we can drop the denominator of (4),as it is common to all W sequences. Furthermore, we canapproximate∑

P

Pr(O | P

)Pr(P | P

)≈ max

PPr(O | P

)Pr(P | P

)(5)

which will be approximately correct when a single phonesequence dominates. The observed phone sequence from thedysarthric speaker, P∗, is obtained as

P∗ = argmaxP

Pr(O | P

)(6)

from a phone recognizer, which also provides the term Pr(O |P∗). Hence the most likely phone sequence is given as

P∗ = argmaxP

Pr(P)Pr(O | P∗

)Pr(P∗ | P

), (7)

where it is understood that P∗ ranges over all valid phonesequences defined by the dictionary and the language model.If we now make the assumption of conditional independenceof the individual phones in the sequences P∗ and P∗, we canwrite

W∗ = argmaxP

j

Pr(pj)

Pr(p∗j | pj

), (8)

where pj is the jth phoneme in the postulated phonesequence P, and p∗j the jth phoneme in the decoded

sequence P∗ from the dysarthric speaker. Equation (8)indicates that the most likely word sequence is the sequencethat is most likely given the observed phone sequence fromthe dysarthric speaker. The term Pr( p∗j | pj) is obtained froma confusion matrix for the speaker.

The overall procedure to use the estimates of Pr( p∗j |pj) into the recognition process is presented in Figure 4.A set of training sentences (as described in Section 2)is used to estimate Pr( p∗j | pj) and identify patternsof deletions/insertions of phonemes. This information ismodelled by our two techniques that will be presentedin Sections 4 and 7. Evaluation is performed when P∗

(which now is obtained from test speech) is decoded byusing the “trained” techniques into sequences of words W∗.The correction process is done at the phonemic level, andby incorporating a word language model a more accurateestimate of W is obtained.

Page 115: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

MetamodelsWFSTs

MetamodelsWFSTs

Word languagemodel

Training speech

Sets of 4, 10,..., 34utterances

Test speech

Baselinerecogniser

Baselinerecogniser

Training of the error modellingtechniques

Confusion-matrixestimation

Modelling of theconfusion-matrix

W

W∗

P∗

P∗

Pr( p∗j |pj)

Figure 4: Diagram of the correction process.

Table 3: Upper pair: alignment of transcription and recognized output using HResults; Lower pair: same, using the improved aligner.

P: dh ax sh uw ih z b ea r ih ng dh ax b ey dh

P∗: dh ax ng dh ax y ua ng dh ax b l ih ng dh ax b uw

P: dh ax sh uw ih z b ea r ih ng dh ax b ey dh

P∗: dh ax ng dh ax y ua ng dh ax b l ih ng dh ax b uw

0 1 2 3 4a01

a11

a02

a12 a23

a33

a24

a34

Figure 5: Metamodel of a phoneme.

4. First Technique: Metamodels

In practice, it is too restrictive to use only the confusionmatrix to model Pr( p∗j | pj) as this cannot model insertionswell. Instead, a hidden Markov model (HMM) is constructedfor each of the phonemes in the phoneme inventory. We termthese HMMs metamodels [40]. The function of a metamodelis best understood by comparison with a “standard” acousticHMM: a standard acoustic HMM estimates Pr(O′ | pj),where O′ is a subsequence of the complete sequence ofobserved acoustic vectors in the utterance, O, and pj is a

postulated phoneme in P. A metamodel estimates Pr(P′ |pj), where P′ is a subsequence of the complete sequence of

observed (decoded) phonemes in the utterance P.The architecture of the metamodel of a phoneme is

shown in Figure 5. Each state of a metamodel has a discreteprobability distribution over the symbols for the set ofphonemes, plus an additional symbol labelled DELETION.The central state (2) of a metamodel for a certain phonememodels correct decodings, substitutions, and deletions of

this phoneme made by the phone recognizer. States 1 and3 model (possibly multiple) insertions before and afterthe phoneme. If the metamodel were used as a generator,the output phone sequence produced could consist of, forexample,

(i) a single phone which has the same label as themetamodel (a correct decoding) or a different label(a substitution);

(ii) a single phone labelled DELETION (a deletion);

(iii) two or more phones (one or more insertions).

As an example of the operation of a metamodel, considera hypothetical phoneme that is always decoded correctlywithout substitutions, deletions, or insertions. In this case,the discrete distribution associated with the central statewould consist of zeros except for the probability associatedwith the symbol for the phoneme itself, which would be 1.0.In addition, the transition probabilities a02 and a24 would beset to 1.0 so that no insertions could be made. When usedas a generator, this model can produce only one possiblephoneme sequence: a single phoneme which has the samelabel as the metamodel.

We use the reference transcription P of a trainingset utterance to enable us to concatenate the appropriatesequence of phoneme metamodels for this utterance. Theassociated recognition output sequence P∗ for the utteranceis obtained from the phoneme transcription of the wordsequences decoded by a speech recognizer and is used to

Page 116: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

train the parameters of the metamodels in this utterance.Note that the speech recognizer itself can be built usingunadapted or MLLR adapted phoneme models. By usingembedded reestimation over the {P, P∗} pairs of all theutterances, we can train the complete set of metamodels. Inpractice, the parameters formed, especially the probabilitydistributions, are sensitive to the initial values to whichthey are set, and it is essential to “seed” the probabilitiesof the distributions using data obtained from an accuratealignment of P and P∗ for each training-set sentence.After the initial seeding is complete, the parameters of themetamodels are reestimated using embedded reestimationas described above. Before recognition, the language modelis used to compile a “metarecognizer” network, which isidentical to the network used in a standard word recognizerexcept that the nodes of the network are the appropriatemetamodels rather than the acoustic models used by theword recognizer. At recognition time, the output phonemesequence P∗ is passed to the metarecognizer to produce a setof word hypotheses.

4.1. Improving Alignment for Confusion Matrix Estimation.Use of a standard dynamic programming (DP) tool to aligntwo symbol strings (such as the one available in the HResultsroutine in the HTK package [39]) can lead to unsatisfactoryresults when a precise alignment is required between P andP∗ to estimate a confusion matrix, as is the case here. Thisis because these alignment tools typically use a distancemeasure which is “0” if a pair of symbols are the same, “1”otherwise. In the case of HResults, a correct match has a scoreof “0”, an insertion and a deletion carry a score of “7”, and asubstitution a score of “10” [39]. To illustrate this, considerthe top alignment in Table 3, which was made using HResults.It is not a plausible alignment, because

(i) the first three phones in the recognized output areunaligned and so must be regarded as insertions;

(ii) the fricative /sh/ in the transcription has been alignedto the vocalic /y/;

(iii) the sequence /b ea/ in the transcription has beenaligned to the sequence /ax b/.

In the lower alignment in Table 3, these problems have beenrectified, and a more plausible alignment results. This align-ment was made using a DP matching algorithm in whichthe distance D( p∗j , pj) between a phone in the reference

transcription P and a phone in the recognition output P∗

considers a similitude score given by the empirically derivedexpression:

Sim(p∗j , pj

)= 5 PrSI

(q∗j | qj

)− 2, (9)

where PrSI(q∗j | qj) is a speaker-independent confusionmatrix pooled over 92 WSJ speakers and is estimated by a DPalgorithm that uses a simple aligner (e.g., HResults). Hence,a pair of phonemes that were always confused is assigned ascore of +3, and a pair that is never confused is assigned a

50

52

54

56

58

60

62

64

Wor

dac

cura

cy(%

)

4 10 16 22 28 34

Sentences for MLLR adaptation and metamodels training

MLLRMetamodels on MLLR

Figure 6: Mean word recognition accuracy of the adapted modelsand the metamodels across all dysarthric speakers.

score of−2. The effect of this is that the DP algorithm prefersto align phoneme pairs that are more likely to be confused.

5. Results of the Metamodels onDysarthric Speakers

Figure 6 shows the results of the metamodels on thephoneme strings from the MLLR adapted acoustic models.When a very small set of sentences, for example, four,is used for training of the metamodels, it is possibleto get an improvement of approximately 1.5% over theMLLR adapted models. This gain in accuracy increasesas the training/adaptation data is increased, obtaining animprovement of almost 3% when all 34 sentences areused. The matched pairs test described in [41] was usedto test for significant differences between the recognitionaccuracy using metamodels and the accuracy obtained withMLLR adaptation when a certain number of sentenceswere available for metamodel training. The results with theassociated P-values are presented in Table 4. In all the cases,metamodels improve MLLR adaptation with P-values lessthan .01 and .05. Note that the metamodels trained withonly four sentences (META 04) decrease the number of worderrors from 1174 (MLLR 04) to 1139.

5.1. Low and High Intelligibility-Speakers. Low intelligibility-speakers were classified as those with low recognition perfor-mances using the unadapted and adapted models. As shownin Figure 3, automatic recognition followed a similar trend tohuman recognition (as scored by the FDA intelligibility test).So in the absence of a human assessment test, it is reasonableto classify a speaker’s intelligibility based on their automaticrecognition performance.

The set of speakers was divided into two equal-sizedgroups: high intelligibility (BB, FB, JF, LL, and MH), andlow intelligibility (BK, BV, RK, RL, and SC). In Figure 7 theresults for all low intelligibility speakers are presented. Thereis an overall improvement of about 5% when using differenttraining sets. However for speakers with high intelligibility,there is no improvement over MLLR, as shown in Figure 8.

Page 117: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

Table 4: Comparison of statistical significance of results over alldysarthric speakers.

System Errors P

MLLR 04 1174 .00168988

META 04 1139

MLLR 10 1073 .0002459

META 10 1036

MLLR 16 1043 .00204858

META 16 999

MLLR 22 989 .0000351

META 22 941

MLLR 28 990 .00240678

META 28 952

MLLR 34 992 .00000014

META 34 924

40

45

50

55

60

65

Wor

dac

cura

cy(%

)

4 10 16 22 28 34

Sentences for MLLR adaptation and metamodels training

MLLRMetamodels on MLLR

Figure 7: Mean word recognition accuracy of the adapted modelsand the metamodels across all low intelligibility dysarthric speakers.

These results indicate that the use of metamodels is asignificantly better approach to ASR than speaker adaptationin cases where the intelligibility of the speaker is low andonly a few adaptation utterances are available, which are twoimportant conditions when dealing with dysarthric speech.We believe that the success of metamodels in increasingperformance for low-intelligibility speakers can be attributedto the fact that these speakers often display a confusionmatrix that is similar to the matrix shown in Figure 1, inwhich a few phonemes dominate the speaker’s repertoire.The metamodels learn the patterns of substitution morequickly than the speaker adaptation technique, and henceperform better even when only a few sentences are availableto estimate the confusion matrix.

6. Limitations of the Metamodels

As presented in Section 5, we had some success using themetamodels on dysarthric speakers. However the experi-ments showed that they suffered from two disadvantages.

(1) The models had a problem dealing with deletions.If the metamodel network defining a legal sequenceof words is defined in such a way that it is possible

5557596163656769717375

Wor

dac

cura

cy(%

)

4 10 16 22 28 34

Sentences for MLLR adaptation and metamodels training

MLLRMetamodels on MLLR

Figure 8: Mean word recognition accuracy of the adapted modelsand the metamodels across all high intelligibility dysarthric speak-ers.

to traverse it by “skipping” every metamodel, thedecoding algorithm fails because it is possible totraverse the complete network of HMMs withoutabsorbing a single input symbol. We attempted toremedy this problem by adding an extra “deletion”symbol (see Section 4), but as this symbol couldpotentially substitute every single phoneme in thenetwork, it led to an explosion in the size of thedictionary, which was unsatisfactory.

(2) The metamodels were unable to model specificphone sequences that were output in response toindividual phone inputs. They were capable of out-putting sequences, but the symbols (phones) in thesesequences were conditionally independent, and sospecific sequences cannot be modelled.

A network of Weighted Finite-State Transducers (WFSTs)[42] is an attractive alternative to metamodels for the task ofestimating W from P∗. WFSTs can be regarded as a networkof automata. Each automaton accepts an input symbol andoutputs one of a finite set of outputs, each of which has anassociated probability. The outputs are drawn (in this case)from the same alphabet as the input symbols and can besingle symbols, sequences of symbols, or the deletion symbolε. The automata are linked by a set (typically sparse) of arcsand there is a probability associated with each arc.

These transducers can model the speaker’s phoneticconfusions. In addition, a cascade of such transducers canmodel the mapping from phonemes to words, and themapping from words to a word sequence described by agrammar.

The usage proposed here complements and extends thework presented in [43], in which WFSTs were used to correctphone recognition errors. Here, we extend the technique toconvert noisy phone strings into word sequences.

7. Second Technique: Network of WeightedFinite-State Transducers

As shown in, for instance, [42, 44], the speech recognitionprocess can be realised as a cascade of WFSTs. In this

Page 118: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

approach, we define the following transducers to decode P∗

into a sequence of words W∗.

(1) C, the confusion matrix transducer, which modelsthe probabilities of phoneme insertions, deletions,and substitutions.

(2) D, the dictionary transducer, which maps sequencesof decoded phonemes from P∗ ◦ C into words in thedictionary.

(3) G, the language model transducer, which definesvalid sequences of words from D.

Thus, the process of estimating the most probable sequenceof words W∗ given P∗ can be expressed as

W∗ = τ∗(P∗ ◦ C ◦D ◦G

), (10)

where τ∗ denotes the operation of finding the most likelypath through a transducer and ◦ denotes composition oftransducers [42]. Details of each transducer used will bepresented in the following sections.

7.1. Confusion Matrix Transducer (C). In this section, wedescribe the formation of the confusion matrix transducerC. In Section 3, we defined p∗j as the jth phoneme in P∗

and pj as the jth phoneme in P, where Pr( p∗j | pj) isestimated from the speaker’s confusion matrix, which isobtained from an alignment of many sequences of P∗ andP. While single substitutions are modelled in the same wayby both, metamodels and WFSTs, insertions and deletionsare modelled in a different way, thus taking advantageof the characteristics of the WFSTs. Here, the confusionmatrix transducer C can map single and multiple phonemeinsertions and deletions.

Consider Table 5, which shows an alignment from one ofour experiments. The top row of phone symbols representsthe transcription of the word sequence and the bottomrow the output from the speech recognizer. It can beseen that the phoneme sequence /b aa/ is deleted after/ax/, and this can be represented in the transducer as amultiple substitution/insertion: /ax/→ /ax b aa/. Similarlythe insertion of /ng dh/ after /ih/ is modeled as /ih ngdh/→ /ih/. The probabilities of these multiple substitu-tions/insertions/deletions are estimated by counting. In caseswhere a multiple insertion or deletion is made of theform A→ /B C/, the appropriate fraction of the unigramprobability mass Pr(A→B) is subtracted and given to theprobability Pr(A→ /B C/), and the same process is used forhigher-order insertions or deletions.

A fragment of the confusion matrix transducer thatrepresents the alignment of Table 5 is presented in Figure 9.For computational convenience, the weight for each con-fusion in the transducer is represented as − log Pr( p∗j |pj). In practice, we have found it convenient to buildan initial set of transducers directly from the speaker’s“unigram” confusion matrix, which is estimated using eachtranscription/output alignment pair available from thatspeaker, and then to add extra transducers that represent

ax:ax/0.73

ax:ax/4.2

ax:ey/2.81b:b/1.16

sil:sil/0.182

0/0

1 2

54

6 7

3

dh:dh/0.682

dh:w/3.42

ih:ih/0.182ng:ng/0.699

ng:z/1.68

r:th/1.16

lh:ih/3.27

b:b/2.77

ng: ε/0dh: ε/0

ax: ε/0

ε: eh/0 ε: t/0

ε: b/0 ε: aa/0

Figure 9: Example of the confusion matrix transducer C.

Stim

ulu

s

Response

aa ae ahaoawaxayea eh er ey ia ih iyohowoyuauhuw b ch d dh f g hhjh k l m n ng p r s sh t th v w y z zhsi l

aaaeahaoawaxayeaehereyiaihiy

ohowoyuauhuw

bch

ddh

fg

hhjhkl

mn

ngprs

sht

thvwyz

zhsi l

Figure 10: Sparse confusion matrix for C.

multiple substitution/insertion/deletions. The complete setof transducers are then determinized and minimized, asdescribed in [42]. The result of these operations is a singletransducer for the speaker.

One problem encountered when limited training datais available from speakers is that some phonemes are neverdecoded during the training phase, and therefore it is notpossible to make any estimate of Pr( p∗j | pj). This is shown inFigure 10, which shows a confusion matrix estimated from asingle talker using only four sentences. Note that the columnsare the response and the rows are the stimulus in this matrix,and so blank columns are phonemes that have never beendecoded. We used two techniques to smooth the missingprobabilities.

7.2. Base Smoothing. It is essential to have a nonzero value forevery diagonal element of a confusion matrix to enable thedecoding process to work using an arbitrary language model.One possibility is to set all diagonal elements for which nodata exists to 1.0, that is, to assume that the associated phoneis always correctly decoded. However, if the estimate of the

Page 119: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal Processing

Table 5: Alignment of transcription P and recognized output P∗.

P: ax b aa th ih ax z w ey ih ng dh ax b eh t

P∗: ax r ih ng dh ax ng dh ax l ih ng dh ax b

overall probability of error of the recognizer on this speakeris p, a more robust estimate is to set any unseen diagonalelements to p, and we begin by doing this. We then need todecide how to assign nondiagonal probabilities for unseenconfusions. We do this by “stealing” a small proportion ofthe probability mass on the diagonal and redistributing italong the associated row. This is equivalent to assigning aproportion of the probability of correctly decoded phonemesto as yet unseen confusions. The proportion of the diagonalprobability that is used to estimate these unseen confusionsdepends on the amount of data from the speaker: clearly,as the data increases, the confusion probability estimatesbecome more accurate and it is not appropriate to use a largeproportion. Some experimentation on our data revealed thatredistributing approximately 20% of the diagonal probabilityto unseen confusions worked well.

7.3. SI Smoothing. The “base” smoothing described inSection 7.2 could be regarded as “speaker-dependent” (SD)in that it uses the (sparse) confusion estimates made fromthe speaker’s own data to smooth the unseen confusions.However, these estimates are likely to be noisy, so we addanother layer of smoothing using the speaker-independent(SI) confusion matrix whose elements are well-estimatedfrom 92 speakers of the WSJ database (see Section 2). Theinfluence of this confusion matrix on the speaker-dependentmatrix is controlled by a mixing factor lambda. Definingthe elements of the SI confusion matrix as q∗j and qj (seeSection 4.1) the resulting joint confusion matrix can beexpressed as

Cjoint = λSI + (1− λ)SD

= λPrSI

(q∗j | qj

)+ (1− λ)Pr

(p∗j | pj

).

(11)

The effect of both the base smoothing and the SI smoothingon the sparse confusion matrix of Figure 10 can be seen inFigure 11. The effect of λ on the mean word accuracy acrossall dysarthric speakers is shown in Figure 14.

7.4. Dictionary and Language Model Transducer (D, G). Thetransducer D maps sequences of phonemes into valid words.Although other work has investigated the possibility of usingWFSTs to model pronunciation in this component [45],in our study, the pronunciation modelling is done by thetransducer C. A small fragment of the dictionary entries isshown in Figure 12(a), where each sequence of phonemesthat forms a word is listed as an FST. The minimized unionof all these word entries is shown in Figure 12(b). The singleand multiple pronunciations of each word were taken fromthe British English BEEP pronouncing dictionary [39]. The

Stim

ulu

s

Response

aa ae ahaoawaxayea eh er ey ia ih iyohowoyuauhuw b ch d dh f g hhjh k l m n ng p r s sh t th v w y z zhsi l

aaaeahaoawaxayeaehereyiaihiy

ohowoyuauhuw

bch

ddh

fg

hhjhkl

mn

ngprs

sht

thvwyz

zhsi l

Figure 11: SI Smoothing of C, with λ = 0.25.

0 1 2 3

0 1 2 3 4

0 1 2 3 4 5

sh: ε

sh: ε

sh: ε

uw: ε

ih: ε

uw: ε

ε: shoe

n: ε

ih: ε

ε: shin

ng: ε ε: shooing

(a)

0 1

2

3

5 7 6

4

sh: ε

ih: ε

uw: ε

ε: shoe

n: ε

ih: ε

ε: shin

ng: ε ε: shooing

(b)

Figure 12: Example of the dictionary transducer D.

language model transducer consisted of a word bigram, asused in the metamodels, but now represented as a WFST.HLStats [39] was used to estimate these bigrams which werethen converted into a format suitable for using in a WFST. Afragment of the word bigram FST G is shown in Figure 13.Note that the network of Figure 13 allows sequences of theform “the X is Y the Z” (see Section 2) to be recognizedexplicitly, but an arbitrary word bigram grammar can berepresented using one of these transducers.

All three transducers used in these experiments weredeterminized and minimized in order to make executionmore efficient.

Page 120: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

6/0 4/00

1 2 3 5

!Enter/0 !Enter/0

!Exit/0.602Heaping/1.69

Going/1.69Is/0.602

The/0.125

The/0.0029

Shooing/1.69Shin/1.99

Inn/1.99

Who/1.99

Figure 13: Example of the language model transducer G.

50

52

54

56

58

60

62

64

Wor

dac

cura

cy(%

)

0 0.25 0.5 0.75 1

Mixing factor (λ)

WFSTs 04WFSTs 10WFSTs 16

WFSTs 22WFSTs 28WFSTs 34

Figure 14: Mean across all dysarthric speakers: comparison ofWFSTs performance for different values of λ.

8. Results of the WFST Approach onDysarthric Speakers

The FSM Library [42, 46] from AT&T was used for theexperiments with WFSTs. Figure 15 shows the mean wordaccuracies across all the dysarthric speakers for differentamounts of adaptation data and using different decodingtechniques. The Figure shows clearly the gain in performancegiven by the WFSTs over both MLLR and the metamodels,where the SI Smoothing increases the WFSTs performanceover the Base Smoothing.

Note that Figure 15 shows results for only two values ofλ: λ = 0 (Base Smoothing only) and SI Smoothing with λ =0.25, since the variation in performance for values of λ above0.25 is small, as observed in Figure 14.

When the WFSTs are trained with four and 22 utterances(WFSTs 04, WFSTs 22), best performance is obtained withλ = 1. WFSTs trained with 10 and 34 reach the maximumwith λ = 0.25, while with 16 and 28 the maximumis obtained with λ = 0.50. However, the variation inperformance is small for λ > 0.25 for most cases. It isimportant to mention that the mixing factor is applied tothe unigram probability mass (see Section 3), which in turn

50

52

54

5658

60

62

64

Wor

dac

cura

cy(%

)

4 10 16 22 28 34

Sentences for MLLR adaptation and metamodels/WFSTs training

MLLRWFSTs, λ = 0

Metamodels on MLLRWFSTs, λ = 0.25

Figure 15: Mean across all dysarthric speakers: comparison of %word accuracy for different techniques.

40

45

50

55

60

65

Wor

dac

cura

cy(%

)

4 10 16 22 28 34

Sentences for MLLR adaptation and metamodels/WFSTs training

MLLRWFSTs, λ = 0

Metamodels on MLLRWFSTs, λ = 0.25

Figure 16: Mean word recognition accuracy of the adapted models,the metamodels, and the WFSTs across all low intelligibilitydysarthric speakers.

affects the probability of insertion/deletion of any sequenceof phonemes associated with that unigram. These sequencesare still estimated from the data provided by the speaker, andthus are considered even when only the speaker-independentestimates are used (λ = 1).

8.1. Low and High Intelligibility Speakers. By separating thespeakers into high and low intelligibility groups, as done inSection 5.1, a more detailed comparison of performance canbe presented. In Figure 16, for low intelligibility speakers, theWFSTs with λ = 0.25 show a significant gain in performanceover the metamodels when 4, 10, and 28 sentences areused for training. The gain in recognition accuracy isalso evident for high intelligibility speakers, as shown inFigure 17. Figure 17 is encouraging because, as commentedin Section 5.1, the modelling using metamodels did notachieve improvements on high intelligibility speakers. HenceWFSTs may be a useful technique to improve performance ofrecognition of normal speech.

9. Summary and Conclusions

We have argued that in the case of dysarthric speakers, whohave a limited phonemic repertoire and thus consistently

Page 121: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

12 EURASIP Journal on Advances in Signal Processing

57596163656769717375

Wor

dac

cura

cy(%

)

4 10 16 22 28 34

Sentences for MLLR adaptation and metamodels/WFSTs training

MLLRWFSTs, λ = 0

Metamodels on MLLRWFSTs, λ = 0.25

Figure 17: Mean word recognition accuracy of the adapted models,the metamodels, and the WFSTs across all high intelligibilitydysarthric speakers.

substitute certain phonemes for others, modelling andcorrecting the errors made by the speaker under the guidanceof a language model is a more effective approach thanadapting acoustic models in the way that is effective fornormal speakers. Our first system proposed the use of a tech-nique called metamodels, which are HMM-like stochasticmodels that incorporate a model of a speaker’s confusionmatrix into the decoding process. Results obtained usingmetamodels showed a statistically significant improvementover the standard MLLR algorithm when the speech has lowintelligibility and there is limited adaptation data availablefor a speaker, two conditions that are often met whendealing with dysarthric speakers. However, the architectureof metamodels gave rise to difficulties when modellingdeletions of sequences of phones, which led us to refine thetechnique to use weighted finite-state transducers (WFSTs).These were used at the confusion matrix, word, and languagelevels in a cascade in order to correct errors. The resultsobtained using this technique were significantly better thanthose obtained using MLLR, and also better than usingmetamodels.

The work presented here must be treated as preliminarygiven the small size of the vocabulary and the restrictedsyntax of the sentences uttered in the NEMOURS database,and it needs to be validated on a larger dataset with moredysarthric speakers, more utterances per speaker, a largervocabulary, and a freer syntax. Future work will concentrateon this, and also

(i) applying the techniques described here to normalspeech;

(ii) integrating better the confusion matrix transducerwith the speech recognizer;

(iii) obtaining robust estimates of confusion matricesfrom sparse data [47].

References

[1] A. Kain, X. Niu, J. P. Hosom, Q. Miao, and J. van Santen,“Formant re-synthesis of dysarthric speech,” in Proceedings of

the 5th ISCA Speech Synthesis Workshop (SSW ’04), pp. 25–30,Pittsburgh, Pa, USA, June 2004.

[2] F. L. Darley, A. E. Aronson, and J. R. Brown, “Differentialdiagnostic patterns of dysarthria,” Journal of Speech andHearing Research, vol. 12, no. 2, pp. 246–269, 1969.

[3] F. L. Darley, A. E. Aronson, and J. R. Brown, “Clusters ofdeviant speech dimensions in the dysarthrias,” Journal ofSpeech and Hearing Research, vol. 12, no. 3, pp. 462–496, 1969.

[4] R. D. Kent, H. K. Vorperian, J. F. Kent, and J. R. Duffy,“Voice dysfunction in dysarthria: application of the multi-dimensional voice programTM,” Journal of CommunicationDisorders, vol. 36, no. 4, pp. 281–306, 2003.

[5] R. D. Kent, J. F. Kent, J. Duffy, and G. Weismer, “Thedysarthrias: speech-voice profiles, related dysfunctions, andneuropathology,” Journal of Medical Speech-Language Pathol-ogy, vol. 6, no. 4, pp. 165–211, 1998.

[6] W. M. Holleran, S. G. Ziegler, O. Goker-Alpan, et al., “Skinabnormalities as an early predictor of neurologic outcome inGaucher disease,” Clinical Genetics, vol. 69, no. 4, pp. 355–357,2006.

[7] K. Bunton, R. D. Kent, J. F. Kent, and J. R. Duffy, “The effectsof flattening fundamental frequency contours on sentenceintelligibility in speakers with dysarthria,” Clinical Linguistics& Phonetics, vol. 15, no. 3, pp. 181–193, 2001.

[8] L. Ramig, “The role of phonation in speech intelligibility: areview and preliminary data from patients with Parkinson’sdisease,” in Intelligibility in Speech Disorders: Theory, Measure-ment and Management, R. D. Kent, Ed., pp. 119–155, JohnBenjamins, Amsterdam, The Netherlands, 1992.

[9] M. Hasegawa-Johnson, J. Gunderson, A. Perlman, and T.Huang, “HMM-based and SVM-based recognition of thespeech of talkers with spastic dysarthria,” in Proceedings ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’06), vol. 3, pp. 1060–1063, Toulouse,France, May 2006.

[10] H. Strik, E. Sanders, M. Ruiter, and L. Beijer, “Automaticrecognition of dutch dysarthric speech: a pilot study,” inProceedings of the 7th International Conference on SpokenLanguage Processing (ICSLP ’02), pp. 661–664, Denver, Colo,USA, September 2002.

[11] P. Raghavendra, E. Rosengren, and S. Hunnicutt, “An inves-tigation of different degrees of dysarthric speech as inputto speaker-adaptive and speaker-dependent recognition sys-tems,” Augmentative and Alternative Communication, vol. 17,no. 4, pp. 265–275, 2001.

[12] P. D. Polur and G. E. Miller, “Effect of high-frequencyspectral components in computer recognition of dysarthricspeech based on a Mel-cepstral stochastic model,” Journal ofRehabilitation Research and Development, vol. 42, no. 3, pp.363–371, 2005.

[13] K. Rosen and S. Yampolsky, “Automatic speech recognitionand a review of its functioning with dysarthric speech,”Augmentative and Alternative Communication, vol. 16, no. 1,pp. 48–60, 2000.

[14] G. K. Poock, W. C. Lee Jr., and S. W. Blackstone, “Dysarthricspeech input to expert systems, electronic mail, and daily jobactivities,” in Proceedings of the American Voice Input/OutputSociety Conference (AVIOS ’87), pp. 33–43, Alexandria, Va,USA, October 1987.

[15] N. Thomas-Stonell, A.-L. Kotler, H. A. Leeper, and P. C. Doyle,“Computerized speech recognition: influence of intelligibilityand perceptual consistency on recognition accuracy,” Augmen-tative and Alternative Communication, vol. 14, no. 1, pp. 51–56, 1998.

Page 122: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 13

[16] P. Green, J. Carmichael, A. Hatzis, P. Enderby, M. Hawley,and M. Parker, “Automatic speech recognition with sparsetraining data for dysarthric speakers,” in Proceedings of the8th European Conference on Speech Communication and Tech-nology (Eurospeech ’03), pp. 1189–1192, Geneva, Switzerland,September 2003.

[17] A.-L. Kotler and C. Tam, “Effectiveness of using discreteutterance speech recognition software,” Augmentative andAlternative Communication, vol. 18, no. 3, pp. 137–146, 2002.

[18] L. J. Ferrier, “Clinical study of a dysarthric adult using a touchtalker with words strategy,” Augmentative and AlternativeCommunication, vol. 7, no. 4, pp. 266–274, 1991.

[19] L. J. Ferrier, H. C. Shane, H. F. Ballard, T. Carpenter, andA. Benoit, “Dysarthric speakers’ intelligibility and speechcharacteristics in relation to computer speech recognition,”Augmentative and Alternative Communication, vol. 11, no. 3,pp. 165–175, 1995.

[20] A.-L. Kotler and N. Thomas-Stonell, “Effects of speech train-ing on the accuracy of speech recognition for an individualwith a speech impairment,” Augmentative and AlternativeCommunication, vol. 13, no. 2, pp. 71–80, 1997.

[21] N. J. Manasse, K. Hux, and J. L. Rankin-Erickson, “Speechrecognition training for enhancing written language genera-tion by a traumatic brain injury survivor,” Brain Injury, vol.14, no. 11, pp. 1015–1034, 2000.

[22] G. Jayaram and K. Abdelhamied, “Experiments in dysarthricspeech recognition using artificial neural networks,” Journalof Rehabilitation Research and Development, vol. 32, no. 2, pp.162–169, 1995.

[23] C. Goodenough-Trapagnier and M. J. Rosen, “Towards amethod for computer interface design using speech recogni-tion,” in Proceedings of the 4th Rehabilitation Engineering andAssistive Technology Society of North America (RESNA ’91), pp.328–329, Kansas City, Mo, USA, June 1991.

[24] N. Talbot, “Improving the speech recognition in the ENABLproject,” TMH-QPSR, vol. 41, no. 1, pp. 31–38, 2000.

[25] T. Magnuson and M. Blomberg, “Acoustic analysis ofdysarthric speech and some implications for automatic speechrecognition,” TMH-QPSR, vol. 41, no. 1, pp. 19–30, 2000.

[26] C. J. Leggetter and P. C. Woodland, “Maximum likelihoodlinear regression for speaker adaptation of continuous densityhidden Markov models,” Computer Speech & Language, vol. 9,no. 2, pp. 171–185, 1995.

[27] M. S. Hawley, P. Green, P. Enderby, S. Cunningham, and R.K. Moore, “Speech technology for e-inclusion of people withphysical disabilities and disordered speech,” in Proceedings ofthe 9th European Conference on Speech Communication andTechnology (Interspeech ’05), pp. 445–448, Lisbon, Portugal,September 2005.

[28] M. Parker, S. Cunningham, P. Enderby, M. Hawley, andP. Green, “Automatic speech recognition and training forseverely dysarthric users of assistive technology: the STAR-DUST project,” Clinical Linguistics and Phonetics, vol. 20, no.2-3, pp. 149–156, 2006.

[29] A. Hatzis, P. Green, J. Carmichael, et al., “An integrated toolkitdeploying speech technology for computer based speech train-ing with application to dysarthric speakers,” in Proceedingsof the 8th European Conference on Speech Communicationand Technology (Eurospeech ’03), pp. 2213–2216, Geneva,Switzerland, September 2003.

[30] Clinical Applications of Speech Technology, Speech andHearing Group, “Voice Input Voice Output CommunicationAid (VIVOCA),” Department of Computer Science, Universityof Sheffield, 2008, http://www.shef.ac.uk/cast/projects/vivoca.

[31] Voicewave Technology Inc., “Speech Enhancer,” 2008,http://www.speechenhancer.com/equipment.htm.

[32] J. Rothwell and D. Fuller, “Functional communication for softor inaudible voices: a new paradigm,” in Proceedings of the 28thRehabilitation Engineering and Assistive Technology Society ofNorth America (RESNA ’05), Atlanta, Ga, USA, June 2005.

[33] H. Kim, M. Hasegawa-Johnson, A. Perlman, et al., “Dysarthricspeech database for universal access research,” in Proceedingsof the International Conference on Spoken Language Processing(Interspeech ’08), pp. 1741–1744, Brisbane, Australia, Septem-ber 2008.

[34] Speech Research Lab, A.I. duPont Hospital for Childrenand the University of Delaware, 2008, http://www.asel.udel.edu/speech/projects.html.

[35] Speech Research Lab, “InvTool Recording Software andModelTalker Synthesizer,” A.I. duPont Hospital for Chil-dren and the University of Delaware, 2008, http://www.asel.udel.edu/speech/ModelTalker.html.

[36] X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio,and H. T. Bunnell, “The Nemours database of dysarthricspeech,” in Proceedings of the International Conference onSpoken Language Processing (ICSLP ’96), vol. 3, pp. 1962–1965,Philadelphia, Pa, USA, October 1996.

[37] K. Hux, J. Rankin-Erickson, N. Manasse, and E. Lauritzen,“Accuracy of three speech recognition systems: case study ofdysarthric speech,” Augmentative and Alternative Communica-tion, vol. 16, no. 3, pp. 186–196, 2000.

[38] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJ-CAM0: a british english speech corpus for large vocabularycontinuous speech recognition,” in Proceedings of the 20thIEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’95), vol. 1, pp. 81–84, Detroit, Mich, USA,May 1995.

[39] S. Young and P. Woodland, The HTK Book Version 3.4,Cambridge University Engineering Department, Cambridge,UK, 2006.

[40] S. J. Cox and S. Dasmahapatra, “High-level approaches to con-fidence estimation in speech recognition,” IEEE Transactionson Speech and Audio Processing, vol. 10, no. 7, pp. 460–471,2002.

[41] L. Gillick and S. J. Cox, “Some statistical issues in thecomparison of speech recognition algorithms,” in Proceedingsof the IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP ’89), vol. 1, pp. 532–535, Glasgow,Scotland, May 1989.

[42] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-statetransducers in speech recognition,” Computer Speech & Lan-guage, vol. 16, no. 1, pp. 69–88, 2002.

[43] M. Levit, H. Alshawi, A. Gorin, and E. Noth, “Context-sensitive evaluation and correction of phone recognitionoutput,” in Proceedings of the 8th ISCA European Conference onSpeech Communication and Technology (Eurospeech ’03), pp.925–928, Geneva, Switzerland, September 2003.

[44] E. Fosler-Lussier, I. Amdal, and H.-K. J. Kuo, “On the roadto improved lexical confusability metrics,” in Proceedings ofthe ISCA Tutorial and Research Workshop on PronunciationModelling and Lexicon Adaptation (PMLA ’02), pp. 53–58,Estes Park, Colo, USA, September 2002.

[45] N. Bodenstab and M. Fanty, “Multi-pass pronunciation adap-tation,” in Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP ’07), vol. 4,pp. 865–868, Honolulu, Hawaii, USA, April 2007.

[46] “Weighted Finite-State Transducer Software Library Lec-ture,” Courant Institute of Mathematical Sciences, New

Page 123: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

14 EURASIP Journal on Advances in Signal Processing

York University, 2007, http://www.cs.nyu.edu/∼mohri/asr07/lecture 2.pdf.

[47] S. J. Cox, “On estimation of a speaker’s confusion matrix fromsparse data,” in Proceedings of the International Conferenceon Spoken Language Processing (Interspeech ’08), pp. 1–4,Brisbane, Australia, September 2008.

Page 124: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 982531, 11 pagesdoi:10.1155/2009/982531

Research Article

Assessment of Severe Apnoea through Voice Analysis,Automatic Speech, and Speaker Recognition Techniques

Ruben Fernandez Pozo,1 Jose Luis Blanco Murillo,1 Luis Hernandez Gomez,1

Eduardo Lopez Gonzalo,1 Jose Alcazar Ramırez,2 and Doroteo T. Toledano3

1 Signal, Systems and Radiocommunications Department, Universidad Politecnica de Madrid, Madrid 28040, Spain2 Respiratory Department, Hospital Torrecardenas, Almerıa 04009, Spain3 ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Madrid 28049, Spain

Correspondence should be addressed to Ruben Fernandez Pozo, [email protected]

Received 1 November 2008; Revised 5 February 2009; Accepted 8 May 2009

Recommended by Tan Lee

This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promoteresearch on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severeobstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment.Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database ofhealthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormalnasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM)pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMsapplied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpinsthe interest in this line of inquiry.

Copyright © 2009 Ruben Fernandez Pozo et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Obstructive sleep apnoea (OSA) is a highly prevalent disease[1], affecting an estimated 2–4% of the male populationbetween the ages of 30 and 60. It is characterized by recurringepisodes of sleep-related collapse of the upper airway at thelevel of the pharynx (AHI> 15, Apnoea Hypopnoea Index,which represents the number of apnoeas and hypoapnoeasper hour of sleep) and it is usually associated with loudsnoring and increased daytime sleepiness. OSA is a seriousthreat to an individual’s health if not treated. The conditionis a risk factor for hypertension and, possibly, cardiovasculardiseases [2], it is usually related to traffic accidents causedby somnolent drivers [1–3], and it can lead to a poorquality of life and impaired work performance. At present,the most effective and widespread treatment for OSA isnasal (Continuous Positive Airway Pressure) CPAP whichprevents apnoea episodes by providing a pneumatic splintto the airway. OSA can be diagnosed on the basis of

a characteristic history (snoring, daytime sleepiness) andphysical examination (increased neck circumference), buta full overnight sleep study is usually needed to confirmthe disorder. The procedure is known as conventionalPolysomnography, which involves the recording of neuro-electrophisiological and cardiorespiratory variables (ECG).Excellent automatic OSA recognition performance—around90% [4]—is attainable with this method based on nocturnalECG recordings. Nevertheless, this diagnostic procedure isexpensive and time consuming, and patients usually have toendure a waiting list of several years before the test is done,since the demand for consultations and diagnostic studies forOSA has recently increased [1]. There is, therefore, a strongneed for methods of early diagnosis of apnoea patients inorder to reduce these considerable delays.

The pathogenesis of obstructive sleep apnoea has beenunder investigation for over 25 years, during which a numberof factors that contribute to upper airway (UA) collapseduring sleep have been identified. Essentially, pharyngeal

Page 125: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

collapse occurs when the normal reduction in pharyngealdilator muscle tone at the onset of sleep is superimposed ona narrowed and/or highly compliant pharynx. This suggeststhat OSA may be a heterogeneous disorder, rather than asingle disease, involving the interaction of anatomic andneural state-related factors in causing pharyngeal collapse.An excellent review of the anatomic and physiological factorspredisposing to UA collapse in adults with OSA can be foundin [5]. Furthermore, it is worth noting here that OSA is ananatomic illness, the appearance of which may have beenfavoured by the evolutionary adaptations in man’s upperrespiratory tract to facilitate speech, a phenomenon thatJared Diamond calls “The Great Leap Forward” [6]. Theseanatomic changes include a shortening of the maxillary,ethmoid, palatal and mandibular bones, acute oral cavity-skull base angulation, pharyngeal collapse with anteriormigration of the foramen magnum, posterior migration ofthe tongue into the pharynx and descent of the larynx, andshortening of the soft palate with loss of the epiglottic-softpalate lock-up. The adaptations came about, it is believed,partly due to positive selection pressures for bipedalism,binocular vision and the development of voice, speech, andlanguage, but they may also have provided the structuralbasis for the occurrence of obstructive sleep apnoea.

In our research we investigate the acoustical characteris-tics of the speech of patients with OSA for the purpose oflearning whether severe OSA may be detected using Auto-matic Speech Recognition techniques (ASR). The automatedacoustic analysis of normal and pathological voices as analternative method of diagnosis is becoming increasinglyinteresting for researchers in laryngological and speechpathologies in general because of its nonintrusive nature andits potential for providing quantitative data relatively quickly.Most of the approaches found in the literature have focusedon parameters based on long-time signal analysis, whichrequire accurate estimation of the fundamental frequency,which is a fairly complex task [7, 8]. In recent years, somestudies have investigated the use of short-time measures forpathological voice detection. Excellent recognition rates havebeen achieved by modelling short-time speech spectruminformation with cepstral coefficients and using statisticalpattern classification techniques such as Gaussian MixtureModels (GMMs) [9, 10] or discriminative methods suchas Support Vector Machines (SVMs) [11]. These techniquesbased on short-time analyses can provide a characterizationof pathologic voices in a direct and noninvasive manner, andso they promise to become a useful support tool for thediagnosis of voice pathologies in general. In our research weare trying to characterize severe apnoea voices in particular.

In this contribution we discuss several ways to applyASR techniques to the detection of OSA-related traits inspecific linguistic contexts. The acoustic properties of voicefrom speakers suffering obstructive sleep apnoea are notwell understood as not much research has been carriedout in this area. However, some studies have suggestedthat certain abnormalities in phonation, articulation, andresonance may be connected to the condition [12]. Inorder to have a controlled experimental framework tostudy apnoea voice characterization we collected a speech

database [13] designed following linguistic and phoneticcriteria we derived from previous research in the field.Our work is focused on continuous speech rather than onsustained vowels, the latter being the standard approachin pathological voice analysis [14]. Therefore, as we areinterested in the acoustic analysis of the speech signal indifferent linguistic and phonetic contexts, our analysis startswith the automatic phonetic segmentation of each sen-tence using automatic speech recognition based on HiddenMarkov Models (HMMs). Together with automatic phoneticsegmentation, some basic acoustic processing techniques,mainly related to articulation, phonation, and nasalization,were applied over nonapnoea and apnoea voices to have aninitial contrastive study on the acoustic discrimination foundin our database. These results provide the proper experi-mental framework to progress beyond previous research inthe field.

After this preliminary acoustic analysis of the discrimi-nation characteristics of our database, we explored the possi-bilities of using GMM-based automatic speaker recognitiontechniques [15] to try to observe possible peculiarities inapnoea patients’ voices. Successfully detecting traits thatprove to be characteristic of the voices of severe apnoeapatients by applying such techniques would allow automatic(and rapid) diagnosis of the condition. To our knowledgethis study constitutes pioneering research on automaticsevere OSA diagnosis using speech processing algorithmson continuous speech. The proposed method is intendedas complementary to existing OSA diagnosis methods (e.g.,Polysomnography) and clinicians’ judgment, as an aid forearly detection of these cases. We have observed a markedinadequacy of resources that has led to unacceptable waitingperiods. Early severe OSA detection can help increase theefficiency of medical protocols by giving higher priorityto more serious cases, thus optimizing both social benefitsand medical resources. For instance, patients with severeapnoea have a higher risk of suffering a car accident becauseof somnolence caused by their condition. Early detectionwould, therefore, contribute to reducing the risk of sufferinga car accident for these patients.

The rest of this document is organized as follows.Section 2 presents the main physiological characteristics ofOSA patients and the distinctive acoustic qualities of theirvoices, as described in the literature. The speech databaseused in our experimental work, as well as its design criteria, isexplained in Section 3. In Section 4 we present a preliminaryanalysis of the speech signal of the voices in our database,using standard acoustic measurements with the purposeof confirming the occurrence of the characteristic acousticfeatures identified in previous research. Section 5 exploresthe advantages that standard automatic speech recognitioncan bring to diagnosis and monitoring. Next, in Section 6,we describe how we used GMMs to study nasalization inspeech, comparing the voices of severe apnoea patients withthose in a “healthy” control group. In the same section wealso present a test we carried out to assess the accuracyof a GMM-based system we developed to classify speakers(apnoea/nonapnoea). Finally, conclusions and a brief outlineof future research are given in Section 7.

Page 126: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

2. Physiological and AcousticCharacteristics in OSA Speakers

At present neither the articulatory/physiological peculiaritiesnor the acoustic characteristics of speech in apnoea speakersare well understood. Most of the more valuable informationin this area can be found in Fox and Monoson’s work [12],a perceptual study in which skilled judges compared thevoices of apnoea patients with those of a control group(referred to as “healthy” subjects). The study showed that,although differences between both groups of speakers werefound, acoustic cues for these differences are somewhatcontradictory and unclear. What did seem to be clear wasthat the apnoea group had abnormal resonances that mightbe due to an altered structure or function of the upper airway.Theoretically, such an anomaly should result not only inrespiratory but also in speech dysfunction. Consequently,the occurrence of speech disorder in OSA population shouldbe expected, and it could include anomalies in articulation,phonation, and resonance.

(1) Articulatory Anomalies. Fox and Monoson stated thatneuromotor dysfunction could be found in the sleep apnoeapopulation due to a “lack of regulated innervations to thebreathing musculature or upper airway muscle hypotonus.”This dysfunction is normally related to speech disorders,especially dysarthria. There are several types of dysarthria,resulting in various different acoustic features. All types ofdysarthria affect the articulation of consonants and vowelscausing the slurring of speech. Another common featurein apnoea patients is hypernasality and problems withrespiration.

(2) Phonation Anomalies. These may be due to the heavysnoring of sleep apnoea patients, which can cause inflam-mation in the upper respiratory system and affect the vocalcords.

(3) Resonance Anomalies. What seems to be clear is that theapnoea group has abnormal resonances that might be due toan altered structure or function of the upper airway causingvelopharyngeal dysfunction. This anomaly should, in theory,result in an abnormal vocal quality related to the coupling ofthe vocal tract with the nasal cavity, and is revealed throughtwo features.

(i) First, speakers with a defective velopharyngeal mech-anism can produce speech with inappropriate nasalresonance. The term nasalization can refer to twodifferent phenomena in the context of speech;hyponasality and hypernasality. The former is saidto occur when no nasalization is produced when thesound should be nasal. Hypernasality is nasalizationduring the production of nonnasal (voiced oral)sounds. The interested reader can find an excellentreference in [16]. Fox and Monoson’s work onthe nasalization characteristics for the sleep apnoeagroup was not conclusive. What they could concludewas that these resonance abnormalities could be

perceived as a form of either hyponasality or hyper-nasality. Perhaps more importantly, speakers withapnoea may exhibit smaller intraspeaker differencesbetween nonnasal and nasal vowels due to thisdysfunction (vowels ordinarily acquire either a nasalor a nonnasal quality depending on the presence orabsence of adjacent nasal consonants). Only recentlyhas resonance disorder affecting speech sound qualitybeen associated with vocal tract damping featuresdistinct from airflow in balance between the oral andnasal cavities. The term applied to this speech disor-der is “cul-de-sac” resonance, a type of hyponasalitythat causes the sound to be perceived as if it wereresonating in a blind chamber.

(ii) Secondly, due to the pharyngeal anomaly, differencesin formant values can be expected, since, for instance,according to [17] the position of the third formantmight be related to the size of the velopharyngealopening (lowering of the velum produces higherthird formant frequencies). This is confirmed inRobb et al.’s work [18], in which vocal tract acousticresonance was evaluated in a group of OSA males.Statistically significant differences were found informant frequency and bandwidth values betweenapnoea and healthy groups. In particular, the resultsof the formant frequency analysis showed that F1and F2 values among the OSA group were generallylower than those in the non-OSA groups. The lowerformant values were attributed to greater vocal tractlength.

These types of anomalies may occur either in isolationor combined. However, none of them was found to besufficient on its own to allow accurate assessment of theOSA condition. In fact, all three descriptors were necessaryto differentiate and predict whether the subject was in thenormal group or in the OSA group.

3. Apnoea Database

3.1. Speech Corpus. In this section, we describe the apnoeaspeaker database we designed with the goal of covering allthe relevant linguistic/phonetic contexts in which physiolog-ical OSA-related peculiarities could have a greater impact.These peculiarities include the articulatory, phonation andresonance anomalies revealed in the previous research review(see Section 2).

As we pointed out in the introduction, the central aim ofour study is to apply speech processing techniques to auto-matically detect OSA-related traits in continuous speech,building on previous perceptual work [12]. Thus, in thepresent paper we will not be concerned with sustained vow-els, even though this has been the most common approachin the literature on pathological voice analysis [14]. Thistrend no doubt seeks to exploit certain advantages of usingsustained vowels, the main one being that their speech signalis more time invariant than that of continuous speech, andtherefore it should, in principle, allow a better estimation ofthe parameters for voice characterization. Another advantage

Page 127: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

for some applications is that certain speaker characteristicssuch as speaking rate, dialect and intonation do not influencethe result. Nevertheless, analysing continuous speech maywell afford greater possibilities than working with sustainedvowels because certain traits of pathological voice patterns,and in particular those of OSA patients, could then bedetected in different sound categories (i.e., nasals, fricatives,etc.) and also in the coarticulation between adjacent soundunits. This makes it possible to study the nature of thesepeculiarities—say, resonance anomalies—in a variety ofphonetic contexts, and this is why we have chosen to focuson continuous speech. However, we note that it is not ourintention here to compare the performance of continuousspeech and sustained vowel approaches.

The speech corpus contains readings of four sentences inSpanish repeated three times by each speaker. Always keepingFox and Monoson’s work in mind, we designed phrases forour speech database that include instances of the followingspecific phonetic contexts.

(i) In relation to resonance anomalies, we designedsentences that allow intraspeaker variation measure-ments; that is, measuring differential voice featuresfor each speaker, for instance to compare the degreeof vowel nasalization within and without nasalcontexts.

(ii) With regard to phonation anomalies, we includedcontinuous voiced sounds to measure irregularphonation patterns related to muscular fatigue inapnoea patients.

(iii) Finally, to look at articulatory anomalies we col-lected voiced sounds affected by certain precedingphonemes that have their primary locus of articu-lation near the back of the oral cavity, specifically,velar phonemes such as the Spanish velar approxi-mant “g”. This anatomical region has been seen todisplay physical anomalies in speakers suffering fromapnoea. Thus, it is reasonable to suspect that differentcoarticulatory effects may occur with these phonemesin speakers with and without apnoea. In particular,in our corpus we collected instances of transitionsfrom the Spanish voiced velar plosive /g/ to vowels,in order to analyse the specific impact of articulatorydysfunctions in the pharyngeal region.

All the sentences were designed to exhibit a similar melodicstructure, and speakers were asked to read them with aspecific rhythmic structure under the supervision of anexpert. We followed this controlled rhythmic recordingprocedure hoping to minimise nonrelevant interspeakerlinguistic variability. The sentences used were the following.

(1) Francia, Suiza y Hungrıa ya hicieron causa comun.′fraN θja ′suj θa i uη ′gri a ya j ′θje roη ′kaw sako ′mun

(2) Julian no vio la manga roja que ellos buscan, enningun almacen.

xu ′ljan no ′βjo la ′maη ga ′ro xa ke ′e λoz ′βuskan en niη ′gun al ma ′ken

(3) Juan no puso la taza rota que tanto le gusta en elaljibe.

xwan no ′pu so la ′ta θa ′ro ta ke ′taN to le ′γusta en el al ′xi βe

(4) Miguel y Manu llamaran entre ocho y nueve ymedia.

mi ′γel i ′ma nu λa ma ′ran ′eN tre ′o t∫

o i ′nweβe i ′me �ja

The first phrase was taken from the Albayzin database, astandard phonetically balanced speech database for Spanish[19]. It was chosen because it contains an interestingsequence of successive /a/ and /i/ vowel sounds.

The second and third phrases, both negative, have asimilar grammatical and intonation structure. They arepotentially useful for contrastive studies of vowels in differentlinguistic contexts. Some examples of these contrastive pairsarise from comparing a nasal context, “manga roja” (′maη ga′ro xa), with a neutral context, “taza rota” (′ta θa ′ro ta).As we mentioned in the previous section, these contrastiveanalyses could be very helpful to confirm whether indeedthe voices of speakers with apnoea have an altered overallnasal quality and display smaller intraspeaker differencesbetween nonnasal and nasal vowels due to velopharyngealdysfunction.

The fourth phrase has a single and relatively long melodicgroup containing mainly voiced sounds. The rationale forthis fourth sentence is that apnoea speakers usually showfatigue in the upper airway muscles. Therefore, this sentencemay be helpful to discover various anomalies during thesustained generation of voiced sounds. These phonation-related features of segments of harmonic voice can becharacterized following any of a number of conventionalapproaches that use a set of individual measurements suchas the Harmonic to Noise Ratio (HNR) [20], periodicitymeasures and pitch dynamics (e.g., jitter). The sentence alsocontains several vowel sounds embedded in nasal contextsthat could be used to study phonation and articulationin nasalized vowels. Finally, with regard to the resonanceanomalies found in the literature, one of the possible traitsof apnoea speakers is dysarthria. Our sentence can be usedto analyse dysarthric voices that typically show differences invowel space with respect to normal speakers [21].

3.2. Data Collection. The database was recorded in theRespiratory Department at Hospital Clınico Universitario ofMalaga, Spain. It contains the readings (see Section 3.1) of80 male subjects; half of them suffer from severe sleep apnoea(AHI> 30), and the other half are either healthy subjects oronly have mild OSA (AHI< 10). Subjects in both groupshave similar physical characteristics such as age and BodyMass Index (BMI), see Table 1. The speech material for theapnoea group was recorded and collected in two differentsessions: one just before being diagnosed and the other afterseveral months under CPAP treatment. This allows studyingthe evolution of apnoea voice characteristics for a particularpatient before and after treatment.

Page 128: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

Table 1: Distribution of normal and pathological speakers in the database.

Number Mean age Std. dev. age Mean BMI Std. dev. BMI

Normal 40 42.2 8.8 26.2 3.9

Apnoea 40 49.5 10.8 32.8 5.4

3.2.1. Speech Collection. Speech was recorded using a sam-pling frequency of 48 kHz in an acoustically isolated booth.The recording equipment consisted of a standard laptopcomputer with a conventional sound card equipped with aSP500 Plantronics headset microphone with A/D conversionand digital data exchange through a USB-port.

3.2.2. Image Collection. Additionally, for each subject inthe database, two facial images (frontal and lateral views)were collected under controlled illumination conditions andover a flat white background. A conventional digital camerawas used to obtain images in 24-bit RGB format, withoutcompression and with 2272 × 1704 resolution. We collectedthese images because simple visual inspections are usually afirst step when evaluating patients under clinical suspicion ofsuffering from OSA. Visual examination of patients includessearching for distinctive features of the facial morphologyof OSA such as a short neck, characteristic mandibulardistances and alterations, and obesity. To our knowledge, noresearch has ever been carried out to detect these OSA-relatedfacial features by means of automatic image processingtechniques.

4. Preliminary Acoustic Analysis ofthe Apnoea Database

In order to build on the relatively little knowledge availablein this area and to evaluate how well our Apnoea Databaseis suited for the purposes of our research, we first examinedsome of the standard acoustic features traditionally used forpathological voice characterization, comparing the apnoeapatient group and the control group in specific linguisticcontexts.

In a related piece of research, Fiz et al. [22] appliedspectral analysis on sustained vowels to detect possibleapnoea-pathological cases. They used the following acousticfeatures: maximum frequency of harmonics, mean frequencyof harmonics and number of harmonics. They foundstatistically significant differences between a control group(healthy subjects) and the sleep apnoea group regarding themaximum harmonic frequency for the vowels /i/ and /e/, itbeing lower for OSA patients. Another piece of research onthe acoustic characterization of sustained vowels uttered byapnoea patients using Linear Predictive Coding (LPC) can befound in [23]. However, these studies do not investigate allof the possible acoustic peculiarities that may be found in thevoices of apnoea patients, since focusing solely on sustainedvowels precludes the discovery of acoustic effects that occurin continuous speech only in certain linguistic contexts.

Thus the first stage of our contrastive study was a per-ceptual and visual comparison of frequency representations(mainly spectrographic, pitch, energy and formant analysis)

of apnoea and control group speakers. After this we carriedout comparative statistical tests on various other acousticmeasurements that might reveal distinctive OSA traits. Thesemeasurements were computed in specific linguistic contextsusing a phonetic segmentation generated with an HMM-based (Hidden Markov Models) automatic speech recognitionsystem. We chose standard acoustic features and testedtheir discriminative power on normal and apnoea voices.We chose to compare groups using Mann-Whitney U testsbecause part of the data was not normally distributed.

With this experimental setup, and following up onprevious research on the acoustic characteristics of OSAspeakers, we searched for articulatory, phonation and reso-nance anomalies in apnoea-suffering speakers.

(1) Articulatory Anomalies. An interesting conclusion fromour initial perceptual contrastive study was that, whencomparing the distance between the second (F2) and thirdformant (F3) for the vowel /i/, clear differences between theapnoea and control groups were found. For apnoea speakersthe distance was greater, and this was especially clear indiphthongs with /i/ as the stressed vowel, as in the Spanishword “Suiza” (′suj θa) (See Figure 1). This finding is inagreement with Robb’s conclusion that the F2 formant valuein the vowels produced by apnoea subjects is lower (andtherefore the distance between F3 and F2 is larger) thannormal [18].

This finding may be related to the greater length of thevocal tract of OSA patients [18], but also, and perhaps moreimportantly, to a characteristically abnormal velopharyngealopening which may cause a shift in the position of thethird formant. Indeed, a lowering of the velum (typical inapnoea speakers) is known to produce higher third formantfrequencies. We measured the distance between F2 and F3in the utterances of first test phrase listed above, whichcontains good examples of stressed i’s. We measured absolutedistances in spite of the fact that the actual location of theformants is speaker dependent. Nevertheless, we consideredthat normalization was not necessary because our databasecontains only male subjects with similar relevant physicalcharacteristics, and the formants should lie roughly in thesame regions for all of our speakers. Significant differenceswere indeed found (Table 2). This fact could support thehypothesis that some form of nasalization is taking place inthe case of apnoea speakers.

(2) Phonation Anomalies. In [12] it is reported that the heavysnoring of sleep apnoea patients can cause inflammation andfatigue in the upper airway muscles and may affect the vocalcords. As indicators of these phonation abnormalities we canuse various individual measurements such as the Harmonicto Noise Ratio (HNR) and dysperiodicity parameters.

Page 129: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 1: Differences between third and second formant for the vowel “i” in the word “Suiza”(′suj θa), (a) for an apnoea speaker and (b) acontrol group speaker.

Table 2: Median and P-values for articulatory measurements ob-tained when both groups were compared with the Mann-WhitneyU Test.

Feature Group MedianP-value

(95% conf)

Dif. third andsecond formant

Apnoeacontrol

614P < .001

586.5

(i) HNR [20] is a measurement of voice pureness. It isbased on calculating the ratio of the energy of theharmonics to the noise energy present in the voice(measured in dB).

(ii) Dysperiodicity, a common symptom of voice dis-orders, refers to anomalies in the glottal excitationsignal generated by the vibrating vocal folds and theglottal airflow. We estimated vocal dysperiodicities inconnected speech following [24].

A normal voice will tend to have a higher HNR and lessdysperiodicity (higher signal-to-dysperiodicity ratio) thana “pathological” voice. We computed HNR and signal-to-dysperiodicity measures for the fourth phrase in the databasesince it mainly contains voiced sounds and the subjects wereasked to read it as a single melodic group. A Mann-WhitneyU test revealed significant differences (P < .05) for thesemeasures in the specific linguistics contexts which we statedpreviously, as we can see in Table 3. This result suggests thatOSA can be linked to certain phonation anomalies, and thatthe data we collected reveals these phenomena.

(3) Resonance Anomalies. Fox et al. state in [12] that acommon resonance feature in apnoea patients is abnormalnasality. The presence and the size of one extra low frequencyformant can be considered an indicator of nasalization [25],but no perceptual differences between the groups in theoverall nasality level could be found. As discussed in previoussections, this could be due to common perceptual difficultiesto classify the voice of apnoea speakers as hyponasal orhypernasal. However, we did find differences in both groups(apnoea and nonapnoea) in how nasalization varied fromnasal to nonnasal contexts and vice versa. Interestingly,

Table 3: Median and P-values of phonation measurements ob-tained when both groups were compared with the Mann-WhitneyU Test.

Feature Group MedianP-value

(95% conf)

HNRApnoeacontrol

10.3P = .0110

10.6

Signal-to-dysperiodicity

Apnoeacontrol

30.1P < .001

32.6

we found variation in nasalization to be smaller for OSAspeakers. One hypothesis is that the voices of apnoeaspeakers have a higher overall nasality level caused byvelopharyngeal dysfunction, so differences between oral andnasal vowels are smaller than normal because the oral vowelsare also nasalized. An explanation for this could be thatapnoea speakers have weaker control over the velopharyngealmechanism, which may cause difficulty in changing nasalitylevels, whether absolute nasalization level is high or low.These hypotheses are intriguing and we will delve deeper intothem later.

5. Automatic Speech and SpeakerRecognition techniques

When trying to develop a combined model of variousfeatures by observing sparse data, statistical modelling isconsidered to be an adequate solution. Digital processing ofspeech signals allows performing several parameterizationsof the utterances in order to weight up the various dimen-sions of the feature space, and therefore aim to outline aproper modelling space. Parameters extracted from a givendata set, combined with heuristic techniques, will, hopefully,describe a generative model of the group’s feature space,which may be compared to others in order to identifycommon features, analyze existing variability, determine thestatistical significance of certain features, or even classifyentities. Selecting a convenient parameterization is thereforea relevant task, and one that depends significantly on thespecific problem we are dealing with.

Page 130: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

Every sentence in our speech database was processedusing short-time analysis with a 20 milliseconds time frameand a 10 milliseconds delay between frames, which gives a50% overlap. Each of the windows analyzed will later bepresented in the form of a training vector for our statisticalmodels (both HMMs and GMMs). However, before trainingit is of great importance, as we have already pointed out, tochoose an appropriate parameterization for the information.For the task of acoustical space modelling we chose to use 39standard components: 12 Mel Frecuency Cepstral Coefficients(MFCCs), plus energy, extended with their speed (delta) andacceleration (delta-delta) components. (We acknowledgethat an optimized representation—similar to that of Godinoet al., for laryngeal pathology detection [9]—could producebetter results, but this would require specific adaptation ofthe recognition techniques to be applied, which falls beyondthe goals of the study we present here.) The vectors resultingfrom this front-end process are placed together in trainingsets for statistical modelling. This grouping task can becarried out following a variety of criteria depending on thefeatures we are interested in or the phonetic classes that needto be modelled.

As we explained in Section 4, after speech signal param-eterization we extract sequences of acoustic features cor-responding to specific phonetic and linguistic contexts—we believe they may reveal distinctive voice characteristicfor OSA speakers. We used well-known speech and speakerrecognition techniques to carry out speech phonetic segmen-tation and apnoea/nonapnoea voice classification.

Since we needed to consider specific acoustical featuresand phonetic contexts, we first performed a phoneticsegmentation of every utterance in the database. Thisallows combining speech frames from different phoneticcontexts for each sound in order to generate a globalmodel, or classifying data by keeping them in separatetraining sets. For each sentence in the speech database,automatic phonetic segmentation was carried out usingthe open-source HTK tool [26]. A full set of 24 context-independent phonetic Hidden Markov Models (HMMs) wastrained on a manually phonetically tagged subcorpus ofthe Albayzin database [18]. As our speech apnoea databaseincludes the transcription of all the utterances, forcedsegmentation was used to align a phonetic transcriptionusing the 3-state context-independent HMMs; optionalsilences between words were allowed to model optionalpauses in each sentence. Using automatic forced alignmentavoids the need for costly annotation of the data set byhand. It also guarantees good quality segmentation, whichis crucial if we are to distinguish phonemes and phoneticcontexts.

After phonetic segmentation, statistical pattern recogni-tion can be applied to classify, study or compare apnoeaand nonapnoea (control) voices for specific speech segmentsbelonging to different linguistic and phonetic contexts. Ascepstral coefficients may follow any statistical distributionon different speech segments, the well-known GaussianMixture Model (GMM) approach was chosen to fit a flexibleparametric distribution to the statistical distribution of theselected speech segment. Figure 2 summarizes the whole

process we have described, showing the direct training of theGMMs from a given database.

In our case we decided to train a universal backgroundGMM model (UBM) from phonetically balanced utterancestaken from the Albayzin database [18], and use MAP(Maximum a Posteriori) adaptation to derive the specificGMMs for the different classes to be trained. This techniqueincreases the robustness of the models especially when sparsespeech material is available [15]. Only the means wereadapted, as is classically done in speaker verification. Figure 3illustrates the GMM training process.

For the experiments discussed below, both processes,generation of the UBM and MAP adaptation to train theapnoea and the control group GMM models, were developedwith the BECARS open source tool [27].

For testing purposes, and in order to increase the numberof tests and thus to improve the statistical relevance of ourresults, the standard leave-one-out testing protocol was used.This protocol consists in discarding one sample speaker fromthe experimental database to train the classifier with theremaining samples. Then the excluded sample is used as thetest data. This scheme is repeated until a sufficient number oftests have been performed.

6. Apnoea Voice Modelling with GMMs

In this section we present experimental results that shed lighton the potential of using GMMs to discover and model pecu-liarities in the acoustical signal of apnoea voices, peculiaritieswhich may be related to the perceptually distinguishabletraits described in previous research and corroborated inour preceding contrastive study. The main reason for usingGMMs over the cepstral domain is related to the greatpotential this combination of techniques has shown for themodelling of the acoustic space of human speech, for bothspeech and speaker recognition. For our study we requireda good modelling of the anomalies described in Section 2,which we expected to find in OSA patients. Since cepstralcoefficients are related with the spectral envelope of speechsignals, and therefore with the articulation of sounds, andsince GMM training sets can be carefully selected in orderto model specific characteristics (e.g., in order to considerresonance anomalies in particular), it seems promising tocombine all this information in a fused model. We shouldexpect such a model to be useful for describing the acousticspaces of both the OSA patient group and the healthy group,and for discriminating between them.

This approach was applied to specific linguistic contextsobtained from our HMM-based automatic phonetic seg-mentation. In particular, as our apnoea speech database wasdesigned to allow a detailed contrastive analysis of vowels inoral and nasal phonetic contexts, we focus on reporting per-ceptual differences related to resonance anomalies that couldbe perceived as either hyponasality or hypernasality. For thispurpose, Section 6.1 discusses how GMM techniques can beapplied to study these differences in degree of nasalization indifferent linguistic contexts. After this prospective research,Section 6.2 presents experimental results to test the potentialof applying these standard techniques to the automatic

Page 131: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

Database

Speech

utterances

Phonetic segmentation

Short-time analysis

Training set generation

Training GMM

Class 1 Class 2

SpeakeriUtterances j

(MFCC1 · · ·MFCCN ) (MFCC3 · · ·MFCCM ) · · ·

· · ·

/y/ /a/ /i/

Figure 2: Phonetic class GMM model training.

Apnoeadatabase

Albayzindatabase

UBM training

MAP

Apnoea group

MAP

Control group GMMcontrol

GMMapnoea

Figure 3: Apnoea and control GMM model training.

diagnosis of apnoea, and demonstrate the discriminativepower of GMM techniques for severe apnoea assessment.

6.1. A study of Apnoea Speaker Resonance Anomalies UsingGMMs. To our knowledge, signal processing and patternrecognition techniques have never been used to analysehyponasal or hypernasal continuous speech from OSApatients. Our aim with the GMM-based experimental setupwas to try to model certain resonance anomalies that havealready been described for apnoea speakers in precedingresearch [12] and revealed in our own contrastive acousticstudy. Our work focuses mainly on nasality, since distin-guishing traits for speakers with apnoea have traditionallybeen sought in this acoustical aspect.

We therefore used GMM techniques to perform acontrastive analysis to identify differences in degree ofnasalization in different linguistic contexts. Two GMMs foreach apnoea or healthy speaker were trained using speech

with nasalized and nonnasalized vowels. Both speaker-dependent nasal and nonnasal GMMs were trained followingthe approach described in Section 5. MAP adaptation wascarried out with a generic vowel UBM trained usingAlbayzin database [18]. These two nasal/nonnasal GMMswere used to quantify the acoustic differences betweennasal and nonnasal contexts for each speaker in both theapnoea and the control groups. The smaller the differencebetween the nasal and the nonnasal GMMs the more similarthe nasalized and the nonnasalized vowels are. Unusuallysimilar nasal and nonnasal vowels for any one speakerreveals the presence of resonance anomalies. We took a fastapproximation of the Kullback-Leibler (KL) divergence forGaussian Mixture Models (Do, 2003) [28] as a measure ofdistance between nasal and nonnasal GMMs. This distanceis commonly used in Automatic Speaker Recognition todefine cohorts or groups of speakers producing similarsounds.

Page 132: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

We found that the distance between nasal and nonnasalvowel GMMs was significantly larger for the control groupspeakers than for the speakers with severe apnoea (a Mann-Whitney U test revealed significant differences (P < .05)for these distance measures). This interesting result confirmsthat the margin of acoustic variation for vowels articulatedin nasal versus nonnasal phonetic contexts is narrower thannormal in speakers with severe apnoea. It also validates theGMM approach as a powerful speech processing and classifi-cation technique for research on OSA voice characterizationand the detection of OSA speakers.

6.2. Assessment of Severe Apnoea Using GMMs. As we havesuggested in the previous section, with the GMM approachwe can identify some of the resonance anomalies of apnoeaspeakers that have already been described in the literature.With our experiment we intended to explore the possibilitiesthat applying GMM-Based Speaker Recognition techniquesmay open up for the automatic diagnosis of severe apnoea.A speaker verification system is a supervised classificationsystem capable of discriminating between two classes ofspeech signals (usually “genuine” and “impostor”). For ourpresent purposes the classes are not defined by referenceto any particular speaker. Rather, we generated a generalsevere sleep apnoea class and a control class (speech fromhealthy subjects) by grouping together all of the trainingdata from speakers of each class and directly applying theappropriate algorithm to fit both Gaussian mixtures onto ourdata, because what we are interested in is in being able toclassify people (as accurately as possible) as either sufferingfrom severe OSA or not. This method is suitable for keepingtrack of the progress of voice dysfunction in OSA patients,it is easy-to-use, fast, noninvasive and much cheaper thantraditional alternatives. While we do not suggest it shouldreplace current OSA diagnosis methods, we believe it can bea great aid for early detection of severe apnoea cases.

Following a similar approach to that of other patholog-ical voice assessment studies [9], GMMs representing theapnoea and control classes were built as follows.

(i) The pathological and control GMMs were trainedfrom the generic UBM relying on MAP adaptationand the standard leave-one-out technique, similarly tohow we described above (Section 5).

(ii) During the apnoea/nonapnoea detection phase aninput speech signal corresponding to the wholeutterance of the speaker to be diagnosed is presentedto the system. The parameterised speech is thenprocessed with each apnoea and control GMM gen-erating two likelihood scores. From these two scoresan apnoea/control decision is made according to adecision threshold adjusted beforehand as a tradeoffto achieve acceptable rates of both failure to detectapnoea voices (false negative) or falsely classifyinghealthy cases as apnoea voices (false positive).

Table 4 shows the correct classification rates we obtainedwhen we applied the GMM control/pathological voiceclassification approach to our speech apnoea database [10].We see that the overall correct classification rate was 81%.

Table 4: Correct classification rate.

Correctclassificationrate in %

Controlgroup

Apnoeagroup

Overall

77.5%(31/40)

85%(34/40)

81%(65/80)

Table 5: Contingency table of clinical diagnosis versus automaticclassification of patients.

GMMclassificationsevere apnoea

GMMclassificationnonapnoea

Diagnosedsevere apnoea(AHI >30)

(A) 40 True positive(TP) 31

False negative(FN) 9

Diagnosednonapnoea(AHI <10)

(N) 40 False positive(FP) 6

True negative(TN) 34

Table 5 is a contingency table that shows that 31 ofthe 40 speakers in the database diagnosed with severeapnoea were classified as such by our GMM-based system(true positives), while 9 of them were wrongly classifiedas nonapnoea speakers (false negatives); and 34 of the 40speakers diagnosed as not suffering from severe apnoea wereclassified as such by our GMM-based system (true negatives),while 6 of them were wrongly classified as apnoea speakers(false positives).

Fisher’s exact test revealed a significant association(P < .001) between diagnosis and automatic (GMM-based)classification, that is, it is significantly more likely that adiagnosed patient (either with or without apnoea) will becorrectly classified by our system than incorrectly classified.

In order to evaluate the performance of the classifier, andso that we may easily compare it with others, we plotted aDetection Error Tradeoff (DET) curve [29], which is a widelyemployed tool in the domain of speaker verification. Onthis curve, false positives are plotted against false negativesfor different threshold values, giving a uniform treatment toboth types of error. On a DET plot, the better the detector,the closer the curve will get to the bottom-left corner.Figure 4 shows the DET curve for our detector. The pointmarked with a diamond is the equal error rate (EER) point,that is, the point for which the false positive rate equals thefalse negative rate. We obtained an EER of approximately20%.

We now evaluate the performance of the classifier usingthe following criteria.

(i) Sensitivity: ratio of correctly classified apnoea-suffering speakers (true positives) to total numberof speakers actually diagnosed with severe apnoea.Therefore, Sensitivity = TP/(TP + FN).

(ii) Specificity: ratio of true negatives to total numberof speakers diagnosed as not suffering from apnoea.Specificity = TN/(TN + FP).

(iii) Positive Predictive Value: ratio of true positives tototal number of patients GMM-classified as havinga severe apnoea voice. Positive Predictive Value =TP/(TP + FP).

Page 133: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal ProcessingM

iss

prob

abili

ty(%

)

1

2

5

10

20

40

60

80

False alarm probability (%)

1 2 5 10 20 40 60 80

Figure 4: DET plot for our classifier.

Table 6: Sensitivity, specificity, positive and negative predictivevalue and overall accuracy.

Sensitivity SpecificityPositive

predictivevalue

Negativepredictive

value

Overallaccuracy

77.5%(31/40)

85%(34/40)

83.8%(31/37)

79%(34/43)

81%(65/80)

(iv) Negative Predictive Value: ratio of true negativesto total number of patients GMM-classified as nothaving a severe apnoea voice. Negative PredictiveValue = TN/(TN + FN).

(v) Overall Accuracy: ratio of all correctly GMM-classified patients to total number of speakers tested.Overall accuracy = (TP + TN)/(TP + TN + FP + FN).

Table 6 shows the values we obtained in our test for thesemeasures of accuracy.

Some comments are in order regarding the correctclassification rates obtained. The results are encouraging andthey show that distinctive apnoea traits can be identified bya GMM based-approach, even when there is relatively littlespeech material with which to train the system. Furthermore,such promising results were obtained without choosingany acoustic parameters in particular on which to basethe classification. Better results should be expected witha representation and parameterization of audio data thatis optimized for apnoea discrimination. Obviously, ourexperiments need to be validated with a larger test sample.Nevertheless, our results already give us an idea of thediscriminative power of this approach to automatic diagnosisof severe apnoea cases.

7. Conclusions and Future Research

In this paper we have presented pioneering research in thefield of automatic assessment of severe obstructive sleepapnoea. The acoustic properties of the voices of speakerssuffering from OSA were studied and an apnoea speechdatabase was designed attempting to cover all the majorlinguistic contexts in which these physiological OSA featurescould have a greater impact. For this purpose we analyzedin depth the possibilities of applying standard speech-based recognition systems to the modelling of the peculiarfeatures of the realizations of certain phonemes by apnoeapatients. In relation with this issue, we focused on nasalityas an important feature in the acoustic characteristicsof apnoea speakers. Our state-of-the-art GMM approachhas confirmed that there are indeed significant differencesbetween apnoea and control group speakers in terms ofrelative levels of nasalization between different linguisticcontexts. Furthermore, we tested the discriminative powerof GMM-based speaker recognition techniques adapted tosevere apnoea detection with promising experimental results.A correct classification rate of 81% shows that GMM-based OSA diagnosis could be useful for the preliminaryassessment of apnoea patients and, which suggests it isworthwhile to continue to explore this area.

Regarding future research, our automatic apnoea assess-ment needs to be validated with a larger sample from abroader spectrum of population. Furthermore, best resultscan be expected using a representation of the audio datathat is optimized for apnoea discrimination. Regarding thedecision threshold, an interesting study would be to lookat all the possible operating points of the system on aDET curve. It would then be possible to move the system’sthreshold and fine-tune it to an optimal operating point formedical applications (where, according to common medicalcriteria, a false negative is a more serious matter than a falsepositive). Finally, we mention that future research will also befocused on exploiting physiological OSA features in relevantlinguistic contexts in order to explore the discriminatingpower of each feature using linear discriminant classifiers orcalibration tools such as the open-source FoCal Toolkit [30].We aim to apply these findings to improve the performanceof the automatic apnoea diagnosis system.

Acknowledgments

The activities described in this paper were funded by theSpanish Ministry of Science and Technology as part of theTEC2006-13170-C02-02 Project. The authors would liketo thank the volunteers at Hospital Clınico Universitarioof Malaga, Spain, and to Guillermo Portillo who madethe speech and image data collection possible. Also, theauthors gratefully acknowledge the helpful comments anddiscussions of David Dıaz Pardo.

References

[1] F. J. Puertas, G. Pin, J. M. Marıa, and J. Duran, “Documentode consenso Nacional sobre el sındrome de Apneas-hipopneasdel sueno (SAHS),” Grupo Espanol De Sueno (GES), 2005.

Page 134: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

[2] G. Coccagna, A. Pollini, and F. Provini, “Cardiovasculardisorders and obstructive sleep apnea syndrome,” Clinical andExperimental Hypertension, vol. 28, pp. 217–224, 2006.

[3] P. Lloberes, G. Levy, C. Descals, et al., “Self-reported sleepinesswhile driving as a risk factor for traffic accidents in patientswith obstructive sleep apnoea syndrome and in non-apnoeicsnorers,” Respiratory Medicine, vol. 94, no. 10, pp. 971–976,2000.

[4] T. Penzel, J. McNames, P. de Chazal, B. Raymond, A.Murray, and G. Moody, “Systematic comparison of differentalgorithms for apnoea detection based on electrocardiogramrecordings,” Medical and Biological Engineering and Comput-ing, vol. 40, no. 4, pp. 402–407, 2002.

[5] C. M. Ryan and T. D. Bradley, “Pathogenesis of obstructivesleep apnea,” Journal of Applied Physiology, vol. 99, no. 6, pp.2440–2450, 2005.

[6] T. M. Davidson, “The great leap forward: the anatomic basisfor the acquisition of speech and obstructive sleep apnea,”Sleep Medicine, vol. 4, no. 3, pp. 185–194, 2003.

[7] B. Boyanov and S. Hadjitodorov, “Acoustic analysis of patho-logical voices: a voice analysis system for the screening andlaryngeal diseases,” IEEE Engineering in Medicine and BiologyMagazine, vol. 16, no. 4, pp. 74–82, 1997.

[8] B. Guimaraes Aguiar, “Acoustic Analysis and Modellingof Pathological Voices,” Microsoft Research, 2007, http://www.researchchannel.org/prog/displayeventaspx?rID=21533&fID=4834.

[9] C. Fredouille, G. Pouchoulin, J.-F. Bonastre, M. Azzarello, A.Giovanni, and A. Ghio, “Application of automatic speakerrecognition techniques to pathological voice assessment (dys-phonia),” in Proceedings of the 9th European Conference onSpeech Communication and Technology (Interspeech ’05), pp.149–152, Lisboa, Portugal, September 2005.

[10] J. I. Godino-Llorente, P. Gomes-Vilda, and M. Blanco-Velasco, “Dimensionality reduction of a pathological voicequality assessment system based on gaussian mixture modelsand short-term cepstral parameters,” IEEE Transactions onBiomedical Engineering, vol. 53, no. 10, pp. 1943–1953, 2006.

[11] J. I. Godino-Llorente, P. Gomez-Vilda, N. Saenz-Lechon, M.Blanco-Velasco, F. Cruz-Roldan, and M. A. Ferrer-Ballester,“Support vector machines applied to the detection of voicedisorders,” in Proceedings of the International Conference onNon-Linear Speech Processing (NOLISP ’05), vol. 3817 ofLecture Notes in Computer Science, pp. 219–230, Springer,Barcelona, Spain, April 2005.

[12] A. W. Fox, P. K. Monoson, and C. D. Morgan, “Speech dys-function of obstructive sleep apnea. A discriminant analysis ofits descriptors,” Chest, vol. 96, no. 3, pp. 589–595, 1989.

[13] R. Fernandez, L. A. Hernandez, E. Lopez, J. Alcazar, G.Portillo, and D. T. Toledano, “Design of a multimodal databasefor research on automatic detection of severe apnoea cases,”in Proceedings of the 6th Language Resources and EvaluationConference (LREC ’08), Marrakech, Morocco, 2008.

[14] V. Parsa and D. G. Jamieson, “Acoustic discriminationof pathological voice: sustained vowels versus continuousspeech,” Journal of Speech, Language, and Hearing Research,vol. 44, no. 2, pp. 327–339, 2001.

[15] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speakerverification using adapted Gaussian mixture models,” DigitalSignal Processing, vol. 10, no. 1, pp. 19–41, 2000.

[16] T. Pruthi, Analysis, vocal-tract modeling and automatic detec-tion of vowel nasalization, Doctor thesis, University of Mary-land, Baltimore, Md, USA, 2007.

[17] A. Hidalgo and M. Quilis, Fonetica y Fonologıa Espanolas,Tirant Blanch, 2002.

[18] M. P. Robb, J. Yates, and E. J. Morgan, “Vocal tract resonancecharacteristics of adults with obstructive sleep apnea,” ActaOto-Laryngologica, vol. 117, no. 5, pp. 760–763, 1997.

[19] A. Moreno, D. Poch, A. Bonafonte, et al., “ALBAYZIN speechdatabase: design of the phonetic corpus,” in Proceedings ofthe 3rd European Conference on Speech Communication andTechnology (EuroSpeech ’93), vol. 1, pp. 175–178, Berlin,Germany, September 1993.

[20] P. Boersma, “Accurate short-term analysis of the fundamentalfrequency and the harmonics-to-noise ratio of a sampledsound,” in Proceedings of the Institute of Phonetic Sciences 17,pp. 97–110, 1993.

[21] G. S. Turner, K. Tjaden, and G. Weismer, “The influenceof speaking rate on vowel space and speech intelligibilityfor individuals with amyotrophic lateral sclerosis,” Journal ofSpeech and Hearing Research, vol. 38, no. 5, pp. 1001–1013,1995.

[22] J. A. Fiz, J. Morera, J. Abad, et al., “Acoustic analysis of vowelemission in obstructive sleep apnea,” Chest, vol. 104, no. 4, pp.1093–1096, 1993.

[23] A. Obrador, O. Capdevila, M. Monso, et al., “Analisis de la vozen los pacientes con sindrome de apnea-hipopnea en el sueno,”in Congreso Nacional de Neumologıa, 2008.

[24] F. Bettens, F. Grenez, and J. Schoentgen, “Estimation of vocaldysperiodicities in disordered connected speech by meansof distant-sample bidirectional linear predictive analysis,”Journal of the Acoustical Society of America, vol. 117, no. 1, pp.328–337, 2005.

[25] J. R. Glass and V. W. Zue, “Detection of nasalized vowelsin American english,” in Proceedings of IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP’85), vol. 10, pp. 1569–1572, Tampa, Fla, USA, April 1985.

[26] S. Young, The HTK Book (for HTK Version 3.2), 2002.[27] R. Blouet, C. Mokbel, H. Mokbel, E. Sanchez Soto, G.

Chollet, and H. Greige, “BECARS: a free software for speakerverification,” in Proceedings of the Speaker and LanguageRecognition Workshop (ODYSSEY ’04), pp. 145–148, Toledo,Spain, May-June 2004.

[28] M. N. Do, “Fast approximation of Kullback-Leibler distancefor dependence trees and hidden Markov models,” IEEE SignalProcessing Letters, vol. 10, no. 4, pp. 115–118, 2003.

[29] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M.Przybocki, “The DET curve in assessment of detection taskperformance,” in Proceedings of the 6th European Conference onSpeech Communication and Technology (Eurospeech ’97), pp.1895–1898, Rhodes, Greece, September 1997.

[30] N. Brummer and J. du Preez, “Application-independent eval-uation of speaker detection,” Computer Speech and Language,vol. 20, no. 2-3, pp. 230–275, 2006.

Page 135: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 540409, 12 pagesdoi:10.1155/2009/540409

Research Article

Alternative Speech Communication System forPersons with Severe Speech Disorders

Sid-Ahmed Selouani,1 Mohammed Sidi Yakoub,2

and Douglas O’Shaughnessy (EURASIP Member)2

1 LARIHS Laboratory, Universite de Moncton, Campus de Shippagan, NB, Canada E8S 1P62 INRS-Energie-Materiaux-Telecommunications, Place Bonaventure, Montreal, QC, Canada H5A 1K6

Correspondence should be addressed to Sid-Ahmed Selouani, [email protected]

Received 9 November 2008; Revised 28 February 2009; Accepted 14 April 2009

Recommended by Juan I. Godino-Llorente

Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders.The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality ofcommunication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and closeto the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a graftingtechnique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system inorder to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and twoAcadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposedmethods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achievedby the speech synthesis systems that deal with SSD and dysarthria, respectively.

Copyright © 2009 Sid-Ahmed Selouani et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

The ability to communicate through speaking is an essentialskill in our society. Several studies revealed that up to60% of persons with speech impairments have experienceddifficulties in communication abilities, which have severelydisrupted their social life [1]. According to the CanadianAssociation of Speech Language Pathologists & Audiologists(CASLPA), one out of ten Canadians suffers from a speechor hearing disorder. These people face various emotionaland psychological problems. Despite this negative impacton these people, on their families, and on the society, veryfew alternative communication systems have been developedto assist them [2]. Speech troubles are typically classifiedinto four categories: articulation disorders, fluency disorders,neurologically-based disorders, and organic disorders.

Articulation disorders include substitution or omissionsof sounds and other phonological errors. The articulationis impaired as a result of delayed development, hearingimpairment, or cleft lip/palate. Fluency disorders also called

stuttering are disruptions in the normal flow of speech thatmay yield repetitions of syllables, words or phrases, hes-itations, interjections, prolongation, and/or prolongations.It is estimated that stuttering affects about one percentof the general population in the world, and overall malesare affected two to five times more often than females[3]. The effects of stuttering on self-concept and socialinteractions are often overlooked. The neurologically-baseddisorders are a broad area that includes any disruption in theproduction of speech and/or the use of language. Commontypes of these disorders encompass aphasia, apraxia, anddysarthria. Aphasia is characterized by difficulty in difficultyin formulating, expressing, and/or understanding language.Apraxia makes words, and sentences sound jumbled ormeaningless. Dysarthria results from paralysis, lack of coor-dination or weakness of the muscles required for speech.Organic disorders are characterized by loss of voice qualitybecause of inappropriate pitch or loudness. These problemsmay result from hearing impairment damage to the vocalcords surgery, disease or cleft palate [4, 5].

Page 136: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

2 EURASIP Journal on Advances in Signal Processing

In this paper we focus on dysarthria and a Sound Substi-tution Disorder (SSD) belonging to the articulation disordercategory. We propose to extend our previous work [6] byintegrating in a new pathologic speech synthesis system agrafting technique that aims at enhancing the intelligibility ofdysarthric and SSD speech uttered by American and AcadianFrench speakers, respectively. The purpose of our study isto investigate to what extent automatic speech recognitionand speech synthesis systems can be used to the benefit ofAmerican dysarthric speakers and Acadian French speakerswith SSD. We intend to answer the following questions.

(i) How well can pathologic speech be recognized byan ASR system trained with limited amount ofpathologic speech (SSD and dysarthria)?

(ii) Will the recognition results change if we train theASR by using variable length of analysis frame,particularly in the case of dysarthria, where theutterance duration plays an important role?

(iii) To what extent can a language model help incorrecting SSD errors?

(iv) How well can dysarthric speech and SSD be correctedin order to be more intelligible by using appropriateText-To-Speech (TTS) technology?

(v) Is it possible to objectively evaluate the resynthesized(corrected) signals using a perceptually-based crite-rion?

To answer these questions we conducted a set of experimentsusing two databases. The first one is the Nemours databasefor which we used read speech of four American dysarthricspeakers and one nondysarthric (reference) speaker [7].All speakers read semantically unpredictable sentences. Forrecognition an HMM phone-based ASR was used. Resultsof the recognition experiments were presented as wordrecognition rate. Performance of the ASR was tested by usingspeaker dependent models. The second database used in ourASR experiments is an Acadian French corpus of pathologicspeech that we have previously elaborated. The two databasesare also used to design a new speech synthesis systemthat allows conveying an intelligible message to listeners.The Mel-Frequency cepstral coefficients (MFCCs) are theacoustical parameters used by our systems. The MFCCsare discrete Fourier transform- (DFT-) based parametersoriginating from studies of the human auditory system andhave proven very effective in speech recognition [8]. Asreported in [9], the MFCCs have been successfully employedas input features to classify speech disorders by using HMMs.Godino-Llorente and Gomez-Vilda [10] use MFCCs andtheir derivatives as front-end for a neural network that aimsat discriminating normal/abnormal speakers relatively tovarious voice disorders including glottic cancer. The reportedresults lead to conclude that short-term MFCC is a goodparameterization approach for the detection of voice diseases[10].

2. Characteristics of Dysarthric andStuttered Speech

2.1. Dysarthria. Dysarthria is a neurologically-based speechdisorder affecting millions of people. A dysarthric speakerhas much difficulty in communicating. This disorder inducespoor or not pronounced phonemes, variable speech ampli-tude, poor articulation, and so forth. According to Aronson[11], dysarthria covers various speech troubles resultingfrom neurological disorders. These troubles are linked tothe disturbance of brain and nerve stimuli of the musclesinvolved in the production of speech. As a result, dysarthricspeakers suffer from weakness, slowness, and impairedmuscle tone during the production of speech. The organs ofspeech production may be affected to varying degrees. Thus,the reduction of intelligibility is a common disruption to thevarious forms of dysarthria.

Several authors have classified the types of dysarthriataking into consideration the symptoms of neurologicaldisorders. This classification is based only upon an auditoryperceptual evaluation of disturbed speech. All types ofdysarthria affect the articulation of consonants, causing theslurring of speech. Vowels may also be distorted in verysevere dysarthria. According to the widely used classificationof Darley [12], seven kinds of dysarthria are considered.

Spastic Dysarthria. The vocal quality is harsh. The voiceof a patient is described as strained or strangled. Thefundamental frequency is low, with breaks occurring in somecases. Hypernasality may occur but is usually not importantenough to cause nasal emission. Bursts of loudness aresometimes observed. Besides this, an increase in phoneme-to-phoneme transitions, in syllable and word duration, andin voicing of voiceless stops, is noted.

Hyperkinetic Dysarthria. The predominant symptoms areassociated with involuntary movement. Vocal quality is thesame as of spastic dysarthria. Voice pauses associated withdystonia may occur. Hypernasality is common. This type ofdysarthria could lead to a total lack of intelligibility.

Hypokinetic Dysarthria. This is associated with Parkinson’sdisease. Hoarseness is common in Parkinson’s patients. Also,low volume frequently reduces intelligibility. Monopitch andmonoloudness often appear. The compulsive repetition ofsyllables is sometimes present.

Ataxic Dysarthria. According to Duffy [4], this type ofdysarthria can affect respiration, phonation, resonance, andarticulation. Then, the loudness may vary excessively, andincreased effort is evident. Patients tend to place equal andexcessive stress on all syllables spoken. This is why Ataxicspeech is sometimes described as explosive speech.

Flaccid Dysarthria. This type of dysarthria results fromdamage to the lower motor neurons involved in speech.Commonly, one vocal fold is paralyzed. Depending on theplace of paralysis, the voice will sound harsh and have low

Page 137: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 3

volume or it is breathy, and an inspirational stridency maybe noted.

Mixed Dysarthria. Characteristics will vary depending onwhether the upper or lower motor neurons remain mostlyintact. If upper motor neurons are deteriorated, the voice willsound harsh. However, if lower motor neurons are the mostaffected, the voice will sound breathy.

Unclassified Dysarthria. Here, we find all types that are notcovered by the six above categories.

Dysarthria is treated differently depending on its levelof severity. Patients with a moderate form of dysarthria canbe taught to use strategies that make their speech moreintelligible. These persons will be able to continue to usespeech as their main mode of communication. Patientswhose dysarthria is more severe may have to learn to usealternative forms of communication.

There are different systems for evaluating dysarthria.Darley et al. [12] propose an assessment of dysarthriathrough an articulation test uttered by the patients. Listenersidentify unintelligible and/or mispronounced phonemes.Kent et al. [13] present a method which starts by identifyingthe reasons for the lack of intelligibility and then adaptsthe rehabilitation strategies. His test comes in the form of alist of words that the patient pronounces aloud; the auditorhas four choices of words to say what he had heard. Thelists of choices take into account the phonetic contrasts thatcan be disrupted. The design of the Nemours dysarthricspeech database, used in this paper, is mainly based onthe Kent method. An automatic recognition of Dutchdysarthric speech was carried out, and experiments withspeaker independent and speaker dependent models werecompared. The results confirmed that speaker dependentspeech recognition for dysarthric speakers is more suitable[14]. Another research suggests that the variety of dysarthricusers may require dramatically different speech recognitionsystems since the symptoms of dysarthria vary so much fromsubject to subject. In [15], three categories of audio-onlyand audiovisual speech recognition algorithms for dysarthricusers are developed. These systems include phone-based andwhole-word recognizers using HMMs, phonologic-feature-based and whole-word recognizers using support vectormachines (SVMs), and hybrid SVM-HMM recognizers.Results did not show a clear superiority for any given system.However, authors state that HMMs are effective in dealingwith large-scale word-length variations by some patients, andthe SVMs showed some degree of robustness against thereduction and deletion of consonants. Our proposed assistivesystem is a dysarthric speaker-dependant automatic speechrecognition system using HMMs.

2.2. Sound Substitution Disorders. Sound substitution disor-ders (SSDs) affect the ability to communicate. SSDs belongto the area of articulation disorders that difficulties withthe way sounds are formed and strung together. SDDs arealso known as phonemic disorders in which some speechphonemes are substituted for other phonemes, for example,

“fwee” instead of “free.” SSDs refer to the structure offorming the individual sounds in speech. They do not relateto producing or understanding the meaning or content ofspeech. The speakers incorrectly make a group of sounds,usually substituting earlier developing sounds for later-developing sounds and consistently omitting sounds. Thephonological deficit often substitutes t/k and d/g. Theyfrequently leave out the letter “s” so “stand” becomes “tand”and “smoke,” “moke.” In some cases phonemes may be wellarticulated but inappropriate for the context as in the casespresented in this paper. SSDs are various. For instance, insome cases phonemes /k/ and /t/ cannot be distinguished,so “call” and “tall” are both pronounced as “tall.” This iscalled phoneme collapse [16]. In other cases many sounds mayall be represented by one. For example, /d/ might replace/t/, /k/, and /g/. Usually persons with SSDs are able to hearphoneme distinctions in the speech of others, but they arenot able to speak them correctly. This is known as the“fis phenomenon.” It can be detected at an early age if aspeech pathologist says: “Did you say “fis,” don’t you mean“fish”?” and the patient answers: “No, I didn’t say “fis,” I said“fis”.” Other cases can deal with various ways to pronounceconsonants. Some examples are glides and liquids. Glidesoccur when the articulatory posture changes gradually fromconsonant to vowel. As a result, the number of error soundsis often greater in the case of SSDs than in other articulationdisorders.

Many approaches have been used by speech-languagepathologists to reduce the impact of phonemic disorderson the quality of communication [17]. In the minimalpair approach, commonly used to treat moderate phonemicdisorders and poor speech intelligibility, words that differ byonly one phoneme are chosen for articulation practice usingthe listening of correct pronunciations [18]. The secondwidely used method is called the Phonological cycle [19].It includes auditory overload of phonological targets at thebeginning and end of sessions, to teach formation and aseries of the sound targets. Recently, an increasing interesthas been noticed for adaptive systems that aim at helpingpersons with articulation disorder by means of computer-aided systems. However, the problem is still far from beingresolved. To illustrate these research efforts, we can cite theOrtho-Logo-Paedia (OLP) project, which proposes a methodto supplement speech therapy for specific disorders at thearticulation level based on an integrated computer-basedsystem together with automatic ASR and distance learning.The key elements of the projects include a real-time audio-visual feedback of a patient’s speech according to a therapyprotocol, an automatic speech recognition system used toevaluate the speech production of the patient and webservices to provide remote experiments and therapy sessions[20]. The Speech Training, Assessment, and Remediation(STAR) system was developed to assist speech and languagepathologists in treating children with articulation problems.Performance of an HMM recognizer was compared toperceptual ratings of speech recorded from children whosubstitute /w/ for /r/. The findings show that the differencein log likelihood between /r/ and /w/ models correlates wellwith perceptual ratings (averaged by listeners) of utterances

Page 138: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

4 EURASIP Journal on Advances in Signal Processing

containing substitution errors. The system is embedded in avideo game involving a spaceship, and the goal is to teach the“aliens” to understand selected words by spoken utterances[21]. Many other laboratory systems used speech recognitionfor speech training purposes in order to help persons withSSD [22–24].

The adaptive system we propose uses speaker-dependentautomatic speech recognition systems and speech synthesissystems designed to improve the intelligibility of speechdelivered by dysarthric speakers and those with articulationdisorders.

3. Speech Material

3.1. Acadian French Corpus of Pathologic Speech. To assessthe performance of the system that we propose to reduceSSD effects, we use an Acadian French corpus of pathologicspeech that we have collected throughout the French regionsof the New Brunswick Canadian province. Approximately32.4% of New Brunswick’s total population of nearly 730 000is francophone, and for the most part, these individualsidentify themselves as speakers of a dialect known asAcadian French [25]. The linguistic structure of AcadianFrench differs from other dialects of Canadian French. Theparticipants in the pathologic corpus were 19 speakers (10women and 9 men) from the three main francophone regionsof New Brunswick. The age of the speakers ranges from 14 to78 years. The text material consists of 212 read sentences. Two“calibration” or “dialect” sentences, which were meant toelicit specific dialect features, were read by all the 19 speakers.The two calibration sentences are given in (1).

(1)a Je viens de lire dans “l’Acadie Nouvelle”qu’un pecheurde Caraquet va monter une petite agence de voyage.

(1)b C’est le meme gars qui, l’annee passee, a vendu samaison a cinq Francais d’Europe.

The remaining 210 sentences were selected from publishedlists of French sentences, specifically the lists in Combescureand Lennig [26, 27]. These sentences are not representativeof particular regional features but rather they correspond tothe type of phonetically balanced materials used in coderrating tests or speech synthesis applications where it isimportant to avoid skew effects due to bad phonetic balance.Typically, these sentences have between 20 and 26 phonemeseach. The relative frequencies of occurrence of phonemesacross the sentences reflect the distribution of phonemesfound in reference corpora of French spoken in theatreproductions; for example, /a/, /r/, and schwa are amongthe most frequent sounds. The words in the corpus arefairly common and are not part of a specialized lexicon.Assignment of sentences to speakers was made randomly.Each speaker read 50 sentences including the two dialectsentences. Thus, the corpus contains 950 sentences. Eightspeech disorders are covered by our Acadian French corpus:stuttering, aphasia, dysarthria, sound substitution disorder,Down syndrome, cleft palate and disorder due to hairimpairment. As specified, only sound substitution disordersare considered by the present study.

3.2. Nemours Database of American Dysarthric Speakers. TheNemours dysarthric speech database is recorded in MicrosoftRIFF format and is composed of wave files sampled with16-bit resolution at a 16 kHz sampling rate after low-passfiltering at a nominal 7500 Hz cutoff frequency with a90 dB/Octave filter. Nemours is a collection of 814 shortnonsense sentences pronounced by eleven young adult maleswith dysarthria resulting from either Cerebral Palsy or headtrauma. Speakers record 74 sentences with the first 37sentences randomly generated from the stimulus word list,and the second 37 sentences constructed by swapping thefirst and second nouns in each of the first 37 sentences. Thisprotocol is used in order to counter-balance the effect ofposition within the sentence for the nouns.

The database was designed to test the intelligibility ofEnglish dysarthric speech according to the same methoddepicted by Kent et al. in [13]. To investigate this intelli-gibility, the list of selected words and associated foils wasconstructed in such a way that each word in the list (e.g.,boat) was associated with a number of minimally differentfoils (e.g., moat, goat). The test words were embeddedin short semantically anomalous sentences, with three testwords per sentence (e.g., the boat is reaping the time). Thestructure of sentences is as follows: “THE noun1 IS verb-ingTHE noun2.”

Note that, unlike Kent et al. [13] who used exclusivelymonosyllabic words, Menendez-Padial et al. [7] in theNemours test materials included infinitive verbs in which thefinal consonant of the first syllable of the infinitive couldbe the phoneme of interest. That is, the /p/ of reapingcould be tested with foils such as reading and reeking.Additionally, the database contains two connected-speechparagraphs produced by each of the eleven speakers.

4. Speech-Enabled Systems to CorrectDysarthria and SSD

4.1. Overall System. Figure 1 shows the system we proposeto recognize and resynthesize both dysarthric speech andspeech affected by SSD. This system is speaker-dependentdue to the nature of the speech and the limited amount ofdata available for training and test. At the recognition level(ASR), the system uses in the case of dysarthric speech avariable Hamming window size for each speaker. The sizegiving the best recognition rate will be used in the finalsystem. Our interest to frame length is justified by the factthat duration length plays a crucial role in characterizingdysarthria and is specific for each speaker. For speaker withSSD, a regular frame length of 30 milliseconds is usedadvanced by 10 milliseconds. At the synthesis level (Text-To-Speech), the system introduces a new technique to definevariable units, a new concatenating algorithm and a newgrafting technique to correct the speaker voice and makeit more intelligible for dysarthric speech and SSD. Therole of concatenating algorithm consists of joining basicunits and producing the desired intelligible speech. The badunits pronounced by the dysarthric speakers are indirectlyidentified by the ASR system and then need to be corrected.

Page 139: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 5

Source speech

Target speech

Text (utterance)

TTS (speech synthesizer)

Grafted units

Graftingtechnique

- Good units- Bad units

Normal speaker:- All units

New concatenating algorithm

ASR (phone, word recognition)

Figure 1: Overall system designed to help both dysarthric speakersand those with SSD.

(a) At the beginning: DH AH

AE_SH_IH

(b) In the middle: AH B AE

(c) At the end: AE TH

Figure 2: The three different segmented units of the dysarthricspeaker BB.

Therefore, to improve them we use a grafting technique thatuses the same units from a reference (normal) speaker tocorrect poorly pronounced units.

4.2. Unit Selection for Speech Synthesis. The communicationsystem is tailored to each speaker and to the particularities ofhis speech disorder. An efficient alternative communicationsystem must take into account the specificities of eachpatient. From our point of view it is not realistic to targeta speaker independent system that can efficiently tackle the

different varieties of speech disorders. Therefore, there is norule to select the synthesis units. The synthesis units are basedon two phonemes or more. Each unit must start and/or finishby a vowel (/a/, /e/ . . . or /i/). They are taken from the speechat the vowel position. We build three different kinds of unitsaccording to their position in the utterance.

(i) At the beginning, unit must finish by a vowelpreceded by any phoneme.

(ii) In the middle, unit must start and finish by a vowel.Any phoneme can be put between them.

(iii) At the end, unit must start by a vowel followed by anyphoneme.

Figure 2 shows examples of these three units. This techniqueof building units is justified by our objective which consistsof facilitating the grafting of poorly pronounced phonemesuttered by dysarthric speakers. This technique is also used tocorrect the poorly pronounced phonemes of speakers withSSD.

4.3. New Concatenating Algorithm. The units replacing thepoorly pronounced units due to SSD or dysarthria areconcatenated at the edge starting or ending of vowels(quasiperiodic). Our algorithm always concatenates twoperiods of the same vowel with different shapes in the timedomain. It concatenates /a/ and /a/, /e/ and /e/, and soforth. For ear perception two similar vowels, following eachother, sound the same as one vowel, even their shapes aredifferent [28] (e.g., /a/ followed by /a/ sounds as /a/). Then,the concatenating algorithm is as follow.

(i) Take one period from the left unit (LP).

(ii) Take one period from the right unit (RP).

(iii) Use a warping function [29] to convert LP to RP inthe frequency domain, for instance, a simple one isY = a X+b. We consider in this conversion the energyand fundamental frequency on both periods. Theconversion adds necessary periods between two unitsto maintain a homogenous energy. Figure 3 showssuch a general warping function in the frequencydomain.

(iv) Each converted period is followed by an interpolationin the time domain.

(v) The added periods are called step conversion numbercontrol. This number is necessary to fix how manyconversions and interpolations are necessary betweentwo units.

Figure 4 illustrates our concatenation technique in an exam-ple using two units: /ah//b//ae/ and /ae//t//ih/.

4.4. Grafting Technique to Correct SSD and Dysarthric Speech.In order to make dysarthric speech and speech affected bySSD more intelligible, a correction of all units containingthose phonemes is necessary. Thus, a grafting technique isused for this purpose. The grafting technique we proposeremoves all poorly or not pronounced phonemes (silence)

Page 140: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

6 EURASIP Journal on Advances in Signal Processing

Target spectrum(target period. e.g. AH)

Source spectrum(source period. e.g. AH)

Wraping function

π0.5π00

0.5π

π

Figure 3: The warping function used in the frequency domain.

AE_SH_IHright unit

AE_SH_IHright unit

AH_B_AEleft unit

AH_B_AEleft unit

One rightperiod

102 points 91 points

One leftperiod

Before concatenation

After concatenation

After interpolations and conversions8 period added

1 3, 4, 5 2

| | | | | | |102 100 99 97 96 94 93 91

Figure 4: The proposed concatenating algorithm used to link twounits: /AH B AE/ and /AE SH IH/.

following or preceding the vowel from the bad unit, andreplaces them with those from the reference speaker. Thismethod has the advantage to provide a synthetic voice thatis very close to the one of the speaker. Corrected units arestored in order to be used by the alternative communicationsystem (ASR+TTS). A smoothing at the edges is necessaryin order to normalize the energy [29]. Besides this, andin order to dominate the grafted phonemes and hear thespeaker with SSD or dysarthria instead of normal speaker, wemust lower the amplitude of those phonemes. By iteratingthis mechanism, we make the energy of unit vowels risingand the grafted phonemes falling. Therefore, the vowelenergy on both sides dominates and makes the originalvoice dominating too. The grafting technique is performedaccording the following steps.

(a) The bad unit and spectrogram before grafting: IH Z W IH

Left phonemes (IH_Z)

Grafted phonemes (Z_W)

Original fromnormal speaker

Amplitude lowered by 34%

Right phonemes (W_IH)

Periods added byconcatenating algorithm

(b) Grafting technique steps

(c) The corrected unit after grafting and spectrogram: IH Z W IH

Figure 5: Grafting technique example correcting the unit/IH/Z/W/IH/.

1st step. Extract the left phonemes of the bad unit (vowel +phoneme) from the speaker with SSD or dysarthria.

2nd step. Extract the grafted phonemes of the good unit fromthe normal speaker.

3rd step. Cut the right phonemes of the bad unit (vowel +phoneme) from the speaker with SSD or dysarthria.

4th step. Concatenate and smooth the parts above obtained inthe three first steps.

5th step. Lower the amplitude of signal obtained in step 2, andrepeat step 4 till we have a good listening.

Figure 5 illustrates the proposed grafting on an exampleusing the unit /IH/Z/W/IH/ where the /W/ is not pro-nounced correctly.

4.5. Impact of the Language Model on ASR of Utterances withSSD. The performance of any recognition system dependson many factors, but the size and the perplexity of thevocabulary are among the most critical ones. In our systems,the size of vocabulary is relatively small since it is verydifficult to collect huge amounts of pathologic speech.

Page 141: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 7

A language model (LM) is essential for effective speechrecognition. In a previous work [30], we have tested theeffect of the LM on the automatic recognition of accentedspeech. The results we obtained showed that the introductionof LM masks numerous pronunciation errors due to foreignaccents. This leads us to investigate the impact of LM onerrors caused by SSD.

Typically, the LM will restrict the allowed sequences ofwords in an utterance. It can be expressed by the formulagiving the a priori probability, P(W):

P(W) = p(w1, . . . ,wm)

= p(w1)m∏

i=2

p

⎛⎜⎝wi | wi−n+1, . . . ,wi−1︸ ︷︷ ︸

n−1

⎞⎟⎠,

(1)

where W = w1, . . . ,wm is the sequence of words. In the n-gram approach described by (1), n is typically restricted ton = 2 (bigram) or n = 3 (trigram).

The language model used in our experiments is abigram, which mainly depends on the statistical numbersthat were generated from the phonetic transcription. Allinput transcriptions (labels) are fed to a set of unique integersin the range 1 to L, where L is the number of distinct labels.For each adjacent pair of labels i and j, the total number ofoccurrences O(i, j) is counted. For a given label i, the totalnumber of occurrences is given by

O(i) =L∑

j=1

O(i, j). (2)

For both word and phonetic matrix bigrams, the bigramprobability p(i, j) is given by

p(i, j) =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

αO(i, j)

O(i), if O(i) > 0,

1L

, if O(i) = 0,

β, otherwise,

(3)

where β is a floor probability, and α is chosen to ensure that

L∑

j=1

p(i, j) = 1. (4)

For back-off bigrams, the unigram probablities p(i) are givenby

p(i) =

⎧⎪⎪⎨⎪⎪⎩

O(i)O

, if O(i) > γ,

γ

O, otherwise,

(5)

where γ is unigram floor count, and O is determined asfollows:

O =L∑

j=1

max[O(i), γ

]. (6)

The backed-off bigram probabilities are given by

p(i, j) =

⎧⎪⎨⎪⎩

(O(i, j)−D)

O(i), if O

(i, j)> θ,

b(i)p(j), otherwise,

(7)

where D is a discount, and θ is a bigram count threshold.The discount D is fixed at 0.5. The back-off weight b(i) iscalculated to ensure that

L∑

j=1

p(i, j) = 1. (8)

These statistics are generated by using the HLStats function,which is a tool of the HTK toolkit [31]. This functioncomputes the occurrences of all labels in the system andthen generates the back-off bigram probabilities based onthe phoneme-based dictionary of the corpus. This file countsthe probability of the occurrences of every consecutivepairs of labels in all labelled words of our dictionary. Asecond function of HTK toolkit, HBuild, uses the back-off probabilities file as an input and generates the bigramlanguage model. We expect that the language model throughboth unigram will correct the nonword utterances. Forinstance, if at the phonetic level HMMs identify the word“fwee” (instead of “free”), the unigram will exclude this wordbecause it does not exist in the lexicon. When SSD involverealistic words as in the French words “cree” (create) and“cle” (key), errors may occur, but the bigram is expected toreduce them. Another aspect that must be taken into accountis the fact that the system is trained only by the speakerwith SSD. This yields to the adaptation of the system to the“particularities” of the speaker.

5. Experiments and Results

5.1. Speech Recognition Platform. In order to evaluate theproposed approach, the HTK-based speech recognitionsystem described in [31] has been used throughout all exper-iments. HTK is an HMM-based speech recognition system.The toolkit was designed to support continuous-densityHMMs with any numbers of state and mixture components.It also implements a general parameter-tying mechanismwhich allows the creation of complex model topologiesto suit a variety of speech recognition applications. Eachphoneme is represented by a 5-state HMM model withtwo nonemitting states (1st and 5th state). Mel-Frequencycepstral coefficients (MFCCs) and cepstral pseudoenergyare calculated for all utterances and used as parametersto train and test the system [8]. In our experiments, 12MFCCs were calculated on a Hamming window advancedby 10 milliseconds each frame. Then, an FFT is performedto calculate a magnitude spectrum for the frame, whichis averaged into 20 triangular bins arranged at equal Mel-frequency intervals. Finally, a cosine transform is appliedto such data to calculate the 12 MFCCs. Moreover, thenormalized log energy is also found, which is added to the12 MFCCs to form a 13-dimensional (static) vector. Thisstatic vector is then expanded to produce a 39-dimensional

Page 142: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

8 EURASIP Journal on Advances in Signal Processing

PESQTime

averagingTime

alignement

Auditorytransform

Pre-processing

Referencesignal

Degradedsignal

Systemunder test

Pre-processing

Auditorytransform

Disturbanceprocessing

Identify badintervals

Figure 6: Block diagram of the PESQ measure computation [32].

vector by adding first and second derivatives of the staticparameters.

5.2. Perceptual Evaluation of the Speech Quality (PESQ)Measure. To measure the speech quality, one of the reliablemethods is the Perceptual Evaluation of Speech Quality(PESQ). This method is standardized in ITU-T recommen-dation P.862 [33]. PESQ measurement provides an objectiveand automated method for speech quality assessment. Asillustrated in Figure 6, the measure is performed by using analgorithm comparing a reference speech sample to the speechsample processed by a system. Theoretically, the results canbe mapped to relevant mean opinion scores (MOSs) basedon the degradation of the sample [34]. The PESQ algorithmis designed to predict subjective opinion scores of a degradedspeech sample. PESQ returns a score from 0.5 to 4.5, withhigher scores indicating better quality. For our experimentswe used the code provided by Loizou in [32]. This techniqueis generally used to evaluate speech enhancement systems.Usually, the reference signal refers to an original (clean)signal, and the degraded signal refers to the same utterancepronounced by the same speaker as in the original signalbut submitted to diverse adverse conditions. The idea comesto use the PESQ algorithm since for the two databases areference voice is available. In fact, the Nemours waveformdirectories contain parallel productions from a normal adultmale talker who pronounced exactly the same sentencesas those uttered by the dysarthric speakers. Referencespeakers and sentences are also available for the AcadianFrench corpus of pathologic speech. These references andsentences are extracted from the RACAD corpus we havebuilt to develop automatic speech recognition systems forthe regional varieties of French spoken in the province ofNew Brunswick, Canada [35]. The sentences of RACAD arethe same as those used for recording pathologic speech.These sentences are phonetically balanced, which justifiestheir use in the Acadian French corpora we have built forboth normal speakers and speakers with speech disorders.The PESQ method is used to perceptually compare theoriginal pathologic speech with the speech corrected by oursystems. The reference speech is taken from the normalspeaker utterances. In the PESQ algorithm, the reference anddegraded signals are level-equalized to a standard listeninglevel thanks to the preprocessing stage. The gain of the twosignals is not known a priori and may vary considerably.

In our case, the reference signal differs from the degradedsignal since it is not the same speaker who utters thesentence, and the acoustic conditions also differ. In theoriginal PESQ algorithm, the gains of the reference, degradedand corrected signals are computed based on the root meansquare values of band-passed-filtered (350–3250 Hz) speech.The full frequency band is kept in our scaled version ofnormalized signals. The filter with a response similar tothat of a telephone handset, existing in the original PESQalgorithm, is also removed. The PESQ method is usedthroughout all our experiments to evaluate synthetic speechgenerated to replace both English dysarthric speech andAcadian French speech affected by SSD. The PESQ has theadvantage to be independent of listeners and number oflisteners.

5.3. Experiments on Dysarthric Speech. Four dysarthricspeakers of the Nemours database are used for the evaluationof ASR. The ASR uses vectors contained in varying HammingWindows. The training is performed on a limited amountof speaker specific material. A previous study showed thatASR of dysarthric speech is more suitable for low-perplexitytasks [14]. A speaker-dependent ASR is generally moreefficient and can reasonably be used in a practical and usefulapplication. For each speaker, the training set is composedof 50 sentences (300 words), and the test is composed of 24sentences (144 words). The recognition task is carried outwithin the sentence structure of the Nemours corpus. Themodels for each speaker are triphone left-right HMMs withGaussian mixture output densities decoded with the Viterbialgorithm on a lexical-tree structure. Due to the limitedamount of training data, for each speaker, we initializethe HMM acoustic parameters of the dependent modelrandomly with the reference utterances as baseline training.

Figure 7 shows the sentence “The bin is pairing the tin”pronounced by the dysarthric speaker referred by his initials,BK, and the nondysarthric (normal) speaker. Note that thesignal of the dysarthric speaker is relatively long. This isdue to his slow articulation. As for standard speech, toperform the estimation of dysarthric speech parameters,the analysis should be done frame-by-frame and withoverlapping. Therefore, we carried out many experimentsin order to find the optimal frame size of the acousticalanalysis window. The tested lengths of these windows are15, 20, 25, and 30 milliseconds. The determination of the

Page 143: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 9

frame size is not controlled only by the stationarity andergodicity condition, but also by the information containedin each frame. The choice of analysis frame length is atrade-off between having long enough frames to get reliableestimates (of acoustical parameters), but not too long sothat rapid events are averaged out [8]. In our applicationwe propose to update the frame length in order to controlthe smoothness of the parameter trajectories over time.Table 1 shows the recognition accuracy for different lengthsof Hamming window and the best result (in bold) obtainedfor BB, BK, FB, and MH speakers. These results show that therecognition accuracy can increase by 6% when the windowlength is doubled (15 milliseconds to 30 milliseconds). Thisleads us to conclude that, in the context of dysarthric speechrecognition, the frame length plays a crucial role. The averagerecognition rate for the four dysarthric speakers is about70%, which is a very satisfactory result. In order to givean idea about the suitability of ASR for dysarthric speakerassistance, 10 human listeners who have never heard therecordings before are asked to recognize the same dysarthricutterances as those presented to the ASR system. Less than20% of correct recognition rate has been obtained. Note thatin a perspective of a complete communication system, theASR is coupled with speech synthesis that uses a voice thatis very close to the one of the patient thanks to the graftingtechnique.

The PESQ-based objective test is used to evaluatethe Text-To-Speech system that aimed at correcting thedysarthric speech. Thirteen sentences generated by the TTS,for each dysarthric speaker, are evaluated. These sentenceshave the same structure as those of the Nemours database(THE noun1 IS verb-ing THE noun2). We used the combi-nation of 74 words and 34 verbs in “ing” form to generateutterances as pronounced by each dysarthric speaker inNemours database. We also generate random utterancesthat have never been pronounced. The advantage of usingPESQ for evaluation is that it generates an output MeanOpinion Score (MOS) that is a prediction of the perceivedquality that would be assigned to the test signal by auditorsin a subjective listening test [33, 34]. PESQ determinesthe audible difference between the reference and dysarthricsignals. The PESQ value of the original dysarthric signal iscomputed and compared to the PESQ of the signal correctedby the grafting technique. The cognitive model used byPESQ computes an objective listening quality MOS rangingbetween 0.5 and 4.5. In our experiments, the reference signalis the normal utterance which has the code JP prefixed to thefilename of dysarthric speaker (e.g., JPBB1.wav), the originaltest utterance is the dysarthric utterance without correction(e.g., BB1.wav), while the corrected utterance is generatedafter application of the grafting technique. Note that thedesigned TTS system can generate sentences that are neverpronounced before by the dysarthric speaker thanks to therecorded dictionary of corrected units and the concatenatingalgorithm. For instance, this TTS system can easily beincorporated in a voicemail system to allow the dysarthricspeaker to record messages with its own voice.

The BB and BK dysarthric speakers who are the mostsevere cases were selected for the test. The speech from

0 2 4 6 8 10 12−0.5

0

0.5

(a) “The bin is pairing the tin” uttered by the dysarthric speaker BK

0 0.5 1 1.5 2 2.5 3 3.5−0.5

0

0.5

(b) “The bin is pairing the tin” uttered by the normal speaker

Figure 7: Example of utterance extracted from the Nemoursdatabase.

the BK speaker who had head Trauma and is quadriplegicwas extremely unintelligible. Results of the PESQ evaluationconfirm the severity of BK dysarthria when compared withthe BB case. Figure 8 shows variations of PESQ for 13sentences of the two speakers. The BB speaker achieves 2.68and 3.18 PESQ average for original (without correction) andcorrected signals, respectively. The BK speaker affected by themost severe dysarthria achieves 1.66 and 2.2 PESQ averagefor the 13 original and corrected utterances, respectively. Thisrepresents an improvement of, respectively, 20% and 30% ofthe PESQ of BB and BK speakers. These results confirm theefficacy of the proposed method to improve the intelligibilityof dysarthric speech.

5.4. Experiments on Acadian French Pathologic Utterances.We carried out two experiments to test our assistive speech-enabled systems. The first experiment assessed the ASRgeneral performance. The second investigated the impact ofa language model on the reduction of errors due to SSD.The ASR was evaluated using data of three speakers, twofemales and one male, who substitute /k/ by /a/, /s/ by /th/and /r/ by /a/ and referred to F1, F2, and M1, respectively.Experiments involve a total of 150 sentences (1368 words)among which 60 (547 words) were used for testing. Table 2presents the overall system accuracies of the two experimentsin both word level (using LM) and phoneme level (withoutusing any LM) by considering the same probability of anytwo sequences of phonemes. Experiments are carried outby using a triphone left-right HMM with Gaussian mixtureoutput densities decoded with the Viterbi algorithm ona lexical-tree structure. The HMMs are initialized withthe reference speakers’ models. For the considered wordunits, the overall performance of the system is increasedby around 38%, as shown in Table 2. Obviously when theLM is introduced, better accuracy is obtained. When therecognition performance is analyzed at the phonetic level, wewere not able to distinguish which errors are corrected by thelanguage model from those that are adapted in the trainingprocess. In fact, the use of the speaker-dependent system withLM masks numerous pronunciation errors due to SSD.

Page 144: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

10 EURASIP Journal on Advances in Signal Processing

Table 1: The ASR accuracy using 13 MFCCs and their first and second derivatives and variable Hamming window size.

Dysarthric Recognition accuracy (%) for different Hamming window size

Speaker 15 milliseconds 20 milliseconds 25 milliseconds 30 milliseconds

BB 62.50 63.89 65.28 68.66

BK 52.08 55.56 56.86 54.17

FB 74.31 76.39 76.39 80.65

MH 74.31 71.53 70.14 72.92

1 5 9 13

Number of utterance

0

1

2

3

PE

SQ

BK speaker

OriginalCorrected

(a)

1 5 9 13

Number of utterance

0

1

2

3

4

PE

SQ

BB speaker

OriginalCorrected

(b)

Figure 8: PESQ scores of original (degraded) and corrected utterances pronounced by BK and BB dysarthric speakers.

1 5 9 13

Number of utterance

0

1

2

3

4

5

PE

SQ

F1 speaker

OriginalCorrected

(a)

1 5 9 13

Number of utterance

0

1

2

3

4

5

PE

SQ

M1 speaker

OriginalCorrected

(b)

Figure 9: PESQ scores of original (degraded) and corrected utterances pronounced by F1 and M1 Acadian French speakers affected by SSD.

Table 2: Speaker dependent ASR system performance with and without language model and using the Acadian French pathologic corpus.

Speaker F1(423/161)

F2(517/192)

M1(428/194)

F1(423/161)

F2(517/192)

M1(428/194)

Corr (%) 43.09% 40.87% 46.45% 81.58% 78.44% 83.48%

Del (%) 4.38% 4.96% 4.02% 3.13% 3.26% 2.88%

Sub (%) 52.22% 54.58% 48.47% 15.04% 16.57% 14.55%

Without bigram-based language model With bigram-based language model

Page 145: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

EURASIP Journal on Advances in Signal Processing 11

The PESQ algorithm is used to objectively evaluate thequality of utterances after correcting the phonemes. Theresults for F1 who substitutes /k/ by /a/ and M1 whosubstitutes /r/ by /a/, for thirteen sentences, are given inFigure 9. Even if it is clear that the correction of this sub-stitution disorder is done effectively and is very impressivefor listeners, the PESQ criterion does not clearly show thisdrastic improvement of pronunciation. For speaker F1, 3.76and 3.98 of PESQ average have been achieved for the thirteenoriginal (degraded) and corrected utterances, respectively.The male speaker M1 achieves 3.47 and 3.64 of PESQ averagefor the original and corrected utterances, respectively. Animprovement of 5% in the PESQ is achieved for each of thetwo speakers.

6. Conclusion

Millions of people in the world have some type of com-munication disorder associated with speech, voice, and/orlanguage trouble. The personal and societal costs of thesedisorders are high. On a personal level, such disorders affectevery aspect of daily life. This motivates us to propose asystem which combines robust speech recognition and anew speech synthesis technique to assist speakers with severespeech disorders in their verbal communications. In thispaper, we report results of experiments on speech disorders.We must underline the fact that very few studies have beencarried out in the field of speech-based assistive technologies.We have also noticed the quasiabsence of speech corpora ofpathologic speech. Due to the fact that speech pathologiesare specific to each speaker, the designed system is speaker-dependant. The results showed that the frame length playeda crucial role in the dysarthric speech recognition. The bestrecognition rate is generally obtained when the Hammingwindow size is greater than 25 milliseconds. The synthesissystem, built for two selected speakers characterized by asevere dysarthria, improved the PESQ by more than 20%.This demonstrates that the grafting technique we proposedconsiderably improved the intelligibility of these speakers.We have collected data of Acadian French pathologic speech.These data permit us to assess an automatic speech recog-nition system in the case of SSD. The combination of usingboth of the language model and the proposed graftingtechnique has been proven effective to completely remove theSSD errors. We train the systems using MFCCs, but currentlywe are investigating the impact of using other parametersbased on ear modeling, particularly in the case of SSD.

Acknowledgments

This research was supported by grants from the NaturalSciences and Engineering Research Council of Canada(NSERC), the Canadian Foundation for Innovation (CFI),and the New Brunswick Innovation Foundation (NBIF) toSid-Ahmed Selouani (Universite de Moncton). The authorswould like to thank Melissa Chiasson and the French HealthAuthorities of New Brunswick (Beausejour, Acadie-Bathurstand Restigouche) for their valuable contributions to thedevelopment of the Acadian pathologic speech corpus.

References

[1] S. J. Stoeckli, M. Guidicelli, A. Schneider, A. Huber, and S.Schmid, “Quality of life after treatment for early laryngealcarcinoma,” European Archives of Oto-Rhino-Laryngology, vol.258, no. 2, pp. 96–99, 2001.

[2] Canadian Association of Speech-Language Pathologistsand Audiologists, “General Speech & Hearing FactSheet,” Report, http://www.caslpa.ca/PDF/fact%20sheets/speechhearingfactsheet.pdf.

[3] E. Yairi, N. Ambrose, and N. Cox, “Genetics of stuttering:a critical review,” Journal of Speech, Language, and HearingResearch, vol. 39, no. 4, pp. 771–784, 1996.

[4] J. R. Duffy, Motor Speech Disorders: Substrates, DifferentialDiagnosis, and Management, Mosby, St. Louis, Mo, USA, 1995.

[5] R. D. Kent, The Mit Encyclopedia of Communication Disorders,MIT Press, Cambridge, Mass, USA, 2003.

[6] M. S. Yakcoub, S.-A. Selouani, and D. O’Shaughnessy, “Speechassistive technology to improve the interaction of dysarthricspeakers with machines,” in Proceedings of the 3rd IEEEInternational Symposium on Communications, Control, andSignal Processing (ISCCSP ’08), pp. 1150–1154, Malta, March2008.

[7] X. Menendez-Padial, et al., “The nemours database ofdysarthric speech,” in Proceedings of the 4th International Con-ference on Spoken Language Processing (ICSLP ’96), Philadel-phia, Pa, USA, October 1996.

[8] D. O’Shaughnessy, Speech Communications: Human andMachine, IEEE Press, New York, NY, USA, 2nd edition, 2000.

[9] M. Wisniewski, W. Kuniszyk-Jozkowiak, E. Smołka, and W.Suszynski, “Automatic detection of disorders in a continu-ous speech with the hidden Markov models approach,” inComputer Recognition Systems 2, vol. 45 of Advances in SoftComputing, pp. 445–453, Springer, Berlin, Germany, 2007.

[10] J. I. Godino-Llorente and P. Gomez-Vilda, “Automatic detec-tion of voice impairments by means of short-term cepstralparameters and neural network based detectors,” IEEE Trans-actions on Biomedical Engineering, vol. 51, no. 2, pp. 380–384,2004.

[11] A. Aronson, Dysarthria: Differential Diagnosis, vol. 1, MentorSeminars, Rochester, Minn, USA, 1993.

[12] F. L. Darley, A. Aronson, and J. R. Brown, Motor SpeechDisorders, Saunders, Philadelphia, Pa, USA, 1975.

[13] R. D. Kent, G. Weismer, J. F. Kent, and J. C. Rosenbek, “Towardphonetic intelligibility testing in dysarthria,” Journal of Speechand Hearing Disorders, vol. 54, no. 4, pp. 482–499, 1989.

[14] E. Sanders, M. Ruiter, L. Beijer, and H. Strik, “Automaticrecognition of Dutch dysarthric speech: a pilot study,” inProceedings of the 7th International Conference on Speech andLanguage Processing (ICSLP ’02), pp. 661–664, Denver, Colo,USA, September 2002.

[15] M. Hasegawa-Johnson, J. Gunderson, A. Perlman, and T.Huang, “HMM-based and SVM-based recognition of thespeech of talkers with spastic dysarthria,” in Proceedings of theIEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’06), vol. 3, pp. 1060–1063, Toulouse,France, May 2006.

[16] W. A. Lynn, “Clinical perspectives on speech sound disorders,”Topics in Language Disorders, vol. 25, no. 3, pp. 231–242, 2005.

[17] B. Hodson and D. Paden, Targeting intelligible speech: aphonological approach to remediation, PRO-ED, Austin, Tex,USA, 2nd edition, 1991.

Page 146: Analysis and Signal Processing of Oesophageal and ...downloads.hindawi.com/journals/specialissues/732412.pdf · developing algorithms for speech signal processing, and despite the

12 EURASIP Journal on Advances in Signal Processing

[18] J. A. Gierut, “Differential learning of phonological opposi-tions,” Journal of Speech and Hearing Research, vol. 33, no. 3,pp. 540–549, 1990.

[19] J. A. Gierut, “Complexity in phonological treatment: clinicalfactors,” Language, Speech, and Hearing Services in Schools, vol.32, no. 4, pp. 229–241, 2001.

[20] A.-M. Oster, D. House, A. Hatzis, and P. Green, “Testinga new method for training fricatives using visual maps inthe Ortho-Logo-Paedia project (OLP),” in Proceedings of theAnnual Swedish Phonetics Meeting, vol. 9, pp. 89–92, Lovanger,Sweden, 2003.

[21] H. Timothy Bunnell, D. M. Yarrington, and J. B. Polikoff,“STAR: articulation training for young children,” in Pro-ceedings of the International Conference on Spoken LanguageProcessing (ICSLP ’00), vol. 4, pp. 85–88, Beijing, China,October 2000.

[22] A. Hatzis, P. Green, J. Carmichael, et al., “An integrated toolkitdeploying speech technology for computer based speech train-ing with application to dysarthric speakers,” in Proceedings ofthe 8th European Conference on Speech Communication andTechnology (Eurospeech ’03), Geneva, Switzerland, September2003.

[23] S. Rvachew and M. Nowak, “The effect of target-selectionstrategy on phonological learning,” Journal of Speech, Lan-guage, and Hearing Research, vol. 44, no. 3, pp. 610–623, 2001.

[24] M. Hawley, P. Enderby, P. Green, et al., “STARDUST: speechtraining and recognition for dysarthric users of assistivetechnology,” in Proceedings of the 7th European Conference forthe Advancement of Assistive Technology in Europe, Dublin,Ireland, 2003.

[25] Statistics Canada, New Brunswick (table), Community Profiles2006 Census, Statistics Canada Catalogue no. 92-591-XWE,Ottawa, Canada, March 2007.

[26] P. Combescure, “20 listes de dix phrases phonetiquementequilibrees,” Revue d’Acoustique, vol. 56, pp. 34–38, 1981.

[27] M. Lennig, “3 listes de 10 phrases francaises phonetiquementequilibrees,” Revue d’Acoustique, vol. 56, pp. 39–42, 1981.

[28] J. P. Cabral and L. C. Oliveira, “Pitch-synchronous time-scaling for prosodic and voice quality transformations,” inProceedings of the 9th European Conference on Speech Commu-nication and Technology (INTERSPEECH ’05), pp. 1137–1140,Lisbon, Portugal, 2005.

[29] S. David and B. Antonio, “Frequency domain vs. timedomain VTLN,” in Proceedings of the Signal Theory andCommunications, Universitat Politecnica de Catalunya (UPC),Spain, 2005.

[30] Y. A. Alotaibi, S.-A. Selouani, and D. O’Shaughnessy, “Exper-iments on automatic recognition of nonnative arabic speech,”EURASIP Journal on Audio, Speech, and Music Processing, vol.2008, Article ID 679831, 9 pages, 2008.

[31] Cambridge University Speech Group, The HTK Book (Version3.3), Cambridge University, Engineering Department, Cam-bridge, UK.

[32] P. Loizou, Speech Enhancement: Theory and Practice, CRCPress, Boca Raton, Fla, USA, 2007.

[33] ITU, “Perceptual evaluation of speech quality (PESQ), andobjective method for end-to-end speech quality assessment ofnarrowband telephone networks and speech codecs,” ITU-TRecommendation 862, 2000.

[34] ITU-T Recommendation P.800, “Methods for SubjectiveDetermination of Speech Quality,” International Telecommu-nication Union, Geneva, Switzerland, 2003.

[35] W. Cichocki, S.-A. Selouani, and L. Beaulieu, “The RACADspeech corpus of New Brunswick Acadian French: design andapplications,” Canadian Acoustics, vol. 36, no. 4, pp. 3–10,2008.