Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TECNICO
Speech and language technologies applied to diagnosis and therapy of brain diseases
Anna Maria Pompili
Supervisor: Doctor Alberto Abad GaretaCo-Supervisor: Doctor Isabel Pavao Martins
Thesis approved in public session to obtain the PhD Degree in
Information Systems and Computer Engineering
Jury final classification: Pass with Distinction
2019
UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TECNICO
Speech and language technologies applied to diagnosis and therapy of brain diseases
Anna Maria Pompili
Supervisor: Doctor Alberto Abad Gareta
Co-Supervisor: Doctor Isabel Pavao Martins
Thesis approved in public session to obtain the PhD Degree in
Information Systems and Computer Engineering
Jury final classification: Pass with Distinction
Jury
Chairperson: Doctor Jose Manuel da Costa Alves Marques,
Instituto Superior Tecnico, Universidade de Lisboa
Members of the Committee:
Doctor Maria de Sao Luıs de Vasconcelos Fonseca e Castro Schoner,
Faculdade de Psicologia e de Ciencias da Educacao, Universidade de Porto
Doctor Mario Jorge Costa Gaspar da Silva,
Instituto Superior Tecnico, Universidade de Lisboa
Doctor Antonio Joaquim da Silva Teixeira, Universidade de Aveiro
Doctor David Manuel Martins de Matos,
Instituto Superior Tecnico, Universidade de Lisboa
Doctor Alberto Abad Gareta, Instituto Superior Tecnico, Universidade de Lisboa
Doctor Ana Rita Mendes Londral Gamboa,
UniSpital, University Hospital of Zurich, Suıca
Funding Institutions
Fundacao para a Ciencia e a Tecnologia
2019
Resumo
As doencas cerebrais, e mais especificamente os disturbios neurodegenerativos, incluem uma
gama de condicoes que afetam o cerebro, causando danos irreversıveis e progressivos. Nao ha
uma cura para muitas dessas doencas, mas a detecao precoce do inıcio dos sintomas pode aten-
uar o seu progresso. O processo atual para rastrear os disturbios neurodegenerativos apresenta
desvantagens importantes, sendo altamente dispendioso e demorado. Estes fatores tornam-se
particularmente onerosos quando e necessaria uma reavaliacao frequente, de forma a ajustar a
dosagem dos farmacos.
Esta tese aborda o uso de tecnologias da fala e da linguagem para contribuir para o di-
agnostico clınico de doencas neurodegenerativas. O uso dessas tecnologias pode facilitar o
processo de triagem dessas doencas e fornecer aos medicos uma ferramenta complementar
e objetiva de diagnostico. De acordo com a manifestacao dos sintomas clınicos, sao identifi-
cadas tres areas distintas nas quais esta dissertacao pode contribuir para o avanco do atual
estado da arte: monitorazacao das habilidades de fala, cognitivas e da linguagem. Relativa-
mente a essas areas, esta tese faz as seguintes contribuicoes: (1) definicao de um conjunto
geral e padrao de caracterısticas capazes de modelar os sintomas de um disturbio que afete
a producao motora da fala, como a doenca de Parkinson. Este conjunto de caracterısticas e
usado para avaliar a relevancia de diferentes tarefas de fala em portugues dedicadas a analise
da fonacao, respiracao e articulacao. Os resultados mostram que as tarefas de producao mais
importantes sao a leitura de frases prosodicas e a narracao de historias, que permitem alcancar
uma precisao de classificacao da doenca de Parkinson de 85.10 % e 82.32 %, respetivamente; (2)
implementacao online de um conjunto representativo de testes neuropsicologicos utilizados
na triagem da demencia, como o Defice Cognitivo Ligeiro, utilizando a tecnologia de recon-
hecimento automatico de fala. Para avaliar a viabilidade do instrumento de monitorazacao,
recolheu-se um corpus de fala em portugues, que inclui gravacoes de cinco pessoas com di-
agnostico de defice cognitivo e de cinco sujeitos saudaveis. O erro entre a avaliacao manual e a
automatica e relativamente pequeno, entre 0.80 e 3.00 para os pacientes e entre 0.80 e 2.80 para
o grupo de controlo, confirmando a viabilidade desse tipo de sistemas; (3) desenvolvimento
de um metodo automatico de analise de aspetos pragmaticos da producao de discurso, como
a coerencia de topico. Este metodo e ainda complementado com aspetos lexicais, sintaticos e
i
semanticos do discurso, de modo a obter uma avaliacao abrangente da producao do discurso,
tarefa que ja provou ser util para a detecao da doenca de Alzheimer. Com este metodo, os resul-
tados da classificacao atingem uma precisao de 85.5% na identificacao automatica da doenca.
ii
Abstract
Brain diseases, and more specifically neurodegenerative disorders, include a range of con-
ditions that affect the brain causing irreversible and progressive damages. There is no cure
for many of these diseases, but the early detection of the symptoms onset may mitigate their
progress. The current process to screen neurodegenerative disorders present important dis-
advantages, being both highly costly and time-consuming. These factors become particularly
burdensome when frequent re-assessment is required to fine-tune dosage of drugs.
This thesis addresses the use of speech and language technologies to contribute to the clini-
cal diagnosis of neurodegenerative diseases. The use of these technologies may ease the screen-
ing process of these disorders and provide clinicians with a complementary, objective diag-
nostic tool. According to the manifestation of the clinical symptoms, three distinct areas are
identified in which this dissertation can contribute to the advance of the current state of the art:
monitoring of speech, cognitive, and language abilities. With respect to these areas, this the-
sis makes the following contributions: (1) definition of a general and standard set of features
that are able of modeling the symptoms of a disorder affecting motor production of speech,
such as Parkinson’s disease. This set of features is used to assess the relevance of different
speech tasks in Portuguese dedicated at evaluating phonation, respiration, and articulation.
Results show that the most important production tasks are reading of prosodic sentences and
storytelling, achieving a Parkinson’s disease classification accuracy of 85.10% and 82.32%, re-
spectively; (2) on-line implementation of a representative set of neuropsychological tests used
in the screening of dementia, such as Mild Cognitive Impairment, exploiting automatic speech
recognition technology. To evaluate the feasibility of the monitoring tool, a Portuguese speech
corpus including the recordings of five people diagnosed with cognitive impairments and five
healthy control subjects was collected. The error between the manual and the automatic evalu-
ation is relatively small, from 0.80 to 3.00 for the patients, and from 0.80 to 2.80 for the control
group, confirming the feasibility of such type of systems; (3) development of an automatic
method to analyze pragmatic aspects of discourse production, in particular to analyze topic
coherence. This method is further complemented with lexical, syntactic, and semantic aspects
of discourse, in order to provide a comprehensive evaluation of discourse production that is
shown to be useful for the detection of Alzheimer’s disease. In this way, classification results
iii
achieve an accuracy of 85.5% in the automatic identification of the disease.
iv
Palavras-Chave
Keywords
Palavras-Chave
Diagnostico Clınico Automatico
Avaliacao de Doencas Neurodegenerativas
Analise Automatica de Fala
Reconhecimento Automatico de Fala
Processamento de Linguagem Natural
Keywords
Automatic Clinical Diagnosis
Neurodegenerative Diseases Assessment
Automatic Speech Analysis
Automatic Speech Recognition
Natural Language Processing
v
vi
Acknowledgments
This doctoral research has been conducted at the Spoken Language Systems Laboratory (L2F)
at INESC-ID. The support of many people and various institutions contributed, directly and
indirectly, to the fulfillment of this work. I would like to take this opportunity to express my
gratitude to all of them.
My deepest gratitude goes to my scientific advisors, Prof. Alberto Abad and Prof. Isabel Pavao
Martins, for their technical knowledge, vision, and suggestions regarding both where to direct
my research, and how to pursue the expectations of this thesis. Given the interdisciplinary na-
ture of this doctoral research, it would not have been possible to fulfill the objectives of this the-
sis without their joint and complementary guidance. In different ways, they always supported
my work and encouraged my choices. Prof. Alberto Abad taught me the right approach to
address complicated problems, and readily helped me to overcome the many difficulties that I
had to tackle while pursuing the goals of this thesis. Prof. Isabel Pavao Martins allowed me to
acquire knowledge in the area of neuroscience, she shared with me her clinical point of view,
interesting research papers, and inspiring discussions. Also, she has been an active supporter
of this work by constantly pursuing new ideas and collaborations.
I wish to express my gratitude to Professor Isabel Trancoso not only for her continuous guid-
ance, but also for having pushed me to start this doctoral research and for having supported it
along the years with her enthusiasm and precious advice. During the time spent at the L2F, she
never missed a chance to demonstrate me her trust, and always provided me her full support
and availability.
I’m grateful to all the members of the Laboratorio de Estudos de Linguagem of the Univer-
sity Clinic of Neurology for warmly receiving me to their weekly meetings. Attending to the
presentations and discussions of research papers and clinical cases has been an enlightening
experience from a personal and a technical point of view.
I owe a very special acknowledgment to all the people that allowed the accomplishment of this
thesis by providing their valuable contributions:
vii
• Dr. Filipa Miranda for introducing me to the subject of topic coherence and for her im-
portant advice provided during the development of this study;
• Dr. Rita Cardoso, Dr. Helena Santos, Dr. Joana Carvalho, Prof. Isabel Guimaraes, and
Prof. Joaquim J. Ferreira for sharing the FrasuloPark database;
• Dr. Jose Salgado, Dr. Ines Cunha and Dr. Vitorina Passao for having allowed the data
collection of cognitive impaired patients and for their availability, support, and precious
feedback provided during the development of a tool for the screening cognitive impair-
ments;
• Prof. Nuno Mamede, for having kindly provided an important resource that constituted
the baseline for some of the results achieved in this dissertation.
Next, I express my gratitude to the members of the Comissao de Acompanhamento da Tese,
Prof. Mario Silva, Prof. Antonio Teixeira, and Prof. David de Matos, for their suggestions and
recommendations, both in terms of scientific research and organization of this document.
I also would like to thank the support of the Portuguese research funding agency Fundacao
para a Ciencia e a Tecnologia (FCT) through the PhD scholarship SFRH/BD/97187/2013 dur-
ing the first four years of this work, as well as the Instituto Superior Tecnico funding during
the last year.
Thank you also to all the colleagues and room-mates that I have had the pleasure to know dur-
ing these years. With their kindness and friendship, they have been active supporters of this
experience.
Finally, my special thanks go to Paolo, my companion. He dealt with my frustrations on a daily
basis, giving me shine and structure, I am very grateful for his patience and understanding. His
advice, cares, and support have been invaluable to me to overcome the hardest difficulties and
to achieve this result.
viii
Contents
1 Introduction 1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structure of this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Technical Background 7
2.1 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Bag-of-words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 N-gram models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Word embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Spoken language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Speech analysis and speaker characterization . . . . . . . . . . . . . . . . . 12
2.2.1.1 Speech analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1.2 Speaker characterization . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1.4 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Characterization of Neurodegenerative Diseases 33
3.1 Mild Cognitive Impairment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Alzheimer’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
ix
3.3 Parkinson disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Dementia with Lewy bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Fronto Temporal Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Amyotrophic Lateral Sclerosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 Huntington disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Neurodegenerative diseases and SLT . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Related Work: SLT for Diagnosis of Neurodegenerative Diseases 45
4.1 Monitoring of speech abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Monitoring of cognitive abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Semantic fluency tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Cognitive tests assessing memory, attention, orientation . . . . . . . . . . 52
4.3 Monitoring of language abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Contributions to the Monitoring of Speech Abilities 65
5.1 Automatic detection of Parkinson’s Disease: an analysis of speech production
tasks used for diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6 Contributions to the Monitoring of Cognitive Abilities 71
6.1 Semantic verbal fluency test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Automatic monitoring and training of cognitive functions . . . . . . . . . . . . . 76
6.2.1 Extending VITHEA for neuropsychological screening . . . . . . . . . . . . 76
6.2.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Contributions to the Monitoring of Language Abilities 83
7.1 Evaluating pragmatic aspects of discourse production for the automatic identifi-
cation of Alzheimer’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
x
7.1.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1.2 The proposed model to analyze topic coherence . . . . . . . . . . . . . . . 86
7.1.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1.2.2 Clause segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1.2.3 Coreference analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1.2.4 Sentence embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.1.2.5 Topic hierarchy analysis . . . . . . . . . . . . . . . . . . . . . . . 88
7.1.3 Features for AD spoken discourse characterization . . . . . . . . . . . . . 90
7.1.3.1 Topic coherence features . . . . . . . . . . . . . . . . . . . . . . . 90
7.1.3.2 Other linguistic features . . . . . . . . . . . . . . . . . . . . . . . 91
7.1.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.1.4.1 Experiments using manual transcriptions . . . . . . . . . . . . . 95
7.1.4.2 Experiments using automatic transcriptions . . . . . . . . . . . . 98
7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8 Conclusions and Future Work 103
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A Appendix 109
A.1 Excerpts of input/output processing . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.1.2 Clause segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.1.3 Coreference analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.2 Computation of semantic features . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.2.1 Specifications of an ICU list . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.2.2 Computing ICUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Bibliography 113
xi
xii
List of Tables
5.1 Description of the acoustic features based on 53 low-level descriptors plus 6
functionals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Demographic and clinical data for patient and control groups. . . . . . . . . . . . 67
5.3 Task-dependent recognition results on the 2-class detection task (PD vs. control). 69
6.1 WER for different language models: i) Generic ASR system: general purpose
language model trained on broadcast news, ii) Prebuilt list based: constrained
keyword model created from the list used in the STRING project, iii) Ontology
based: constrained keyword model created from the ontology Temanet. . . . . . 74
6.2 Performance of AuToBI using English and European Portuguese models and
three segmentation strategies: ASR-based, ontology-based (TemaNet), and
phone-based. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Implemented cognitive tests. KWS: Keyword spotting, RBG: Rule-based gram-
mar, ALM: ad-hoc language model for keyword spotting. . . . . . . . . . . . . . . 78
6.4 Accuracy and WER according to the type of question. . . . . . . . . . . . . . . . . 79
6.5 MAE and MRAE (in brackets) by type of question and by neuropsychological test. 81
7.1 Statistical information on the Cookie Theft corpus . . . . . . . . . . . . . . . . . . 86
7.2 Summary of all extracted features (141 in total). The number of each type of
features is reported in parenthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 Summary of AD classification results (avg. and range accuracy % ) . . . . . . . . 99
xiii
xiv
List of Figures
2.1 Vocabulary and BOW vector representations for an example corpus of two doc-
uments containing each one sentence. . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Visual representation of the computation of P(Mary loves that person) using a
bi-gram model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 (a) A schematic diagram of the human speech production apparatus. (b) Wave-
form of /sees/, showing a voiceless phoneme /s/, followed by a voiced sound,
the vowel /iy/. The final sound, /z/, is a type of voiced consonant (Huang et al.
2001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Schematic representation of the extraction of a sequence of 39-dimensional
MFCC feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Jitter and Shimmer perturbation measures in a speech signal (Teixeira & Fernan-
des 2014). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Quadrilateral and triangular vowel space area for healthy subjects. The picture
was extracted from the work of Vizza et al. (2017), on the use of speech signal for
studying sclerosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Schematic representation of the main modules constituting an automatic speech
recognizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 A typical workflow used in the training process of a machine learning model. . . 24
2.9 (a) 3 possible hyperplanes for an SVM trained with linearly separable data, the
best hyperplane is shown with a solid line. (b) The hyperplane with the max-
imum distance from data points. (c) Soft-margin allowing some classification
errors. (d) Non-linearly separable data . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 A decision tree for the concept PlayTennis. This tree classifies Saturday mornings
according to whether or not they are suitable for playing tennis (Mitchell 1997). . 26
2.11 (a) A schematic drawing of a biological neural network with two neurons, (b) a
diagram of an artificial neuron, (c) an architecture of an artificial neural network
with three layers (Negnevitsky 2005). . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.12 An illustration of the k-fold cross-validation method with k=10. . . . . . . . . . . 31
xv
6.1 An excerpt of an audio recording showing, respectively, from the top: the spec-
trogram, the F0, the textual transcriptions of the sound, and prosodic events clas-
sification. Red arrows indicate a continuation rise contour, while the yellow ar-
row indicates a finality contour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 On the left side, MMSE scores of the human and automatic evaluations for the
patient speakers. On the right side, MMSE scores of the human and automatic
evaluations for the healthy speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1 (a) The Cookie Theft picture, from the Boston Diagnostic Aphasia Examination
(Goodglass et al. 2001). (b) An excerpt of a topic hierarchy for the Cookie Theft
picture found in the work of Miranda (2015). . . . . . . . . . . . . . . . . . . . . . 85
7.2 The proposed method for modeling discourse as a hierarchy of topics. . . . . . . 87
7.3 Topic hierarchy building algorithm. (a) The current sentence is compared with
the topic clusters to identify its topic. (b) Identification of the level of special-
ization of the current sentence. If there are no nodes with the same topic of the
current sentence, this is considered a new topic. (c) If the current hierarchy con-
tains one or more nodes with the same topic of the current sentence, each of them
is analyzed with respect to the current one. (d) As a result, the current sentence
is added as a child of its closest node. . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4 Variation of the classification accuracy with the SFS method, while increasing the
number of features. Results are presented for the set of topic coherence features
that provided the maximum accuracy. Features are computed on the manual
transcriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Accuracy achieved with the top selected features using the fusion of different
sets. Results are computed on the manual transcriptions (top) and on the auto-
matic transcriptions (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xvi
List of Acronyms
AD Alzheimer’s Disease. xvii, 34, 35, 37, 38, 41–43, 50, 52, 56–62, 71, 76, 83–85, 91, 92, 94–101,
108
ADAS-Cog Alzheimer’s Disease Assessment Scale - Cognitive Subscale. 35, 71, 76–78, 104
ALS Amyotrophic Lateral Sclerosis. 38–41, 43, 65
ANN Artificial Neural Network. xiii, 27, 28, 56, 73
ANOVA analysis of variance. 29, 51, 57
ASR Automatic Speech Recognition. xvii, 5, 21, 50, 52–55, 72–76, 80, 98–101, 107
BDAE Boston Diagnostic Aphasia Examination. 56
BN Bayes Network. 56
BOW bag-of-words. xix, 8, 9, 11, 61, 62
CART Classification and regression tree. 26
CFD Castiglioni Fractal Dimension. 51
CHAT Codes for the Human Analysis of Transcripts. 56
CSR Continuous Speech Recognition. 23
DFT Discrete Fourier Transform. 15
DLB Dementia with Lewy bodies. 36, 37, 41
DT Decision Tree. xiii, 26, 27, 56, 57
EM Expectation-Maximization. 20
FTD Frontotemporal Dementia. 37, 38, 42, 43, 57
GMM Gaussian Mixture Model. 19, 20, 22, 48, 49
xvii
HD Huntington Disease. 39–41, 43, 45, 65
HMM Hidden Markov Model. 21, 22, 73
ICU Information Content Unit. 92, 94, 110, 111
ID3 Iterative Dichotomiser 3. 26
IDFT inverse Discrete Fourier Transform. 15
IG Information Gain. 56
IVR interactive voice response. 50, 51
IWR Isolated Word Recognition. 23
KWS keyword spotting. 22, 73, 79
LASSO least absolute shrinkage and selection operator. 29
LOO leave-one-out. 30, 48, 49, 59
LPC Linear Prediction Coefficients. 16, 46, 47
LPCC Linear Prediction Cepstral Coefficients. 46, 47
LPO leave-p-out. 30
LSA Latent Semantic Analysis. 11
LVCSR Large Vocabulary Continuous Speech Recognition. 22, 23
MAE Mean Absolute Error. xvii, 59, 80–82
MAP Maximum a Posteriori. 20
MATTR moving-average type-token ratio. 58
MCI Mild Cognitive Impairment. 33, 34, 41, 43, 50–52, 54, 56, 60–62, 71, 76
MFCC Mel-frequency cepstral coefficient. xix, 15, 16, 19, 21, 46–49, 51, 58–60, 66
MLP Multilayer Perceptron. 27, 57, 73
MMSE Mini-Mental State Examination. 34, 35, 37, 53, 59, 71, 76, 77, 80, 104
MRAE Mean Relative Absolute Error. xvii, 80, 81
xviii
NB Naıve Bayes. 56
NLP Natural Language Processing. 8, 12, 87, 92
NNLM neural network language model. 11, 12
OOV Out-Of-Vocabulary. 10, 79
PCA Principal Component Analysis. 46
PD Parkinson’s Disease. 35–37, 41, 43, 45–49, 65–69, 103, 107
PE Permutation Entropy. 51
PLP Perceptual Linear Prediction coefficients. 16, 17, 21, 46, 47, 73
POS Part of Speech. 57, 86, 87, 92, 109
PPA Primary Progressive Aphasia. 37, 38, 43
RASTA Relative Spectra coefficients. 46
RF Random Forest. xiii, 27, 60, 67, 69, 94
SFS sequential forward selection. xx, 95–97
SLP Speech Language Pathologist. 2, 3, 36, 39–41, 65
SLT Speech and Language Technology. xiv, 3, 4, 33, 40–42, 45, 50, 52, 62, 71, 82, 103–105
STT Speech to Text. 98
SVM Support Vector Machine. xiii, xix, 25, 26, 47–49, 51, 54, 56
TTR type-token ratio. 58, 91
UBM Universal Background Model. 19, 20, 48, 49
UHDRS Unified Huntington’s Disease Rating Scale. 40
UPDRS Unified Parkinson’s Disease Rating Scale. 36
VAI Vowel Articulation Index. 18, 19, 47
VSA Vowel Space Area. 17, 18, 47
xix
VUV voiced/unvoiced. 48
WAB Western Aphasia Battery. 54
WAIS-III Wechsler Adult Intelligence Scale - III. 34
WAIS-IV Wechsler Adult Intelligence Scale - IV. 51
WER Word Error Rate. 31, 32, 52, 54, 55, 73, 74, 79, 80, 82, 98, 100, 101
WLM Wechsler Logical Memory. 54
WMS Wechsler Memory Scale. 54
WMS-IV Wechsler Memory Scale fourth edition. 51
xx
1Introduction
Language is a fundamental ability in our daily lives, as it is used to communicate with the
world around us. The production of language is a complex, multidimensional skill that in-
volves different, interdependent cognitive domains. From a very general point of view, to
express meaning through language our thoughts have to be converted into a conceptual rep-
resentation, which corresponds to the generation of the message to be conveyed. This phase
implies access to semantic information for the retrieval and selection of the words to be used.
Then, syntactic properties of the identified lexical items are elaborated and the appropriate or-
der of words within the sentence is established. Finally, the conceptual representation of the
words to be spoken is transformed into a sequence of speech sounds to be pronounced, which
are sent from the brain to the articulatory system. In order for sounds to be produced correctly,
the lips, tongue, jaw, velum, and larynx must produce accurate movements at the right time
or the intended sounds become distorted (Dell et al. 1999, Dronkers & Ogar 2004, Garrett 1975,
Jay 2002). In conclusion, the utterance of a simple sentence requires the activation of a large-
scale neural network dedicated to semantic, syntactic, and phonological processing. Engaging
in a conversation is an even more cognitive-demanding task, since beside linguistic processing
it may also require access to memory, world knowledge, or to high-order cognitive functions
like planning. When considering also the integration of sensory and motor functions, and the
corresponding brain regions that control them, it is not surprising that speech production has
been described as one of the most complex human behaviors. A considerable set of widely
distributed brain regions is involved in speech production, a lesion in any of these areas may
disturb the equilibrium of this complex system and produce alterations in the resulting speech.
Brain disease is an umbrella term for a range of conditions that affect the brain in different
forms: neurodegenerative disorders, infections, trauma, stroke, and tumors. Neurodegenera-
tive diseases are incurable and debilitating conditions that result in a progressive degeneration
and death of neurons in different regions of the nervous system. As these result in permanent
damages, the condition tends to get worse as the disease progresses. There are more than six
hundred disorders affecting the nervous system. According to the clinical symptoms, they can
be classified into three main categories: i) diseases presenting cognitive decline, dementia and
alterations in high-order brain functions, ii) movement disorders, and iii) a combination of both
symptoms (Kovacs 2014). Major clinical features representative of the first category include
deficits in various cognitive domains, like memory, attention, and language (e.g., Alzheimer’s
Disease, Dementia with Lewy bodies). Movement disorders are clinically associated with hy-
perkinetic, hypokinetic, and akinetic symptoms, like uncontrollable or slowness of movement,
and lack of spontaneous motility (e.g., Parkinson’s Disease). Current clinical practice to screen
neurodegenerative diseases requires an examination with a neurologist expert, which could
eventually be followed by an examination with a Speech Language Pathologist (SLP). The as-
sessment typically includes a medical examination, the manual administration of standardized
neuropsychological tests, and, eventually the perceptual evaluation of voice quality. The his-
tory from the patient is of particular relevance, both to consider previous similar cases in the
family, and to observe the evolution of the clinical picture according to the patient complaints.
The clinical evaluation could be quite long, depending to the types of tests and exams per-
formed, and to the cooperation of the patient. Neuroimaging studies are both invasive and
expensive, and have a limited use as a preliminary screening tool. Additionally, although there
are sophisticated quantitative and objective image analysis methods, standard clinical practice
in diagnostic imaging is qualitative in nature.
1.1 Motivations
There is no cure for neurodegenerative diseases, but treatment can still be of help. In this re-
gard, the diagnosis of early onset symptoms is critical to start the appropriate intervention and
mitigate disease progression (Sheinerman & Umansky 2013). However, the current process
to screen neurodegenerative disorders presents important disadvantages, being both highly
costly and time-consuming. These factors become particularly burdensome when frequent
re-assessment is required to fine-tune dosage of drugs. Another important concern regards the
reduced availability of specialized neurologists. Data published by the European Brain Council
show that 220.7 million people in Europe suffer from at least one neurological disease (Wittchen
et al. 2011). On the other hand, the number of specialized neurologists in the EU countries is
around 25,000. Depending on the country there are between 4 and 13 neurologists per 100,000
people (Olesen et al. 2012, Steck et al. 2013). When considering the rapidly aging global popu-
lation and an expected dramatic increase of neurological disorders, the number of specialized
clinicians is clearly insufficient to meet the growing needs. This problem is also of particular
relevance in remote areas with reduced medical resources, where the availability of specialized
physicians is even more limited.
2
For all the above reasons, nowadays, there is an increasing need for additional, noninva-
sive, and cost-effective tools allowing a preliminary identification of diseases in their early
clinical stages. Further examinations using additional screening measures could then be per-
formed in a next step by an established clinician. Speech production, being the primary means
of interaction, plays a fundamental role in the diagnostic process. In fact, speech is used to
provide information about ourselves and to fulfill part of the clinical evaluation. Speech is an
ecological way to collect biometric information, as it can be elicited and recorded automatically
relatively easily, and at much lower cost than in-person clinical assessment. Additionally, it
naturally conveys important cues that can be further investigated and analyzed. In this regard,
Speech and Language Technology (SLT) could supply an important contribution to this area.
In fact, the development of automated methods based on the evaluation of speech and lan-
guage functionalities could be of great support in the clinical diagnosis. Not only by providing
a complementary diagnostic tool, but also to assess disease progression over time objectively
and accurately. As a matter of fact, the automatic analysis of voice and language allows to offer
an objective evaluation, that is consistent and independent from the experience of the clinician,
excluding in this way possible differences due to inter-expert-variability. Finally, the ability to
offer an alternative, remote assessment represents an opportunity to provide access to medical
services for those who might be otherwise deprived of a SLP.
In the light of these considerations, the major aim of this thesis was to conduct research
in the area of diagnosis of neurodegenerative diseases by means of SLT that can contribute to
an improvement of the current state of the art methods. Research on SLT applied to neurode-
generative diseases is an interdisciplinary area, which requires knowledge from the linguistic
domain, computer and electronics engineering, and partially from neuroscience. In fact, to
actively contribute to the diagnosis and therapy of neurological disorders, an understanding
of the symptoms caused by neural damages is fundamental, and even more important, how
these signs have an impact on speech and language functionalities. This is an area that is rel-
atively new to the research group where I am involved, whose primary focus is the automatic
processing of natural spoken language. The group holds a strong background in linguistic,
computer, and electronics engineering, gathered in over 20 years of research. However, until
the development of this thesis, research on the application of these technologies to the health
area was quite limited. In this context, it is also the purpose of this thesis to set the stage for
future investigations in this field to be developed in the group.
3
1.2 Contributions
To contribute to the diagnosis of neurodegenerative diseases by means of SLT, it is required
to deeply investigate existing solutions applied to speech impairments, identifying current ap-
proaches and their limitations. However, before diving in the literature review of existing
solutions, there is first the need to analyze the most common neurodegenerative diseases, to
understand their symptoms and also the methods used in clinical practice for their diagnosis.
The ultimate goal is to identify common patterns between different disorders and then focus
the research on a sample of diseases representative of these disorders. From the results of this
investigation, I identified three main areas where SLT could provide its contributions. When
the impairment affects the organs related with the production of sounds, the resulting speech
may become distorted or unintelligible (e.g., Parkinson’s Disease). In this case, alterations in
voice could be analyzed and monitored through the analysis of the speech signal. When speech
production is preserved, it becomes a reliable means to investigate cognitive decline through
neuropsychological tests (e.g., Dementias). Finally, when the neurological damage affects the
areas of the brain related with the processing of language, the resulting speech could be im-
paired in different ways (e.g., aphasia, Fronto Temporal Dementia). In this case, the alterations
that occur in discourse production could also represent an important clue to screen cognitive
deterioration. The main contributions of this work in the three areas identified are explained
in more detail hereafter:
• Monitoring of speech abilities: when the brain lesions cause a dysfunction in the regu-
lation of the major brain structures involved in the control of movements, the production
of speech may also be affected. In fact, in these cases the muscles implicated in speech
production are also subject to specific dysfunctions, causing patients to experience dif-
ficulties in communication despite the existence of language competence. Different dis-
eases involving a neural degeneration of the brain may cause similar disorders on motor
speech abilities. The most common motor speech disorders caused by neurological in-
jury is dysarthria. This is characterized by a problem in any of the speech subsystems,
tongue, throat, lips or lungs leading to impairments in intelligibility, audibility, natural-
ness, and efficiency of vocal communication. These kinds of disorders are typically as-
sessed through a battery of vocal tests aiming at evaluating the patient’s abilities in tasks
involving phonation, respiration, articulation, and prosody. Through the understanding
of how these impairments affect the motor speech system, and the ability assessed by
each task, it is possible to identify sensible features able to model the problem under con-
4
sideration. In the literature, there is an extensive body of research targeting the automatic
characterization of dysarthria through different set of speech features, speech tasks, and
machine learning models. However, from these studies it is not clear, which among the
several tasks administered could be more relevant for characterizing and monitor the dis-
order. For this reason, my contribution to the evaluation of motor speech disorders aims
at filling this gap, with a study focused on analyzing the importance of each individual
vocal task and, correspondingly, the importance of individual speech impairments in the
identification of the disease.
• Monitoring of cognitive abilities: many neurodegenerative diseases cause cognitive
dysfunctions in multiple cognitive domains. Impairments depend on the area of the
brain affected and may include alterations in visuospatial ability, planning, reasoning,
attention, memory, language, and personality. The process of screening neurodegenera-
tive diseases partly relies on a cognitive assessment in which several tests are adminis-
tered to the patient. In fact, in the clinical practice, a considerable number of batteries
of neuropsychological tests has been developed, since each test aims to assess different
cognitive functions. The majority of these tests require a verbal interaction from the pa-
tient to provide the desired answers. Thus, when language ability is spared, many of
these tests are eligible to be automated through Automatic Speech Recognition (ASR)
technology, reducing the need for the physical presence of a clinician. An online imple-
mentation of these tests could ease the screening process, providing great benefits to the
health community, both for speech and language pathologists and the elderly popula-
tion. A revision of the state of the art for this area highlighted very few works targeting
the automatic assessment of cognitive tests. Also, the implementation of systems able
to automatically administer and evaluate batteries of neuropsychological tests was even
more limited. Thus, one contribution to the screening of neurodegenerative diseases con-
sists of the automatic implementation of two widely used neuropsychological batteries.
These tests have been integrated into an online platform, (Abad et al. 2013) whose flexibil-
ity easily allows the creation of different types of test, and are evaluated automatically by
means of ASR. Additionally, another contribution is related to the automatic evaluation of
the semantic-verbal fluency test, a sensitive test for distinguish between Mild Cognitive
Impairment and Alzheimer’s Disease, a challenging task for current ASR systems.
• Monitoring of language abilities: neurological lesions that involve the areas of the brain
related with the production or understanding of language may compromise language
abilities in a variety of different ways. Speech may be affected at the phonological or
5
syntactical level, may become non-fluent with problems in word-finding and words rep-
etitions, or fluent but poor in meaning, as in the case of aphasia. More generally, problems
may arise in language production at the level of discourse structure. Speech may result
poorly organized, may lack of coherence or present a substantial use of word repetitions
and a reduction in the use of more complex expression. The analysis of discourse produc-
tion is a complex and broad task, being evaluated along a micro and a macro dimension
that together address different aspects, such as coherence, cohesion, lexical, and syntactic
analysis. Discourse evaluation is usually performed through the analysis of spontaneous
speech elicited with different types of stimuli. In the literature, there has been a growing
interest in investigating the computational analysis of language impairment of neurode-
generative disorders. Overall, existing works assess the quality of discourse production
through the automatic analysis of a combination of lexical, syntactic, acoustic, and se-
mantic features. Few studies, however, approached linguistic deficits at a higher level of
processing, considering macrolinguistic aspects of discourse production such as cohesion
and coherence. For these reasons, my contribution to the assessment of language abili-
ties focuses on pragmatic aspects of discourse, and in particular, on the analysis of topic
coherence. This method is further complemented with lexical, syntactic, and semantic
aspects of discourse, in order to provide a comprehensive evaluation of discourse pro-
duction. Finally, the impact of using a speech recognition system to obtain automatically
the transcriptions of the speech samples is also evaluated.
1.3 Structure of this document
The remainder of this dissertation is structured as follows. Chapter 2 provides an introduc-
tion to some technical notions and methods that are frequently used in the areas of machine
learning, speech, and natural language processing that are relevant for the rest of this docu-
ment. In Chapter 3, I carry out a characterization of neurodegenerative diseases. This study
is required in order to identify a subset of disorders that are the focus of this thesis. Then, in
Chapter 4, I survey the related work relevant for this proposal. The state of the art of existing
speech technologies solutions applied to the areas of motor speech disorders, cognitive screen-
ing, and analysis of discourse production is reported. Afterwards, Chapters 5, 6, and 7 present
the contributions provided in each of the areas of interest of this thesis, reporting the key re-
sults achieved. The document ends with Chapter 8, where the conclusions of this dissertation
and some directions for future work are presented.
6
2Technical Background
This chapter reports on a brief review of common topics frequently used in speech and lan-
guages technologies. In particular, three areas are considered: natural language processing,
spoken language processing, and machine learning. The goal of natural language processing
is to enable computers to perform useful tasks involving human language, such as human-
machine communication. It has a very broad scope, which extends to speech recognition and
language understanding. Spoken language processing, on the other hand, is focused on the
study of speech signals, including their digital processing, representations, and coding. Both
areas extensively rely on machine learning methods to accomplish their tasks. Machine learn-
ing is concerned with the study of mathematical models that allow computers to perform a
task without explicit instructions.
In the following sections, basic principles and more advanced concepts are introduced for
these three areas. The topics described where selected with the aim of providing the necessary
knowledge for the understanding of the concepts mentioned in this dissertation. Section 2.1 de-
scribes some text modeling techniques commonly used in the area of natural language process-
ing. Section 2.2 is dedicated at spoken language processing, with a focus on speech analysis,
speaker characterization, and speech recognition. Finally, Section 2.3 reports on some machine
learning models, feature selection approaches, and evaluation methods used in the machine
learning area.
2.1 Natural language processing
Natural language processing (NLP) is an area of computer science concerned with the compu-
tational processing and analysis of human language data. It is an interdisciplinary field with
a very wide purpose, which include machine translation, text summarization, and conversa-
tional agents, just to mention a few. In this section, the focus will be limited at providing an
overview of some commonly used language models. In general, the goal of a language model
is to capture salient statistical characteristics of the distribution of sequences of words in a nat-
ural language, allowing to make probabilistic predictions of the next word given preceding
ones. Broadly, there are two main categories of language models: statistical or count-based,
and predictive models. Count-based methods compute the statistics of how often some words
occur with its neighbor words in a large text corpus. Predictive models directly try to predict
a word from its neighbors in terms of learned embedding vectors. Both approaches require
large text corpus either to compute the statistics of the model or to learn a compact, distributed
representations of words.
In the following, Section 2.1.1 introduces a straightforward approach for representing text
data, while Section 2.1.2 describes a statistical language model widely used in NLP and speech
recognition. Finally, Section 2.1.3 briefly reports on neural language model and word embed-
ding.
2.1.1 Bag-of-words
The bag-of-words (BOW) model (Manning et al. 2010) is a simple representation used in NLP
and information retrieval. The main idea behind this approach is that important words will
occur repeatedly in various documents. Under this assumption, the number of occurrences
represents the importance of a word. In a straightforward implementation, documents may be
modeled with a fixed-length vector representation. The length is determined by the number of
unique words existing in the corpus, which is usually referred to as the vocabulary. Then, each
position of the fixed-length vector accounts for the number of times a word of the vocabulary
exists in the current document.
To provide a practical example, consider a corpus composed of two text documents con-
taining a single sentence: 1) /Mary likes french movies, but Sara prefers horror movies ./, and 2)
/Mary loves that person ./. In this case the vocabulary will be composed of 11 words: but, french,
horror, likes, loves, mary, movies, person, prefers, sara, that. The corresponding BOW vector
representations are shown in Figure 2.1 for a clearer presentation.
Despite being a very simple approach, according to this implementation, the BOW model
preserves multiplicity by accounting for the occurrence of repeated words within a document.
Nevertheless, this approach presents some important limitations that should be mentioned.
First, it is an order-less representation of documents. This means that any information related
with the order or structure of words is disregarded. To account also for spatial information,
one should consider probabilistic language models like n-grams, described in the next section.
Another limitation is related with the fact that word occurrences may be a very poor representa-
tion for a text. Function words like the, a, to, are usually the most frequent terms in a document,
although clearly, they not are the most important. To address this problem, the frequency of
a term is usually normalized by the inverse of the document frequency (idf) (Manning et al.
8
but french horror likes loves mary movies person prefers sara that
but1 1 1 1 0 1 2 0 1 1 0
1 . “ M a r y l i k e s f r e n c h m o v i e s , b u t S a r a p r e f e r s h o r r o r m o v i e s ”
but french horror likes loves mary movies person prefers sara that
but0 0 0 0 1 1 0 1 0 0 1
2 . “ M a r y l o v e s t h a t p e r s o n ”
Vocabulary
BOW
Vocabulary
BOW
Figure 2.1: Vocabulary and BOW vector representations for an example corpus of two docu-ments containing each one sentence.
2010). The idf corresponds to the number of documents in the corpus that contain the term,
thus balancing the fact that some words appear more frequently in general. Another inconve-
nient of the BOW model is that for a very large corpus, the length of the vocabulary might be
thousands or millions of positions, thus requiring either more computational resources or to
limit the size of the vocabulary.
2.1.2 N-gram models
The n-gram model (Jelinek & Mercer 1980, Katz 1987) is a probabilistic language model that
estimates the probability of a word based on the sequence of previous n words. In this way,
this model accounts for the fact that words in a sentence respect a particular order. Approaches
based on the n-gram model have been the dominant methodology for probabilistic language
modeling since the 1980s. In fact, due to their simplicity and scalability, they are widely
used in many areas, such as, computational biology (Tomovic et al. 2006), image (Soffer 1997),
speech (Hirsimaki et al. 2009) and language processing (Dunning 1994). When the estimation
of the current word depends on the previous two words, one has a tri-gram language model:
P(wi|wi−2, wi−1). Similarly, one can have uni-gram, P(wi), or bi-gram, P(wi|wi−1), language
models. For example, to calculate the probability of the sentence /Mary loves that person ./
9
<s> Mary loves that person </s>
<s> Mary loves that person </s>
<s> Mary loves that person </s>
<s> Mary loves that person </s>
P(Mary|<s>)
P(loves|Mary)
P(that|loves)
P(person|that)
<s> Mary loves that person </s> P(</s>|person)
Figure 2.2: Visual representation of the computation of P(Mary loves that person) using a bi-gram model.
using a bi-gram model, one would take:
P(Mary loves that person)=
P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)P(</s>|person).
To make P(wi|wi−1) meaningful for i=1, the beginning of the sentence is padded with a distin-
guished token <s>; pretending in this way that w0 = <s>. In addition, it is necessary to place
a distinguished token </s> at the end of the sentence. The process to compute the probability
of a sentence with a bi-gram model is visually shown in Figure 2.2.
The frequencies with which the word wi occurs given that the previous word is wi−1, are
estimated on a training corpus. This is achieved by counting how often the sequence (wi−1, wi)
occurs and then normalize the count by the number of times wi−1 occurs.
N-grams have been criticized because they explicitly ignore any dependencies between
words preceding n − 1. Furthermore, a new observed sequence typically will have occurred
rarely or not at all in the training corpus. In particular, when modeling the joint distribution of
a sentence, the n-gram model described above would assign a zero probability to those word
sequences that were not encountered in the training corpus. This is an inherent problem of n-
gram training, known as Out-Of-Vocabulary (OOV) words. Smoothing techniques address this
problem by adjusting the probabilities estimation for unseen data (Jurafsky & Martin 2014).
Finally, n-grams, as any other statistical language model, suffer from the impact of the curse
of dimensionality. This arises when a huge number of different combinations of values of the
10
input variables must be discriminated from each other, and at least one example per relevant
combination of values is needed (Bishop 2006).
2.1.3 Word embedding
Traditional statistical approaches are not able to capture information about the meaning of a
word or about its context. This means that potential relationships, such as contextual close-
ness, are not captured across collections of words. For example, neither the BOW, nor the
n-gram model can capture simple relationships, such as determining that the words dog and
cat both refer to animals that are often discussed in the context of household pets. Also, tra-
ditional approaches become unpractical when dealing with very large corpus, leading to a
large feature dimension and a sparse representation. Word embedding is a computationally-
efficient model for learning distributed representations of words that preserve linear regular-
ities (Mikolov, Sutskever, Chen, Corrado & Dean 2013). Differently from statistical language
model (e.g., n-grams), or count-based continuous vector representations (e.g., Latent Seman-
tic Analysis (LSA) (Dumais 2004)), word embeddings are learned by a neural network model.
In this way, the learning algorithm is exploited to discover the features that best characterize
the meaning of a word. These may include grammatical features like gender and number,
as well as semantic features like animate or invisible. These not mutually exclusive features
are continuous-valued. Consider that each word corresponds to a point in a feature space.
The goal of the learning algorithm, then, is to associate each word with a multidimensional
continuous-valued vector representation wherein each dimension corresponds to a semantic
or grammatical characteristic of words. The idea is that, in this feature space, semantically
similar words are closer to each other. This means that words such as dog and cat should have
similar word vectors to the word pet, whereas the word banana should be quite distant. This
forces the learned word features to correspond to a form of semantic and grammatical similar-
ity, and help the neural network to compactly represent them. A sequence of words can then
be transformed into a sequence of learned feature vectors. The neural network learns to map
that sequence of feature vectors to a prediction of interest, such as the probability distribution
over the next word in the sequence (Bengio 2008).
Mikolov (Mikolov, Chen, Corrado & Dean 2013) introduced two model architectures for
learning word embeddings that try to minimize computational complexity, the continuous Bag-
of-Words model (CBOW) and the continuous Skip-Gram model. Algorithmically, these models
are similar as they both rely on a training method composed of two steps. First, continuous
word vectors are learned using a simple model, then a n-gram neural network language model
11
(NNLM) is trained on top of these distributed representations of words. The main difference
between the two models proposed by Mikolov is that the CBOW model predicts a target word
from previous and future context words, while the skip-gram model predicts context words
within a certain range before and after the current word. By training high dimensional word
vectors on a large amount of data, the resulting vectors can be used to answer subtle semantic
relationships between words. This could be achieved by performing simple algebraic opera-
tions with the vector representations of words, as in the following example:
Paris− France + Italy=Rome
That is, by knowing that the capital of France is Paris, it is possible to query which is the capital
of Italy.
2.2 Spoken language processing
From a general point of view, spoken language processing could be considered an area of NLP,
as speech applications are often integrated in natural language tasks, such in the case of speech
recognition and conversational agents. However, speech processing is actually an area with a
strong component of digital signal processing, that incorporates knowledge from the electrical
engineering and the linguistic fields. In fact, a fundamental part of speech processing deals
with the extraction of useful information from speech signals and with the identification of
efficient representations for these data. The information extracted from speech signals is used
for a wide range of applications, which include analysis of pathological voices and biometrics
identification.
In the remainder of this section, two important areas of speech processing are described.
Section 2.2.1 reports on the many types of information that can be extracted from speech sig-
nals, used either in speech recognition and in the analysis of voice quality. The section con-
cludes with a brief introduction to an area of speech processing dedicated at the creation of
speaker models typically applied in the tasks of speaker identification and recognition. Then,
Section 2.2.2 is devoted at detailing the main building blocks of a conventional automatic
speech recognizer.
2.2.1 Speech analysis and speaker characterization
Speech analysis is an area of speech processing dedicated at providing a quantitative evaluation
of voice quality. It is typically used in paralinguistic or clinical speech studies to detect physio-
logical changes in voice production. In fact, there are many medical conditions that adversely
12
affect the voice. Diseases of the larynx are normally associated with breathiness and hoarse-
ness of the produced voice. Stroke or neurodegenerative diseases, may cause the production
of inconsistent speech sounds (e.g., apraxia of speech) or the weakness of the muscles involves
in speech production (e.g., dysarthria). Overall, voice quality is assessed through the analy-
sis of phonation, articulation, and prosody. The study of phonation analyzes the alterations
that occur in the vocal folds vibration process (e.g., respiration). Measures of articulation as-
sess modifications that may happen in the positioning or shape of the speech organs (e.g.,
tongue, lips). Finally, the study of prosody examines variations of loudness, intonation, and
timing. Computationally, phonation, articulation, and prosody are measured through spectral
and temporal parameters of speech.
Speaker characterization is an area of speech processing whose goal is to identify or verify
the identity of a person based upon his/her voice. It is typically used in security, medical, and
forensic applications. In fact, the impressive advancements that has occurred in recent years
in voice-based solutions, raised a growing interest in the automatic verification of a speaker’s
identity. Among the new consumer applications based on speech, one should also consider the
digital cloning of voice characteristics, and voice conversion. While the last offers new solu-
tions for privacy protection, it also brings the possibility of misuse the technology in order to
spoof someone’s identity. Voice is a central part of our identity and offers a low-cost biometric
solution for the authentication of a person. This is due to the fact that speech carries impor-
tant information about a speaker, such as gender, age, language, and dialect. Various acoustic
features can be used to model and characterize a speaker’s identity. After describing some of
these measures in Section 2.2.1.1, a brief introduction to the subject of speaker characterization
is provided in Section 2.2.1.2.
2.2.1.1 Speech analysis
In the remainder of this section, first, a very short review of the human speech production sys-
tem is reported for completeness. Then, some features that are widely used in speech analysis
are introduced.
Speech production
On a physiological level, speech production begins in the lungs which contract and push out
air that flows through the larynx and the glottis, the orifice between the vocal folds. Airflow
then proceeds through the pharynx, into the mouth between the tongue and palate, and is
finally emitted through the lips and the nose. A schematic representation of the human speech
13
(a) (b)
Figure 2.3: (a) A schematic diagram of the human speech production apparatus. (b) Waveformof /sees/, showing a voiceless phoneme /s/, followed by a voiced sound, the vowel /iy/. Thefinal sound, /z/, is a type of voiced consonant (Huang et al. 2001).
apparatus is shown in Figure 2.3(a).
Upper and lower lips, upper and lower teeth, tongue, and roof of the mouth are among the
major articulators contributing to the production of different sounds. Sounds can be classified
into subgroups with particular properties according to the speech production apparatus and to
the position and motion of the articulators. When the vocal folds are held close together and
oscillate against one another during a speech sound, the sound is said to be voiced. When the
folds are too slack or tense to vibrate periodically, the sound is said to be unvoiced. Voiced
sound include vowels, their time and frequency structure present a roughly regular pattern
that voiceless sounds, such as consonants, lack. A voiced and an unvoiced sound are shown
in Figure 2.3(b). Articulation takes place in the mouth, between the oral cavity, which acts as
a resonator, and the articulators. The place and manner of articulation allow to differentiate
most speech sounds.
Fundamental frequency, formants, harmonics
The fundamental frequency, usually known as F0, corresponds to the rate of cycling (opening
and closing) of the vocal folds in the larynx during phonation of voiced sounds. The funda-
mental frequency is the lowest frequency of a speech signal, and it is usually perceived as the
loudest. It contributes more than any other single factor to the perception of pitch in speech,
the semi-musical rising and falling of voice tones.
The periodic glottal wave consists of the fundamental frequency and a number of harmonics
that are integral multiples of F0. When the shape of the vocal tract changes, the harmonics
present in the sound also change. More closure in the vocal folds will create stronger, higher
harmonics. The harmonics are not all of equal intensity, for example the vowel /a/ typically
has more energy than the vowel /o/ or /i/. Regions of frequency space where speech sounds
14
DFT Melfilterbank log IDFT Delta
12MFCC12Δ MFCC12ΔΔMFCC1energy1Δ energy1ΔΔenergy
Figure 2.4: Schematic representation of the extraction of a sequence of 39-dimensional MFCCfeature vectors.
carry a lot of energy are known as formants. They arise from the vocal tract and filter the orig-
inal sound source. Speakers change the resonance frequencies by moving the articulators and
thereby changing the dimensions of the resonance cavities in the vocal tract. The first two for-
mants, F1 and F2, are, from a linguistic point of view the most important, since they uniquely
identify or characterize the vowels.
Mel-frequency cepstral coefficients
Probably the most widely known parameterization of the speech signal, used either in speech
recognition, and speech analysis, are Mel-frequency cepstral coefficients (MFCCs) (Davis &
Mermelstein 1980, Mermelstein 1976). They are of particular relevance because they provide
the ability to separate the vocal tract filter (the position of the tongue and the other articulators)
from information about the glottal source (the energy of the lungs). To compute the MFCC, first
spectral information is extracted from a speech sample through the Discrete Fourier Transform
(DFT). Then, the result of the DFT, which corresponds to the amount of energy at each fre-
quency band, is warped onto the mel scale. A mel (Stevens & Volkmann 1940, Stevens et al.
1937) is a unit of pitch defined so that pairs of sounds which are perceptually equidistant in
pitch are separated by an equal number of mels. The mapping between frequency in Hertz and
the mel scale is linear below 1000 Hz and logarithmic above 1000 Hz. This conversion allows
to approximates the sensitivity of the human hear, which is less sensitive at higher frequencies,
roughly above 1000 Hz. The next step in MFCC feature extraction consists of taking the log of
each of the mel spectrum values. This is motivated by the fact that, in general, human response
to signal level is logarithmic. Humans are less sensitive to slight differences in amplitude at
high amplitudes than at low amplitudes. Finally, the spectrum of the log spectrum is com-
puted through the inverse Discrete Fourier Transform (IDFT). The result of this transformation
is called the cepstrum.
For the purposes of MFCC extraction, generally, the first 12 cepstral values are considered.
These 12 coefficients will represent information solely about the vocal tract filter, cleanly sepa-
15
rated from information about the glottal source. For speech recognition, features vectors of 39
coefficients are usually computed. The first twelve features are the MFCCs, whereas the thir-
teenth feature corresponds to the energy of the frame. This is computed as the sum over time
of the power of the samples in the frame. Then, for each of the 13 features, a delta (or veloc-
ity) feature, and a double delta (or acceleration) features are added. Each of the delta features
represents the change between frames in the corresponding cepstral feature, while each of the
delta delta features represents the change between frames in the corresponding delta features.
Linear Prediction Coefficients
Another approach frequently used in speech analysis is Linear Prediction Coefficients
(LPC) (Atal & Schroeder 1968, Burg 1967, Itakura & Saito 1968, Makhoul 1973, Markel & Gray
1973). This method allows to represent the spectral envelope of a speech signal in a compressed
form, providing an accurate estimation of speech parameters. The basic idea behind LPC is the
source-filter model. The vocal cords produce the sound source, while the vocal tract constitutes
the acoustic filter. In particular, LPC assumes the vocal tract to be approximate as a loss-less
tube characterized by its resonances, which give rise to formants. The transfer function of a
loss-less tube can be described by an all-pole linear filter. With a sufficient number of poles,
this type of filter is a good approximation for speech signals. LPC derives its name from the
fact that it predicts the current sample as a linear combination of past samples. The predictor
coefficients are estimated using short-term analysis. A segment of speech is selected in the
proximity of a given sample, then the LPC coefficients are estimated using the criterion of min-
imum mean squared error. The corresponding coefficients are those that minimize the total
prediction error. LPC analyzes the speech signal by estimating the formants, removing their
effects, and then estimating the intensity and frequency of the remaining signal. The process of
removing the formants is called inverse filtering, and the remaining signal after the subtraction
of the filtered modeled signal is called the residue.
Perceptual Linear Prediction coefficients
The short term power spectrum of speech estimated by LPC is widely used as it is a simple and
effective way of estimating the main parameters of speech signals. However, one of the main
disadvantages of LPC is that it approximates in the same way all frequencies of the analysis
band. This property is inconsistent with human hearing, as beyond about 800 Hz the spectral
resolution of hearing decreases with frequency. Additionally, hearing is more sensitive in the
middle frequency range of the audible spectrum. As a consequence, LPC not always preserve
16
1229 João Paulo Teixeira and Paula Odete Fernandes / Procedia Technology 16 ( 2014 ) 1228 – 1237
involved. In the medical field this is a subjective assessment technique which leads to the lack of consensus among professionals. Therefore it became necessary to search for an objective assessment, in which the voices were analyzed by devices which are capable of measuring several acoustic parameters, as stated by Almeida [4]. Using speech signal processing it is possible to extract a set of parameters of the voice that may allow detecting pathologies of the vocal cords in individuals by comparing the data of patients with certain pathology with the data of persons considered with healthy voice.
The parameters obtained by the acoustic analysis have the advantage of describing the voice objectively. With the existence of normative databases characterizing voice quality or using intelligent tools combining the various parameters, it is possible to distinguish between normal and pathological voice or even identify or suggest the pathology. These tools allow the monitoring of clinical standpoint and reduce the degree of subjectivity of perceptual analysis, as Teixeira, et al. [5].
Currently, acoustic parameters commonly used in applications of acoustic analysis as well as the most referenced in the literature, are the fundamental frequency (F0), jitter, shimmer, HNR and frequency formants.
The measure of these parameters is performed in a recorded speech signal with the patient/control producing a long steady state vowel.
Measurements of F0 disturbance jitter and shimmer, has proven to be useful in describing the vocal characteristics. Jitter is defined as the parameter of frequency variation from cycle to cycle, and shimmer relates to the amplitude variation of the sound wave, as Zwetsch et al. [2] and [5, 6, 7]. In Fig. 1 the jitter and shimmer are represented.
8000 8500 9000 9500 10000-0.5
0
0.5Jitter
Shim
mer
Fig.1. Jitter and Shimmer perturbation measures in speech signal [6].
The jitter is affected mainly by the lack of control of vibration of the vocal cords; the voices of patients with
pathologies often have a higher percentage of jitter. The shimmer changes with the reduction of glottal resistance and mass lesions on the vocal cords and is
correlated with the presence of noise emission and breathiness. Diseases that affect larynx cause changes in the patient’s vocal quality. Early signs of deterioration of the
voice due to vocal malfunctioning are normally associated with breathiness and hoarseness of the produced voice. The most common signs that may indicate changes in the larynx relate hoarseness, breathiness and roughness. The transient hoarseness may result from abuse of the voice or the casual flu. But when the hoarseness persists and becomes a characteristic voice, is indicative of pathology of the larynx. Hoarseness can
Figure 2.5: Jitter and Shimmer perturbation measures in a speech signal (Teixeira & Fernandes2014).
or discard spectral details according to their auditory prominence. Perceptual Linear Prediction
coefficients (PLP) (Hermansky 1990), overcomes this drawback using a critical-band filtering
over the linear predictive analysis. The power spectrum is first warped onto a Bark scale,
then the Bark scaled spectra is convoluted with the power spectra of the critical band filter.
This simulates the frequency resolution of the ear which is approximately constant on the Bark
scale. In this way, PLP approximate the behavior of the human auditory system.
Jitter, shimmer
Jitter and shimmer are two measures of period perturbation commonly used in a comprehen-
sive voice examination (Brockmann et al. 2011, Farrus et al. 2007, Kreiman & Gerratt 2003, Silva
et al. 2009, Titze & Martin 1998). They assess the micro-instability of vocal fold vibrations by
representing the variability of the fundamental frequency from one cycle to the next. As shown
in Figure 2.5, the jitter measures the variations of the fundamental glottal period, while the
shimmer measures the variations of the fundamental glottal period amplitudes. The jitter is
affected mainly by the lack of control of vibration of the cords. The voices of patients with
pathologies often have higher values of jitter. The shimmer changes with the reduction of glot-
tal resistance and mass lesions on the vocal cords and it is correlated with the presence of noise
emission and breathiness (Teixeira et al. 2013). It is expected that patients with pathologies
have higher values of shimmer.
Vowel space area
The Vowel Space Area (VSA) is an acoustic index commonly used to assess the ability to prop-
erly articulate vowels (Kent & Kim 2003, Kuhl et al. 1997, Vorperian & Kent 2007). Some speech
17
Figure 2.6: Quadrilateral and triangular vowel space area for healthy subjects. The picture wasextracted from the work of Vizza et al. (2017), on the use of speech signal for studying sclerosis.
disorders, like dysarthria for instance, may be characterized by a centralization of vowels. This
phenomenon is associated with a reduction in the amount of movement of the tongue in pro-
nouncing the vowel. As a result, a centralized vowel is closer to the midpoint of the vowel
space than their referent vowel. In fact, due to centralization problems, vowels that normally
possess a high center formant frequency tend to have a lower frequency, while vowel formants
that normally have a low center frequency tend to have an higher frequency.
VSA refers to the two-dimensional area bounded by the lines connecting the coordinates
vertices of different vowels. The coordinates are obtained by plotting the F1 frequency as a
function of the F2 frequency. When the vowels /i/, /u/, and /A/ are considered, one is dealing
with a triangular vowel space. When the vowels /i/, /u/, /A/, and /æ/ are considered, the
resulting shape is a quadrilateral. A representation of a quadrilateral and triangular vowel
space area of healthy adults is shown in Figure 2.6
Vowel articulation index
Although the VSA should reflect changes in articulatory function, several studies found that
it failed to differentiate individuals perceptually judged to have abnormal articulation or poor
speech intelligibility (Ansel & Kent 1992, Bunton & Weismer 2001, Sapir et al. 2007, Weismer
et al. 2001). One possible explanation could be the large inter-speaker variability associated
with vowel formant measurements in general. To cope with this problem, Sapir (2006) intro-
duced the Vowel Articulation Index (VAI) an acoustic metric of vowel formant production, de-
signed to minimize the effects of inter-speaker variability and maximize sensitivity to formant
centralization and decentralization. The VAI can be calculated with the following formula:
18
(F2/i/ + F1/A/)/(F2/u/ + F2/A/ +F1/i/ + F1/u/), where F2/i/ refers to the second formant of
the vowel /i/, F1/A/ refers to the first formant of the vowel /A/, and so on.
The vowel-formant elements in the VAI ratio are arranged such that elements in the nu-
merator (F2/i/, F1/A/) will decrease, while elements in the denominator (F2/u/, F2/A/, F1/i/,
F1/u/) will increase in the case of vowel centralization. In American English, the normal VAI
values are expected to be close to 1.0, as the sum of formant frequencies in the denominator is
very similar to the sum of formant frequencies in the numerator. For this reason, the VAI may
be considered a function that normalizes the relationships between the vowels across speakers.
2.2.1.2 Speaker characterization
In the area of speaker recognition, speaker modeling techniques aim at building dependable
speaker models in order to establish or confirm the identity of a speaker from a speech signal.
These approaches are used in the resolution of two related problems, speaker verification and
identification. While the first aims at verifying the truthfulness of a claim of identity, the latter
implies to establish the identity of an unknown speaker from a voice sample. Speaker verifi-
cation requires an enrollment phase to create the speaker model that is then used during the
verification phase. Speaker identification, on the other hand, requires a set of several labeled
speaker models that are used during the comparison with the unknown voice sample. For both
problems, the creation of the speaker models comprises a phase of preprocessing and feature
extraction analogous to the ones described in the previous sections. MFCCs are the features
most typically used to characterize the speaker models.
There are several approaches to the problem of speaker modeling, among them the GMM-
UBM framework is briefly introduced in the following section. This method has been exten-
sively used to establish the identity of a user, primarily in the area of speaker (Campbell et al.
2009, Liu et al. 2006, Zheng et al. 2004), language (Torres-Carrasquillo et al. 2004, Wong & Srid-
haran 2002, Yin et al. 2006), and emotion recognition (Bao et al. 2007, Kockmann et al. 2011, Wu
et al. 2006). Recently, it has been exploited also for modeling speech disorders (Bocklet et al.
2011, Orozco-Arroyave et al. 2016).
GMM-UBM
Reynolds (2009a) defined a Universal Background Model (UBM) as a model used in a biometric
verification system to represent general, person-independent feature characteristics. These fea-
tures are then compared against a model of person-specific feature characteristics to make an
accept or reject decision. Typically, in a speaker verification system, the UBM is modeled by a
19
Gaussian Mixture Model (GMM) trained with the Expectation-Maximization (EM) algorithm.
This method is used to find maximum likelihood parameters of a statistical model. In order to
represent general speech characteristics, the UBM is typically created using the speech samples
from a large number of speakers. In this case, it is important that the subpopulations compos-
ing the data should be balanced. For example, in using gender-independent data, one should
be sure there is a balance of male and female speech. Otherwise, the final model will be biased
towards the dominant subpopulation. Another approach considers the training of individual
UBMs over the subpopulations in the data, such as one for male and one for female speech,
and then combine the subpopulation models together. This method provides the advantages
that one can effectively use unbalanced data and can carefully control the composition of the
final UBM.
A speaker-dependent GMM, on the other hand, may be trained using the speech sam-
ples of a particular enrolled speaker. Alternatively, a speaker-dependent GMM may be de-
rived by adapting the parameters of the UBM using the speaker personal training data and
Maximum a Posteriori (MAP) estimation. This last approach provides a tighter coupling be-
tween the speaker’s model and the UBM, resulting in better performance than decoupled mod-
els (Reynolds et al. 2000). Then, in the case of speaker identification, each test segment is scored
against all speaker models to determine who is speaking. In the case of speaker verification,
each test segment is scored against the background model and a given speaker model to accept
or reject an identity claim. Noteworthy, the GMM-UBM paradigm is the basis for some of the
most successful developments in the field of speaker characterization, including factor analysis
methods, such as i-vectors (Dehak et al. 2010).
2.2.2 Automatic speech recognition
Traditional speech recognition systems do not actually perform the recognition or decoding
step directly on the speech signal. Rather, the recognition process is composed of several
phases: preprocessing and feature extraction, acoustic modeling, language modeling, and the
decoding phase. The overall process and its major components are represented in Figure 2.7,
and described hereafter.
Preprocessing and feature extraction
Initially, the speech waveform is preprocessed in order to enhance it and better prepare it for
the following phases. Typical preprocessing approaches may include preemphasis, which cor-
responds to boost the energy in the high frequencies, and noise reduction. More advanced
20
SentencesFeaturesvectors
Speechsignal
LanguageModel Lexicon
AcousticModel
DecodingPreprocessingandfeatureextraction
Figure 2.7: Schematic representation of the main modules constituting an automatic speechrecognizer.
techniques may consider background, and overlapped speech removal. Then, the input signal
is divided into short frames of samples, which are converted to a meaningful set of features.
The duration of the frames is selected so that the speech waveform can be regarded as being
stationary. The goal of the feature extraction step is to derive, from each frame, a parameterised
version of the input signal that captures its important qualities while discarding unimportant
and distracting characteristics. More in detail, features should be robust against noise, acoustic
variations and other events that are irrelevant for the recognition process. Also features should
be sensitive to linguistic context, allowing to distinguish between different linguistic units (e.g.,
phones). Features typically used in speech recognition are the MFCC or PLP, introduced in Sec-
tion 2.2.1.1
Acoustic model
The next stage in the recognition process is to do a mapping of the speech vectors found at
the previous step and the underlying sequence of acoustic classes modeling concrete symbols
(such as phonemes, letters, and words). Acoustic modeling is arguably the central part of
any speech recognition system, playing a critical role in improving ASR performance. The
practical challenge is how to build accurate acoustic models that can truly reflect the spoken
language to be recognized. Typically, subword models like phonemes, diphones or triphones
are more often used as the unit of acoustic model with respect to word model. An extended
and successful statistical parametric approach to speech recognition is the Hidden Markov
21
Model paradigm (Rabiner 1989, Rabiner et al. 1993) that supports both acoustic and tempo-
ral modeling. HMMs model the sequence of feature vectors as a piecewise stationary pro-
cess. An utterance X=x1, . . . , xn, . . . , xN is modeled as a succession of discrete stationary states
Q=q1, . . . , qk, . . . , qK, K<N, with instantaneous transitions between these states. An HMM is
typically defined as a stochastic finite state automaton, usually with a left-to-right topology. It
is called ”hidden” Markov model because the underlying stochastic process (the sequence of
states) is not directly observable, but still affects the observed sequence of acoustic features.
HMM has been succesfully used in combination with Gaussian Mixture Models, a parametric
probability density function represented as a weighted sum of Gaussian component densi-
ties (Reynolds 2009b). In this approach, a GMM is associated to each state in order to describe
local characteristics of the data. For many years, the paradigm GMM-HMM represented the
dominant technology in acoustic modeling (Baker et al. 2009). Recently, deep learning methods
have been shown to outperform conventional GMM-based modeling approaches by achieving
important improvements in terms of recognition accuracy (Abdel-Hamid et al. 2012).
Language model
Knowledge of the rules of a language, the way in which words are connected together into
phrases, is expressed by the language model. It is an important building block in the recogni-
tion process, it is used to guide the search for an interpretation of the acoustic input. There are
two types of models that describe a language: grammar-based and statistical-based language
models. When the range of sentences to be recognized is very small, it can be captured by a
deterministic grammar that describes the set of allowed phrases. In large vocabulary applica-
tions, on the other hand, it is too difficult to write a grammar with sufficient coverage of the
language, therefore a stochastic grammar, typically an n-gram model is often used. When sub-
word models are used, the word model is then obtained by concatenating the subword models
according to the pronunciation transcription of the words provided by a dictionary or lexical
model. The purpose of a vocabulary is to map the orthography of the words to the units that
model the actual acoustic realization of the vocabulary entries. Lexicon generation may rely
on manual dictionaries or on automatic grapheme-to-phoneme modules, both rule-based or
data-driven learned approaches (or hybrid).
A different speech recognition task, known as keyword spotting (KWS), can be used when
the expected result of the recognition is limited to a reduced number of isolated words. This
kind of approaches search in the continuous audio stream a certain set of words of interest.
Broadly, KWS methods can be classified into two categories: based on Large Vocabulary Con-
22
tinuous Speech Recognition (LVCSR) or based on acoustic matching of speech with keyword
models in contrast to a background model (Szoke et al. 2005). Methods based on LVCSR search
for the target keywords in the recognition results, usually in lattices or confusion networks.
Acoustic approaches, on the other hand, are very closely related to Isolated Word Recognition
(IWR). The language model in this case contains the words that should be recognized. Ad-
ditionally, they incorporate an alternative competing model to the list of keywords generally
known as background, garbage or filler speech model. A robust background model must be
able to provide low recognition likelihoods for the keywords and high likelihoods for out-of-
vocabulary words in order to minimize false alarms and false rejections (Abad et al. 2013).
Decoding
The last step in the recognition process is the decoding phase, its purpose is to find a se-
quence of words whose corresponding acoustic and language models best match the input
signal. Therefore, such a decoding process with trained acoustic and language models is of-
ten referred to as a search process. Its complexity varies according to the recognition strategy
and to the size of the vocabulary. With IWR word boundaries are known, thus the word with
the highest forward probability is chosen as the recognized word and the search problem be-
comes a simple pattern recognition problem. Search in Continuous Speech Recognition (CSR),
on the other hand, is more complicated since the search algorithm has to consider the possi-
bility of each word starting at any arbitrary time frame. Also, for small vocabulary tasks, it is
possible to expand the whole search network defined by the language and lexical restrictions
to directly apply conventional time-synchronous Viterbi search. However, in LVCSR systems
different strategies should be addressed. These, span from graph compaction techniques, on-
the-fly expansion of the search space (Ortmanns & Ney 2000) and heuristic methods.
2.3 Machine learning
Machine learning is a field of artificial intelligence dedicated at the study of adaptive mecha-
nisms that enable computers to learn from experience, learn by example, and learn by analogy.
Learning capabilities can improve the performance of an intelligent system over time. Ma-
chine learning algorithms build a statistical model from sample data, the model is then used to
perform specific tasks, like making predictions or decisions over new, unknown data. From a
broad perspective, the first step in the definition of a machine learning problem is the identifica-
tion of a domain of interest. This corresponds to defining the feature space. Then, the following
step is to train a model able to identify, from the feature space, significant relations for the prob-
23
Featureextraction
Inputdata
Pre-processing
Featureselection
Modeltraining
Modelvalidation MLmodel
Figure 2.8: A typical workflow used in the training process of a machine learning model.
lem at hand. In practice, this process usually includes several additional intermediate stages.
In fact, defining a feature space, besides an identification of relevant features, typically involves
also a stage of data preprocessing, feature extraction, and possibly, feature selection. Then, on
these features, the actual training of the machine learning model is performed. Finally, the
trained model is evaluated to understand its ability to process new information, not previously
analyzed. These steps are visually represented in Figure 2.8.
In the following, in Section 2.3.1, some common models used in machine learning prob-
lems are introduced, followed by a brief review on different feature selection approaches (Sec-
tion 2.3.2). Then, Section 2.3.3 describes two model evaluation approaches that are used to
assess the performance of a machine learning model and two common evaluation metrics used
in the areas of classification and speech recognition.
2.3.1 Machine learning models
Machine learning approaches are usually classified into two categories: supervised and unsu-
pervised learning. In the first one, the learning task infers a function from examples of data
consisting of input-output pairs. For example, if the task consists of determining if the speech
of an individual is affected from a specific impairment, the data provided to the model would
include speech signals with an associated label designating if the clinical condition is verified
or not. Of course, the data provided to the model would include speech signals of people with
and without the impairment. Typically, supervised learning is modeled as a problem of clas-
sification or regression. Classification algorithms are used when the outputs are restricted to a
known set of values. According to the previous example, the output would be the presence or
absence of a clinical condition. Regression algorithms provide a continuous output, meaning
that any value within a given range is possible. An example would be the severity of a disease
in a scale from 0 to 30. In unsupervised learning, the learning task infers a function from ex-
amples of data containing only inputs, but no desired output labels. In this way, unsupervised
methods are able to discover structure and patterns in the data.
In the literature, there are many computational models used for a variety of machine learn-
24
(a) (b) (c) (d)
Figure 2.9: (a) 3 possible hyperplanes for an SVM trained with linearly separable data, the besthyperplane is shown with a solid line. (b) The hyperplane with the maximum distance fromdata points. (c) Soft-margin allowing some classification errors. (d) Non-linearly separabledata
ing tasks. Here, this revision is limited at introducing some of the most common methods that
are later referred in this document. All of them may be used to model a classification or a
regression problem.
2.3.1.1 Support Vector Machine
Support Vector Machine (SVM) (Vapnik 1963) is a family of discriminative binary linear classi-
fiers. This means that the algorithm learns a decision function directly from the data in order
to classify it into two possible groups. To do that, the algorithm constructs a hyperplane in a
high-dimensional space, which can be used to separate the input data (Figure 2.9(a)). A good
separation is achieved by the hyperplane that provides the maximum distance from the margin,
which corresponds to the nearest data point of any class. An example of a maximum margin is
depicted in Figure 2.9(b).
The traditional SVM algorithm fails to find a hyperplane when the input data that is not
linearly separable, as shown on Figure 2.9(d). Some extensions have been proposed to solve
this issue. The first, known as soft-margin (Boser et al. 1992), accounts for small deviations by
allowing that a reduced number of points close to the boundary is misclassified (Figure 2.9(c)).
The number of possible misclassifications is governed by a free parameter called the cost. It cor-
responds to the penalty associated with performing a classification error. Higher values of the
cost imply lower possibilities that the algorithm will misclassify a point. This method, while
it provides a solution for simpler problems, still fails to classify non-separable data as the one
showed in Figure 2.9(d). Thus, another approach relies on the use of kernel methods (Cortes
& Vapnik 1995), which perform a mapping of the original classification problem into another
metric space in which it is separable. Generally, the transformed space has a higher dimen-
sionality, with each of the dimensions being a combination of the original problem variables.
Common types of kernels used to separate non-linear data are polynomial and radial basis
25
Outlook
Humidity
Overcast
Yes
RainSunny
High Normal High Normal
No Yes No Yes
Wind
Figure 2.10: A decision tree for the concept PlayTennis. This tree classifies Saturday morningsaccording to whether or not they are suitable for playing tennis (Mitchell 1997).
kernels. SVM can also be used to solve multiclass classification problems. Typically, this is
achieved by reducing the original task into multiple binary classification problems (Duan &
Keerthi 2005).
2.3.1.2 Decision Tree
A Decision Tree (DT) is a decision support tool that uses a tree-like model. Each node in the
tree specifies a test on an attribute, and each branch descending from that node corresponds to
one of the possible outcomes of the test. Each leaf node represents a decision, or a class label.
An instance is classified by sorting it down the tree from the root to some leaf node. A visual
representation of a DT for the problem PlayTennis is shown in Figure 2.10.
In the machine learning area, Decision Tree (DT) (Mitchell 1997) is used as a non-parametric
supervised learning method both for classification and regression. This model predicts the
value of a target variable by learning simple decision rules inferred from the data features.
Tree models where the target variable can take a discrete set of values are called classification
trees. In these tree structures, leaves represent class labels and branches represent conjunctions
of features that lead to those class labels. Decision trees where the target variable can take con-
tinuous values are called regression trees. They are similar to classification trees, except that a
regression model is fitted to each node to give the predicted value of the target variable. Com-
mon decision tree algorithms are: Iterative Dichotomiser 3 (ID3) (Quinlan 1986), C4.5 (Quinlan
2014), Classification and regression tree (CART) (Breiman 1984).
26
2.3.1.3 Random Forest
DT present many advantages, they are simple to understand and interpret, training is straight-
forward, and classification is fast. However, as for any other machine learning method, they
also include important disadvantages. The most important is related with the fact that DT
learners cannot be grown to arbitrary complexity because they may lose generalization accu-
racy on unseen data. For these reasons, Ho (1995) proposed a method to construct tree-based
classifiers with a capacity that can be arbitrarily expanded. This is achieved by building multi-
ple trees in randomly selected subspaces of the feature space. The trees built in this way gen-
eralize their classification in complementary ways, and their combined classification revealed
monotonically improvement. The idea of random subspace selection of Ho (1995) influenced
the design of Random Forest (RF), later proposed by Breiman (2001). RF are a way of build-
ing a forest of uncorrelated trees that are trained on different parts of the same dataset. To do
that, RF relies on bootstrap aggregating, a meta-algorithm in which new training sets are gen-
erated by sampling, uniformly and with replacement, from the original dataset. Each tree in
the ensemble votes for the most popular class, multiple decision trees are then averaged. With
respect to standard decision trees, RF provides a reduced interpretability, but generally they
greatly boost the performance of the final model.
2.3.1.4 Artificial Neural Network
An Artificial Neural Network (ANN) (Ivakhnenko & Lapa 1966, 1967, Rosenblatt 1958) can
be defined as an information-processing paradigm inspired by the structure and functions of
the human brain. The brain consists of a densely interconnected set of nerve cells, called neu-
rons. A neuron consists of a cell body, soma, a number of fibres called dendrites, and a sin-
gle long fibre called the axon. Electrical or chemical signals between neurons are exchanged
through synapses. An ANN consists of a number of simple and highly interconnected proces-
sors, called neurons, which are analogous to the biological neurons in the brain. In a similar
way to synapses, signals from one neuron to another are passed by weighted links connecting
neurons. A biological and an artificial neuron are depicted in Figure 2.11(a), and Figure 2.11(b).
The simplest form of ANN is called the perceptron (Rosenblatt 1958), it consists of a single
neuron with adjustable synaptic weights. This type of architecture is able to solve only lin-
early separable problem. This limitation is overcome with advanced forms of neural network,
namely the Multilayer Perceptron (MLP), a feedforward neural network with one or more hid-
den layers. Typically, the network consists of an input layer of source neurons, at least one
middle layer of computational neurons, and an output layer of computational neurons. The
27
What is a neural network?A neural network can be defined as a model of reasoning based on the humanbrain. The brain consists of a densely interconnected set of nerve cells, or basicinformation-processing units, called neurons. The human brain incorporatesnearly 10 billion neurons and 60 trillion connections, synapses, between them(Shepherd and Koch, 1990). By using multiple neurons simultaneously, thebrain can perform its functions much faster than the fastest computers inexistence today.
Although each neuron has a very simple structure, an army of such elementsconstitutes a tremendous processing power. A neuron consists of a cell body,soma, a number of fibres called dendrites, and a single long fibre called theaxon. While dendrites branch into a network around the soma, the axonstretches out to the dendrites and somas of other neurons. Figure 6.1 is aschematic drawing of a neural network.
Signals are propagated from one neuron to another by complex electro-chemical reactions. Chemical substances released from the synapses cause achange in the electrical potential of the cell body. When the potential reaches itsthreshold, an electrical pulse, action potential, is sent down through the axon.The pulse spreads out and eventually reaches synapses, causing them to increaseor decrease their potential. However, the most interesting finding is that a neuralnetwork exhibits plasticity. In response to the stimulation pattern, neuronsdemonstrate long-term changes in the strength of their connections. Neuronsalso can form new connections with other neurons. Even entire collections ofneurons may sometimes migrate from one place to another. These mechanismsform the basis for learning in the brain.
Our brain can be considered as a highly complex, nonlinear and parallelinformation-processing system. Information is stored and processed in a neuralnetwork simultaneously throughout the whole network, rather than at specificlocations. In other words, in neural networks, both data and its processing areglobal rather than local.
Owing to the plasticity, connections between neurons leading to the ‘rightanswer’ are strengthened while those leading to the ‘wrong answer’ weaken. As aresult, neural networks have the ability to learn through experience.
Learning is a fundamental and essential characteristic of biological neuralnetworks. The ease and naturalness with which they can learn led to attempts toemulate a biological neural network in a computer.
Figure 6.1 Biological neural network
ARTIFICIAL NEURAL NETWORKS166
(a)
But does the neural network know how to adjust the weights?As shown in Figure 6.2, a typical ANN is made up of a hierarchy of layers, and theneurons in the networks are arranged along these layers. The neurons connectedto the external environment form input and output layers. The weights aremodified to bring the network input/output behaviour into line with that of theenvironment.
Each neuron is an elementary information-processing unit. It has a means ofcomputing its activation level given the inputs and numerical weights.
To build an artificial neural network, we must decide first how many neuronsare to be used and how the neurons are to be connected to form a network. Inother words, we must first choose the network architecture. Then we decidewhich learning algorithm to use. And finally we train the neural network, that is,we initialise the weights of the network and update the weights from a set oftraining examples.
Let us begin with a neuron, the basic building element of an ANN.
6.2 The neuron as a simple computing element
A neuron receives several signals from its input links, computes a new activationlevel and sends it as an output signal through the output links. The input signalcan be raw data or outputs of other neurons. The output signal can be either afinal solution to the problem or an input to other neurons. Figure 6.3 showsa typical neuron.
Figure 6.3 Diagram of a neuron
Table 6.1 Analogy between biological and artificial neural networks
Biological neural network Artificial neural network
Soma Neuron
Dendrite Input
Axon Output
Synapse Weight
ARTIFICIAL NEURAL NETWORKS168
(b)
Although a present-day artificial neural network (ANN) resembles the humanbrain much as a paper plane resembles a supersonic jet, it is a big step forward.ANNs are capable of ‘learning’, that is, they use experience to improve theirperformance. When exposed to a sufficient number of samples, ANNs cangeneralise to others they have not yet encountered. They can recognise hand-written characters, identify words in human speech, and detect explosivesat airports. Moreover, ANNs can observe patterns that human experts failto recognise. For example, Chase Manhattan Bank used a neural network toexamine an array of information about the use of stolen credit cards – anddiscovered that the most suspicious sales were for women’s shoes costingbetween $40 and $80.
How do artificial neural nets model the brain?An artificial neural network consists of a number of very simple and highlyinterconnected processors, also called neurons, which are analogous to thebiological neurons in the brain. The neurons are connected by weighted linkspassing signals from one neuron to another. Each neuron receives a number ofinput signals through its connections; however, it never produces more than asingle output signal. The output signal is transmitted through the neuron’soutgoing connection (corresponding to the biological axon). The outgoingconnection, in turn, splits into a number of branches that transmit the samesignal (the signal is not divided among these branches in any way). The outgoingbranches terminate at the incoming connections of other neurons in thenetwork. Figure 6.2 represents connections of a typical ANN, and Table 6.1shows the analogy between biological and artificial neural networks (Medskerand Liebowitz, 1994).
How does an artificial neural network ‘learn’?The neurons are connected by links, and each link has a numerical weightassociated with it. Weights are the basic means of long-term memory in ANNs.They express the strength, or in other words importance, of each neuron input.A neural network ‘learns’ through repeated adjustments of these weights.
Figure 6.2 Architecture of a typical artificial neural network
INTRODUCTION, OR HOW THE BRAIN WORKS 167
(c)
Figure 2.11: (a) A schematic drawing of a biological neural network with two neurons, (b) adiagram of an artificial neuron, (c) an architecture of an artificial neural network with threelayers (Negnevitsky 2005).
input signals are propagated in a forward direction on a layer-by-layer basis. An example of
an ANN with three layers is provided in Figure 2.11(c). Similarly to the human brain, ANN
are able to learn, or in other words, they are able to use experience to improve their perfor-
mance. While in a biological neural network, learning involves adjustments to the synapses,
an ANN learns through repeated adjustments of the weights. Different methods have been
proposed for learning, the most popular of them is the back-propagation algorithm (Bryson
& Ho 1969). With this approach, the network is presented with a training set of input sam-
ples. The network then propagates the input samples from layer to layer until an output is
generated by the output layer. If this outcome is different from the desired output, an error is
calculated and then propagated backwards through the network from the output layer to the
input layer. The weights are modified as the error is propagated. This process is repeated sev-
eral times until a stop condition is satisfied. The back-propagation algorithm cannot guarantee
an optimal solution and usually converges to a set of suboptimal weights. With the recent
advancements in deep learning, ANN-based architectures have gained much interest in the
area of speech processing, achieving very high performance when a large amount of data is
available for training (Abdel-Hamid et al. 2012).
2.3.2 Feature selection
In the machine learning area, there are two common approaches for identifying the set of fea-
tures used to train a model. A traditional way, in which the features to use are carefully selected
considering previous knowledge of the problem under assessment, and a brute-force approach
in which thousands of general-purpose features are typically extracted. In both cases, after
a feature extraction, features should be evaluated according to their relevance for the prob-
lem under examination and, consequently, a reduced number of them should be selected. In
fact, identifying a relevant subset of features typically improves the performance of the model,
leading to shorter training times and an enhanced generalization ability.
28
The process of feature selection may be described as a search technique for proposing new
feature subsets. This implies an evaluation measure to score the different subsets. The simplest
algorithm may perform an exhaustive search of the feature space in order to find the feature
set that minimizes the error rate. Except for very small feature sets, this is typically computa-
tionally intractable and a metaheuristic algorithm is often used. The choice of the evaluation
metric strongly influences the algorithm, and allows to distinguish among three main cate-
gories of feature selection approaches: filters, wrappers, and embedded methods (Guyon &
Elisseeff 2003).
• Filter feature selection methods score each feature independently from the others with
a statistical measure. Typically, the statistical measure evaluates the correlation of the
feature with the outcome variable. The features are then ranked by the score obtained
and either selected to be kept or removed from the feature set. In this way, the process
of features selection is independent of any machine learning algorithm. According to
the type of problem, several methods may be used to evaluate the features. Usually, for
continuous values, the Pearson’s correlation coefficient (Pearson 1895) is used, but other
methods include ANOVA (Fisher 1919) or the Chi-Square test (Pearson 1992).
• Wrapper methods consider the selection of a set of features as a search problem, evaluat-
ing and comparing among them different combinations of features. Each feature subset is
scored using a predictive model. For this reason, wrapper methods are computationally
very intensive, but usually provide the best performing feature set for the model con-
sidered. Some common examples of wrapper methods are: sequential forward feature
selection (Pudil et al. 1994), backward feature elimination (Pudil et al. 1994), and recur-
sive feature elimination.
• Embedded methods learn which features best contribute to the accuracy of the model
while the model is being created. The most common type of embedded feature selec-
tion methods are regularization methods. These include least absolute shrinkage and
selection operator (LASSO) regression (Santosa & Symes 1986, Tibshirani 1996) and Elas-
tic net (Zou & Hastie 2005), which have built-in penalization functions to discard some
coefficients and thus reducing the complexity of the model.
2.3.3 Model evaluation
The machine learning models reviewed in the previous sections are built using a large amount
of data from which the algorithm may infer meaningful patterns to later predict new samples.
29
This is called the training phase. After this phase, the built model should be evaluated in order
to assess how well the produced result generalizes to new, unseen data. This is called the evalu-
ation or test phase. There are two common approaches to evaluate the performance of a model,
holdout and cross-validation. The first consists in dividing the data into three subsets: training,
validation, and test set. The training set is used to build the model, whereas the validation set is
used to assess the performance of the model when adjustments of its parameters are required.
The test set should not contain data previously used in the training phase as is used to assess
the future performance of a model on unseen data. Alternatively, one of the most widely used
approaches to estimate the accuracy of a predictive model is cross-validation (Bishop 2006). It
also involves the partition of the original dataset in two complementary subsets, one for train-
ing, one for testing. However, to reduce variability, multiple rounds of cross-validation are
performed using different partitions. The validation results are then averaged over the rounds
to provide an estimate of the model’s predictive performance. This method is usually preferred
when the amount of data available for the training and test phases are reduced, as for example
in the area of health.
There are two common methods of cross-validation, exhaustive and non-exhaustive ones.
In the first case, training and test phases are performed on all possible ways of dividing the
original dataset. One example of this approach often used is known as leave-one-out (LOO)
cross-validation (Geisser 1975, Stone 1974, 1977), which can be considered a special case of
leave-p-out (LPO) cross-validation (Shao 1993) with p = 1. It involves using p observation(s) for
validation and the remaining for training. This process is repeated until all the samples in the
dataset have been divided into a validation set of p observation(s) and a corresponding training
set.
Non-exhaustive cross-validation methods do not compute all the possible combination of
splitting the original dataset. These methods are an approximation of LPO cross-validation.
Probably the most widely used implementation is k-fold cross-validation. In this approach, the
original dataset is randomly partitioned into k equal sized subsets. Of the k subsets, a single
subset is retained as the validation data for testing the model, and the remaining k - 1 subsets
are used as training data. The cross-validation process is then repeated k times, with each of
the k subsets used exactly once as the validation data. The k results can then be averaged to
produce a single estimation. The advantage of this method is that all observations are used for
both training and validation, and each observation is used for validation exactly once. A visual
representation of this approach is depicted in Figure 2.12. In performing cross-validation, it
is usually a good practice to verify that each fold contains roughly the same proportions of
30
Trainingfolds Testfold
1st iteration
2nd iteration
3rd iteration
Dataset
…10th iteration
Figure 3.10: ADD CAPTION.
fold contains roughly the same proportions of observations with a given categorical value, such as the
class outcome value. This is called stratified k -fold cross-validation.
3.3.6.1 Evaluation metrics
Word Error Rate (WER) is a common metric of the performance of a speech recognition system. It
measures the differences between a recognized word sequence, also called the hypothesis, and the
corresponding spoken word sequence, the reference. There are three types of errors that may occur
as a result of comparing these two sequences. Words that exist only in the hypothesis are called
insertions. Likewise, words that exist only in the reference are called deletions. Finally, word existing in
both sequences, but improperly recognized, are called substitutions. Thus, the WER is measured as the
sum of number of insertions, deletion, and substitutions, divided by the total number of words existing in
the reference.
31
Figure 2.12: An illustration of the k-fold cross-validation method with k=10.
observations with a given output value, such as the class outcome. This approach is called
stratified cross-validation.
2.3.3.1 Evaluation metrics
According to the type of machine learning problem, several measures may be used to describe
the validity of a model. Usually, different areas have different preferences for specific metrics
due to different goals. For instance, in medicine, sensitivity and specificity are often used,
while in information retrieval precision and recall are preferred. In this section, the accuracy
and Word Error Rate (WER) are briefly introduced, they are two measures widely used in the
areas of classification and speech recognition.
In a classification problem, the accuracy measures the fraction of all instances correctly
categorized. Consider as an example a test for the presence of a disease performed on a set
of subjects. Some people will actually have the disease, and if the test correctly identify them,
then these instances are called true positives (TP). Some other subjects have the disease, but the
test incorrectly claims they do not have it. In this case, these instances are called false negatives
(FN). Subjects that do not have the disease, and the test correctly identify them are called true
negatives (TN). Finally, healthy people who have a positive test result are called false positives
(FP). Provided with these definitions, the accuracy of a binary classifier is then computed as the
sum of the true positives and true negatives instances, divided by the total number of instances
existing in the dataset:
Accuracy=TP + TN
TP + TN + FP + FN
The WER is a common metric of the performance of a speech recognition system. It mea-
sures the differences between a recognized word sequence, also called the hypothesis, and the
corresponding spoken word sequence, the reference. There are three types of errors that may
31
occur as a result of comparing these two sequences. Words that exist only in the hypothesis are
called insertions. Likewise, words that exist only in the reference are called deletions. Finally,
word existing in both sequences, but improperly recognized, are called substitutions. Thus, the
WER is measured as the sum of number of substitutions, deletion, and insertions, divided by
the total number of words existing in the reference:
WER=S + D + I
N
2.4 Summary
In this chapter, some introductory notions of technical concepts required for the understanding
of the rest of this document were reviewed. Topics related to the areas of natural language
processing, spoken language processing, and machine learning were covered. Each of these
areas is very extensive and providing a description of the several existing approaches in each
of them is beyond the scope of this chapter. Relatively to the area of natural language process-
ing, a simple method for representing text data was first briefly introduced, then statistical and
predictive language models were approached. The section dedicated to spoken language pro-
cessing required to cover a broader set of topics in order to ease the reading of the next chapters.
A characterization of speech production and some of the many types of information that can
be extracted from speech signals were provided. Then, a brief review on a speaker modeling
technique used in the tasks of speaker identification and recognition is also reported. The sec-
tion concluded with some basic notions about the main components of a speech recognition
system. Finally, some models commonly used in machine learning problems are presented
together with an overview on feature selection approaches, and a description of a standard
approach to evaluate machine learning models. Additionally, two common evaluation metrics
in classification problems and in speech recognition were described. These metrics have been
also used to assess the contributions provided in this dissertation.
32
3Characterization of
Neurodegenerative Diseases
Neurodegenerative diseases affect millions of people worldwide. The origin of these disorders
is identified with nerve cells in the brain or in the peripheral nervous system that gradually
lose their functionality. The clinical condition becomes progressively worse over time, until
ultimately nerve cells die. The risk of neurodegenerative diseases increases with age. Consid-
ering that lifespan has been extended notably in the last decades, it is not surprising that their
prevalence is also increasing. This creates a critical need to improve the understanding of these
disorders to develop new approaches for their prevention and treatment. For these reasons, the
most common neurodegenerative diseases are briefly introduced in this chapter. For each of
them, the main symptoms and the clinical criteria used for diagnosis are reported. The chapter
ends with a summary in which (i) it is shown how Speech and Language Technology (SLT) can
provide benefits to the diagnostic process of the observed neurodegenerative diseases, and (ii)
I identify two diseases on which to focus the rest of this study.
3.1 Mild Cognitive Impairment
Mild Cognitive Impairment (MCI) is a brain function syndrome involving the onset and evolu-
tion of cognitive decline greater than expected for an individual’s age and education level, but
that does not interfere notably with the activities of daily life. Prevalence in population-based
epidemiological studies ranges from 16% to 20% in adults older than 60 years. Some people
with MCI seem to remain stable or regress to normal over time, but 20% to 40% progress to
dementia within five years (Roberts & Knopman 2013). MCI can thus be regarded as a risk
state for dementia and its identification could lead to secondary prevention by controlling risk
factors such as systolic hypertension. The amnestic subtype of MCI has a high risk of progres-
sion to Alzheimer’s disease and could constitute a prodromal stage of this disorder (Gauthier
et al. 2006).
In 2011, the American National Institute on Aging and the Alzheimer’s Association (NIA-
AA) published the core clinical criteria for the diagnosis of MCI. These require the existence
of a concern regarding a change in cognition together with an impairment in one or more
cognitive domains. Symptoms should allow the preservation of independence in functional
abilities, and there should be no evidence of a significant impairment in social or occupational
functioning (Albert et al. 2011).
Cognitive assessment includes tests of episodic memory. These are useful for identifying
amnestic MCI patients who have a high likelihood of progressing to Alzheimer’s disease de-
mentia within a few years. Since other cognitive domains can be impaired in addition to mem-
ory, the assessment typically also includes tests that evaluate executive functions (e.g., reason-
ing, problem solving, planning), language (e.g., naming, fluency, and comprehension), visu-
ospatial skills, and attentional control. Many validated clinical neuropsychological measures
are available to assess these cognitive domains. Among them, the Mini-Mental State Examina-
tion (MMSE) (Folstein et al. 1975) is strongly widespread for screening of MCI and Alzheimer’s
disease due to its brevity, high sensitivity, and ease of administration and scoring (Nunes 2005).
Another commonly used battery is the Wechsler Adult Intelligence Scale - III (WAIS-III) (Ryan
& Lopez 2001). It is considered ”the gold standard” in intelligence testing, providing informa-
tion about the overall level of intellectual functioning and the presence or absence of significant
intellectual disability.
3.2 Alzheimer’s disease
Dementias represent a broad category of brain diseases that cause a long term, gradual de-
crease in multiple cognitive functions. They are responsible for the greatest burden of neu-
rodegenerative diseases, with Alzheimer’s Disease (AD) representing the most common cause
of dementia, contributing to 60%-70% of cases (WHO 2017).
AD is characterized by loss of neurons and synapses in the cerebral cortex and in certain
subcortical regions. It causes, gradually and progressively, the change and destruction of the
nervous tissues. At an early stage, AD is characterized by alterations of memory and of spa-
tial and temporal orientation. With the progression of the disease, other neuropsychological
changes arise, such as language impairments, visuospatial deficits and changes in abstraction
and judgment. At a later stage, the disease may lead to the development of apraxia (difficulty
in organizing motor actions intentionally). AD is diagnosed when there are cognitive or behav-
ioral symptoms that represent a decline from previous levels of functioning and interfere with
the ability to function at work or at usual activities. The diagnostic process may include brain
imaging and cerebrospinal fluid exams. Cognitive impairment should be diagnosed through
an objective cognitive assessment and it should involve at least two of the following domains:
memory, reasoning, visuospatial abilities, language, personality (McKhann et al. 2011). Al-
though memory impairment due to medial temporal lobe damage is the characteristic symp-
34
tom of AD, language problems are also prevalent and existing literature confirms they are an
important factor. The most well-known symptoms of impaired language abilities include nam-
ing, word-finding difficulties, repetitions, an overuse of indefinite and vague terms, and inap-
propriate use of pronouns (Ahmed, Haigh, de Jager & Garrard 2013, Almor et al. 1999, Forbes
et al. 2002, Kempler 1984, 1995, Kempler et al. 1987, Kim & Thompson 2004, Oppenheim 1994,
Reilly et al. 2011, Salmon et al. 1999, Taler & Phillips 2008, Ulatowska et al. 1988).
When it comes to discourse, syntactic and semantic deficits in language processing con-
strain the production of meaningful speech. The discourse of AD patients is described as flu-
ent but not informative, characterized by incomplete and short sentences (Hier et al. 1985,
Nicholas et al. 1985), poorly organized, and with a disproportionate deficit in maintaining co-
hesion (Shekim & LaPointe 1984), and coherence (Appell et al. 1982, Glosser & Deser 1991,
Hutchinson & Jensen 1980, Obler & Albert 1984, Ripich & Terrell 1988).
No treatments stop or reverse the progression of this disease, though some may temporar-
ily improve the symptoms. The disease onset is often mistakenly attributed to aging or stress.
Detailed neuropsychological testing can reveal mild cognitive difficulties up to eight years be-
fore a person fulfills the clinical criteria for the diagnosis of AD (Backman et al. 2004). Among
the most used neuropsychological measures, there are the MMSE and the Alzheimer’s Dis-
ease Assessment Scale - Cognitive Subscale (ADAS-Cog) (Rosen et al. 1984). The latter is the
most widely administered tool in AD trials (Cano et al. 2010, Robert et al. 2010), being used to
measure cognitive performance and detect therapeutic efficacy in cognition. The ADAS-Cog
consists of eleven tasks assessing six areas of cognition: memory; language; ability to orien-
tate to time, place, person; construction of simple designs; planning; and performing simple
behaviors in pursuit of a predefined goal. The battery takes approximately 30 to 45 minutes to
complete, depending on the AD severity stage of the patient.
3.3 Parkinson disease
Parkinson’s Disease (PD) is due to the progressive death of neurons in the substantia nigra, a
region of the mid-brain. This has the effect to decrease the synthesis of dopamine, which causes
a dysfunction in the regulation of major brain structures involved in the control of movements.
PD is the second most common neurodegenerative disorder after AD, affecting about 1% of
people older than 60 years (de Lau & Breteler 2006). About 89% of PD patients develop speech
disorders (Ramig et al. 2008).
The cardinal motor signs of PD include the characteristic clinical picture of resting tremor,
rigidity, bradykinesia, and impairment of postural reflexes, while non-motor symptoms in-
35
clude behavioral disorders, sleep and sensory abnormalities. These symptoms slowly worsen
during the disease with a nonlinear progression. Dementia becomes common in the advanced
stages of the disease (Sveinbjornsdottir 2016). PD patients often develop a speech disorder
referred as hypokinetic dysarthria. This is characterized by weakness, paralysis, lack of coordi-
nation in the motor speech system, affecting respiration, phonation, articulation, and prosody.
These deficits may result in an altered speech characterized by a reduced intelligibility, natu-
ralness, and overall efficiency of vocal communication. Deficits in phonation are related with
vocal fold bowing and incomplete closing of vocal folds. These can result in a decreased loud-
ness and an impaired ability to produce normal phrasing and intensity. Articulation deficits
are manifested as a reduced amplitude and velocity of the articulatory movements in the lips,
tongue, and jaw. Patients may report imprecise stop consonants, produced as fricative, and de-
fects in the ability to make rapid articulator movements in the repetition of a consonant–vowel
combination. Prosodic impairments comprise changes in loudness, pitch, and timing, which
overall contribute to the resulting intelligibility of speech.
The standard method to evaluate and rate the neurological state of Parkinson’s patients
is based on the revised version, provided by the Movement Disorders Society, of the Unified
Parkinson’s Disease Rating Scale (UPDRS) (MDS 2003). The motor part of the UPDRS (Section
III) addresses speech evaluating volume, prosody, clarity, and repetition of syllables. Speech
symptoms of PD are typically assessed by a SLP through several speaking tasks thought to
measure the extent of speech and voice disorders. The most traditional of them are the sus-
tained vowel phonation, rapid syllable repetition (diadochokinesis or DDK), and variable read-
ing of short sentences, longer passages, or freely spoken spontaneous speech (Goberman &
Coelho 2002). Each of these tasks evaluates a specific impairment caused by dysarthria, such
as difficulties in consonant-vowel articulation, phonation, respiration, and prosody.
3.4 Dementia with Lewy bodies
The predominant histological feature of Dementia with Lewy bodies (DLB) is the presence of
cortical and subcortical Lewy bodies, clumps of alpha-synuclein protein in neurons. DLB is the
second most common type of degenerative dementia in the elderly, possibly accounting for up
to 15% of all dementia cases (McKeith et al. 1996).
Dementia, defined as a progressive cognitive decline of sufficient magnitude to interfere
with usual daily activities, is an essential requirement for DLB diagnosis. Prominent or persis-
tent memory impairment may not necessarily occur in the early stages, but is usually evident
as the disease progresses. Deficits on tests of attention, executive function, and visuopercep-
36
tual ability may instead be especially prominent and occur early. Core clinical features also
include fluctuation in cognition, recurrent visual hallucination, or motor features of parkinson-
ism (McKeith et al. 2017). This disease presents a pronounced clinical and neuropathological
overlap with AD as well as PD with dementia (PDD).
Dementia screening batteries such as the MMSE and the Montreal Cognitive Assessment
(MOCA) (Nasreddine et al. 2005) are useful to characterize global impairment in DLB. How-
ever, neuropsychological assessment should include measures assessing different cognitive do-
mains that are capable of highlighting clinical deficits typical of this disease. Measures of atten-
tion and executive function that differentiate DLB from AD and normal aging include tests of
processing speed and divided attention (e.g., Stroop tasks, phonemic fluency, and trail making
tests). These tests are particularly important because they assess the brain’s ability to attend
to multiple stimuli simultaneously, while evaluating the reaction time. A version of the Stroop
test (Stroop 1935), for instance, requires to read aloud the name of a color which is printed out
in a color different by the name (e.g., the word ”red” printed in blue ink instead of red ink).
Tests of verbal fluency (Benton et al. 1994), on the other hand, aim at assessing verbal initia-
tive ability, inhibition ability, and the difficulty in switching among tasks. In these tests, the
participants should produce as many words as they can think of beginning with a particular
letter (phonemic fluency) or belonging to a particular category (semantic fluency), in a con-
strained time of 60 seconds. Examples of useful probes of spatial and perceptual difficulties
include tasks of figure copy (e.g., intersecting pentagons and complex figure copy). Memory
and object naming tend to be less affected in DLB and are best evaluated through story recall,
verbal list learning, and confrontation naming tasks to detect impairments of word-finding
abilities (McKeith et al. 2017).
3.5 Fronto Temporal Dementia
Frontotemporal Dementia (FTD) defines a heterogeneous group of clinical syndromes
marked by the progressive, focal neurodegeneration of the frontal and anterior temporal
lobes (Pasquier & Petit 1997). It is the third most common dementia for individuals older
than 65 years (Brunnstrom et al. 2009, Ratnavalli et al. 2002). FTD affects brain regions im-
plicated with motivation, reward processing, personality, social cognition, attention, executive
functioning, and language.
Currently, FTD incorporates three clinical subtypes known as variants of Primary Progres-
sive Aphasia (PPA): non-fluent, semantic, and logopenic PPA. Patients are first diagnosed with
FTD and then are divided into clinical variants based on the relative presence or absence of
37
salient speech and language features. The diagnosis of FTD requires initial and progressive
decline in social functioning and changes in personality, characterized by a progressive deteri-
oration of behavior accompanied by three out of the following features: disinhibition, apathy,
loss of empathy, eating behavior changes, compulsive behaviors, and an executive predomi-
nant pattern of dysfunction on cognitive testing. The main language domains considered to
classify disease’s variants include speech production features (e.g., grammar, motor speech,
sound errors, and word-finding pauses), repetition, single-word and syntax comprehension,
confrontation naming, semantic knowledge, and reading/spelling.
The clinical diagnosis of the non-fluent variant of primary progressive aphasia (nfvPPA)
requires either agrammatism in language production or effortful, slow, and labored speech
with inconsistent sound errors and distortions (apraxia of speech) (Gorno-Tempini et al. 2011).
The semantic variant of primary progressive aphasia (svPPA) preserves language fluency, but is
characterized by anomia, severe single-word comprehension deficits, loss of object knowledge,
semantic and paraphasic errors (Gorno-Tempini et al. 2004). These variants affect between 20%-
25% of patients diagnosed with FTD (Johnson et al. 2005). Finally, the third variant known as
logopenic variant of primary progressive aphasia (lvPPA) presents word retrieval and sentence
repetition deficits. Spontaneous speech is characterized by a slow rate, but without a clear
agrammatism. This syndrome is the most recently identified among the variants of PPA and
presents some patterns in overlap with AD especially in the early age of onset (Henry & Gorno-
Tempini 2010).
Tasks typically used in the diagnostic process include picture description and story retelling
tests to evaluate grammatical structure, diadochokinesis task to assess motor speech capabil-
ities, repetitions, confrontation naming, and sentence or single-word comprehension (Gorno-
Tempini et al. 2011).
3.6 Amyotrophic Lateral Sclerosis
Amyotrophic Lateral Sclerosis (ALS) is characterized by a rapid, progressive degeneration of
motor neurons in the brain and spinal cord, which ultimately leads to paralysis and prema-
ture death. Overall, the prevalence of ALS is low, approximately 5 in 100,000 individuals, but
incidence increases with age, showing a peak between 55 and 75 years (Bertram & Tanzi 2005).
Primarily characterized by weakness and atrophy in the muscles of the extremities, de-
creased muscle tone, and fasciculations, this disease is often subtyped into several variants
according on the site of onset (i.e., bulbar, spinal, and respiratory). Approximately the 70%
of patients are affected from the spinal form of the disease, the 25% of the cases reports bul-
38
bar onset, and the remaining 5% has initial trunk or respiratory involvements (Kiernan et al.
2011). While the spinal variant presents symptoms that may start with upper and lower limbs
muscle weakness, the bulbar subtype presents speech and swallowing difficulties, being char-
acterized by respiratory problems, tongue atrophy, and by the eventual loss of speech intelli-
gibility (Yorkston et al. 1993). In this subtype speech problems are often present as an early
manifestation of the disease, possibly affecting the phonatory, articulatory, resonatory, and res-
piratory speech subsystems. As a result, ALS patients may experience both dysarthria and
dysphagia (difficulty in swallowing). The pattern of speech impairments includes effortful,
slow productions with short phrases, inappropriate pauses, imprecise consonants, hypernasal-
ity, strain-strangled voice, as well as a decreased pitch and loudness range (Watts & Vanryck-
eghem 2001). Acoustic analysis of the voice has confirmed a deviant fundamental frequency,
amplitude and frequency perturbations, voice range, vocal quality, and phonatory instabil-
ity (Silbergleit et al. 1997).
The criteria for the diagnosis of ALS (Brooks 1994) require the presence of signs of degen-
eration of lower motor neuron (weakness, or paralysis accompanied by loss of muscle tone),
upper motor neuron (paralysis accompanied by severe spasticity and rigidity), and progres-
sive spread of signs within a region to other. At the same time, it is required pathological,
neuroimaging and electrophysiological evidence of the absence of other diseases that might
explain the observed clinical signs. Abnormal speech or swallowing studies and abnormal pul-
monary or larynx function are among the clinical features used to support the diagnosis. The
progression of speech impairments can be quite rapid. Case studies report an overall decay of
speech intelligibility (from 98% to 48%) and pulmonary function in an observation period of
only two years (Kent et al. 1991). Due to this aggressive loss, ALS patients should be frequently
re-assessed by SLPs. In order to be comprehensive, the assessment should individually eval-
uate the articulatory, respiratory, phonatory, and resonatory subsystem. The first is assessed
using kinematic measures (e.g., speed, strength) of facial components (jaw, lips, and tongue),
while the respiratory subsystem is evaluated considering aerodynamic (e.g., oral pressure, air-
flow) and acoustics variables. Both may use specialized instruments and speaking tasks of
spontaneous speech. The evaluation of the phonatory subsystem relies on voice characteristics
using the maximum phonation time task, while the evaluation of the resonatory subsystem is
based on the analysis of velopharyngeal muscle weakness.
3.7 Huntington disease
Huntington Disease (HD) is caused by a degeneration of neurons in the basal ganglia and in
cortical regions, affecting the areas of the brain involved in movement, cognition, and emotions.
39
Its prevalence is similar to that of ALS, affecting approximately 2.71 in 100,000 individuals
worldwide (Pringsheim et al. 2012).
HD is characterized by a progressive motor dysfunction, behavioral changes and cognitive
decline resulting in dementia. From a clinical perspective, HD is primarily manifested by in-
voluntary movements known as chorea, which may be accompanied by bradykinesia, motor
impersistence, and deficits in movement planning, aiming, tracing, and termination (Berardelli
et al. 1999, Paulsen 2011). A primary consequence of chorea is the onset of a motor speech
disorder characterized as hyperkinetic dysarthria. The main patterns of this disorder are im-
precise consonants, prolonged intervals, variable rate, monopitch, harsh voice, inappropriate
silence, distorted vowels, and excessive loudness variations (Hartelius et al. 2003, Saldert et al.
2010).
Currently, HD is formally diagnosed based on the presence of the HD gene and on the
development of motor symptoms that are unequivocal signs of HD, matching the fourth confi-
dence level (≥ 99% confidence) in the Diagnostic Confidence Level of the Unified Huntington’s
Disease Rating Scale (UHDRS) (Reilmann et al. 2014). These criteria, however, may miss the
earliest signs and symptoms of the disorder, which could occur up to 10-15 years before the
disease’s onset. In this period, individuals may experience the gradual appearance of subtle
motor, cognitive, and behavioral changes, but do not meet the current criteria for formal HD di-
agnosis. As such, new diagnostic categories for HD are being proposed, based on an improved
understanding of natural history (Reilmann et al. 2014). Also, some recent studies are starting
to consider motor speech deficits and language difficulties as a clinical indicator of disease on-
set and marker of disease progression (Rusz et al. 2014, Skodda et al. 2014, Vogel et al. 2012).
These studies evaluate major observed impairments in HD patients, including deviations in
phonation (i.e., increased pitch, harsh voice), poor oral motor performance (i.e., reduced co-
ordination of tongue and lips) and alterations in speech timing and prosody (i.e., shortened
phrase length). The tasks typically used are: syllable repetition, sustained vowel phonation,
reading of a passage, and freely spoken spontaneous speech.
3.8 Neurodegenerative diseases and SLT
In the previous sections, the diagnosis process of several neurodegenerative diseases was re-
ported, observing that it partially relies on different kind of tests. According to the disease,
the assessment may include neuropsychological tests or a perceptual assessment of voice qual-
ity, and is typically performed by a neurologist expert or a SLP. In any case, the evaluation
should be repeated over time to monitor the disease progression and adjust drugs administra-
40
tion. Disease monitoring turns the administration of screening tests even more burdensome,
since it implies the physical dislocation of one of the two parties, the clinician or the patient. In
both cases, it could raise some inconveniences, namely for the additional stress that the patient
and his/her caregiver have to face besides the daily routine, and for the scarcity of clinicians
in remote places with limited resources. The possibility of providing such kind of tests as a
service, available for instance on the internet or by phone, will be useful for allowing remote
clinical assessment. According to the disease, SLT may provide additional advantages to the
diagnostic process of neurodegenerative disorders, these are highlighted in the remainder of
this section according to disease clinical symptoms.
Neurological disorders that present motor impairments (e.g., PD, HD, ALS) affect the
speech apparatus, while preserving syntactic, semantic, and pragmatic abilities of language
production. In these kind of diseases, the neurological injury causes the weakness, paralysis or
lack of coordination of the speech organs that contribute to the production of sounds, like vocal
folds, lungs or jaws, with evident consequences on the resulting voice quality. Given their na-
ture, these disorders are evaluated on speech tasks targeted to analyze phonation, articulation,
and prosody. During the administrations of these tasks, the SLP should be able to perform a
perceptual evaluation of the speech functionalities, and to compare these results with those of
a previous examination. As such, the assessment is strongly dependent on the expertise of the
clinician. SLT could overcome these limitations and provide a great contribution to the exe-
cution of these tasks. In fact, there is an important area of speech processing that deals with
the identification of representative features of the vocal tract and its possible deviations due
to diseases. This will lead to an objective, deterministic and reproducible evaluation, and will
ease the comparison with previous data of the same patient. Additionally, the evaluation could
be performed remotely, with obvious advantages already highlighted previously.
Neurodegenerative disorders affecting cognitive functions require a different evaluation,
based on the administration of several cognitive tests. The typology of tests used is strictly de-
pendent on the disease and on the cognitive domains impaired. The diagnosis of MCI, AD, and
DLB is based on neuropsychological tests assessing primarily memory, orientation and higher
order functions like planning, and attention. The diagnosis of isolated language deficits relies
on cognitive stimuli assessing a specific functionality, such as the ability to repeat a word or
a sentence, or the ability to name an object or a person. In both cases, the majority of these
tests include a verbal component provided in response to a visual or spoken stimulus solicited
by the clinician. Also, the result of a neuropsychological evaluation involving cognitive de-
cline typically provides a numerical score, which is adjusted by accounting for age and literacy.
41
Then, it is compared with reference normative values in order to establish if the obtained score
is considered normal or below expectations. Due to their nature, and to the need of continu-
ously monitor the cognitive decline over time, neuropsychological tests lend themselves natu-
rally to be automated through SLT. A tool including the digitized version of these tests, with
the possibility of an immediate evaluation through automatic speech recognition, is feasible
and could be of valuable support in health care centers: first, the therapist will have access
to an organized archive of tests; second, tests could be administered in the traditional way, or
remotely, when the physical dislocation of the subject is hampered by logistic constraints or
physical disabilities; finally, recordings and evaluations could be stored and available for later
consultation.
A different evaluation is required for the class of disorders that present language impair-
ments (e.g., FTD, AD), whose assessment includes tests requiring complex language abilities,
like discourse production. In this case, the analysis relies on samples of spontaneous speech,
elicited through different types of cognitive stimuli. Also, the way in which speech is elicited
will directly influence the resulting discourse and its characteristics, and should be considered
in the assessment. A descriptive speech is obtained through the description of an image or
an object. A narrative speech emerges through the recall of an event and involves memory,
while, a procedural speech includes instructions directed to explain how to perform a task,
and involves higher order functions like planning. Spontaneous speech samples are typically
recorded and then analyzed in terms of their phonological, syntactic, semantic, and pragmatic
features. The process to obtain these measures is based, first on the manual transcription of
the recordings and then on the subsequent identification and annotation of linguistic elements,
such as lexical items (e.g., noun, verb, adjective), sentence clause boundaries, and cohesive el-
ements. Through these annotations, word frequencies, and other statistics are then manually
computed in order to assess discourse production in terms of its correctness, fluency, informa-
tion conveyed, and overall coherence. Due to the time that this type of analysis requires, this
approach could not provide an immediate feedback in clinical settings environments. Addi-
tionally, it may also lead to different inter-expert assessments due to the intrinsic, ambiguous
nature of spontaneous language. Providing this kind of analysis in an automatic way, through
natural language processing and automatic speech recognition, will allow to overcome these
limitations, and provide clinicians with a complementary tool to evaluate complex language
abilities.
42
3.9 Summary
In this chapter, several neurodegenerative disorders were introduced, reporting their major
symptoms and core clinical criteria used for diagnosis. From this analysis, Alzheimer and
Parkinson’s diseases are among the most important neurodegenerative disorders due to their
high prevalence, representing, respectively, the first and second most common neurodegen-
erative diseases affecting people older than 60 years. Additionally, both diseases can be con-
sidered, individually, representative of other disorders that present similar symptoms. This is
especially the case for the overlap between reported language impairments in AD and some
subtypes of FTD. In fact, although memory impairment is the main symptom of AD, language
problems are also prevalent. Initially, the semantic domain is impaired, as the disease worsens,
also the syntax and phonology domains get affected. An extensive use of pronouns, accom-
panied by a reduced use of nouns, is a hallmark of both AD and the semantic variant of PPA.
To continue, word-finding difficulties, characteristic of the speech of AD patients, are also one
of the core features of the logopenic variant of PPA. Another example is MCI and the several
variants that exist of dementia. Although along the course of the disease quite different evolu-
tions may occur, the appearance of the initial symptoms may often be similar. Amnestic and
orientation impairments are commonly referred as the first symptoms manifested in MCI, AD,
and in some cases, ALS. In what concerns PD, one can observe that the motor degeneration af-
fecting the speech apparatus, known as dysarthria, can be developed also in HD and, to some
extent, in ALS. Based on these considerations, it is probably the case that the development of
automatic methods targeted for the diagnosis of a particular disorder, could be easily extended
to other disorders with similar onset or evolution.
Finally, considering that these diseases represent two of the most common neurological
disorders, the choice of focusing on them will intrinsically bring other advantages. Among
them, the most practical one is concerned with the availability of data. This issue becomes
increasingly important when dealing with current machine learning approaches that typically
require a considerable amount of input data to allow for a good generalization of the problem
under consideration. Although nowadays there is an increasing availability of digital resources
in many areas (e.g., online newspapers and encyclopedias), this is not straightforward for the
area of speech processing applied to health. In this context, one is dealing with sensitive data
that are difficult to gather, since they consist of the speech recordings of subjects with a clinical
condition. For this reason, data are also subjected to privacy and ethical concerns. The process
of data collection should be supported by a detailed protocol that should be validated by an
ethics commission. The protocol should specify the objectives of the study and how the privacy
43
and security of the data will be guaranteed. To conclude, the distribution of the observed
population should be balanced in terms of gender, age, and education in order to be validated
with existing normative values. By selecting the two most widespread diseases, it is more
probable to find publicly available data sets.
44
4Related Work: SLT for Diagnosis
of Neurodegenerative Diseases
In Chapter 3, several neurodegenerative diseases that affect different speech and language ca-
pabilities were introduced. Their relevance, major symptoms and diagnostic criteria were ana-
lyzed. Then, the study focused on how current speech and language technology may provide
benefits to the diagnostic process of these diseases. Finally, Alzheimer and Parkinson’s diseases
were identified as the disorders on which to focus the next part of this research. From this re-
port, it is also possible to observe and identify the link that exists between speech and some
neurodegenerative diseases. In some cases, speech production can be considered a marker of
central nervous system integrity, leading to a frequent motor disorder known as dysarthria
(e.g., Parkinson’s Disease (PD) and Huntington Disease (HD)). In other instances, when speech
production is spared, it becomes a standard way to screen cognitive disorders that may present,
at least initially, similar onset. Finally, speech production could be a link between different
diseases that report similar impact on language functionality, such in the case of dementias
where correctness, fluency, and meaningfulness become impoverished over time. These con-
siderations motivated the three contributions of this thesis, mentioned in Chapter 1, namely,
monitoring of speech, cognitive, and language abilities. In this chapter, for each of them the
literature review is presented by describing the most relevant works that provide automated
solutions based on SLT.
4.1 Monitoring of speech abilities
In the last few years, there has been a growing interest from the research community in motor
speech disorders. The current state of the art includes an extensive body of research targeting
an automatic characterization of dysarthria in order to discriminate among PD patients and
healthy subjects. Overall, these studies have considered the analysis of different speech fea-
tures that should be able to reflect the physical impairments caused by the disease. A selection
of the most relevant works existing on this topic is reported in the following paragraphs.
Rusz et al. (2011) investigated a set of quantitative acoustic measurements for the character-
ization of speech and voice disorders in early untreated PD patients. The corpus is composed
of 46 Czech native speakers, 23 individuals diagnosed with an early stage of PD, and 23 healthy
individuals matched for age. Eight vocal tasks were used in the study, including two different
versions of the sustained phonation of vowels /a/, /i/, /u/, the rapid /pa/-/ta/-/ka/ sylla-
bles repetition, a monologue, and the reading of various tasks with different characteristics (i.e.,
sentences with varied stress patterns, and sentences to be read according to specific emotions
among others). The study considered measures traditionally used for evaluating phonation,
articulation, and prosody in PD and, additionally, introduced some new measurements of ar-
ticulation. Among the set of features, there are: the fundamental frequency (F0), jitter, shimmer,
first and second formant frequency (F1, and F2 respectively), speech rate, pause, variations in
loudness, and articulation accuracy. The most representative features are then identified ac-
cording to two criteria: i) selection of measures with statistically significant differences between
the two groups, and ii) removal of highly correlated variables. After this phase, 19 out of 32
measures were selected. The Wald task (Schlesinger & Hlavac 2002) was used to separately as-
sess each measure for its ability of classifying subjects as PD, healthy, or not sure. Results have
shown that the 26.77% of subjects were correctly classified according to their group, the 71.97%
were classified as indecisive situation, while the remaining 1.26% were classified to the inverse
group. Variations of F0 in monologue and sentences read according to specific emotions, were
the best method for discriminating PD patients.
Orozco-Arroyave et al. (2013) explored the discriminant capability of different perceptual
features for automatically classifying between people with PD and healthy individuals. Fea-
tures included in the study considered LPC, Linear Prediction Cepstral Coefficients (LPCC),
MFCC, PLP, and two versions of Relative Spectra coefficients (RASTA), with and without cep-
stral filtering (RASTA-PLP-CEPS, RASTA-PLP-SPEC). The number of coefficients is 12 for each
type of feature except RASTA-PLP-SPEC, for which 27 coefficients are estimated. Four statis-
tics are computed for each kind of features: mean, standard deviation, skewness and kurtosis.
The corpus consisted of the speech recordings of 20 patients and 20 healthy subjects while
performing the sustained vowel phonation task for the five Spanish vowels. Each subject re-
peated the task three times for each vowel. Data is balanced by gender and age. Following the
feature extraction phase, the authors performed feature selection using Principal Component
Analysis (PCA). Then, a two-layer classification scheme was implemented. The first stage of
classification considers each kind of features individually, the second stage combines the results
obtained previously into a new feature space. The 70% of the data is used for feature selection
and for training the classifier, while the remaining 30% is used for testing. Each stage of the
classification process is repeated ten times per each pair of subsets (training and testing), form-
ing a total of 100 independent realizations of the experiment. Classification was performed
46
using SVM (Cortes & Vapnik 1995) trained with a Gaussian kernel (Scholkopf & Smola 2001).
The best accuracy for each vowel was achieved using different features. For vowel /a/ the
PLP parameters exhibited the best results (76.19%), while for vowels /i/ and /u/ the best fea-
tures were the MFCC (75.30%, 76.28%). The best results for vowels /e/ and /o/ are obtained
when five subsets of features are combined. For the case of vowel /e/ the considered features
are RASTA-PLP- SPEC, MFCC, PLP, LPC and RASTA-PLP-CEPS (77.22%), while for vowel
/o/ the set of features include MFCC, LPCC, RASTA-PLP-CEPS, PLP and RASTA-PLP- SPEC
(81.08%).
Skodda et al. (2011) analyzed the ability to articulate vowels in a group of PD patients
suffering from mild hypokinetic dysarthria. Results were compared with the ones obtained
with a control group. The goal of the study is to confirm the hypothesis that PD patients
present a reduced working space for vowels with respect to healthy control, even when voice
intelligibility is preserved. In fact, limited movements of the articulators, as may be the case in
hypokinetic dysarthria, should be characterized by a lowering of high frequency formants and
by an elevation of normally low frequency formants. To this purpose, the authors resort to the
notions of VAI, and triangular VSA. The relationship of vowel space with the net speech rate
and with the global motor impairment of the disease were also investigated. Analysis of speech
rate was performed by measuring the length of each syllable and pause. As such, the net speech
rate was defined as syllables per second related to the total speech time minus the sum of all the
pauses. The dataset is composed of German speakers, 68 patients and 32 healthy individuals,
each participant performed a reading task composed of four complex sentences. Each of the
vowels /a/, /i/, and /u/ were extracted 10 times from different words within the text. Results
have shown that PD patients present reduced formant transitions and a restricted acoustic
vowel space. However, these impairments were independent from global motor function and
the stage of the disease. The triangular VSA was found to be reduced in male but not in female
PD speakers, whereas measurement of VAI seemed to be more applicable for the differentiation
of PD and healthy speakers.
In another work, Orozco-Arroyave et al. (2014) exploited acoustic measures for analyzing
phonation and articulation in PD patients and for distinguishing them from a control group.
The corpus is composed of 50 patients and 50 healthy subjects while performing three repeti-
tions of the five Spanish vowels. Participants are balanced by gender and age. The acoustic
features considered include: F0 and measures of its variability (jitter, shimmer, correlation di-
mension), F1, F2, the VAI, the triangular VSA, and three new measures: the vocal prism, the
vocal pentagon, and the vocal polyhedron. The base of the vocal prism is the triangular VSA,
47
while its altitude is given by the variability of the pitch estimated on the vowels /a/, /i/, and
/u/. The vertexes of the vocal pentagon are composed by the values of the F1 and F2 for the five
Spanish vowels. The base of the vocal polyhedron is formed by the vocal pentagon, while its
edges are given by the pitch variability obtained from the five Spanish vowels. From these mea-
sures, the authors computed different features based on their geometrical properties (e.g., area,
volume). For each feature set, mean value, standard deviation, kurtosis, and skewness were
also computed. Classification was performed in two stages, in the first a linear Bayesian classi-
fier allowed the identification of those features with a minimum accuracy of 61%. This subset
of features is then included in the second phase, where an SVM (Cortes & Vapnik 1995) with
Gaussian kernel (Scholkopf & Smola 2001) was used to classify between PD and healthy con-
trol. The parameters of the SVM are optimized using a 10-fold cross-validation strategy. Two
of the features introduced by the authors were selected for the second phase of classification
(std[Vprism], centPenta[F2u]). The best result was achieved using a combination of articulation
and phonation features, providing an accuracy of 81.3%.
Bocklet et al. (2011) investigated acoustic, prosodic and voice-related features to perform
the automatic classification of PD. The analysis was performed using three different systems.
Articulation is characterized through statistical modeling of acoustic features, using the first 39
MFCCs. The authors implemented a GMM-UBM approach to obtain speaker specific GMMs.
The means of each speaker are then used as speaker-specific features. Prosodic analysis was
based on a voiced/unvoiced (VUV) decision, voiced segments are then used to compute the
F0, energy, duration, pauses, jitter, shimmer, and different statistics. Voice and phonation were
modeled by a glottal excitation system based on a two-mass vocal fold modeling. Apart from
a phonation task, the corpus used in this study is the same as the one described in the work of
Rusz et al. (2011). Classification was performed with SVM (Cortes & Vapnik 1995), using LOO
cross-validation. The three systems achieved different recognition rates with different tasks.
Prosodic system: 90.5% on reading a text of 136 words, acoustic system: 88.1% on reading
of sentences containing words with varied stress patterns, glottal excitation system: 78.6% on
reading a text of 136 words, and reading of sentences according to specific emotions. Feature
selection was performed with a correlation-based (Hall 1999) approach, which prefers subsets
of features highly correlated with the class, but with a low inter-correlation among them. After
this approach, the recognition rate of the three systems are: 88.1% (prosodic system), 100%
(acoustic system), and 83.3% (glottal excitation).
In a recent work, Orozco-Arroyave et al. (2016) investigated the characterization of the
speech signal into voiced and unvoiced frames to automatically classify PD patients. Voiced
48
frames are used to compute different features assessing prosody (i.e., F0, jitter, shimmer) and
articulation (i.e., F1, F2, MFCC). The intuition behind the use of unvoiced frames stems from
the fact that PD patients develop problems in the correct pronunciation of stop and voiceless
consonants. Thus, there could be also important information in those frames where the vocal
folds should not vibrate. Unvoiced frames are modeled using 12 MFCC and 25 bands scaled
according to the Bark scale (Zwicker 1961). Similar to the work of Bocklet et al. (2011), acous-
tic features are modeled using a GMM-UBM strategy. Prosodic features were computed using
the Erlangen prosody module (Zeißler et al. 2006). The dataset contains the recordings of PD
patients and healthy individuals while performing four tasks (reading of isolated words, text,
sentences, and rapid syllable repetition) in three different languages (German, Spanish, Czech).
The classification model is a radial basis SVM (Scholkopf & Smola 2001) evaluated with 10-fold
and LOO cross-validation strategies (Geisser 1975, Stone 1974, 1977), according to the language.
The proposed approach is directly compared with other standard approaches classically used
for speech modeling, such as (1) noise measures, MFCC, and vocal formants extracted from
voiced segments; (2) MFCC extracted from the utterances without pauses and modeled using
a GMM-UBM strategy; and (3) different prosodic features extracted with the Erlangen prosody
module. Results obtained using unvoiced frames have proven to be more accurate than classi-
cal approaches, reaching an accuracy that ranges from 85% to 99%, depending on the language
and the speech task. Cross-language experiments were also performed following a two-step
strategy. The system was trained with the recordings of one language and then tested on the
remaining ones. Additionally, subsets of the language used for testing were included in the
training set and excluded from the test set incrementally. In general, the accuracy ranged from
60% to 100% when recordings of the language that was going to be tested were moved from
testing and added to training.
4.2 Monitoring of cognitive abilities
Far from being complete, Chapter 3 also introduced some of the numerous neuropsychological
tests used for screening cognitive performance and tracking alterations of cognition over time.
The wide range of existing test batteries is partially motivated by the need to have screening
measures able to distinguish the many types of neurodegenerative diseases affecting cogni-
tive abilities. As observed, these disorders present specific patterns regarding the regions of
the brain affected and the consequent symptoms. Accordingly, the screening process and the
measures adopted will also vary, having to be sufficiently sensitive to differentiate between
different disorders.
49
Since different cognitive stimuli require different underlying solutions, I restrict this revi-
sion to two different types of studies. First, the studies related with an automatic analysis
of semantic verbal fluency tasks are summarized. I recall that in these tests the participants
should produce as many words as they can remember belonging to a particular semantic cate-
gory. This is an interesting task for the technological challenges it raises for current SLT. Then,
I describe those works that include tests assessing cognitive functions, such as memory, atten-
tion, or orientation, and that provide a completely automated solution based on ASR.
4.2.1 Semantic fluency tests
Pakhomov et al. (2012) were among the first authors targeting an automatic characterization
of verbal fluency tasks. Results on these kinds of tests are related with the ability to organize
semantic information into conceptually related clusters, and with the strategy used to access
these clusters. Thus, to provide an automatic assessment of clustering and switching strategies,
the authors resort to the notions of semantic similarity and semantic relatedness. The compu-
tation of these measures relies on the publicly available lexical database WordNet (Fellbaum
2010, Miller 1995). In this resource, each word is characterized by its morphosyntactic category
(e.g., noun, adjective), senses (possible different meanings), glosses (definitions), and semantic
relations (e.g., synonymy). To estimate how semantically similar two words are, the hyponymic
(i.e., ”is-a”) relation between words was used. In this way, WordNet was represented as a hier-
archy and the distance between two words was calculated as the distance between the locations
of these words in the hierarchy. Semantic relatedness has been computed using the Gloss Vec-
tors approach (Patwardhan & Pedersen 2006). This method leverages on WordNet and on word
co-occurrence frequency information computed from large corpora. A semantic representation
of a word is built as a high-dimensional second-order context vector, wherein each dimension
is represented by a term that co-occurs with the terms contained in the gloss of the word be-
ing analyzed. The corpus is composed of 113 patients with MCI and possible or probable AD.
Data are in the English language. Patients were administered a comprehensive test battery
considering the verbal fluency task for the animals category, and other tests assessing different
cognitive domains. The latter have been included to verify their relationship with the semantic
indices investigated. Results have shown that similarity and relatedness indices were corre-
lated with tests assessing executive functions, attention, and memory. Statistical differences
between the MCI and the probable AD group were also investigated, finding, surprisingly, that
the AD group produced higher scores in the similarity and relatedness indices.
Miller et al. (2013) evaluated the feasibility of interactive voice response (IVR) technology to
50
provide neuropsychological tests to older adults. Participants were administered the Wechsler
Adult Intelligence Scale - IV (WAIS-IV), the Wechsler Memory Scale fourth edition (WMS-IV),
the verbal fluency task for the fruit category, and the digit span forward and backward test.
The study involved 158 English speaking subjects, with an age ranging from 65 to 92 years. The
algorithms for the IVR tasks were developed by TelAsk technologies (TelAsk 2019), the word
recognition engine used is the Nuance Open Speech Recognizer (Nuance 2019). The system
was not trained to optimize the recognition of individual speakers. No other further details on
the recognition approach were provided. The feasibility of the system was analyzed in terms
of its capability of independently administer and score simple neuropsychological tests, and
in terms of its capability to provide comparable results to in-person administration. Results
have shown that only 4% of participants were unable to complete all the tasks, indicating that,
overall, the system was easy to use. In the verbal fluency task, 90% of the fruits were correctly
recognized, while in the digit span tests, the number of sequences correctly recognized was
of 93%, and 95%. Overall, clinician and IVR system scoring in the three tests were highly
correlated (r=0.89, r=0.95, r=0.94), but the study also reports a lack of high agreement between
clinician and computer scoring (41.1%, 63.8%, 68%). According to the authors, this represents
the greatest obstacle to the use of these systems in clinical practice. To conclude, the authors
also acknowledge that the different mode of administration of the tasks may change what the
tests measure, and that IVR may possibly introduce new variables in the cognitive evaluation.
Lopez-de Ipina et al. (2015) performed an automated analysis of a semantic verbal flu-
ency task in order to distinguish between MCI and control subjects. To this purpose, the au-
thors used a corpus composed of 187 healthy subjects and 38 patients diagnosed with MCI.
Speech samples were processed in order to compute several linear and non linear features,
among which the first 12 MFCC, the pitch and its variation, the Castiglioni Fractal Dimension
(CFD) (Castiglioni 2010), and the Permutation Entropy (PE). The fractal dimension quantifies
the roughness of a temporal signal and estimates its degrees of freedom. According to the
authors, this feature has the ability to capture the dynamics of a system and thus may reveal
relevant variations in speech utterances. Feature selection was performed with the analysis
of variance (ANOVA) (Fisher 1919) test, which reduced the original feature set by more than
the 50%. Experiments were performed with SVM (Cortes & Vapnik 1995) using 10-fold cross-
validation, results were provided for the two groups separately. Overall, there was a strong
difference in terms of classification accuracy between the control and the MCI group. In fact,
with a combination of linear features, CFD, and PE; the authors reported an accuracy of 85%
and 50% in classifying the control and the MCI group, respectively. Using feature selection on
51
the same set of features the accuracy for the control group improved to 90%, while for the MCI
group decreased to 40%.
In a following work, Pakhomov et al. (2015) exploited ASR to assess the spoken responses
produced to the semantic verbal fluency test for the animals category. The authors used a
combined approach consisting of a constrained language model, a speaker-adapted acoustic
model, and confidence scoring to filter the ASR output. The corpus was composed of 38 En-
glish speaking professional fighters participating in a longitudinal study of effects of repetitive
head trauma on brain function. Responses were recorded and manually evaluated. The assess-
ment, comprised also the reading of a text passage (∼30 seconds) that was used to perform the
adaptation of the acoustic model. The language model was trained using a corpus of responses
to the animal verbal fluency test provided by 1367 subjects. Finally, the authors also exper-
imented with confidence scores to filter the ASR output of the responses. Responses to the
verbal fluency test contained a large number of disfluencies, noise, and non-speech events that
led to relatively poor baseline ASR performances (WER 89%). However, both speaker adapta-
tion and confidence scoring, individually, improved the baseline result leading to a reduction
of the WER to 70%. Using only the adaptation of the acoustic model, the correlation between
automatically and manually computed scores was relatively high (r=0.80). A closer inspection
revealed the existence of individual samples in which there were clear differences between the
two scores. Extraneous comments were found as the biggest contributors to these discrepan-
cies. Using a constrained language model, these non-animal words are likely to result in lower
overall confidence scores, and thus they can be easily filtered out. In fact, after the confidence
scoring approach, the correlation between automatically and manually computed scores im-
proved to 0.86. Overall, the confidence score approach reduced the number of insertions, but
with the trade-off of an increased number of deletions. The combination of speaker adaptation
and confidence scores filtering led to an improvement of the results, with a reduction of the
WER to 53%.
4.2.2 Cognitive tests assessing memory, attention, orientation
Currently, there are few works targeting the automatic administration of cognitive tests
through the integration of SLT. Among them, I mention the work of Coulston et al. (2007).
The authors investigated the use of a computerized in-home monitoring system incorporating
SLT for the early detection of AD. The system is designed as a kiosk, providing an unattended
battery of questionnaires and cognitive tests. Appointments are scheduled either in person
or by phone, the selected times are entered by the study coordinators via a web interface. A
52
session has an approximate duration of 30 minutes, responses are typically recorded and, with
just one exception, processed manually. Instructions to the user are explained through a short
video. Interactions take place exclusively via touchscreen or speech; in fact, speech recogni-
tion is enabled for simple navigation through the interface (e.g., a yes/no question). The client
is synchronized regularly with a remote server either to upload the results of the evaluation,
which is cached on the local file system on the client machine, or to check for newly scheduled
testing appointments. Questionnaires include self-report questions about the quality of life,
medication adherence, and how well participants are able to complete activities of daily life.
Cognitive tests include four tasks relying on speech technology and a task requiring partici-
pants to connect labeled dots using their finger on the touchscreen. Among the speech tasks,
the study considered: word list recall, backward digit span, category fluency and the East
Boston story recall. At the time of the study, speech recognition was used to automatically rec-
ognize and score only the backward digit span. Speech detection was used to determine when
to send the patient an encouragement to continue the task, or to ask if he/she has finished. The
work does not include any evaluation of the system.
Wang & Starren (1999) implemented a speech-enabled version of the MMSE (Folstein et al.
1975) in order to evaluate the capabilities of the Java Speech Application Programming Inter-
face (JSAPI) (Oracle 2019) for speech recognition and synthesis. Interactions with the system
may happen through voice, mouse, and keyboard. In order to implement the MMSE com-
pletely, some questions had to be modified. This was needed for those questions that require
human supervision, such as tasks in which the patient has to perform different actions or those
tasks requiring reading or writing abilities. In this application, the ASR system uses a rule-
based grammar. In fact, rule-based recognition is well suited when the number of inputs is
limited, providing, in general, higher accuracy of recognition. For all the tasks except a ques-
tion related with the current date, the grammar was programmed in advance. In what concerns
the date question, since it changes daily, the grammar had to be dynamically generated. The
system integrates a scoring module that computes the result of each individual answer. The
scoring component was implemented with a Boolean variable that is set to true for each ques-
tion answered correctly. However, at the date of completion of the study the score for writing
a sentence was processed manually, but the possibility of feeding the patient’s input into an
external syntactic parser was being explored. Usability tests performed with five graduate stu-
dents revealed an overall satisfaction with the system. Furthermore, the average automatic
score of 24.8 computed by the system was quite close to the average manual score of 26. To
conclude, the authors reported that, for significantly impaired patients, interaction relying en-
53
tirely on the computer would probably be unpractical, but the system still has potential for
clinical use as a routine screening tool for cognitive disorders.
Lehr et al. (2012) developed a framework to automatically analyze the responses provided
to the Wechsler Logical Memory (WLM) test, part of the Wechsler Memory Scale (WMS) bat-
tery (Wechsler 1997). During the test, which is used to assess memory function, the examiner
reads a brief narrative that the subject is required to retell twice, immediately and after an inter-
val of about 30 minutes. Responses are graded according to how many key story elements are
recalled, in any order, from a list of 25 predetermined story elements. The corpus is composed
of 72 English speaking subjects, 35 diagnosed with MCI and 37 healthy individuals. Three
different adaptations strategies of the acoustic model were evaluated, leading to important im-
provements of the WER (47.5%, 39.8%, and 41.7%). The automatic transcriptions were then
used to derive word-level alignments between each retelling and the WLM source narrative.
The Berkeley aligner (Liang et al. 2006) was used to obtain the alignments, it was trained on
a source-to-retelling and retelling-to-retelling parallel corpus. The alignments, along with the
WLM administration guidelines, allowed to determine which retelling words are matches for
the story elements. Finally, the story elements are used as features for diagnostic classification.
Each subject is associated to a feature vector of length 50, containing 25 story element features
for the immediate retelling and 25 story element features for the delayed retelling. The fea-
tures correspond to the 25 WLM story elements having a value of 1 if the story element was
recalled and 0 otherwise. An SVM (Cortes & Vapnik 1995) model was trained with the story el-
ements feature vectors manually extracted from the held-out dataset. The model is then tested
on the story element feature vectors extracted from the ASR output with the three acoustic
models. Results have shown that when the ASR quality improves, classification accuracy also
improves (75.4%, 77.7%, 80.9%), yielding outcomes comparable to that of manually-derived
features (81.5%).
Hakkani-Tur et al. (2010) investigated the usability of automated methods for evaluating
verbal cognitive status assessment tests. The work is focused on two types of tests: a story-
recall task, used to assess memory and language functioning, and a picture description task,
used to assess the information content in speech. For the story retelling stimulus, the WLM
subtest of the WMS was used (Wechsler 1997), while for the picture description test, the Picnic
picture included in the Western Aphasia Battery (WAB) was selected (Kertesz 1982). The corpus
is composed of 123 English participants of ages 20-102. The goal of the work is to prove that
measures derived automatically from the subject’s speech provide high correlation with cor-
responding measures extracted manually. For these reasons, speech samples were manually
54
transcribed and annotated, and also processed with an ASR in order to obtain the automatic
transcriptions. The speech recognizer was developed for recognition of meetings with close
talking microphones, using acoustic data of young speakers (Stolcke et al. 2008), no model
adaptation was performed. Preliminary results have shown a WER of 30.7% and 26.7%, for
the story retelling and the picture description tasks, respectively. For the story retelling test,
the authors extracted 35 atomic semantic content units, a sentence-level piece of information
comparable to a fact, while for the picture description test, a list of 36 units subdivided in 4 key
categories was used. The information content of the descriptions was then evaluated based
on the number of information units produced. Recall, precision, and F-score of uni-grams and
bi-grams are then computed on the story retelling test, while recall is computed for the picture
description test. Finally, the correlation between the manual evaluation scores and automatic
metrics is derived both from manual and ASR transcriptions. With respect to the story re-
call test, uni-gram F-score provided the highest correlation, being that manual transcriptions
achieved an higher score (r=0.85) than automatic ones (r=0.70). Regarding the picture descrip-
tion task, the correlation of uni-gram recall computed on the manual and automatic transcrip-
tions was of 0.93, and 0.89, respectively.
4.3 Monitoring of language abilities
Language abilities are evaluated through the assessment of isolated, specific functionality (e.g.,
naming), or through the evaluation of discourse production. The last can provide a broader and
more comprehensive vision of linguistic impairments. In fact, discourse production can be as-
sessed along two dimensions: microlinguistic, concerned with lexical and syntactic processing,
and macrolinguistic, focused on pragmatic processing. The first yields data about language-
specific abilities for processing phonological, lexical, and syntactic aspects of single words and
sentences. The second, depends on the integration of linguistic and non-linguistic knowl-
edge for maintaining conceptual, semantic, and pragmatic organization at the suprasentential
level (Kintsch 1994, Kintsch & Van Dijk 1978, Marini et al. 2011). In a discourse production task,
microlinguistic aspects are quantified by lexical error measures (i.e., verbal paraphasias and use
of indefinite terms), and by syntactic measures consisting of omissions or errors in grammati-
cal forms and syntactic complexity. Macrolinguistic aspects of language production are instead
quantified by rating of the cohesion and coherence. In the literature revised for this area, I have
found several works targeting an automatic evaluation of discourse production. The focus of
these studies is on some aspects of the micro and macro linguistic dimensions, with the last be-
ing approached very recently. Overall, these works assess the quality of discourse production
55
through the automatic analysis of a combination of lexical, syntactic, acoustic, and semantic
features.
Among the works targeting an automatic analysis of narrative speech, I briefly recall the
work of Hakkani-Tur et al. (2010), presented at the end of the previous section. In this work,
the authors used a picture description test to assess the information content in speech. To this
end, they considered the number of information units produced with respect to a predefined
list of 36 units subdivided in 4 key categories.
Orimaye et al. (2014) investigated five different computational models for predicting AD
and related dementias using several syntactic and lexical features. The corpus used in this work
is the DementiaBank (MacWhinney et al. 2011, TalkBank 2017), a public database for the study
of communication in dementia. The collection was gathered in the context of a yearly basis lon-
gitudinal study: demographic data, together with the education level, are provided. It contains
the recordings of healthy individuals and subjects diagnosed with AD, MCI, and other disor-
ders, while performing four tasks. Among these, there are also the descriptions of the Cookie
Theft picture from the BDAE (Goodglass et al. 2001). Recordings were manually transcribed at
word level following the TalkBank Codes for the Human Analysis of Transcripts (CHAT) proto-
col (MacWhinney 2000). Data are in English language. In their work, Orymaie et al. considered
242 samples from patients diagnosed with various dementias, mostly of the AD type, and 242
samples from the control group. The authors identified 21 relevant features: 9 syntactic, 11 lexi-
cal, and age as a confounding feature. Syntactic features relied both on the annotations existing
in the original transcriptions and on the annotations extracted with the Stanford parser (Klein
& Manning 2003a). These features included the set of coordinated, subordinated, and reduced
sentences, the dependency distance used as a measure of grammatical complexity, the number
of dependencies, predicates and their average, and production rules. Lexical features consid-
ered the total number of utterances and their mean length, the total number of function and
unique words, revisions, and the number of word repetitions. A revision indicates that the
patient retraced a preceding error and then made a correction. The feature extraction stage is
followed by statistical tests to identify the most important features. Additionally, the authors
also performed a stage of feature selection with the Information Gain (IG) method. In this
way, they found that the eight features with the highest IG value matched the subset of eight
significant features identified through the statistical tests. Experiments were performed using
five different models, SVM (Cortes & Vapnik 1995), Naıve Bayes (NB), Bayes Network (BN),
J48, DT, and ANN. Performance is measured in terms of precision, recall, and F-score, 10-fold
cross-validation was implemented for each model. Results identified SVM (Cortes & Vapnik
56
1995) as the best predicting algorithm, achieving the highest F-score of 74% on the disease
group. Overall, results have demonstrated that the patient group used less complex sentences
than the control group and produced more grammatical errors.
Jarrold et al. (2014a) evaluated the predictive capabilities of different machine learning al-
gorithms in the problem of diagnosing four dementia subtypes. Lexical and acoustic features
were automatically computed from the speech recordings, and the associated transcriptions, of
the Picnic picture description test (Kertesz 1982). The corpus is composed of 9 healthy individ-
uals and 39 patients diagnosed with AD (N=9), FTD (N=9), svPPA (N=13), and nfvPPA (N=8).
Acoustic features were extracted with the Meeting Understanding system (Stolcke et al. 2008)
and consider measures related with the duration of consonants, vowels, pauses, and other
acoustic-phonetic categories. Lexical features included frequencies of different morphosyntac-
tic categories, also known as Part of Speech (POS) annotations, and frequencies of words orga-
nized according to 81 categories. To evaluate the sensitivity to speech recognition errors, fea-
tures were computed relying both on the automatic and on the manual transcriptions. Lexical
and acoustic features were combined to form a unique vector characterizing each speaker. The
most informative features were selected through a one-way ANOVA (Fisher 1919) performed
on each feature in each group with respect to the diagnosis. Evaluation was conducted using
5-fold cross-validation over the set of patients. Logistic regression, MLP (Rosenblatt 1958), and
DT (Mitchell 1997) were evaluated. MLP achieved a slightly better performance with an ac-
curacy of 88% in the classification of AD versus FTD, and AD versus control subjects. When
comparing this result with the one obtained using manual transcriptions, the authors found a
difference in classification accuracy of only 2-3%.
Fraser et al. (2016) automatically computed a number of linguistic and acoustic variables
from the recordings, and the associated transcription, of the Cookie Theft picture description
task (Goodglass et al. 2001), contained in the DementiaBank database (MacWhinney et al. 2011,
TalkBank 2017). The dementia group included participants with a diagnosis of possible AD
or probable AD, resulting in 240 samples from 167 participants. The control group included
233 samples from 97 speakers. The authors considered a large number of features, more than
350, derived from the areas of linguistic, psycholinguistics, and speech processing. Relying on
previous studies showing that AD patients may report altered proportion of nouns, adjectives,
and verbs (Bucks et al. 2000a, Jarrold et al. 2014b), the frequency of occurrence of different POS
tags was calculated. Syntactic complexity was measured through mean length of sentences,
T-units, clauses, and scores calculated on the results of a parse tree computed using the Stan-
ford parser (Klein & Manning 2003a). Then, in order to further explore syntactic differences,
57
the frequency of occurrence of different grammatical constituents were quantified. Vocabulary
richness was assessed using type-token ratio (TTR), moving-average type-token ratio (MATTR)
(Covington & McFall 2010), Brunet’s index (Brunet 1978), and Honore’s statistic (Honore 1978).
Psycholinguistics features were included with the intuition that a semantic impairment may
be manifested through an increased reliance on familiar words. Thus, different norms were
used to rate content words, nouns and verbs in terms of familiarity, imageability, and age-
of-acquisition. To deal with a decreased information content, the authors computationally
measured the mentioning of relevant lexical items relying on a list of expected information
units (Croisile et al. 1996). Finally, acoustic analysis included several features used in the litera-
ture as indicative of pathological speech and the first 42 MFCC. Multilinear logistic regression
with 10-fold cross-validation was used to classify between AD and healthy control. At each iter-
ation, the 90% of the data is used to train the model and to select the most useful features, while
the remaining 10% is retained for validation. Feature selection was performed choosing the N
features with the highest Pearson’s correlation coefficient between each feature and the binary
class. The maximum average accuracy was 81.9%, achieved with the 35 top-ranked features.
Using all the features, the classification accuracy drops to 58.5%. Furthermore, to examine the
underlying structure of the data, the authors performed an exploratory factor analysis, finding
that four factors were the most relevant: semantic impairment, acoustic abnormality, syntactic
impairment, and information impairment.
In a following work, Fraser & Hirst (2016) investigated distributed word representations to
detect semantic changes that may occur in AD. The authors constructed two semantic spaces,
one for the control and one for the patient group, and analyzed the differences between them.
To this end, they built a simple word-word co-occurrence model with the transcriptions of
the Cookie Theft picture (Goodglass et al. 2001). The corpus used in this study contains the
same data described in the previous work of these authors (Fraser et al. 2016). Differences in
the groups were found in eleven word vectors: /three/, /another/, /put/, /side/, /getting/,
/spill/, /splash/, /which/, /up/, /say/, and /fall/. A contextual analysis for these words re-
vealed two different scenarios. In one case, control participants used a number of context
words not used by the AD participants. These words were associated by the authors with a
certain attention to detail. In the other case, AD participants did not use a number of context
words used by the control group. These words were associated with implausible details. Then,
the authors performed a shifting of word vector representations in order to understand how
the words examined have moved in the vector space. Results showed that, in many cases,
the word representations in the AD and control corpora lied very close to each other. When
58
vector representations were quite distant, the corresponding words were used in different con-
texts. An example is provided for the verb /getting/. Examining the surrounding vectors, it
appears that /getting/ is closer to /running/, /overflowing/, and /falling/ in the AD corpus,
while it is closer to words like /reaching/ and /ask/ in the control corpus. Thus, in order to
discover the multiple senses and the different context in which these terms appear in the two
corpora, the authors also performed a cluster analysis. In this case results revealed that most
word senses were used by both groups, some rare senses that were used only by the AD group
corresponded to semantic errors.
Yancheva et al. (2015) used a set of 477 automatically extracted lexicosyntactic, acoustic, and
semantic features to estimate clinical MMSE (Folstein et al. 1975) scores along time. The corpus
used in the study consists of the recordings of the Cookie Theft picture (Goodglass et al. 2001),
contained in the DementiaBank database (MacWhinney et al. 2011, TalkBank 2017). The au-
thors considered only subjects with associated MMSE scores, resulting in 393 speech samples
from 255 subjects (165 AD, 90 control subjects). To estimate clinical MMSE scores, a bivari-
ate dynamic Bayes Network was used to represent the longitudinal progression of linguistic
features and MMSE scores. Lexicosyntactic features were extracted from syntactic parse trees
constructed with the Brown parser (Charniak & Johnson 2005) and from the annotations pro-
vided with the transcriptions of the narratives. A total of 182 measures was computed, among
which vocabulary richness, syntactic complexity, repetitions, and phrase types. Acoustic mea-
sures included the standard MFCC, formant features, and measures of disruptions in vocal
fold vibration regularity, leading, overall, to 210 features. Finally, semantic measures assessed
the ability to describe concepts and objects of the Cookie Theft picture. To this purpose, 85 fea-
tures were used to verify that a key concept was mentioned, and to compute word frequencies.
Then, three feature selection methods were exploited to identify the most informative mea-
sures. The first method selected a set of top 10 features, which was corroborated by the second
and third method. Interestingly, acoustic features were not included among the top 10 fea-
tures. Performance was measured in terms of the Mean Absolute Error (MAE) between actual
and predicted MMSE scores. In the experimental phase, both the size of the feature set and
the feature selection methods varied. Experiments were performed with LOO cross-validation.
The lowest MAE of 3.83 was achieved using the correlation with the MMSE as a feature se-
lection method and choosing the top 40 features. To evaluate the effect of longitudinal data
in the prediction of the MMSE score, the authors repeated the same experiment, but dividing
the dataset according to the number of longitudinal samples. In this case, results showed that
the lowest MAE for each feature selection method was found on the dataset with the highest
59
number of longitudinal visits (≥3).
In a following work, Yancheva & Rudzicz (2016) presented a generalizable method to au-
tomatically generate and evaluate the information content conveyed from the description of
the Cookie Theft picture (Goodglass et al. 2001). The data selected contained 255 speech sam-
ples from 168 participants diagnosed with probable or possible AD, and 241 samples from 98
healthy controls. The authors trained a word vector model on a large general-purpose corpus
composed of Wikipedia 2014 (Wikipedia 2014) and Gigaword 5 (LDC 2019). The trained model
consisted of 400,000 word vectors, in 50 dimensions. Then, vector representations for each
word in the DementiaBank corpus were extracted using the previously trained model. Two
cluster models were built, one for each group, using the k-means algorithm. Clusters represent
topics, or groups of semantically related word vectors, discussed by the respective group of
subjects. All previous works related to the manual definition of content units for the Cookie
Theft picture were combined with the content units defined by a speech language pathologist
expert. To evaluate the generated clusters, the Euclidian distance between each clinical con-
tent unit and its closest cluster centroid, in each model, was computed. For both groups, the
recall was of 96.8%. This measure was defined as the percentage of content units whose dis-
tance to the cluster centroid was less than the distance of 99.7% of the datapoints in the cluster.
Different semantic features were extracted from the two models and then used in the classifi-
cation among patient or control. Experiments were performed with a RF classifier and 10-fold
cross-validation, varying the cluster model and the feature set. Using a set of 12 features au-
tomatically extracted, it was achieved an F-score of 0.74. This value is higher than the score
obtained with a set of 85 manual features (0.72). With a combination of the 12 features auto-
matically extracted, and the set of lexicosyntactic and acoustic features introduced by Fraser
et al. (2016), the F-score improved to 0.80.
Hernandez-Domınguez et al. (2018) approached the automatic evaluation of information
content from the recordings and associated transcriptions of the Cookie Theft picture (Good-
glass et al. 2001), contained in the DementiaBank database (MacWhinney et al. 2011, TalkBank
2017). The authors selected 262 participants among AD, MCI, and healthy control, providing
a total of 517 transcriptions (257 AD, 43 MCI, and 217 healthy control samples). Additionally,
25 healthy controls and their transcriptions were retained for the generation of a referent that
is used to automatically evaluate the informativeness and the pertinence of the descriptions.
The referent is created by extracting patterns that consider different manners of describing ac-
tions or situations. Linguistic and phonetic features were also considered, by accounting for
the frequency of different word classes, measures of vocabulary richness, and MFCC. Over-
60
all, a total of 105 features were computed. The authors investigated the correlation between
each feature and the severity of the disease, which was measured on a three-point rating scale
(healthy = 0, MCI = 1, and AD = 2). From this analysis, they found that the information cov-
erage measures appeared to be the variables most strongly correlated with the severity of the
cognitive impairment. Classification experiments were initially performed between the group
of healthy control and AD patients. In a following phase, the MCI group was joined with the
AD group. Evaluation was conducted with different classifiers using 10-fold cross-validation.
Results show an F-score of 81%, and 82%, for the first and second experiments respectively, in
the identification of patients with cognitive impairments.
Very few studies approached linguistic deficits at a higher level of processing. Among
them, in the remainder of this section, I describe the work of Santos et al. (2017) and the work
of Toledo et al. (2018).
Santos et al. (2017) assessed coherence and cohesion in a population of subjects with MCI. The
dataset used in this study consists of manually transcribed samples of spontaneous speech
elicited with different types of stimuli: (i) the description of the Cookie Theft picture (Goodglass
et al. 2001), contained in the DementiaBank database (MacWhinney et al. 2011, TalkBank 2017),
(ii) the telling of the Cinderella story, and (iii) the immediate and delayed recall of Portuguese
narratives of the Arizona Battery for Communication Disorders of Dementia (ABCD) battery.
From the DementiaBank database, the authors selected 43 transcriptions for the MCI and the
control group. The Cinderella and the ABCD datasets included, respectively, 20 and 23 subjects
with MCI, and 20 elderly control. Discourse transcripts were modeled as a complex network
using the word adjacency model (i Cancho et al. 2004). With this approach, each distinct word
becomes a node and words that are adjacent in the text are connected by an edge. The authors
trained two 100-dimensional word embeddings models, for English and Portuguese language,
using Wikipedia dumps from October and November 2016 respectively. These models are
then used to enrich the complex networks. New edges are added between words whose word
vectors had a cosine similarity higher than a given threshold. Classification was performed
using topological metrics of the network, BOW representations, and linguistic features. These
were extracted with the tool Coh-Metrix (Graesser et al. 2004), which includes measures of
lexical diversity, syntactic complexity, word informatoin, and text cohesion through latent se-
mantic analysis. For the Portuguese language, the tool Coh-Metrix-Dementia (Aluısio et al.
2016) was used. Experiments were performed with different classifiers, using using 5-fold
cross-validation and different combinations of features. Depending on the dataset used, the
accuracy was of 52% (Cinderella), 65% (DementiaBank), and 74% (ABCD), achieved with a
61
combination of topological features computed on the enriched network, BOW, and linguistic
features.
Toledo et al. (2018) analyzed macrolinguistic aspects of speech on a corpus of 60 Portuguese
subjects divided in three groups: AD, MCI, and a healthy control. Participants were required
to narrate the Cinderella story. Discourse samples were recorded, manually preprocessed and
transcribed. In order to extract macrostructural characteristics of discourse, features were com-
puted with the tool Coh-Metrix-Dementia (Aluısio et al. 2016) and by manual marking. The
analysis investigated two variables of discourse production: i) informativity and narrative
structure, and ii) global coherence and modalization (e.g., comments external to the content
of the story). To account for the first variable, the authors considered the number of proposi-
tions of each text. For the analysis of global coherence, the amount of empty emissions, the
total ideas density feature, and the latent semantic analysis features were considered. Statisti-
cal analyses were performed to verify the features and metrics able of differentiating the three
groups. The nonparametric Kruskal-Wallis (Kruskal & Wallis 1952) test was used to compare
performance among the three groups regarding the variables of interest. Results showed that
AD individuals presented less propositions than the MCI and healthy individuals, indicating
less informative discourses with less reference to what was expected for the narrative. They
also presented higher numbers of empty emissions without reference to the narrative, indicat-
ing greater difficulty to maintain the theme. Additionally, AD individuals found difficulty in
the planning and organization of the ideas related to the topic, demonstrating compromise of
the textual macroplane. It was not possible to differentiate each group based on features related
with global coherence.
4.4 Summary
In this chapter, the state of the art of SLT solutions applied to the monitoring of speech, lan-
guage, and cognitive abilities has been presented. From this revision, it is possible to under-
stand the accomplishments achieved in each of these areas, as well as limitations of current
research. Regarding speech abilities, one can observe that existing works investigated different
speech tasks and several acoustic measures to characterize the symptoms caused by dysarthria
in PD. Nevertheless, few studies analyzed common speech production tasks typically used for
diagnosis in terms of their utility for automatic PD discrimination. Research in the diagnosis
of cognitive abilities through neurospsychological tests showed that few works targeted the
automatic administration of cognitive tests through the integration of SLT. Furthermore, none
of these works target the Portuguese language. Existing solutions are only partially automated
62
and are focused on the implementation of a specific test, not providing the possibility to extend
the work to other tests. In the area of automatic monitoring of language abilities, I witnessed
an increasing body of research dedicated to the analysis of lexical, syntactic, and semantic as-
pects of discourse production. Up to now, however, very few works faced pragmatic aspects
of language. The limitations highlighted in each of these areas represent opportunities to con-
tribute to the research of automatic diagnostic methods of neurodegenerative diseases and will
be further developed in this dissertation.
63
64
5Contributions to the Monitoring of
Speech Abilities
As described in Chapter 3, dysarthria is a motor speech disorder characterized by weakness,
paralysis, or lack of coordination in the motor speech system, affecting respiration, phonation,
articulation and prosody. This impairment is characteristic of diseases such as Parkinson’s
Disease (PD), Huntington Disease (HD), and Amyotrophic Lateral Sclerosis (ALS). Several
speaking tasks are used to evaluate the extent of motor voice disorders, the most traditional
ones include the sustained vowel phonation, diadochokinesis, and variable reading of short
sentences, longer passages or freely spoken spontaneous speech (Goberman & Coelho 2002).
These tasks are subjected to a perceptual evaluation from the Speech Language Pathologist
(SLP), that should be able to compare current outcomes with those resulting from a previous
evaluation. In this context, an automatic analysis of the result of these tests would provide an
additional evaluation that could be used to support the one provided by the SLP. From the
literature review of Chapter 4, I found an extensive body of research that investigated the use
of sensible acoustic measures able to represent the symptoms of this disorder. These studies
differ on many aspects: on the set of features considered, on the speech tasks used for the anal-
ysis, and on the statistical approach used in the characterization of the problem. Few studies
analyzed common speech production tasks typically used for diagnosis in terms of their utility
for automatic PD discrimination. For these reasons, my first approach to this problem targeted
the definition of a standard feature set and a classification strategy that can be suitable to un-
derstand the relevance of the various tasks. This work was published in TSD 2017 (Pompili
et al. 2017).
5.1 Automatic detection of Parkinson’s Disease: an anal-ysis of speech production tasks used for diagnosis
In this study, I am not interested in comparing the large amount of different acoustic measures
and learning approaches that have emerged along the years, but rather in defining a feature
set and an evaluation method in order to assess different speech tasks. To this end, I consider
some of the measures that are repeatedly mentioned in the majority of the works examined.
These features were carefully selected considering their sensitivity to represent deficits at var-
Descriptors Functionals
Logarithmic F0 (1), Loudness (1)mean and stdev, mean and stdev ofthe slope of rising/falling signal parts (x6)
Jitter (1), Shimmer (1), Formant 1 bandwidth (1),Formant 1, 2, 3 frequency (3), amplitude (3),Harmonic to Noise Ratio (1),Harmonic difference: H1-H2 (1), H1-A3 (1),MFCC [1-12] (12), LOGenergy (1), First and secondderivative of MFCC and Log-energy (26)
mean and stdev (x2)
Table 5.1: Description of the acoustic features based on 53 low-level descriptors plus 6 func-tionals.
ious dimension of language production: phonation, articulation, prosody. In the literature,
the most traditional measures used in examining phonation include measurement of F0, jitter,
shimmer, and Harmonics to Noise Ratio (HNR) (Orozco-Arroyave et al. 2014, Rusz et al. 2011).
Articulation is typically assessed considering differences in vocal tract resonances. The first
and second formant frequencies and the vowel space area are frequently studied (Proenca et al.
2013, Vasquez-Correa et al. 2013). Prosodic analysis includes measurements of F0, intensity,
articulation rate, pause, and rhythm (Bocklet et al. 2013, Rusz et al. 2011, Skodda & Schlegel
2008).
The complete set of features is reported in Table 5.1. These features have been extracted with
the openSMILE toolkit (Eyben et al. 2010), in order to allow the reproducibility of the results.
First, these features are initially computed at the frame level, the so-called low-level descrip-
tors, which are obtained based on a subset of the Geneva Minimalistic Acoustic Parameter Set
(GeMAPS) (Eyben et al. 2016) and the MFCC pre-built configuration files. Then, in a second
step, two functionals (mean and standard deviation) are applied in order to obtain a feature
vector of constant length for the whole utterance. For some features, (F0 and loudness), mean
and standard deviation of the slope of rising/falling signal parts were also computed. Finally,
a 114-dimensional feature vector composed of 78 MFCC-based features and 36 GeMAPS-based
features was obtained. Some other features, also frequently mentioned in the literature (i.e., the
articulation rate, pause analysis, or VSA), were not considered in this work in order to build a
general-purpose feature set, which could be suitable for each task under assessment.
5.1.1 Corpus
The FraLusoPark database (Pinto et al. 2016) has been used to assess the relevance of different
speech tasks. This is a new corpus of 140 European Portuguese speakers, 65 healthy control
and 75 PD subjects, age-matched and sex-matched with the control group. Each participant
66
PD patients ControlsM F M F
Gender 38 37 34 31Age 64.6±11.9 66.9±8.5 62.4±12.4 66.6±14.4Years diagnosed 6.7±4.5 10.8±5.6 —- —-MDS-UPDRS-III 32.1±12.9 38.3±14.5 —- —-
Table 5.2: Demographic and clinical data for patient and control groups.
was required to perform 8 different speech tasks with an increasing complexity in a fixed or-
der: (1) three repetitions of the sustained phonation of the vowel /a/, (2) two repetitions of
the maximum phonation time (vowel /a/ sustained as long as possible), (3) oral diadochoki-
nesis (repetition of the pseudo-word /pataka/ at a fast rate for 30 s.), (4-5) reading aloud of 10
words and 10 sentences, (6) reading aloud of a short text (”The North Wind and the Sun”), (7)
storytelling speech guided by visual stimuli, and (8) reading aloud of a set of sentences with
specific prosodic properties. The total duration of the recordings is 6 hours and 31 minutes for
the control group, and 7 hours and 30 minutes for the PD group. Demographics data of the
corpus are presented in Table 5.2. The study was approved by the ethics committee of the Fac-
ulty of Medicine at the Santa Maria University Hospital (Lisbon, Portugal). Data was manually
preprocessed in order to remove the therapist’s speech and the spontaneous interventions in-
troduced by the subject that were not directly related with the task. After that, recordings were
down-sampled to 16 kHz.
5.1.2 Evaluation
The selected model is a Random Forest classifier as implemented in the WEKA toolkit (Hall
et al. 2009). This implementation relies on bootstrap aggregating, also known as bagging, a
machine learning ensemble meta-algorithm designed to improve the stability and accuracy of
machine learning algorithms used in statistical classification and regression. Bagging reduces
variance and helps to avoid over-fitting. A stratified k-fold cross-validation per speaker strat-
egy is used for training and evaluation of each speech task separately, with k being equal to 5.
In this way, it is ensured that the train and the test sets at each iteration do not contain the same
speakers. Also, the percentage of speakers of each class is balanced in the two data sets at each
iteration.
On a first attempt, the recordings of every speech production task for each speaker have
been processed as described previously to obtain a feature vector of 114 elements. I refer to this
approach as sentence-level feature extraction. This strategy results in a single feature vector
per speaker and task. In other words, cross-validation experiments for each task are limited
67
to only 140 sample vectors, which will probably result in poorly trained models and less reli-
able results. Alternatively, in order to increase the number of samples, I have also performed
a segmental feature extraction strategy. In this case, this strategy results in a feature vector,
as previously described, for each audio subsegment of fixed length equal to 4 seconds with a
time shift of 2 seconds. This approach permits increasing the amount of training samples for
the cross-validation experiments, besides extracting more detailed information of the speech
productions. Table 5.3 shows classification accuracy (%) results for each speech production
task following the two feature extraction strategies described previously: sentence-level and
segmental. As expected, the former approach led to poorer results, mostly motivated by the
reduced number of training samples (only 112 in each task at each cross-validation iteration).
However, one may also argue that in this way valuable information is lost when applying
the functionals to long speech segments as the ones corresponding to each speech produc-
tion. On the other hand, the segmental feature extraction strategy leads to very remarkable im-
provements in terms of classification accuracy. In particular, the reading words task achieves a
maximum of 40.6% relative improvement, followed by the reading sentences task with 31.5%
relative improvement.
Overall, from these results it is possible to observe that the reading prosodic sentences task
achieved the best recognition accuracy (85.10%). In fact, this is the best performing task also
in the case of sentence-level feature extraction. This observation confirms the relevance of this
task, which was carefully designed in order to explore language-general and language-specific
details of PD dysprosody. The second most discriminant task in terms of automatic PD classi-
fication is the storytelling one (82.32%). As a matter of fact, this task corresponds to the pro-
duction of spontaneous speech, since the subject has to create a story based on temporal events
represented in a picture. Although its overall duration is extremely variable and dependent on
the speaker, this task definitely contains many important acoustic and prosodic information.
This result is very encouraging for the development of tele-monitoring applications that may
use spontaneous speech recorded over the telephone.
The next most discriminant tasks are those consisting of reading short passages of text and
sentences. Again, I believe that these productions are richer in terms of acoustic and prosodic
information, which makes them more convenient for automatic PD detection in contrast to less
informative rapid syllables repetitions or maximum phonation time of vowel /a/. In general, it
is likely that more complex tasks will contain more linguistics phenomena, like for instance co-
articulations, that may provide important cues for discrimination. Moreover, these more com-
plex tasks consist generally of longer speech productions, which is expected to be beneficial
68
Feature Extraction Results - Accuracy (%)Task Sentence Level SegmentalSustained vowel phonation (/a/) 55.00 58.14Maximum phonation time (/a/) 60.00 75.65Rapid syllables repetitions 60.71 73.28Reading of word 54.29 76.35Reading of sentences 62.14 81.74Reading of text 65.00 79.86Storytelling guided by visual stimuli 66.43 82.32Reading of prosodic sentences 70.71 85.10
Table 5.3: Task-dependent recognition results on the 2-class detection task (PD vs. control).
for the segmental feature extraction approach. Nevertheless, both feature extraction strategies
provide coherent results in terms of identifying the top-4 most significant speech production
tasks. Finally, I observe that the sustained phonation of vowel /a/ is the task that achieved the
worst results with the segmental approach by a large margin (58.14%). However, this perfor-
mance is in line with the results of Bocklet et al. (2011), where the authors found that read texts
and monologue were the most meaningful tasks for the automatic detection of PD, while the
phonation task achieved the poorest recognition rate.
5.2 Summary
In this chapter, the potential discriminative ability of a large set of speech production tasks
used in the automatic detection of PD has been analyzed. For this purpose, the FraLusoPark
database (Pinto et al. 2016) has been used. This resource contains data from European Por-
tuguese PD and healthy speakers while performing 8 tasks designed to assess speech disor-
ders at various dimensions. For each task, automatic classification experiments have been con-
ducted using a RF classifier and a custom set of acoustic features carefully selected based on the
study of the state of the art. The experimental results have shown that the most important pro-
duction tasks are reading of prosodic sentences and storytelling, achieving a PD classification
accuracy of 85.10% and 82.32%, respectively. These tasks definitely contain more acoustic and
prosodic information than the sustained vowel phonation or the reading of a word. Their selec-
tion, thus, may indicate that in the identification of PD a comprehensive evaluation of speech
impairments is more important than the assessment of isolated abilities, such as phonation or
articulation.
69
70
6Contributions to the Monitoring of
Cognitive Abilities
In Chapter 3, some of the numerous neuropsychological tests used for screening cognitive
performance were introduced. As previously observed, many of them are eligible to be ad-
ministered remotely through current Speech and Language Technology (SLT) solutions. Their
automatic implementation represents an appealing target both for the technical challenges it
may raise and for the advantages it may bring to the community. However, from the literature
revised in Chapter 4, it was possible to observe that existing solutions present important limi-
tations. In fact, in some cases, they are only partially automated. Also, these works are focused
on the implementation of a specific test, not providing the possibility to easily extend the work
to other tests.
In this doctoral research, the first approach to neuropsychological tests targeted the seman-
tic verbal fluency task. This test reveals to be particularly useful in the screening of dementia,
allowing to differentiate between Alzheimer’s Disease (AD) and Mild Cognitive Impairment
(MCI) (Pakhomov et al. 2012). The inclusion of this test in the Mini-Mental State Examination
(MMSE) has been recommended as a means of increasing its sensitivity (Strauss et al. 2006).
This is a demanding task for current speech recognition technology, as it requires the spon-
taneous production of a list of items belonging to an unconstrained domain. This problem
has been approached both with the construction of a tailored language model and exploit-
ing prosodic hints from the linguistic area. The results of this work were published in ICPhS
2015 (Moniz et al. 2015).
Next, the focus of my work shifted towards the automatic implementation of two widely
used neuropsychological batteries: the MMSE and Alzheimer’s Disease Assessment Scale -
Cognitive Subscale (ADAS-Cog). They are two general batteries used in the screening of var-
ious dementias, in particular of AD and MCI. The MMSE is so widely used in preliminary
screening methods exactly because it is somehow ”general purpose”, providing a quick and
generic evaluation to understand if the condition of initial cognitive decline is met. This work
takes a step towards introducing a set of neuropsychological tests for AD and MCI, intended
for the Portuguese population. These tests have been integrated into an online platform, ex-
tending a system previously used for the remote rehabilitation of aphasia. As far as I know, it
is the only platform of this type implemented for the Portuguese population. This work was
published in SLPAT 2015 (Pompili et al. 2015).
6.1 Semantic verbal fluency test
Semantic verbal fluency tests require the patient to name as many items as possible belong-
ing to a specific category, within a time constrained interval, typically one minute. The most
common category is animals (Strauss et al. 2006), though other commonly used are food or
first names. In animal naming tests, the target of this work, the score corresponds to the sum
of all admissible words, where names of extinct, imaginary, or magic animals are considered
admissible, while inflected forms and repetitions are considered inadmissible. ASR could be of
valuable support in the automation of fluency tasks, even though their implementation raises
important challenges. The first is related with the open domain nature addressed by the task.
In fact, current speech recognition technology is able to provide quite reliable results when
dealing with problems whose domain could be somehow limited (Abad et al. 2013). One of
the components of a speech recognition system, the language model, is particularly affected by
the nature of the task, since it should contain the knowledge of the rules of a language, being
used to guide the search for an interpretation of the acoustic input. If the language model does
not contain a given word, this will never be recognized. Another challenge is represented by
the number of disfluencies produced in these kind of tasks, which may significantly affect the
recognition accuracy. In fact, disfluencies are relatively frequent in spontaneous speech, but,
in this context, they may be particularly relevant, because of the cognitive load required by
the test and its duration. The first challenge is addressed with the automatic construction of a
language model suited for the task. The second challenge is faced by investigating the prosodic
patterns that the nature of the task induces to produce.
6.1.1 Corpus
The corpus used in this study consists of a database of recordings of 42 native Portuguese
healthy speakers (19 females and 23 males), with ages varying from 20 to 65 years, different
education, socioeconomic status, and cultural background. The corpus was collected by the
author of this dissertation with the aim of having a diverse sample of adults for assessing the
animal naming task. Orthographic transcriptions were manually produced for each session,
all the events were classified as a word from an animal list, as a disfluency, or as other events,
namely comments. The overall duration of the corpus is approximately 43 minutes, of which
about 21 minutes are silent, about 15 minutes include speech, while the remaining ones con-
tain disfluencies and other paralinguistic events (e.g., laugh, cough) or background noise. The
72
number of valid words is 1171, while the number of disfluencies is 321, representing 27% of the
whole corpus. This percentage is clearly not in line with the ones reported by (Clark 1996, Lev-
elt 1989, Moniz et al. accepted, Moniz, Batista, Mata & Trancoso 2014, Shriberg 1994, 2001, Tree
1995), which indicate an interval of 5%-10% of disfluencies in human-human conversations.
This very high disfluency rate is interpretable by task effect, in particular, naming animals un-
der strict temporal constraints. 75% of the data, corresponding to 31 speakers, have been used
for training the animal naming task, while the remaining ones have been used for testing.
6.1.2 Evaluation
The experiments here described use the in-house ASR engine named Audimus (Meinedo et al.
2010, 2003), a large vocabulary continuous speech recognition module. Audimus is a hybrid
recognizer that combines the strengths of ANN and HMM (Morgan & Bourlard 1995). The
baseline system incorporates three MLP outputs trained with PLP features, log-RelAtive Spec-
TrAl features, and Modulation SpectroGram features. This version integrates a generic lan-
guage model trained on broadcast news, encompassing 100k words. Initial experiments using
the standard version of Audimus, as expected, due to the challenges described above, led to
very poor results with an average WER around 105%. Thus, in order to improve these results, a
technique known as keyword spotting (KWS) was exploited. This approach already proved to
be appropriate for dealing with naming tasks and also for filtering speech disfluencies (Abad
et al. 2013, Moniz et al. 2007, Pompili et al. 2011). I recall that keyword spotting aims at detect-
ing a certain set of words of interest in the continuous audio stream. This is achieved through
the acoustic match of speech with a keyword model in contrast to a background model (ev-
erything that is not a keyword). In this approach, the keyword model contains the names of
admissible animals that will be accepted by the speech recognition system. The size of this list
may have a significant impact on the outcome of the recognizer. In fact, if a keyword is missing
from the list, it will never be detected; on the other hand, longer lists will increase the perplex-
ity of the keyword model. The initial model consisted of an existing list used in the context
of the STRING project by a finite-state incremental parser to add semantic information to the
output of a part-of-speech tagger (Mamede et al. 2012). This list contains 6044 animal names,
grouped, classified, and labeled with their semantic category, without inflected forms. To com-
pute the likelihood of the different target terms, it was taken into account that some names are
more common than others. Thus, the total number of results provided by a web search engine
for a particular term has been exploited to compute this likelihood. The retrieval strategy had
to be refined several times in order to find the optimal approach. In fact, initial queries have
73
Language Model Train set Test SetGeneric ASR system 88.95 105.47Prebuilt list based 16.80 21.22Ontology based 11.94 19.94
Table 6.1: WER for different language models: i) Generic ASR system: general purpose lan-guage model trained on broadcast news, ii) Prebuilt list based: constrained keyword modelcreated from the list used in the STRING project, iii) Ontology based: constrained keywordmodel created from the ontology Temanet.
led to incorrect counts due to homonyms of some terms. The final approach consisted of using
the animal name and the associated semantic category. Finally, the likelihood associated with
each term also allows to sort the list numerically and thus to reduce its size by filtering out less
popular terms. After several experiments, the language model that achieved the best results
contained the 802 most popular animal names. The WER achieved using this language model
decreased to 21.22%. In a following step, in order to further enhance the accomplished results,
and to easily extend the animal naming task to different semantic domains, different resources
were considered. In particular, the Portuguese lexical-conceptual networks TemaNet (Marrafa
et al. 2006) has been used. This resource is organized in twelve semantic domains. TemaNet
is of particular interest for the animal naming task because it is highly structured. In fact, the
hyponyms of animals are organized in a hierarchy of several layers that include, among others,
the separation between male and female. This is relevant not only because in Portuguese, un-
like English, there are different words to express the genre of an animal, but also because the
animal naming task evaluation rules require to account for genre differences.
The first approach with this resource, however, did not impose constraints on the depth of
the hierarchy or on the type of the extracted information. All the subtypes of animals have
been accepted, leading to a keyword model composed of 400 keywords. Then, in a similar
way to the keyword model created from the prebuilt list, the likelihood of the target terms
has been computed by exploiting the total number of results provided by a web search engine.
Experiments with the ontology-based keyword model reported a reduction of the average WER
up to 4% and 2% for the train and test corpus, with respect to the prebuilt list based keyword
model.
One characteristic of this task that hinders the performance of the ASR-based approach is
related with the extraordinary number of disfluencies present in list evocations. It was ob-
served that the elements that actually belong to the list display a characteristic prosodic list
effect (e.g., /dog, cat, cow, horse/), that is not present in the other events unrelated with the task
(e.g., /ah! I already said that. . . /). Thus, one can hypothesize if it is possible to differentiate a
74
Figure 6.1: An excerpt of an audio recording showing, respectively, from the top: the spectro-gram, the F0, the textual transcriptions of the sound, and prosodic events classification. Redarrows indicate a continuation rise contour, while the yellow arrow indicates a finality contour.
word as an element of a list from other types of events by exploiting prosodic patterns. It has
been shown that list effects, or serial recall tasks, display prosodic features mostly character-
ized by two patterns: i) a continuation rise contour, a rising F0 movement from the nuclear
or prenuclear syllables up to the end of the phrase; and ii) a finality contour, a fall from the
nuclear or prenuclear syllables until the end of the phrase (Savino 2004, Savino et al. 2014). The
continuation contour expresses that the list is to be continued and the finality contour that the
item is the last one of a recall series or the last one in the entire file. These two patterns are
visually shown in Figure 6.1.
To confirm the hypothesis that prosodic hints may distinguish possible animal names, the
AuToBI (Rosenberg 2009, 2010) tool has been used. AuToBI, the Automatic ToBI annotation sys-
tem for Standard American English (SAE) is a publicly available tool, which detects and classi-
fies tones and break indices using the acoustic correlates (pitch, intensity, spectral balance, and
pause/duration). This tool requires an initial segmentation of the events to classify. To this end,
the results of the recognition experiments performed in the previous section have been used to
produce the required segmentation input. Two types of configurations of the ASR were selected
to segment the input data: one using a generic language model (Generic ASR system), and
another using a language model specifically built for this task through the Portuguese word-
net TemaNet (Ontology based). Alternatively to the language model based configurations, a
phone-based segmentation is also obtained with the same ASR using a phone-loop grammar.
Results are shown in Table 6.2 for the different segmentation strategies. Regarding AuToBI,
experiments were performed with both the English and European Portuguese models (Moniz,
Mata, Hirschberg, Batista, Rosenberg & Trancoso 2014), although the last were trained on a
75
Segmentation AuToBI model Accuracy
Generic ASR systemPT 71.8%EN 89.1%
Ontology basedPT 72.7%EN 84.3%
Phone basedPT 77.8%EN 91.8%
Table 6.2: Performance of AuToBI using English and European Portuguese models and threesegmentation strategies: ASR-based, ontology-based (TemaNet), and phone-based.
small data set of about 33 minutes. By applying the AuToBI English models with the generic
ASR language model, the system achieved an accuracy of 89.1% in the prediction of potential
animal names. With the Portuguese prosodic models, the accuracy decreases to 71.8%. Such
impoverishment of the performance may be interpreted by the fact that the Portuguese models
used significantly less training data than the English ones. The best performance is achieved
with the phone recognizer using AuToBI English models, with an accuracy of 91.8%.
6.2 Automatic monitoring and training of cognitive func-tions
After approaching the semantic verbal fluency task, the focus of this doctoral research shifted
towards the automatic implementation of two popular neuropsychological test batteries: the
MMSE (Folstein et al. 1975) and the ADAS-Cog (Rosen et al. 1984). These tests are used for
screening cognitive performance and tracking alterations of cognition over time in AD and
MCI. They involve the assessment of different capabilities, such as orientation to time and
place, attention and calculus, language (naming, repetition, and comprehension), or immediate
and delayed recall. In the remainder of this chapter, I summarize some of the key results of this
work, more details can be found in (Pompili et al. 2015).
6.2.1 Extending VITHEA for neuropsychological screening
The baseline for the integration of the MMSE and ADAS-Cog was an automatic web-based
system named VITHEA (Abad et al. 2013). The system aims at acting as a ”virtual therapist”,
incorporating an animated character with speech syntesis capability and ASR to provide word
naming exercises for the remote rehabilitation from aphasia. The ASR engine integrated in
the monitoring tool corresponds to the in-house speech recognizer Audimus (Meinedo et al.
2010, 2003). In order to provide robust feedback to word naming exercises, the speech recog-
nizer resorts to a keyword spotting technique (Abad et al. 2012). The platform comprises two
76
specific modules, dedicated respectively to the patients, for carrying out the therapy sessions,
and to the clinicians, for the administration of the functionalities related to them (e.g., man-
age patient data, manage exercises, and monitor user performance). VITHEA is used daily by
patients and speech therapists and has received several awards from both the speech and the
health-care communities. The success of this platform and its flexibility, that allows to create
different exercises, have motivated its use as a foundation for this study. The main goal was
the automation of the exercises in MMSE and ADAS-Cog that involve speech. Additionally,
the animal naming test was also implemented in the platform. As explained in the following,
the automation of such tests has raised several technological challenges, both for the automatic
speech recognition and text-to-speech synthesis technologies. Extending VITHEA for includ-
ing neuropsychological tests also involved important alterations in the original platform, both
on the patient and the clinician modules. However, the flexibility of this platform allows for
the easy addition of new categories of exercises. These can then be combined in multiple ways
by the clinician to form new tests, and to create different exercises of the same type. Extensions
were related with the usability of the interface, adapted to meet the needs of an aging popu-
lation with cognitive impairments, and with the presentations of the tests. In fact, following
the feedback received from the neurologists involved in the study, for some stimuli, optional
instructions and semantic hints were provided. The behavior of the animated character was
altered to provide a random feedback when the patient switches among different classes of
stimuli. Finally, the platform was updated to store additional information of the patient’s pro-
file needed for the assessment of some subtests (i.e., place of birth, age, etc.), and the result of
the assessment in terms of the score obtained.
Since the selected neuropsychological tests comprise common or similar questions, one may
approach their concrete implementation organized by type of questions and the underlying
technology with which they were implemented, rather than per test. They are briefly summa-
rized in Table 6.3. Each type of question has set different challenges, each of which has been
addressed individually with ad-hoc solutions. Overall, a total of 185 stimuli belonging to dif-
ferent types of tests have been selected for their implementation in the platform. The scoring
methodology of the implemented tests is straightforward, one point is given to each correct an-
swer provided. For all the tests but one, this corresponds to a single item correctly produced,
only in the case of the repetition exercise the patient has to repeat the whole sentence to get a
score of 1.
77
Test Battery Description Technology
Naming objects and fingers MMSE /Adas-Cog Name a series of objects shown in pictures KWS
Repetition MMSE Repeat the sentence /O rato roeu a rolha/(/the mouse gnawed the stopper/) KWS
Attention and calculation MMSE Starting on 30, successively subtract 3 KWSOrientation to time, placeand person
MMSE /Adas-Cog Questions about time, city KWS/
RBG
Word recognition Adas-CogLearn a list with 12 words. Recall the word froma new list containing 12 new distractors. Theprocess is repeated 3 times
RBG
Evocation MMSE /Adas-Cog Recall a list of word (previously learned or not) ALM
Verbal Fluency – Name as many items as possible in a given cate-gory in 1 min. ALM
Table 6.3: Implemented cognitive tests. KWS: Keyword spotting, RBG: Rule-based grammar,ALM: ad-hoc language model for keyword spotting.
6.2.2 Corpus
To evaluate the feasibility of the monitoring tool, an ad-hoc speech corpus has been collected.
This includes recordings of five people diagnosed with cognitive impairments and five healthy
control subjects. All the participants are Portuguese native speakers. Recordings took place in
different environments with different acoustic conditions. Healthy subjects were recorded in a
quiet, domestic environment, while patients were recorded at CHPL, the Psychiatric Hospital
of Lisbon. No particular constraints were imposed over background noise conditions. Each
session consisted approximately of a 20-30 minutes recording. The data was originally captured
with the platform at 16 kHz, and later down-sampled to 8 kHz to match the acoustic models
sampling frequency.
6.2.3 Evaluation
Due to the extensiveness of the ADAS-Cog test, it was unfeasible to evaluate all the imple-
mented neuropsychological tests. In fact, an estimation of the total duration of the evalua-
tion indicated that it would take more than two hours, which was considered unacceptable.
Thus, only a representative subset of all the tests was selected, comprising a total of 41 stimuli.
The system was evaluated by considering its ability of correctly transcribing the participants’
speech. In fact, answers incorrectly recognized will produce an automatic score of the test that
will underestimate or overestimate the actual result of the participant. Thus, the performance
of the recognition process provides a measure of the reliability of the platform as a screening
tool. To assess the result of the automatic recognition, two evaluation metrics have been con-
sidered, depending on the type of automated tests. On the one hand, the answers to the tests
78
Question type /technology
Patients HealthyAccuracy %
KWS 77.39 88.70RGB 74.29 88.57
(a)
Question type /technology
Patients HealthyWER %
ALM (w/o animals) 20.00 8.16ALM (animals) 74.42 46.48
(b)
Table 6.4: Accuracy and WER according to the type of question.
based on KWS are evaluated as right or wrong, thus their performance can be computed by ac-
counting for the number of coincidences between the manual and automatic result divided by
the total number of exercises, that is, the metric used is the classification accuracy. On the other
hand, the evocation exercises differ from the KWS exercises since the answers to these stimuli
cannot be evaluated as right or wrong, but instead the number of terms correctly recalled needs
to be counted. For this reason, the WER between the manual and the automatic users’ answer
has been computed. These results are summarized in Tables 6.4a and 6.4b.
KWS-based exercises results have shown an average accuracy around 77% and 89% for pa-
tients and healthy subjects, respectively. These results can be considered quite promising, being
comparable to those reported in (Abad et al. 2012), in an evaluation with aphasia patients. No
significance differences were found between those tests relying on simple KWS and for those
ones relying on rule-based grammars. Evocation exercises are divided into two categories:
those requiring a limited number of words to recall and those considering an open domain of
possible answers complying to a specific semantic category (e.g., animal naming test). The av-
erage WER computed for patients and control group in the class of evocation exercises with a
closed domain was 20.00% and 8.16%, respectively. However, the average WER computed for
patients and control group on the animal naming test was much higher, 74.42% and 46.48%,
respectively. Previous results obtained for the same task, but using a corpus of healthy subjects
(Section 6.1), showed a WER of around 20% on the test data. Thus, in comparison, current out-
comes witness a strong impoverishment of the performance. This result was partially expected
since the previous work used a different corpus, containing data from a heterogeneous set of
healthy subjects. The corpus used in this study, on the other hand, is composed exclusively
of elderly participants whose average ages ranges around 73 and 75 years, for the healthy and
patient group respectively. This is reflected on the resulting speech, which is characterized by
a reduced intensity, a reduced pitch, and a hoarse voice. These characteristics represent an
additional challenge for the speech recognizer. Also, after a closer analysis, it was possible to
note that deletions and substitutions were the main source of error. This may be partly ex-
plained by the amount of OOV keywords used by both the healthy and patient group. In fact,
79
the healthy group uttered 71 unique animal names, but 23 of them were missing from the key-
word model, generating the 39% of deletions and the 42% of substitutions. The patient group
uttered a total of 43 unique animal names, of which 26 were missing from the keyword model.
In this case, 66% of the errors were deletions, while the 28% were substitutions. These results
can be partially compared with the ones achieved by Pakhomov et al. (2015). In that work, the
authors assessed the spoken responses of the animals naming test using a combined approach
that exploits a constrained language model, a speaker-adapted acoustic model, and confidence
scoring to filter the ASR output. Nevertheless, they achieved a WER of 53% using a corpus
composed by younger subjects (mean age 28.4). While this represents a quite high WER, it is
still lower than the error obtained in this study for the patient group, implying that either the
acoustic and the language models could still be improved.
A global evaluation of the platform has been also performed in order to assess its reliability
as an automatic screening tool. A straightforward evaluation method consists of comparing
the manual and the automatic scores achieved by the user for each type of test. The scores
were calculated according to the traditional assessment performed when applying a neuropsy-
chological test. Table 6.5 reports the MAE, the Mean Relative Absolute Error (MRAE), and the
maximum score for the previously described subsets of stimuli. The maximum score for each
test depends on the number of stimuli selected. In addition, the results for the MMSE are also
reported. For this test, the scores achieved by each speaker are also shown in Figures 6.2.
In general, the results were better for healthy people than for patients. This was an expected
achievement due to the impaired condition of the patients, which were reflected on the quality
of their speech. The MAE error reported for question type and technology ranges from 0.80
to 3.00 for the patients, and from 0.80 to 2.80 for the control group. Comparing the MAE with
the maximum possible score it can be observed that the difference between the automatic and
the manual evaluation is relatively small. For instance, observing the results for the patient
group, one may notice that the MAE for the questions based on keyword spotting is 3.00 out
of a maximum score of 23, which corresponds to a relative error of 13%. The same analysis for
the control group leads to a relative error of 11.3%.
Through the evaluation and data collection, I had the opportunity to gather important feed-
back about the platform. I acknowledge that individuals with an advanced impaired condition
may have more difficulties in using the system, especially when the condition is worsened by
deafness or computer illiteracy, two factors rather common in the elderly. Patients with a more
pronounced cognitive impairment or with auditory impairments, may have difficulties in un-
derstanding the questions being asked. The computer illiteracy, however, may no longer be a
80
1 2 3 4 50
5
10
15
20
25
Patient
MM
SE
score
Human
Auto
6 7 8 9 100
5
10
15
20
25
Healthy
MM
SE
score
HumanAuto
Figure 6.2: On the left side, MMSE scores of the human and automatic evaluations for thepatient speakers. On the right side, MMSE scores of the human and automatic evaluations forthe healthy speakers.
Question type / Technology Max. Score Patients HealthyKWS 23 3.00 (26%) 2.60 (12%)RBG 14 2.80 (37%) 1.60 (15%)ALM (w/o animals) 11 0.80 (23%) 0.80 (11%)ALM (animals) – 2.60 (34%) 1.80 (17%)MMSE 22 2.20 (21%) 2.80 (14%)
Table 6.5: MAE and MRAE (in brackets) by type of question and by neuropsychological test.
problem in the not so distant future.
Nevertheless, both the patients and healthy subjects demonstrated their appreciation for the
platform, showing interest in using the platform regularly. Particularly, some of the patients
were captivated by the animated virtual character, they liked its cartoon nature and the fact
that it interacted with them verbally.
6.3 Summary
In this chapter, the automatic implementation of some widely used neuropsychological tests
has been addressed. Initially, a specific task, the semantic verbal fluency test, has been targeted.
In this type of test, the patient has to name as many items as possible belonging to a specific
category, within a time interval of one minute. This is a challenging task for current speech
recognition technology, both for its open domain nature, and because of the cognitive load
it imposes, which induces the production of a considerable number of disfluencies. The first
problem has been addressed with the automatic construction of a language model suited for
the task. In this way, it was possible to witness an important improvement of the recognition,
81
with a reduction of the WER from 105.5%, achieved with a general purpose language model to
19.9%, achieved with the developed language model. The second problem, the high amount
of disfluencies, is approached by investigating the use of prosodic patterns to predict potential
animal names. In this case, it was possible to distinguish disfluencies with an accuracy of
91.8%.
Then, the focus of my work shifted towards the automatic implementation of two neuropsy-
chological test batteries commonly applied by neurologists to assess the cognitive condition of
a person. They have been developed as an automatic web-based tool with SLT integration. In
this way, the tool could be used also for remote monitoring of cognitive impairments. As far
as I know, it represents the only platform of this type implemented for the Portuguese popula-
tion. The system has been assessed both with healthy subjects and patients. The MAE between
the manual and the automatic evaluation was relatively small, showing the feasibility of such
type of system. Additionally, the flexibility of the used platform allows the very easy creation
of new exercises of the same type, with different stimuli. In this way, it could be easily ex-
tended to include different types of exercises that can be used for the daily training of cognitive
abilities.
82
7Contributions to the Monitoring of
Language Abilities
As observed at the end of Chapter 3, the evaluation of discourse impairments requires both the
manual transcription of the speech samples and the subsequent identification and annotation
of predefined linguistic features. These requirements preclude the applicability of discourse
analysis to clinical settings. Additionally, a manual analysis may also lead to different inter-
expert assessments due to the intrinsic, ambiguous nature of spontaneous language. For this
purpose, an automated analysis of discourse impairments would provide clinicians with an
additional screening tool that could be used in an objective way in clinical settings. The lit-
erature review presented in Chapter 4 showed that lexical, syntactic, and semantic aspects of
language production have been widely investigated, confirming the viability of these methods
in the identification of Alzheimer’s Disease (AD). However, very few studies approached lin-
guistic deficits at a higher level of processing, considering macrolinguistic aspects of discourse
production. For this reason, I developed a method targeting the analysis of pragmatic aspects
of a discourse. This approach is further complemented by considering lower levels aspects of
language processing, such as lexical, syntactic, and semantic abilities. Overall, the analysis of
such a wide set of language characteristics should provide a comprehensive evaluation of dis-
course production. Additionally, with the aim of automating the entire process of this type of
analysis, a speech recognition system is used to obtain the transcriptions of the recordings. This
work was published in IberSPEECH 2018 (Pompili et al. 2018) and accepted for publication to
a special issue of the IEEE Journal of Selected Topics in Signal Processing (JSTSP) on Automatic
assessment of health disorders based on voice, speech and language processing (Pompili et al.
2019).
7.1 Evaluating pragmatic aspects of discourse productionfor the automatic identification of Alzheimer’s disease
Macrolinguistic aspects of language production are quantified by rating of the cohesion and
coherence. While cohesion expresses the semantic relationship between elements, coherence is
related to the conceptual organization of speech, and is usually analyzed through the study of
local, global, and topic coherence. Local coherence refers to the conceptual links that maintain
meaning between proximate propositions within smaller textual units. Global coherence refers
to the way in which the discourse is organized with respect to an overall plan. Finally, topic
coherence refers to the organization and maintenance of the topics used within the discourse.
In fact, topics should be structured according to an internal organization, in order to achieve
an information hierarchy, which is essential for effective communication (Ulatowska & Chap-
man 1994b). To the best of my knowledge, there are no computational studies targeting an
automatic analysis of topic coherence. This subject has been investigated only in the clinical
literature. It was introduced in 1991 in the work of Mentis & Prutting (1991), whose focus was
the study of topic introduction and management. A topic was described as a clause identifying
the question of immediate concern, while a subtopic was an elaboration or expansion of one
aspect of the main topic. Several years later, Brady et al. (2003) analyzed topic coherence and
topic maintenance in individuals with right hemisphere brain damage. This work extended
the one of Mentis & Prutting (1991) with the inclusion of the notion of sub-subtopic and sub-
sub-subtopic. Topic and subdivisional structures were further categorized as new, related, or
reintroduced. In a later study, Mackenzie et al. (2007) used discourse samples elicited through
a picture description task to determine the influences of age, education, and gender on the con-
cepts and topic coherence of 225 healthy adults. Results confirmed education level as a highly
important variable affecting the performance of healthy adults. More recently, Miranda (2015)
investigated the influence of education in the macrolinguistic dimension of discourse evalua-
tion, considering concepts analysis, local, global and topic coherence, and cohesion. Results
corroborated the ones obtained by Mackenzie et al. (2007), confirming the effect of literacy in
this type of analysis.
In this chapter, I propose a novel approach to automatically discriminate AD based on the
analysis of topic coherence. A discourse is modeled as a graph encoding a hierarchy of topics,
a relatively small set of pragmatic features are extracted from this hierarchical structure and
used to discriminate AD. Results have shown comparable classification performance with cur-
rent state of the art. Then, this approach is further extended with the introduction of a wider
set of linguistic features. In fact, the set of topic coherence features is broadened with new
measures assessing pragmatic aspects of discourse. This set of features is also integrated with
lexical, syntactic, and semantic features. It is expected that the analysis of different aspects of
discourse production will contribute to an improved automatic discrimination of AD. Addi-
tionally, the method proposed depends on accurate manually produced transcriptions of the
speech narratives. This is a common requirement to many studies targeting an automatic char-
acterization of linguistic impairments in AD (Fraser & Hirst 2016, Fraser et al. 2016, Orimaye
84
(a) (b)
Figure 7.1: (a) The Cookie Theft picture, from the Boston Diagnostic Aphasia Examination(Goodglass et al. 2001). (b) An excerpt of a topic hierarchy for the Cookie Theft picture foundin the work of Miranda (2015).
et al. 2014, Santos et al. 2017, Toledo et al. 2018, Yancheva et al. 2015, Yancheva & Rudzicz 2016).
However, this requirement still hampers the applicability of computational approaches to clin-
ical settings. Thus, I assess the impact of using automatically generated transcriptions of the
spoken narratives by a speech recognition system. In this sense, the type of errors introduced
and in which way they impact on the performance of the proposed method are analyzed.
7.1.1 Corpus
Data used in this study are obtained from the DementiaBank database (MacWhinney et al. 2011,
TalkBank 2017), which is part of the larger TalkBank project (Becker et al. 1994, MacWhinney
et al. 2004). The collection was already introduced in Chapter 4, I briefly recall that among other
assessments, participants were required to provide the description of the Cookie Theft picture,
shown in Figure 7.1(a). Each speech sample was recorded and then manually transcribed at
word level. Narratives were segmented into utterances and annotated with disfluencies, filled
pauses, repetitions, and other more complex events. Among these, retracing and reformulation
are used to indicate abandoned sentences where the speaker starts to say something, but then
stops. While in the former the speaker may maintain the same idea changing the syntax, the
latter involves a complete restatement of the idea. For the purposes of this study, only partic-
ipants diagnosed with probable AD were selected, resulting in 234 speech samples from 147
patients. Control participants were also included, resulting in 241 speech samples from 98
speakers. Table 7.1 reports additional information about the size of the corpus, demographic
and clinical data. More details about the study cohort can be found in Becker et al. (1994).
85
Age range (avg.) MMSErange (avg.)
Audioduration N. of words
Controls 46-80 (63.84) 26-30 (29.06) 04h:13m 26591
AD 53-88 (71.31) 8-30 (19.36) 05h:04m 23029
Table 7.1: Statistical information on the Cookie Theft corpus
7.1.2 The proposed model to analyze topic coherence
The topics used during discourse production should contain an internal, structural organiza-
tion, in order to achieve an information hierarchy. This organizational structure allows a grad-
ual organization of information that is essential for an effective communication (Ulatowska &
Chapman 1994a). Being important for both the speaker and the listener, this type of organiza-
tion highlights the key concepts and indicates the degrees of importance and relevance within
the discourse. Mackenzie et al. (2007) provided an example of a topic hierarchy based on the
Cookie Theft picture description task, which was later extended in the study of Miranda (2015).
To allow a better understanding of the problem at hand, an excerpt of this hierarchy is also
reported in Figure 7.1(b).
The number of relevant topics that can be described from the Cookie Theft picture, is limited
to the concepts that are explicitly represented in the image (e.g., garden), and to those ones that
can be implicitly suggested by the scene (e.g., weather). Taking this into account, the problem of
building a topic hierarchy from a transcript can be modeled with a semi-supervised approach
in which a predefined set of topics clusters is used to guide the assignment of a new topic to a
level in the hierarchy.
Both for the creation of the topics clusters, and for the analysis of a new discourse sample,
a multistage approach is followed to transform the original transcriptions into a representation
suitable for subsequent analysis, as shown in Figure 7.2. Initially, the transcriptions are pre-
processed, then syntactic information is extracted and used to separate sentences into clauses
and to identify coreferential expressions. Finally, a sentence vector representation is computed
based on the word embeddings extracted for each word in a clause. Each stage of this process
is further described in the following sections.
7.1.2.1 Preprocessing
.In order to prepare the transcriptions for the next stage of the pipeline, all the annotations
were removed. Disfluencies were disregarded and contractions were expanded to their canon-
ical form. Once the preprocessing phase is concluded, POS tags are automatically extracted
using the Stanford University lexicalized probabilistic parser (Klein & Manning 2003b). Ap-
86
Topichierarchycreation
ClausesegmentationPreprocessing Coreference
analysisSentenceembedding
Topicclustersdefinition
Figure 7.2: The proposed method for modeling discourse as a hierarchy of topics.
pendix A.1.1 provides an excerpt of a transcription before and after the preprocessing phase,
and its corresponding POS tag annotations.
7.1.2.2 Clause segmentation
The next step requires identifying dependent and independent clauses. In fact, complex, com-
pound, or complex-compound sentences may contain references to multiple topics. The sen-
tence /the sink is overflowing while she is wiping a plate and not looking/ is an example of this
problem, as it is composed by an independent clause (/the sink is overflowing/) and a dependent
one (/she is wiping a plate and not looking/).
A possible way to cope with the separation of different sentence types is by using syntactic
parse trees. Thus, in a similar way to the work of Feng et al. (2012), clause and phrase-level tags
are used for the identification of dependent and independent clauses. For the former, the tag
SBAR is used, while for the latter, the proposed solution checks the sequence of nodes along
the tree to verify if the tag S or the tags [NP VP] appear in the sequence (Treebank 2019). The
corresponding parse tree for the sentence shown as an example is reported in Appendix A.1.2.
7.1.2.3 Coreference analysis
The analysis of coreference proves to be particularly useful in higher level NLP applications
that involve language understanding, such as in discourse analysis (Boytcheva et al. 2001).
Strictly related with the notions of anaphora and cataphora, coreference resolution goes beyond
the relation of dependence implicated by these concepts. It allows to identify when two or more
expressions refer to the same entity in a text.
In this study, the analysis of coreference has been performed with the Stanford coreference
resolution system (Manning et al. 2014), taking into account the segmentation performed in the
previous step.
During the process of building the hierarchy, coreference information is used to guide the
assignment of a subtopic to the corresponding level in the hierarchy. For this purpose, the re-
sults provided by the coreference system are constrained to those relationships in which the
87
referent and the referred terms are mentioned in different clauses, and to those referred men-
tions that belong to the set of third-person personal pronouns (i.e., he, she, it, they). Examples
of an accepted and a rejected relationships are provided in Appendix A.1.3.
One may argue that there are some limitations to this method. First, by constraining in this
way the results provided by the coreference analysis, possible interesting relationships may be
lost. Also, the use of third-person personal pronouns is a fragile approach and may lead to
incorrect relationships in the case of grammatically incorrect sentences.
However, this method still provides a simple and preliminary way to exploit the coreference
information, which is valuable in the process of building the topic hierarchy.
7.1.2.4 Sentence embeddings
In the last step of the pipeline, discourse transcripts are converted into a representation suitable
to compare and measure differences between sentences. In particular, the transformed tran-
scripts should be robust to syntactic and lexical differences and should provide the capability
to capture semantic regularities between sentences. For this purpose, I rely on a pre-trained
model of word vector representations containing 2 million word vectors, in 300 dimensions,
trained with fastText on Common Crawl (Mikolov et al. 2018). In the process of converting a
sentence into its vector space representation, first a selection of four lexical items (nouns, pro-
nouns, verb, adjectives) is performed. Then, for each word, the corresponding word vector is
extracted and finally the average over the whole sentence is computed.
7.1.2.5 Topic hierarchy analysis
To create a topic hierarchy from a transcript, a methodology that is partly inspired by current
clinical practice is followed. Thus, in modeling the problem, it is not possible to impose a
predefined order or structure in the way topics and subtopics may be presented, as this will
depend on how the discourse is organized. However, one can take advantage of the closed
domain nature of the task to define a reduced number of clusters of broad topics that will help
to guide the construction of the hierarchy and the identification of off-topic clauses.
Topic clusters definition
As mentioned, the proposed solution relies on the supervised creation of a predefined number
of clusters of broad topics. Each cluster contains a representative set of sentences that are re-
lated with the topic of the cluster. 10 clusters were defined: main scene, mother, boy, girl, children,
garden, weather, unrelated, incomplete, and no-content. The purpose of the cluster unrelated was
to match those sentences in which the participant is not performing the task (e.g., questions
88
Currenthierarchy
themotheriswearinganapron
Notrelated
Topicclusters
Garden
zGirl
Mother
?...…
...
Mother
…
Mainscene
…
a)Topicidentification
themotheriswearingan
apron
stoolisreadyfall
boyistakingcookies
shelefttapopen
motheriswashingdishes
Mainscene
motheriswearingan
apron
stoolisreadyfall
boyistakingcookies
shelefttapopen
motheriswashingdishes
Mainscene
stoolisreadyfall
boyistakingcookies
shelefttapopen
motheriswashingdishes
Mainscene
b) Topiclevelidentification d)Topicassignmentc) Topiclevelidentification
Subgraph‘Mother’ Updatedhierarchy
Figure 7.3: Topic hierarchy building algorithm. (a) The current sentence is compared with thetopic clusters to identify its topic. (b) Identification of the level of specialization of the currentsentence. If there are no nodes with the same topic of the current sentence, this is considereda new topic. (c) If the current hierarchy contains one or more nodes with the same topic of thecurrent sentence, each of them is analyzed with respect to the current one. (d) As a result, thecurrent sentence is added as a child of its closest node.
directed to the interviewer). The clusters incomplete and no-content are expected to match sen-
tences that may be characteristic of a language impairment. They identify fragments of text
that do not represent a complete sentence (e.g., /overflowing sink/) and expressions that do
not add semantic information about the image (e.g., /fortunately there is nothing happening out
there/, /what is going on/). To build the clusters, around 35% of the data from the control group
is used. Each sentence has been manually annotated with the corresponding cluster label and
clusters are simply modeled by the complete set of sentences belonging to them. These clusters
are used as topic references for building the topic hierarchy of new transcriptions, as described
next.
Topic hierarchy building algorithm
The algorithm to build the topic hierarchy relies on the cosine similarity between sentence em-
beddings. The first step consists of verifying which is, among the 10 topic clusters defined, the
one that best matches the content of the current sentence (Figure 7.3(a)). This is achieved by
computing the cosine similarity between the current sentence embeddings and each sentence
embeddings in each topic clusters. The highest result determines the cluster for the new sen-
tence. In the following step, one needs to assign the current sentence embeddings to a level in
the current hierarchy (Figure 7.3(b)). This implies establishing whether one is dealing with a
new or a repeated topic and its level of specialization (i.e., subtopic, sub-subtopic, etc.). This
is achieved by first identifying, in the current hierarchy, the subgraph whose nodes belong to
the same cluster of the current sentence (e.g., the subgraph corresponding to the mother cluster
in Figure 7.3). Then, the cosine similarity between the current sentence and each node of this
89
subgraph is computed. This process is shown in Figure 7.3(c). The new sentence is considered
a child of the closest node if the similarity is higher than a threshold. Otherwise, it is considered
a repeated topic (Figure 7.3(d)). If there is no subgraph, the sentence embedding is added as
a new topic. If the new topic turns out to be a coreferential expression, this kind of informa-
tion supersedes the cosine metric strategy, and the new topic is added directly as a child of its
referent.
7.1.3 Features for AD spoken discourse characterization
In this section, I report the set of features used in this study. First, pragmatic features are
described. These are computed from the topic hierarchy and include measures of topic, global,
and local coherence. Then, a set of additional lexical, morphosyntactic, and semantic features
is introduced. The complete set of features is listed in Table 7.2.
7.1.3.1 Topic coherence features
From the output that is produced at each step of the processing pipeline (i.e., from processing
steps described in Sections 7.1.2.1 to 7.1.2.4, and shown in Figure 7.2), and from the final topic
hierarchy, a set of 37 measurements was identified as of potential interest to characterize topic
coherence.
In a similar way to the standard clinical evaluation, the number of topics, sub-subtopics and
sub-sub-subtopics introduced, as well as the total number of repeated topics are accounted for.
Then, with the aim of investigating more thoroughly the subtopics produced, the number of
topics in each topic cluster related with the Cookie Theft picture are also considered. Addition-
ally, the number of sentences that, during the creation of the topic hierarchy, were classified as
unrelated, incomplete, or no-content is computed. These features were added to explicitly model
language impairments. In fact, the sentences classified as unrelated should identify those ex-
pressions in which the participant is not performing the task, but instead asks questions to
the interviewer. In a similar way, sentences classified as incomplete should model fragments of
a sentence, while sentences classified as no-content should identify those expressions that do
not add semantic information about the image. Furthermore, in the literature the mean cosine
value between all possible pairs and adjacent pairs of sentences have been used as measures of
global and local coherence (Graesser et al. 2004). This approach has been extended to the set
of topics constituting the final hierarchy, to the set of repeated topics, and to those classified
as unrelated or no-content. Finally, features characterizing the topology of the hierarchy, such
as the average number of outgoing edges, were also included. The complete set of features is
90
reported in Table 7.2.
7.1.3.2 Other linguistic features
As observed in Chapters 3 and 4, language deficits in AD typically include word-finding diffi-
culties, naming impairments, and semantic errors. These problems contribute to an incoherent,
circumlocutory speech characterized by the use of indefinite and vague terms. To analyze these
deficits in the context of narrative speech, and, thus, providing a more comprehensive evalu-
ation of language abilities, I integrate the topic coherence feature set with a number of lexical,
syntactic, and semantic features. These features and the methodology used to compute them
are detailed in the remainder of this section.
Lexical features
An excessive use of indefinite and generic terms, could be analyzed through measures that aim
at revealing the richness and diversity of the lexicon. For this purpose, one of the most widely
reported metric, which has been used in many linguistic and clinical research studies (Bucks
et al. 2000b, Fraser et al. 2016, Johansson 2009, Kettunen 2014), is the TTR. It is a sample measure
of vocabulary size, representing the ratio of the total vocabulary to the overall text length.
In addition, the Brunet’s index (Brunet 1978) and the Honore’s statistic (Honore 1978), two
alternative measures of richness of vocabulary were also computed. Brunet’s index quantifies
lexical richness without being sensitive to text length. It is calculated according to the following
formula: W=NV (−.165), where N is the total text length and V is the total vocabulary used
by the participant. The Honore’s statistic evaluates the richness of a lexicon by counting the
number of words that occur only once. It is calculated according to the following formula:
R=100 · log N/(1− V1/V), where V1 is the number of words spoken only once, V is the total
vocabulary used, and N is the total text length.
Morphosyntactic and syntactic features
Problems with pronominal reference may also contribute to a less informative speech. In fact,
several studies have found that the discourse of AD patients contains an overuse, often inap-
propriate, of pronouns (Almor et al. 1999, Kempler 1984, 1995, Kempler et al. 1987, Ripich &
Terrell 1988). This type of problems, in particular the overuse of demonstratives (e.g., /here/)
and references without a clear antecedent (e.g., /this/) have also been found in the work of Ula-
towska et al. (1988). AD patients have also shown impaired verb production, verb naming, and
impaired verb knowledge in sentence processing (Kim & Thompson 2004, Reilly et al. 2011).
91
To account for these and other linguistic phenomena, the frequency of occurrence of differ-
ent word classes was computed by relying on the POS information obtained with the Stanford
parser (Klein & Manning 2003b). The frequency of each class is computed at the sentence level
and then normalized by the total number of words in a narrative. Finally, frequencies are aver-
aged over all the sentences.
In the same vein of word class frequencies, the frequency of different types of produc-
tion rules is accounted for. This type of analysis has been used in several NLP classification
tasks (Post & Bergsma 2013, Wong & Dras 2010), and in problems aiming at identifying AD and
related dementias (Fraser et al. 2016, Orimaye et al. 2014, Yancheva et al. 2015). In a context-free
grammar, the set of production rules is used to describe the rules of a grammar. A production
is typically a relation of the form A→β, where A is a non-terminal symbol (each non-terminal
represents a different type of phrase or clause), and β is a string of symbols (the actual content).
More concretely, instances of these rules may be of the form: NP→NP VP, a noun phrase (NP)
that consists of a noun phrase and a verb phrase (VP), or NN→ /mother/, a noun (NN) that
is a terminal.
To identify a meaningful set of production rules, the phrase structure trees of the 35% of the
data that were retained for building the topic clusters were examined. This analysis provided
more than three thousand different production rules, thus a reduced subset of rules was se-
lected with the definition of the two following conditions: i) the left hand side should belong to
a restricted set of constituent tags, and ii) the right hand side should not be a terminal symbol.
Provided with this list of production rules, for each of them, the corresponding frequency of
occurrence is accounted for. The result is then normalized by the total number of rules in the
narrative. Overall, the total number of syntactic and morphosyntactic features is 52.
Semantic features
A decline in semantic content is consistent with the claims that describe the discourse of AD
patients as empty, containing little or no information (Ahmed, Haigh, de Jager & Garrard 2013).
Analyzing the narratives of a picture description task, several authors have found that the AD
group, in comparison to the control, produced less content elements and made shorter descrip-
tions with fewer categories of information (Ahmed, Haigh, de Jager & Garrard 2013, Croisile
et al. 1996, Hier et al. 1985, Nicholas et al. 1985). Computationally, the identification of infor-
mation content has been approached by several authors (Bucks et al. 2000b, Fraser et al. 2016,
Hakkani-Tur et al. 2010, Hernandez-Domınguez et al. 2018, Jarrold et al. 2014a, Sirts et al. 2017,
Yancheva et al. 2015, Yancheva & Rudzicz 2016). In these studies, the mention of a given con-
cept is assessed through a predefined list of Information Content Units (ICUs). In this respect,
92
Type Description
Number of topics (T1), subtopics (T2), sub-subtopics (T3), and sub-sub-subtopics introduced (T4).
Number of topics produced in each topic cluster, namely main scene (T5),mother (T6), boy (T7), girl (T8), children (T9), garden (T10), weather (T11).
Proportion of dependent (T12) and independent clauses to the totalnumber of sentences (T13).
Total number of coreferential mentions (T14).
Total number of repeated topics, subtopics, sub-subtopics, and sub-sub-subtopics (T15).
Number of sentences that were classified as unrelated (T16), incomplete(T17), or no-content (T18) in the first step of the main algorithm.
Mean, standard deviation, and coefficient of variation (the ratio of thestandard deviation to the mean) of the cosine similarity between twotemporally consecutive topics (T19-T21), all pairs of topics (T22-24), allpairs of repeated topics (T25-27), and those classified as unrelated (T28-T30) or no-content (T31-T33).
Length of the longest path from the root node to all leaves (T34).
Average number of outgoing edges of all nodes (T35).
Total number of sentences (T36).
Topic coherence
Ratio of dependent to independent clauses (T37).
Lexical TTR (L1), Brunet’s index (L2), and Honore’s statistic (L3).
Word class frequencies: adverb (M1), verbs (M2), noun (M3), pronouns(M4), and adjectives (M5).
Number of times a production rule is used (M6-M41).Morphosyntacticand syntactic
Rate, proportion, and average length of noun (M42-M44), verb (M45-M47), and prepositional phrases (M48-M50). Ratio of nouns to verbs(M51) and of pronouns to nouns (M52).
SemanticICUs to consider the mention of a key concept in the Cookie Theft picture(S1-S23).
Frequency of occurrence of specific keywords relevant for the CookieTheft picture (S24-S49).
Table 7.2: Summary of all extracted features (141 in total). The number of each type of featuresis reported in parenthesis.
some works have recently approached a completely automatic evaluation of information con-
tent (Hakkani-Tur et al. 2010, Hernandez-Domınguez et al. 2018, Jarrold et al. 2014a, Yancheva
& Rudzicz 2016).
To account for information content features, the approach described in the work of Croisile
et al. (1996) has been followed. The authors examined 23 information units in four categories
(i.e., subjects, places, objects, and actions). Overall, these concepts are assumed to constitute a
93
complete description of the Cookie Theft picture:
• three subjects: boy, girl, and mother,
• two places: kitchen and exterior seen through the window,
• eleven objects: cookie, jar, stool, sink, plate, dishcloth, water, window, cupboard, dishes,
and curtains,
• seven actions: boy taking or stealing, boy or stool falling, woman drying or washing
dishes/plate, water overflowing or spilling, action performed by the girl, woman uncon-
cerned by the overflowing, woman indifferent to the children.
Information units are used to identify the mentioning of a concept in a narrative, as such, they
should be robust to lexical or semantic variations of the same content (e.g., mother, lady, wife,
mom). Also, another issue is related with the ICU ’action performed by the girl’, whose defini-
tion is too vague and needs to be further specified in order to be approached computationally.
Details on the definition of the ICUs can be found in Appendix A.2.1. Given the ICU defi-
nitions, the feature corresponding to categories subjects, objects, and places were computed by
simply verifying the mention of the corresponding items in the text. To compute the men-
tioning of the ICU category actions, the dependency representations provided by the Stanford
Parser (Klein & Manning 2003b) were examined. More details about the ICU features compu-
tation can be found in Appendix A.2.2.
Finally, in a similar way to the work of Fraser et al. (2016), the set of ICU features is ad-
ditionally complemented with the occurrences of specific words that may be of relevance to
the Cookie Theft picture. That is, the number of times that a given uni-gram is referred to is
computed. This may highlight possible subtle variations in the linguistic patterns used by the
AD and the control group. Overall, the total number of semantic features is 49.
7.1.4 Evaluation
AD classification experiments have been performed on a subset of the Cookie Theft corpus using
a RF classifier. As described in Section 7.1.2, 35% of the data was held-out to define the topic
clusters. Hence, only 65% of the data has been used for experimental validation, that is, for
training and testing AD classifiers. This consists of 148 discourse samples from the control
group, and 153 from the dementia group. The two sets together contained 1241 unique words,
885 from control subjects, and 878 from AD patients. On average, each transcribed narrative
contained around 12-13 sentences, with the patients group producing shorter descriptions. A
stratified k-fold cross validation per subject strategy was implemented, with k number of folds
94
equal to 10. In the following, I report the average and range accuracy at the 90% confidence
level computed from the results of each fold.
In order to identify the most discriminant feature for AD classification, I implemented a
method based on the sequential forward selection (SFS) (Pudil et al. 1994). The SFS algorithm
is an iterative search approach in which a model is trained with an incremental number of
features. Starting with no features, at each iteration the accuracy of the model is tested by
adding, one at a time, each of the features that were not selected in a previous iteration. The
feature that yields the best accuracy is retained for further processing. The method ends when
the addition of a new feature does not improve the performance of the model. In this study, I
implemented a variation of the SFS that explores a larger features space in order to find better
solutions to the problem at hand. That is, I removed the constraint of terminating the search
as soon as a first local maximum is found, and performed an extended search until the last
feature was selected. Then, the modified approach selects the minimal set of features that meet
a certain performance convergence criterion. In this case, this is defined by the attainment of a
classification performance of at most 1% worse than the global maximum.
7.1.4.1 Experiments using manual transcriptions
This section reports on the validation of different set of features computed on the manual tran-
scriptions: experiments with topic coherence features, with additional linguistic features, and
fusion of these two sets.
Topic coherence features results
Using only the set of features computed through the multistage approach yielding the topic
hierarchy, the average accuracy in classifying AD was 79.0% ± 4.8%. This performance is
achieved with a selection of 11 features out of 37. Figure 7.4 reports the classification accuracy
obtained with the SFS algorithm on this subset of features.
From this image, it is possible to note that the number of topics (T1) was the first feature se-
lected, providing, alone, an average accuracy of 66%. The second and the fifth features selected
were the coefficient of variation between two temporally consecutive topics (T21) and between
all pairs of topics (T24). Nevertheless, other statistical measures related with the mean and
standard deviation of the cosine value between adjacent (T19, T20) and all pairs of topics (T22,
T23) were not considered discriminant for classification by the SFS algorithm. These findings
are in agreement with the results achieved by Toledo et al. (2018). However, these features
were introduced because in the literature they have been used as an index of local and global
95
T1 T21 T16 T10 T24 T14 T12 T35 T8 T34 T30.50
0.55
0.60
0.65
0.70
0.75
0.80
Accu
racy
Features
Figure 7.4: Variation of the classification accuracy with the SFS method, while increasing thenumber of features. Results are presented for the set of topic coherence features that providedthe maximum accuracy. Features are computed on the manual transcriptions.
coherence (Graesser et al. 2004, Santos et al. 2017, Toledo et al. 2018). In order to understand
why they were not considered relevant for classification, their average values were analyzed in
more detail. In this way, one can note that differences between the AD and the control group
are relatively small, which may represent a possible explanation. From this analysis, I also dis-
covered that, in line with the results achieved by Toledo et al. (2018), the AD group achieved
higher scores in the mean value of the cosine between adjacent topics, rather than between
all pairs of topics. This difference has been associated with a greater difficulty in keeping the
theme throughout the discourse. A significant impairment in global, but not in local coherence
was also found by Glosser & Deser (1991) while assessing macrolinguistic patterns of discourse
production in AD patients. Dijkstra et al. (2002) justified this finding with the assumption that
global coherence and elaborations on a topic require more cognitive resources than local co-
herence, which needs only the activation of information that is relevant between continuous
sentences.
Another relevant result regards the number of coreferential mentions (T14) and the propor-
tion of dependent clauses (T12). Analyzing in more detail these data, it is confirmed that, on
average, there is a large difference between the two groups for these statistics. Two opposite
patterns also emerge. AD patients produced a greater number of coreferential mentions and
a reduced number of dependent clauses with respect to the control group. These results are
in agreement with the findings that AD speech is characterized by an increased use of pro-
nouns and a reduced number of subordinate clauses (Ahmed, Haigh, de Jager & Garrard 2013,
Croisile et al. 1996, Kemper et al. 1993, Ripich & Terrell 1988, Ulatowska et al. 1988).
96
Linguistic features results
Using the SFS feature selection approach with the linguistic features, the accuracy improves
to 82.6% ± 5.1%. This result was achieved with the selection of 55 features, out of 104. This
outcome is consistent with those achieved in similar works of the state of the art (Hernandez-
Domınguez et al. 2018, Yancheva & Rudzicz 2016), and in particular with the one of Fraser
et al. (2016). I recall, in fact, that the range of linguistic measures analyzed in this study can
be considered a subset of those implemented by Fraser et al. In this way, it is possible to draw
a comparison between the two approaches. In both studies, the frequencies of adverbs (M1),
verbs (M2), and nouns (M3) were identified in the set of the most important features. The same
applies to the rate of prepositional phrases (M48). An analysis of the mean values of these
features in the two groups confirmed a trend in agreement with the current state of the art.
AD speech contains a reduced number of nouns, verbs, and prepositional phrases (Ahmed,
de Jager, Haigh & Garrard 2013, Croisile et al. 1996, Kave & Goral 2016, Kemper et al. 1993).
For what concerns semantic features, a partial overlap was also found in the set of ICUs and
word occurrences that were considered relevant for classification. I acknowledge that in this
study a larger number of semantic features was selected in comparison to the one of Fraser
et al. These differences may be explained either by the different feature set, being that the
ones implemented by Fraser et al. also contain acoustic and psycholinguistics measures, or by
different computational implementations.
Fusion of features results
When combining the topic coherence features with the set of lexical, syntactic, and semantic
features, the accuracy improves from 82.6% ± 5.1% to 85.5% ± 2.9%. This result is achieved
with the identification of a restricted number of features, only 19 out of 141, corresponding to
13% of the total number of features. This is a relevant outcome if considering that the use of
these two sets individually required the selection of 30% and 53% of the total number of fea-
tures, achieving a lower accuracy. As shown in Figure 7.5 (top), the selected subset includes
16 features assessing syntactic, and semantic abilities, and 3 features evaluating pragmatic as-
pects of language. This distribution confirms that the set of other linguistic features is more
comprehensive and covers a wide range of phenomena, assessing language impairment at dif-
ferent levels of processing. On the other hand, pragmatic features encode in a compact way a
different type of information, complementing lower level aspects of language production.
97
Figure 7.5: Accuracy achieved with the top selected features using the fusion of different sets.Results are computed on the manual transcriptions (top) and on the automatic transcriptions(bottom).
7.1.4.2 Experiments using automatic transcriptions
The generation of manual transcriptions of discourse samples is a laborious, time-consuming
task that also requires expert linguistic knowledge. This need hampers the applicability of the
type of proposed computational analysis to clinical settings. Thus, the use of a speech recogni-
tion system to automatically produce the transcriptions can remove this constraint. However,
this may be at the cost of a negative impact on performance due to recognition errors. Nowa-
days, state of the art automatic speech recognition (ASR) systems can achieve accuracy levels
comparable to human transcribers (Word Error Rate (WER) of ∼5-6%) in certain spontaneous
speech recognition tasks (Saon et al. 2017, Stolcke & Droppo 2017). Nevertheless, when it
comes to the recognition of atypical speech, such as speech from elderly people, recognition
errors typically get worse. This performance drop is even exacerbated in the case of speech
and language affecting diseases. Hakkani-Tur et al. (2010) reported a WER of 30.7%, and 26.7%
in recognizing the speech of elderly, healthy adults while performing a picture description and
a story recall tasks. Lehr et al. (2012) achieved a WER of 34.5% on a corpus of MCI and elderly
subjects while performing a story recall task.
In this section, the approach to automatic transcription generation is described. Then, the
impact of automatic transcriptions on the AD classification task is assessed.
Automatic transcription generation
The Google Cloud Platform (Google 2019a) and the Google Cloud Speech to Text (STT) (Google
2019b) API have been used to obtain the automatic transcriptions for the Cookie Theft corpus.
Recordings that were originally encoded in MP3 format, were converted to 16 kHz sampling
98
Manualtranscriptions
Automatictranscriptions
Topic coherence Linguistic Fusion Fusion
Accuracy 79.0±4.8 82.6±5.1 85.5±2.9 79.7±3.5
Table 7.3: Summary of AD classification results (avg. and range accuracy % )
rate WAV audio files, a coding format accepted by the Google API. The quality of the record-
ings is quite poor, as they were originally collected on tapes in the late ’80s. Besides, some of
them contain a background noise or a very low voice. However, no speech enhancement tech-
nique was applied. Moreover, before performing ASR, the original recording sessions were
segmented in order to remove the clinician interventions, which are extraneous to the descrip-
tion of the participant and need to be ignored. For this segmentation task, a speaker diariza-
tion system could have been used (Bonastre et al. 2005, Meignier & Merlin 2010). However,
the sentence level manual transcriptions have been used to obtain the utterance boundaries.
Notice that overlapped speech segments due to the superimposition of the clinician voice were
also kept. In addition to speaker changes, the corpus manual annotations were also used to
obtain manual sentence boundaries. This is due to the fact that the ASR system provided au-
tomatic sentence boundaries based on long pauses that did not correspond to the actual end
of the sentence. In fact, a correct identification of sentence boundaries is essential for model-
ing the topic hierarchy in the proposed method and for the extraction of some of the features
described in Section 7.1.3. While there are natural language processing tools that perform au-
tomatic sentence segmentation, most of them require the speech transcripts to include accurate
punctuation marks, a feature that is not currently available.
AD classification results
Automatic transcriptions were obtained both for the train and test data. Apart from the way in
which the transcriptions were generated, these experiments are performed with the exact same
procedure used in previous Section (7.1.4.1). This means that the algorithm used to build the
hierarchy, the dataset separation, and also the partitions used in cross validation experiments
are exactly the same. The experiments were performed using the complete set of features that
includes topic coherence and other linguistic features. In this way, a classification accuracy of
79.7% ± 3.5% was achieved. This result was obtained through the selection of 10 features out
of 141, corresponding to 7% of the total number of features. Table 7.3 reports a summary of the
results achieved in the task of AD classification, using both manual and automatic transcrip-
tions.
99
A comparison between the set of features that provided the best classification accuracy on
the automatic transcriptions and those identified on the manual transcriptions can be done by
observing Figure 7.5. In this way, it is possible to notice that the selected subset includes 8
features assessing lexical, syntactic, and semantic abilities, and 2 features evaluating pragmatic
aspects of language. That is, a 20% in contrast to the 13% when using manual transcriptions.
Interestingly, the total number of coreferential mentions (T14) is the only pragmatic feature that
appear in both selections. Although the type of information used in the analysis of coreference
is strongly affected by recognition errors, this feature continues to be of particular relevance for
the discrimination of the disease, as confirmed by an analysis of its mean values. The number
of topics (T1) is now selected, while the number of incomplete sentences (T17) and the average
number of outgoing edges (T35) are no longer selected. These observations suggest on the one
hand that ASR errors affect differently some of the proposed features, and on the other hand,
that a small number of these features seem to be more insensitive to these errors (20% selected
in contrast to 13%, as noted previously).
Transcription errors analysis
As expected, due to recognition errors, one can observe a negative impact on the model’s per-
formance in terms of classification accuracy. The WER was computed using the manual tran-
scriptions of the Cookie Theft corpus as ground truth. The WER for the control and the AD
groups was, respectively, of 37%, and 43%. An analysis of the audio recordings confirmed that
such a high error rate is partly due to the poor quality of the recordings. In fact, those audio
segments that reported a higher error rate either contained a lot of background noise, or the
recorded voice was of very low energy. Another possible source of error for the recognizer is
related with the speaking style of the subjects, which in some cases was really fast, presenting
a high rate of coarticulation. Nevertheless, it is worth noting the robustness of the selected set
of features and the proposed method, which is able to achieve up to a 79.7% AD classification
accuracy even with such an high WER.
Analyzing ASR results in more detail, it was possible to verify that deletions (i.e., words or
sentences that were not recognized) were the main source of error followed by substitutions
(i.e., words that were incorrectly recognized). Indeed, both for the AD and the control group,
the number of deletions was almost the double of the amount of substitutions, contributing, re-
spectively, to the 26% and 22% of the error. Additionally, for both groups, the features obtained
with the manual and the automatic transcripts were analyzed. In this way, it was possible to
observe that the majority of the features computed with the automatic transcriptions showed a
100
lower average value in comparison to the manual ones. This outcome is in agreement with the
high rate of deletions found. Only a reduced number of features showed an opposing trend.
That is, their average value was found higher with respect to the corresponding value com-
puted on the manual transcriptions. Among the topic coherence features, this phenomenon
was observed for the number of sentences classified as incomplete (T17). As expected, due to
deletions its average value considerably increases, resulting however in a less discriminant fea-
ture for the AD classification task, as the experiments with automatic transcriptions confirm.
Lexical and syntactic features also reflected the alterations existing in the transcriptions by
showing an increasing trend in some measures. Among them, I mention the Honore’s statis-
tic (Honore 1978) (L3), the frequency of nouns (M3) and other features that are related with
this measure. Notably, an higher value in the Honore’s statistic is associated with the use of a
richer vocabulary. This phenomenon is most likely due to substitutions errors. The frequency
of nouns (M3), the proportion of nouns to verb (M51), the rate, proportions, and average length
of noun phrases (M42-M44) also provided an increasing trend in both groups. A closer inspec-
tion revealed that this increment was indeed justified by a reduction in the number of noun
phrases and by a general reduction of sentences length.
7.2 Summary
In this chapter, I explored an approach to automatically classify AD based on the analysis of
a wide set of linguistic features. Pragmatic abilities of language processing are evaluated by
modeling discourse samples into a hierarchy of information. Through this process, a number
of features related with the analysis of topic coherence is computed. This set is then comple-
mented with various lexical, syntactic, and semantic features in order to provide a comprehen-
sive evaluation of language deficits in AD. Classification experiments achieved an accuracy of
85.5%, which is in line with the current state of the art. These results confirm that by incor-
porating features representing pragmatic aspects of discourse, is it possible to attain a better
characterization of language impairments in AD. This, contributes to an improved ability of
current computational approaches to provide an objective, complementary evaluation. Ad-
ditionally, I evaluated the impact of using a speech recognizer to automatically produce the
transcriptions needed for this type of computational analysis. This is an important step to-
wards the applicability of these kind of approaches in a clinical setting. In this case, results
provided a lower accuracy of 79.7% in the automatic identification of the disease, mostly due
to the negative effect of ASR deletion errors. Nevertheless, this is still a remarkable disease
classification result considering that the WER in these data was around 40%.
101
102
8Conclusions and Future Work
This Chapter presents the final remarks of this dissertation. Initially, the major achievements
of the research carried out in the areas of monitoring of speech, cognitive, and language abili-
ties are summarized. Then, these results are discussed in the context of the central aim of this
thesis, which is to contribute to the current state of the art on the diagnosis of neurodegener-
ative diseases by means of Speech and Language Technology (SLT) (Section 8.1). Finally, new
directions for future research are suggested (Section 8.2).
8.1 Contributions
This dissertation addressed the problem of providing complementary diagnostic methods
based on SLT for the diagnosis of neurodegenerative diseases. To this end, a preliminary study
over the major symptoms and the core criteria used in the clinical diagnosis of these disorders
has been conducted. This research allowed to identify three distinct areas deserving further
investigations: the monitoring of speech, cognitive, and language abilities. For each of these
areas, the relevant state of the art was reviewed, identifying current progress and limitations.
In monitoring of speech abilities, I found an extensive body of research that investigated
several acoustic measures able to represent the symptoms of dysarthria, a motor speech disor-
der typically developed in Parkinson’s Disease (PD). Dysarthria is commonly assessed through
the evaluation of several speech production tasks. Few studies analyzed the discriminative
ability of these tasks in what concerns the identification of PD. In my research, I have defined a
standard set of acoustic features, which includes measures evaluating speech disorders at vari-
ous dimensions and thus can be suitable to assess the relevance of the different speech tasks. To
this end, I have used a database containing Portuguese PD patients and healthy speakers per-
forming 8 tasks designed to assess phonation, articulation, and prosody. Results have shown
that the most important production tasks were reading of prosodic sentences and storytelling,
achieving a PD classification accuracy of 85.10% and 82.32%, respectively. These tasks elicit the
production of continuous speech and definitely contain more acoustic and prosodic informa-
tion than the sustained vowel phonation or the reading of a word. Their selection may indicate
that, in the identification of PD, a comprehensive evaluation of speech impairments is more
important than the assessment of isolated abilities, such as phonation or articulation.
Relatively to the monitoring of cognitive abilities, I found that many neuropsychological
tests are eligible to be administered through current SLT solutions, providing evident advan-
tages to the community. However, the literature review showed that existing solutions present
important limitations. For these reasons, I targeted the automatic implementation of some neu-
ropsychological tests widely used for screening cognitive performance. First, I addressed the
semantic verbal fluency task. This test represents a challenge for current speech recognition
technology, as it requires the spontaneous production of a list of items belonging to an uncon-
strained domain. To assess its automatic implementation, I collected a database containing the
recordings of 42 healthy speakers. This challenge has been faced with the construction of a
tailored language model and exploiting prosodic hints from the linguistic area. Both methods
provided an improvement in the prediction of potential animal names. Then, I approached the
automatic implementation of the MMSE and ADAS-Cog, two neuropsychological test batter-
ies. They have been developed as an automatic web-based tool with SLT integration. In this
way, the tool could be used also for remote monitoring of cognitive impairments. As far as I
know, it represents the only platform of this type implemented for the Portuguese population.
To evaluate the feasibility of the monitoring tool, a speech corpus including the recordings of
5 people diagnosed with cognitive impairments and 5 healthy control subjects was collected.
The error between the manual and the automatic evaluation was relatively small (from 0.80 to
3.00 for the patients, and from 0.80 to 2.80 for the control group), confirming the feasibility of
such type of system. Additionally, the flexibility of the platform used allows to easily extend
the tests implemented to different types of exercises that can be used for the daily training of
cognitive abilities.
In monitoring of language abilities, I found that the analysis of discourse impairments re-
quires both the manual transcriptions of continuous speech samples and the subsequent iden-
tification and annotation of predefined linguistic features. These requirements preclude the
applicability of discourse analysis in clinical settings and may also lead to different inter-expert
assessments. I developed an automatic method targeting the analysis of pragmatic aspects of a
discourse. This approach is further complemented by considering lower levels aspects of lan-
guage processing, such as lexical, syntactic, and semantic abilities. Overall, the analysis of such
a wide set of language characteristics should provide a comprehensive evaluation of discourse
production. The method has been evaluated with a publicly available corpus devoted at the
study of communication in dementia. In the experiments, I show that pragmatic features pro-
vide complementary information, increasing accuracy from 82.6% to 85.5% in the detection of
104
dementia. Additionally, with the aim of automating the entire process of this type of analysis,
a speech recognition system has been used to obtain the transcriptions of the recordings.
To conclude, I believe that these contributions provide relevant advances in the current state
of the art on the diagnosis of neurodegenerative diseases by means of SLT, fulfilling the major
aim of this thesis. In particular, the research carried out in the areas of monitoring speech,
cognitive and language abilities makes a step ahead towards a future in which clinicians may
be provided with complementary diagnostic tools. The work carried out during this doctoral
research accomplished also another important objective, to introduce the research group where
I am involved to the interdisciplinary area of diagnosis of neurodegenerative diseases. This
result can be confirmed by the amount of new studies in this area that have started in the
group along these years. This statement is based on an international project targeting speech
therapy, in which I have been directly involved, and on a European consortium dedicated at the
analysis of pathological speech. Furthermore, new master theses related with the topics of this
dissertation have been developed, in some of which, I was directly involved as co-supervisor.
Finally, the work carried out in this doctoral research allowed to establish new collaborations
with neurologists from different institutions, specialized in different areas, which will be of
great importance for the development of future interdisciplinary research.
Publication list
The work carried out in the context of this dissertation have led to the following publications:
• Anna Pompili, Alberto Abad, David Martins de Matos, Isabel P. Martins, Pragmatic as-
pects of discourse production for the automatic identification of Alzheimer’s disease, accepted
for publication to a special issue of the IEEE Journal of Selected Topics in Signal Pro-
cessing (JSTSP) on Automatic assessment of health disorders based on voice, speech and
language processing, May 2019.
• Anna Pompili, Alberto Abad, David Martins de Matos, Isabel P. Martins, Topic coher-
ence analysis for the classification of Alzheimer’s disease, In Proceedings IberSPEECH 2018,
Barcelona, Spain, November 2018.
• Anna Pompili, Cristiana Filipa Lopes Amorim, Alberto Abad, Isabel Trancoso, Speech and
language technologies for the automatic monitoring and training of cognitive functions, In Work-
shop on Speech and Language Processing for Assistive Technologies (SLPAT), Dresden,
Germany, September 2015.
105
• Helena Moniz, Anna Pompili, Fernando Batista, Isabel Trancoso, Alberto Abad, Cristiana
Filipa Lopes Amorim, Automatic Recognition of Prosodic Patterns in Semantic Verbal Fluency
Tests - an Animal Naming Task for Edutainment Applications, In International Congress of
Phonetic Sciences (ICPhS 2015), Glasgow, Scotland, UK, August 2015.
• Anna Pompili, Alberto Abad, Paolo Romano, Isabel P. Martins, Rita Cardoso, Helena
Santos, Joana Carvalho, Isabel Guimaraes, and Joaquim J. Ferreira. Automatic Detection
of Parkinson’s Disease: An Experimental Analysis of Common Speech Production Tasks Used
for Diagnosis. In International Conference on Text, Speech, and Dialogue, pp. 411–419,
Springer, August 2017.
Some of the works carried out during the doctoral research have been omitted from the pre-
vious list because they were not directly related with the topics identified in this dissertation.
However, it is relevant to mention them in the following:
• Ruben Solera-Urena, Helena Moniz, Fernando Batista, Vera Cabarrao, Anna Pompili,
Ramon Astudillo and Isabel Trancoso, Uma abordagem de aprendizagem semi-supervisionada
para a percepcao automatica de personalidade, baseada em pistas acustico-prosodicas em domınios
com poucos recursos, accepted for publication to Revista da Associacao Portuguesa de
Linguıstica, March 2019.
• Javier Tejedor, Doroteo T. Toledano, Paula Lopez-Otero, Laura Docio-Fernandez, Jorge
Proenca, Fernando Perdigao, Fernando Garcıa-Granada, Emilio Sanchis, Anna Pompili
and Alberto Abad, ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation,
EURASIP Journal on Audio, Speech, and Music Processing, April 2018.
• Sonia Reis, Anna Pompili, Alberto Abad, Jorge Baptista, O proverbio como estımulo num
terapeuta virtual, VI Simposio mundial de estudos da lıngua portuguesa, Santarem, Por-
tugal, October 2017.
• Ruben Solera Urena, Helena Moniz, Fernando Batista, Vera Cabarrao, Anna Pompili, Ra-
mon Fernandez Astudillo, Joana Carvalho Filipe de Campos, Ana Paiva, Isabel Trancoso,
A Semi-Supervised Learning Approach for Acoustic-Prosodic Personality Perception in Under-
Resourced Domains, In Proc. of Interspeech 2017, pages 929–933, Stockholm, Sweeden,
August 2017.
• Anna Pompili, Alberto Abad, The L2F Query-by-Example Spoken Term Detection system
for the ALBAYZIN 2016, In Albayzin Evaluation - IberSPEECH 2016, Lisbon, Portugal,
November 2016.
106
Lists of invited presentations:
• Vania Mendonca, Anna Pompili, Ruben Santos, Isabel Trancoso, Luısa Coheur e Alberto
Abad. E-Inclusao no L2F: as tecnologias da lıngua ao servico da saude, educacao e comunicacao.
VII Workshop on Linguistics, Language Development and Impairment, Lisbon, Portugal.
2017
• Isabel Trancoso, Alberto Abad, Luısa Coheur, Anna Pompili Pompili, Cristiana Amorim,
Vania Mendonca. Virtual Therapists. Clef 2016 - Conference and Labs of the Evaluation
Forum. VII Conferencia sobre Information Access Evaluation meets Multilinguality, Mul-
timodality, and Interaction. Evora, Portugal, 2016
• Vania Mendonca, Anna Pompili, Alberto Sardinha, Luısa Coheur. VITHEA-Kids: Adapt-
ing the VITHEA platform to children with Autism Spectrum Disorder. VI Workshop on Lin-
guistics, Language Development and Impairment. Lisbon, Portugal, 2016
• Anna Pompili, Alberto Abad, Isabel Trancoso, Joao Paulo Carvalho. Pos-VITHEA: as
tecnologias da fala para melhorar as funcionalidades cognitivas. V Workshop on Linguistics,
Language Development and Impairment, Lisbon, Portugal, 2015
• Alberto Abad, Anna Pompili, Isabel Trancoso, Jose Fonseca, Pedro Fialho. VITHEA -
Terapia remota para patologias da fala. II encontro de terapeutas da fala do Alentejo, Evora,
Portugal, 2014.
• Alberto Abad, Anna Pompili, Isabel Trancoso, Jose Fonseca. VITHEA O potencial das
tecnologias da fala no tratamento da afasia. IV Workshop on Linguistics, Language Develop-
ment and Impairment, Lisbon, Portugal, 2014.
8.2 Future Work
Considering the different topics addressed in this dissertation, there are naturally a number
of directions that can be taken as future works. For what concerns the monitoring of speech
abilities, it would be important to validate the results achieved in the automatic identification
of PD with different datasets, in order to confirm their robustness. In the area of monitoring of
cognitive abilities, the results achieved with the implementation of the semantic verbal fluency
test have shown that there is still much room for improvement. Future extensions may consider
using confidence scores to filter ASR results, as in the approach of Pakhomov et al. (2015), or
training a specific acoustic model suited to elderly voice characteristics. Additionally, for what
concerns neuropsychological tests, future research could target the development of new tests
107
addressing cognitive stimulation, rather than the diagnosis of diseases. In the area of moni-
toring language abilities, it would be important to incorporate automatic speaker and sentence
segmentation processing, in order to fully remove the dependence on manual transcriptions. A
more challenging line of research considers the extension of this kind of analysis to other types
of discourse production tasks, including open-domain ones. Finally, it is important to control
some clinical and demographic variables of the samples in future studies, namely participants
degree of literacy and the severity or stage of disease. The control of these factors will improve
the diagnostic value of the results.
Probably one of the most important limitations of these studies is the lack of standard, con-
sistent datasets publicly available. This was foreseen in the definition of the objectives of this
thesis. However, it was expected to find large databases with a greater ease by focusing this
research on two widespread diseases. The DementiaBank is one of the largest collections used
in the literature for assessing language impairments in AD. Nevertheless, its size still repre-
sents a problem for modern computational approaches. For this reason, an ambitious future
line of research envisions the development of tools for the general analysis of speech and lan-
guage abilities, rather than for detecting specific diseases. The results provided by these tools
will contain general-purpose statistics and information over the analyzed samples. These data
could be interpreted by a clinician in a similar way as current blood tests are used. In this way,
without the aim of identifying a specific disorder, the dependence from a database contain-
ing data of subjects diagnosed with that disease is also removed. Without this restriction, it is
possible to share different types of data for model training.
108
AAppendix
This appendix contains additional information about the study on monitoring of language abil-
ities presented in Chapter 7. Section A.1 provides some examples of input/output for the in-
termediate steps of the topic coherence analysis’ process. Section A.2 describes some technical
details related with the computation of semantic features.
A.1 Excerpts of input/output processing
A.1.1 Preprocessing
The examples below show an excerpt of an original transcription contained in the Dementia-
Bank database (MacWhinney et al. 2011, TalkBank 2017), before and after the preprocessing
phase. Its corresponding POS tag annotations are also shown.
- /there’s [//] the sink is overflowing while she’s wiping &uh &uh &k &uh a plate and not looking
&=laughs ./
- /the sink is overflowing while she is wiping a plate and not looking ./
- the/DT sink/NN is/VBZ overflowing/VBG while/IN she/PRP is/VBZ wiping/VBG a/DT
plate/NN and/CC not/RB looking/VBG ./
A.1.2 Clause segmentation
The example below shows the syntactic parse tree obtained with the Stanford University lex-
icalized probabilistic parser (Klein & Manning 2003b) for the sentence: /the sink is overflowing
while she is wiping a plate and not looking ./
(ROOT
(S
(NP (DT the) (NN sink))
(VP (VBZ is)
(VP (VBG overflowing)
(SBAR (IN while)
(S
(NP (PRP she))
(VP (VBZ is)
(VP
(VP (VBG wiping)
(NP (DT a) (NN plate)))
(CC and)
(RB not)
(VP (VBG looking))))))))
(. .)))
A.1.3 Coreference analysis
The excerpts below provide some examples of the relations identified by the Stanford coref-
erence resolution system (Manning et al. 2014). Subscripts r and a denote a rejected and an
accepted relationship according to the methodology described in Section 7.1.2.3.
1. /a boya is standing up on a stoolr ./
/hea is falling of a stoolr ./
2. /and shea is getting her feet wet.. ./
/shea is also oblivious to the fact ./
/that herr kids are stealing cookies.. ./
A.2 Computation of semantic features
A.2.1 Specifications of an ICU list
The definition of the Information Content Units (ICUs) list required different specifications for
some of the categories indicated in the work of Croisile et al. (1996).
The sets of subjects and objects were simply defined by complementing each individual con-
cept with its synonyms and semantic variations. This was achieved by consulting the work
of Pakhomov et al. (2010) and online dictionaries (Merriam-Webster 2019, Thesaurus 2019).
The category places required the specification of several n-grams that may match a description
related to something seen in the kitchen, or in the garden, that is, seen through the kitchen win-
dow. Finally, ICUs in the actions category where defined through the triple subject-verb-object.
110
The set of acceptable terms for the subject element corresponds to those specified in the ICU
subjects category. The definition of the verb element includes: synonyms, a complete specifica-
tion of the phrasal verbs that are accepted, and n-grams (e.g., is unaware). The object element of
the triple subject-verb-object was defined only in case of ambiguity (i.e., woman unconcerned
by the overflowing or woman indifferent to the children). The ICU ’action performed by the
girl’ was identified with the following activities: girl asking for cookies, girl with her finger to
her mouth, girl saying to be quiet, girl trying to help, girl reacting to the fall.
A.2.2 Computing ICUs
The ICU categories subjects, objects, and places, were computed by simply verifying the mention
of the corresponding items in the text. This approach suffers from important limitations that
were already highlighted by Fraser et al. (2016). In fact, if a noun is inappropriately substituted,
or a concept is defined in an unpredictable way, it could be either attributed to the wrong ICU
or could not be accounted at all.
To compute the mentioning of the ICU category actions, similarly to the work of Fraser et al.
(2016), the dependency representations provided by the Stanford Parser (Klein & Manning
2003b) were examined. An example of this representation for the sentence /a boy is standing up
on a stool ./ is shown in the snippet below:
det(boy-2, a-1)
nsubj(standing-4, boy-2)
aux(standing-4, is-3)
root(ROOT-0, standing-4)
compound:prt(standing-4, up-5)
case(stool-8, on-6)
det(stool-8, a-7)
nmod(standing-4, stool-8)
In particular, it is verified if there is a typed dependency that identifies the subject of the
sentence (i.e., nsubj) and that matches one of the elements specified in the subjects category.
If there is a correspondence, the same process is applied to the verb, and eventually to the
object of the sentence. Verbs are normalized to their root form, and compounds are accounted.
N-grams are computed by searching for sequences of n words.
111
112
Bibliography
Abad, A., Pompili, A., Costa, A. & Trancoso, I. (2012), Automatic word naming recognition for
treatment and assessment of aphasia, in ‘Proc. Interspeech’.
Abad, A., Pompili, A., Costa, A., Trancoso, I., Fonseca, J., Leal, G., Farrajota, L. & Martins, I. P.
(2013), ‘Automatic word naming recognition for an on-line aphasia treatment system’, Com-
puter Speech & Language 27(6), 1235 – 1248. Special Issue on Speech and Language Processing
for Assistive Technology.
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H. & Penn, G. (2012), Applying convolutional neural
networks concepts to hybrid NN-HMM model for speech recognition, in ‘2012 IEEE interna-
tional conference on Acoustics, speech and signal processing (ICASSP)’, IEEE, pp. 4277–4280.
Ahmed, S., de Jager, C. A., Haigh, A.-M. F. & Garrard, P. (2013), ‘Semantic processing in con-
nected speech at a uniformly early stage of autopsy-confirmed Alzheimer’s disease.’, Neu-
ropsychology 27(1), 79.
Ahmed, S., Haigh, A.-M. F., de Jager, C. A. & Garrard, P. (2013), ‘Connected speech as a marker
of disease progression in autopsy-proven Alzheimer’s disease’, Brain 136(12), 3727–3737.
Albert, M. S., DeKosky, S. T., Dickson, D., Dubois, B., Feldman, H. H., Fox, N. C., Gamst, A.,
Holtzman, D. M., Jagust, W. J., Petersen, R. C., Snyder, P. J., Carrillo, M. C., Thies, B. & Phelps,
C. H. (2011), ‘The diagnosis of mild cognitive impairment due to Alzheimer’s disease: Rec-
ommendations from the National Institute on Aging-Alzheimer’s Association workgroups
on diagnostic guidelines for Alzheimer’s disease’, Alzheimer’s & Dementia 7(3), 270 – 279.
Almor, A., Kempler, D., MacDonald, M. C., Andersen, E. S. & Tyler, L. K. (1999), ‘Why do
Alzheimer patients have difficulty with pronouns? Working memory, semantics, and refer-
ence in comprehension and production in Alzheimer’s disease’, Brain and language 67(3), 202–
227.
Aluısio, S., Cunha, A. & Scarton, C. (2016), Evaluating progression of Alzheimer’s disease by
regression and classification methods in a narrative language test in Portuguese, in ‘Inter-
national Conference on Computational Processing of the Portuguese Language’, Springer,
pp. 109–114.
113
Ansel, B. M. & Kent, R. D. (1992), ‘Acoustic-phonetic contrasts and intelligibility in the
dysarthria associated with mixed cerebral palsy’, Journal of Speech, Language, and Hearing
Research 35(2), 296–308.
Appell, J., Kertesz, A. & Fisman, M. (1982), ‘A study of language functioning in Alzheimer
patients’, Brain and language 17(1), 73–91.
Atal, B. S. & Schroeder, M. R. (1968), ‘Predictive coding of speech signals’, Report of the 6th Int.
Congress on Acoustics . Tokio, Japan.
Backman, L., Jones, S., Berger, A., Laukka, E. & Smalltt, B. (2004), ‘Multiple cognitive deficits
during the transition to Alzheimer’s disease’, Journal of internal medicine 256(3), 195–204.
Baker, J. M., Deng, L., Glass, J., Khudanpur, S., Lee, C., Morgan, N. & O’Shaughnessy, D.
(2009), ‘Developments and directions in speech recognition and understanding, Part 1 [DSP
Education]’, IEEE Signal Processing Magazine 26(3), 75–80.
Bao, H., Xu, M.-X. & Zheng, T. F. (2007), Emotion attribute projection for speaker recognition on
emotional speech, in ‘Eighth Annual Conference of the International Speech Communication
Association’.
Becker, J. T., Boiler, F., Lopez, O. L., Saxton, J. & McGonigle, K. L. (1994), ‘The natural history
of Alzheimer’s disease: description of study cohort and accuracy of diagnosis’, Archives of
Neurology 51(6), 585–594.
Bengio, Y. (2008), ‘Neural net language models’, Scholarpedia 3(1), 3881.
Benton, A., Hamsher, K. & Sivan, A. (1994), Multilingual Aphasia Examination: Manual of Instruc-
tions, AJA Assoc.
Berardelli, A., Noth, J., Thompson, P. D., Bollen, E. L., Curra, A., Deuschl, G., van Dijk, J. G.,
Topper, R., Schwarz, M. & Roos, R. A. (1999), ‘Pathophysiology of chorea and bradykinesia
in Huntington’s disease.’, Mov Disord 14(3), 398–403.
Bertram, L. & Tanzi, R. E. (2005), ‘The genetic epidemiology of neurodegenerative disease’,
Journal of Clinical Investigation 115(6), 1449–1457.
Bishop, C. M. (2006), Pattern recognition and machine learning, springer.
Bocklet, T., Noth, E., Stemmer, G., Ruzickova, H. & Rusz, J. (2011), Detection of persons with
Parkinson’s disease by acoustic, vocal, and prosodic analysis, in ‘2011 IEEE Workshop on
Automatic Speech Recognition & Understanding’, pp. 478–483.
114
Bocklet, T., Steidl, S., Noth, E. & Skodda, S. (2013), Automatic evaluation of Parkinson’s speech-
acoustic, prosodic and voice related cues., in ‘Interspeech’, pp. 1149–1153.
Bonastre, J., Wils, F. & Meignier, S. (2005), ALIZE, a free toolkit for speaker recognition, in
‘Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal
Processing, 2005.’, Vol. 1, pp. I/737–I/740 Vol. 1.
Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992), A training algorithm for optimal margin
classifiers, in ‘Proceedings of the fifth annual workshop on Computational learning theory’,
ACM, pp. 144–152.
Boytcheva, S., Dobrev, P. & Angelova, G. (2001), Cgextract: Towards extraction of conceptual
graphs from controlled english, Contributions to ICCS-2001, 9th International Conference of
Conceptual Structures,.
Brady, M., Mackenzie, C. & Armstrong, L. (2003), ‘Topic use following right hemisphere
brain damage during three semi-structured conversational discourse samples’, Aphasiology
17(9), 881–904.
Breiman, L. (1984), Classification and Regression Trees, New York: Routledge.
Breiman, L. (2001), ‘Random forests’, Machine learning 45(1), 5–32.
Brockmann, M., Drinnan, M. J., Storck, C. & Carding, P. N. (2011), ‘Reliable jitter and shimmer
measurements in voice clinics: the relevance of vowel, gender, vocal intensity, and funda-
mental frequency effects in a typical clinical task’, Journal of voice 25(1), 44–53.
Brooks, B. R. (1994), ‘El Escorial World Federation of Neurology criteria for the diagnosis of
amyotrophic lateral sclerosis’, Journal of the Neurological Sciences 124, 96–107.
Brunet, E. (1978), Le vocabulaire de Jean Giraudoux. Structure et evolution., number 1 in ‘Collection
”Travaux de linguistique quantitative”’, Slatkine.
Brunnstrom, H., Gustafson, L., Passant, U. & Englund, E. (2009), ‘Prevalence of dementia sub-
types: a 30-year retrospective survey of neuropathological reports.’, Arch Gerontol Geriatr
49(1), 146–149.
Bryson, A. E. & Ho, Y.-C. (1969), Applied Optimal Control: Optimization, Estimation, and Control,
Waltham, Mass: Blaisdell Pub. Co.
115
Bucks, R. S., Singh, S., Cuerden, J. M. & Wilcock, G. K. (2000a), ‘Analysis of spontaneous, con-
versational speech in dementia of Alzheimer type: Evaluation of an objective technique for
analysing lexical performance’, Aphasiology 14(1), 71–91.
Bucks, R. S., Singh, S., Cuerden, J. M. & Wilcock, G. K. (2000b), ‘Analysis of spontaneous, con-
versational speech in dementia of Alzheimer type: Evaluation of an objective technique for
analysing lexical performance’, Aphasiology 14(1), 71–91.
Bunton, K. & Weismer, G. (2001), ‘The relationship between perception and acoustics for a high-
low vowel contrast produced by speakers with dysarthria’, Journal of Speech, Language, and
Hearing Research .
Burg, J. P. (1967), ‘Maximum Entropy Spectral Analysis’, Proceedings of 37th Meeting, Society of
Exploration Geophysics . Oklahoma City.
Campbell, J. P., Shen, W., Campbell, W. M., Schwartz, R., Bonastre, J.-F. & Matrouf, D. (2009),
Forensic speaker recognition, in ‘IEEE Signal Processing Magazine’, Institute of Electrical and
Electronics Engineers, pp. 95–103.
Cano, S. J., Posner, H. B., Moline, M. L., Hurt, S. W., Swartz, J., Hsu, T. & Hobart, J. C. (2010),
‘The ADAS-cog in Alzheimer’s disease clinical trials: psychometric evaluation of the sum
and its parts’, J Neurol Neurosurg Psychiatry 81(12), 1363–1368.
Castiglioni, P. (2010), ‘Letter to the Editor: What is wrong in Katz’s method? Comments on:
A note on fractal dimensions of biomedical waveforms’, Computers in biology and medicine
40(11-12), 950–952.
Charniak, E. & Johnson, M. (2005), Coarse-to-fine n-best parsing and MaxEnt discriminative
reranking, in ‘Proceedings of the 43rd annual meeting on association for computational lin-
guistics’, Association for Computational Linguistics, pp. 173–180.
Clark, H. H. (1996), Using language, Cambridge University Press.
Cortes, C. & Vapnik, V. (1995), ‘Support-vector networks’, Machine learning 20(3), 273–297.
Coulston, R., Klabbers, E., Villiers, J. d. & Hosom, J.-P. (2007), Application of speech technol-
ogy in a home based assessment kiosk for early detection of Alzheimer’s disease, in ‘Eighth
Annual Conference of the International Speech Communication Association’.
Covington, M. A. & McFall, J. D. (2010), ‘Cutting the Gordian knot: The moving-average type–
token ratio (MATTR)’, Journal of quantitative linguistics 17(2), 94–100.
116
Croisile, B., Ska, B., Brabant, M. J., Duchene, A., Lepage, Y., Aimard, G. & Trillet, M. (1996),
‘Comparative study of oral and written picture description in patients with Alzheimer’s dis-
ease.’, Brain Lang 53(1), 1–19.
Davis, S. & Mermelstein, P. (1980), ‘Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences’, IEEE transactions on acoustics, speech, and
signal processing 28(4), 357–366.
de Lau, L. M. & Breteler, M. M. (2006), ‘Epidemiology of Parkinson’s disease’, The Lancet Neu-
rology 5(6), 525–535.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P. & Ouellet, P. (2010), ‘Front-end factor anal-
ysis for speaker verification’, IEEE Transactions on Audio, Speech, and Language Processing
19(4), 788–798.
Dell, G. S., Chang, F. & Griffin, Z. M. (1999), ‘Connectionist models of language production:
Lexical access and grammatical encoding’, Cognitive Science 23(4), 517–542.
Dijkstra, K., Bourgeois, M., Petrie, G., Burgio, L. & Allen-Burge, R. (2002), ‘My Recaller is on Va-
cation: Discourse Analysis of Nursing-Home Residents With Dementia’, Discourse Processes
33(1), 53–76.
Dronkers, N. & Ogar, J. (2004), ‘Brain areas involved in speech production’.
Duan, K.-B. & Keerthi, S. S. (2005), Which is the best multiclass SVM method? An empirical
study, in ‘International workshop on multiple classifier systems’, Springer, pp. 278–285.
Dumais, S. T. (2004), ‘Latent semantic analysis’, Annual review of information science and technol-
ogy 38(1), 188–230.
Dunning, T. (1994), Statistical identification of language, Computing Research Laboratory, New
Mexico State University Las Cruces, NM, USA.
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., Andre, E., Busso, C., Devillers, L. Y., Epps,
J., Laukka, P., Narayanan, S. S. & Truong, K. P. (2016), ‘The Geneva Minimalistic Acoustic
Parameter Set (GeMAPS) for Voice Research and Affective Computing’, IEEE Transactions on
Affective Computing 7(2), 190–202.
Eyben, F., Wollmer, M. & Schuller, B. (2010), Opensmile: The Munich Versatile and Fast Open-
source Audio Feature Extractor, in ‘Proceedings of the 18th ACM International Conference
on Multimedia’, MM ’10, ACM, New York, NY, USA, pp. 1459–1462.
117
Farrus, M., Hernando, J. & Ejarque, P. (2007), Jitter and shimmer measurements for speaker
recognition, in ‘Eighth annual conference of the international speech communication associ-
ation’.
Fellbaum, C. (2010), WordNet, in ‘Theory and applications of ontology: computer applications’,
Springer, pp. 231–243.
Feng, S., Banerjee, R. & Choi, Y. (2012), Characterizing stylistic elements in syntactic structure,
in ‘Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning’, Association for Computational
Linguistics, pp. 1522–1533.
Fisher, R. A. (1919), ‘XV.—The correlation between relatives on the supposition of Mendelian
inheritance.’, Earth and Environmental Science Transactions of the Royal Society of Edinburgh
52(2), 399–433.
Folstein, M. F., Folstein, S. E. & McHugh, P. R. (1975), ‘”Mini-mental state”. A practical method
for grading the cognitive state of patients for the clinician’, J Psychiatr Res 12(3), 189–198.
Forbes, K. E., Venneri, A. & Shanks, M. F. (2002), ‘Distinct patterns of spontaneous speech
deterioration: an early predictor of Alzheimer’s disease.’, Brain Cogn 48(2-3), 356–361.
Fraser, K. C. & Hirst, G. (2016), Detecting semantic changes in Alzheimer’s disease with vector
space models, in ‘Proceedings of LREC 2016 Workshop. Resources and processing of lin-
guistic and extra-linguistic data from people with various forms of cognitive/osychiatric
impairments (RaPID-2016), Monday 23rd of May 2016’, number 128, Linkoping University
Electronic Press, Linkopings universitet, p. 1 to 8.
Fraser, K. C., Meltzer, J. A. & Rudzicz, F. (2016), ‘Linguistic Features Identify Alzheimer’s Dis-
ease in Narrative Speech.’, J Alzheimers Dis 49(2), 407–422.
Garrett, M. F. (1975), ‘Syntactic process in sentence production’, Psychology of learning and moti-
vation: Advances in research and theory. (9), 133–177.
Gauthier, S., Reisberg, B., Zaudig, M., Petersen, R. C., Ritchie, K., Broich, K., Belleville, S.,
Brodaty, H., Bennett, D., Chertkow, H., Cummings, J. L., de Leon, M., Feldman, H., Ganguli,
M., Hampel, H., Scheltens, P., Tierney, M. C., Whitehouse, P. & Winblad, B. (2006), ‘Mild
cognitive impairment’, The Lancet 367(9518), 1262 – 1270.
Geisser, S. (1975), ‘The predictive sample reuse method with applications’, Journal of the Ameri-
can statistical Association 70(350), 320–328.
118
Glosser, G. & Deser, T. (1991), ‘Patterns of discourse production among neurological patients
with fluent language disorders’, Brain and language 40(1), 67–88.
Goberman, A. M. & Coelho, C. (2002), ‘Acoustic analysis of Parkinsonian speech I: speech
characteristics and L-Dopa therapy.’, NeuroRehabilitation 17(3), 237–246.
Goodglass, H., Kaplan, E. & Barresi, B. (2001), The Boston Diagnostic Aphasia Examination, Balti-
more: Lippincott, Williams & Wilkins.
Google (2019a), ‘Google cloud platform’, https://cloud.google.com/. [Accessed on 15-
January-2019].
Google (2019b), ‘Google cloud speech-to-text api’, https://cloud.google.com/
speech-to-text/. [Accessed 15-January-2019].
Gorno-Tempini, M. L., Dronkers, N. F., Rankin, K. P., Ogar, J. M., Phengrasamy, L., Rosen, H. J.,
Johnson, J. K., Weiner, M. W. & Miller, B. L. (2004), ‘Cognition and anatomy in three variants
of primary progressive aphasia.’, Ann Neurol 55(3), 335–346.
Gorno-Tempini, M. L., Hillis, A. E., Weintraub, S., Kertesz, A., Mendez, M., Cappa, S. F., Ogar,
J. M., Rohrer, J. D., Black, S., Boeve, B. F., Manes, F., Dronkers, N. F., Vandenberghe, R., Ras-
covsky, K., Patterson, K., Miller, B. L., Knopman, D. S., Hodges, J. R., Mesulam, M. M. &
Grossman, M. (2011), ‘Classification of primary progressive aphasia and its variants.’, Neu-
rology 76(11), 1006–1014.
Graesser, A. C., McNamara, D. S., Louwerse, M. M. & Cai, Z. (2004), ‘Coh-Metrix: Analysis of
text on cohesion and language’, Behavior research methods, instruments, & computers 36(2), 193–
202.
Guyon, I. & Elisseeff, A. (2003), ‘An introduction to variable and feature selection’, Journal of
machine learning research 3(Mar), 1157–1182.
Hakkani-Tur, D., Vergyri, D. & Tur, G. (2010), Speech-based automated cognitive status assess-
ment, in ‘Eleventh Annual Conference of the International Speech Communication Associa-
tion’.
Hall, M. A. (1999), ‘Correlation-based feature selection for machine learning’.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I. H. (2009), ‘The WEKA
Data Mining Software: An Update’, SIGKDD Explor. Newsl. 11(1), 10–18.
119
Hartelius, L., Carlstedt, A., Ytterberg, M., Lillvik, M. & Laakso, K. (2003), ‘Speech disorders in
mild and moderate Huntington’s disease: results of dysarthria assessment of 19 individuals’,
Journal of Medical Speech-Language Pathology 1, 1–14.
Henry, M. & Gorno-Tempini, M. (2010), ‘The logopenic variant of primary progressive aphasia’,
Current opinion in neurology 23(6), 633–637.
Hermansky, H. (1990), ‘Perceptual linear predictive (PLP) analysis of speech’, the Journal of the
Acoustical Society of America 87(4), 1738–1752.
Hernandez-Domınguez, L., Ratte, S., Sierra-Martınez, G. & Roche-Bergua, A. (2018),
‘Computer-based evaluation of Alzheimer’s disease and mild cognitive impairment patients
during a picture description task’, Alzheimer’s & Dementia: Diagnosis, Assessment & Disease
Monitoring 10, 260–268.
Hier, D. B., Hagenlocker, K. & Shindler, A. G. (1985), ‘Language disintegration in dementia:
Effects of etiology and severity’, Brain and language 25(1), 117–133.
Hirsimaki, T., Pylkkonen, J. & Kurimo, M. (2009), ‘Importance of high-order n-gram models in
morph-based speech recognition’, IEEE Transactions on Audio, Speech, and Language Processing
17(4), 724–732.
Ho, T. K. (1995), Random decision forests, in ‘Proceedings of 3rd international conference on
document analysis and recognition’, Vol. 1, IEEE, pp. 278–282.
Honore, A. (1978), Some Simple Measures of Richness of Vocabulary, number 7, Association of
Literary and Linguistic Computing Bulletin.
Huang, X., Acero, A., Hon, H.-W. & Reddy, R. (2001), Spoken language processing: A guide to
theory, algorithm, and system development, Vol. 1, Prentice hall PTR Upper Saddle River.
Hutchinson, J. M. & Jensen, M. (1980), A pragmatic evaluation of discourse communication
in normal and senile elderly in a nursing home., in ‘In L. K. Obler & M. L. Albert (Eds.)
Language and communication in the Elderly. Lexington, MA: Lexington Books.’, pp. 59–73.
i Cancho, R. F., Sole, R. V. & Kohler, R. (2004), ‘Patterns in syntactic dependency networks’,
Physical Review E 69(5), 051915.
Itakura, F. & Saito, S. (1968), Analysis synthesis telephony based upon the maximum likelihood
method, in ‘Proc. 6th Int. Congress on Acoustics’. Tokio, Japan.
120
Ivakhnenko, A. G. & Lapa, V. G. (1966), Cybernetic predicting devices, Technical report, Purdue
University School of Electrical Engineering.
Ivakhnenko, A. G. & Lapa, V. G. (1967), Cybernetics and Forecasting Techniques Modern Analytic
and Computational Method in Science and Mathematics, New York: American Elsevier Publish-
ing Company, Inc.
Jarrold, W., Peintner, B., Wilkins, D., Vergryi, D., Richey, C., Gorno-Tempini, M. L. & Ogar, J.
(2014a), Aided diagnosis of dementia type through computer-based analysis of spontaneous
speech, in ‘Proceedings of the ACL Workshop on Computational Linguistics and Clinical
Psychology’, pp. 27–36.
Jarrold, W., Peintner, B., Wilkins, D., Vergryi, D., Richey, C., Gorno-Tempini, M. L. & Ogar, J.
(2014b), Aided diagnosis of dementia type through computer-based analysis of spontaneous
speech, in ‘Proceedings of the ACL Workshop on Computational Linguistics and Clinical
Psychology’, pp. 27–36.
Jay, T. B. (2002), The Psychology of Language, New Jersey: Pearson Education.
Jelinek, F. & Mercer, R. L. (1980), Interpolated estimation of Markov source parameters from
sparse data, in E. S. Gelsema & L. N. Kanal, eds, ‘Proceedings, Workshop on Pattern Recog-
nition in Practice’, North Holland, Amsterdam, pp. 381–397.
Johansson, V. (2009), ‘Lexical diversity and lexical density in speech and writing: A develop-
mental perspective’, Working Papers in Linguistics 53, 61–79.
Johnson, J. K., Diehl, J., Mendez, M. F., Neuhaus, J., Shapira, J. S., Forman, M., Chute, D. J.,
Roberson, E. D., Pace-Savitsky, C., Neumann, M., Chow, T. W., Rosen, H. J., Forstl, H., Kurz,
A. & Miller, B. L. (2005), ‘Frontotemporal lobar degeneration: demographic characteristics of
353 patients.’, Arch Neurol 62(6), 925–930.
Jurafsky, D. & Martin, J. H. (2014), Speech and language processing, Vol. 3, Pearson London.
Katz, S. (1987), ‘Estimation of probabilities from sparse data for the language model component
of a speech recognizer’, IEEE transactions on acoustics, speech, and signal processing 35(3), 400–
401.
Kave, G. & Goral, M. (2016), ‘Word retrieval in picture descriptions produced by individuals
with Alzheimer’s disease’, Journal of clinical and experimental neuropsychology 38(9), 958–966.
121
Kemper, S., LaBarge, E., Ferraro, F. R., Cheung, H., Cheung, H. & Storandt, M. (1993), ‘On the
preservation of syntax in Alzheimer’s disease: Evidence from written sentences’, Archives of
neurology 50(1), 81–86.
Kempler, D. (1984), Syntactic and symbolic abilities in Alzheimer’s disease, PhD thesis, UCLA.
Kempler, D. (1995), ‘Language changes in dementia of the Alzheimer type’, Dementia and com-
munication pp. 98–114.
Kempler, D., Curtiss, S. & Jackson, C. (1987), ‘Syntactic preservation in Alzheimer’s disease’,
Journal of Speech, Language, and Hearing Research 30(3), 343–350.
Kent, R. D. & Kim, Y.-J. (2003), ‘Toward an acoustic typology of motor speech disorders’, Clinical
linguistics & phonetics 17(6), 427–445.
Kent, R. D., Sufit, R. L., Rosenbek, J. C., Kent, J. F., Weismer, G., Martin, R. E. & Brooks, B. R.
(1991), ‘Speech deterioration in amyotrophic lateral sclerosis: a case study.’, J Speech Hear Res
34(6), 1269–1275.
Kertesz, A. (1982), The Western aphasia battery, Grune and Stratton, New York, NY.
Kettunen, K. (2014), ‘Can type-token ratio be used to show morphological complexity of lan-
guages?’, Journal of Quantitative Linguistics 21(3), 223–245.
Kiernan, M. C., Vucic, S., Cheah, B. C., Turner, M. R., Eisen, A., Hardiman, O., Burrell, J. R. &
Zoing, M. C. (2011), ‘Amyotrophic Lateral Sclerosis.’, Lancet 377(9769), 942–955.
Kim, M. & Thompson, C. K. (2004), ‘Verb deficits in Alzheimer’s disease and agrammatism:
Implications for lexical organization’, Brain and language 88(1), 1–20.
Kintsch, W. (1994), ‘Text comprehension, memory, and learning.’, American Psychologist
49(4), 294.
Kintsch, W. & Van Dijk, T. A. (1978), ‘Toward a model of text comprehension and production.’,
Psychological review 85(5), 363.
Klein, D. & Manning, C. D. (2003a), Accurate Unlexicalized Parsing, in ‘Proceedings of the
41st Annual Meeting on Association for Computational Linguistics - Volume 1’, ACL ’03,
Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 423–430.
Klein, D. & Manning, C. D. (2003b), Accurate unlexicalized parsing, in ‘Proceedings of the 41st
Annual Meeting on Association for Computational Linguistics-Volume 1’, Association for
Computational Linguistics, pp. 423–430.
122
Kockmann, M., Burget, L. et al. (2011), ‘Application of speaker-and language identification
state-of-the-art techniques for emotion recognition’, Speech Communication 53(9-10), 1172–
1185.
Kovacs, G. G. (2014), Neuropathology of Neurodegenerative Diseases: A Practical Guide, Cambridge
University Press.
Kreiman, J. & Gerratt, B. R. (2003), Jitter, shimmer, and noise in pathological voice quality
perception, in ‘ISCA Tutorial and Research Workshop on Voice Quality: Functions, Analysis
and Synthesis’.
Kruskal, W. H. & Wallis, W. A. (1952), ‘Use of Ranks in One-Criterion Variance Analysis’, Journal
of the American Statistical Association 47(260), 583–621.
Kuhl, P. K., Andruski, J. E., Chistovich, I. A., Chistovich, L. A., Kozhevnikova, E. V., Ryskina,
V. L., Stolyarova, E. I., Sundberg, U. & Lacerda, F. (1997), ‘Cross-language analysis of phonetic
units in language addressed to infants’, Science 277(5326), 684–686.
LDC (2019), ‘English Gigaword Fifth Edition’, https://catalog.ldc.upenn.edu/LDC2011T07.
[Accessed 22-February-2019].
Lehr, M., Shafran, I. & Roark, B. (2012), Fully automated neuropsychological assessment for
detecting mild cognitive impairment, in ‘In Interspeech’.
Levelt, W. (1989), Speaking, MIT Press, Cambridge, Massachusetts.
Liang, P., Taskar, B. & Klein, D. (2006), Alignment by Agreement, in ‘Proceedings of the Main
Conference on Human Language Technology Conference of the North American Chapter of
the Association of Computational Linguistics’, HLT-NAACL ’06, Association for Computa-
tional Linguistics, Stroudsburg, PA, USA, pp. 104–111.
Liu, M., Dai, B., Xie, Y. & Yao, Z. (2006), Improved GMM-UBM/SVM for speaker verification,
in ‘2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceed-
ings’, Vol. 1, IEEE, pp. I–I.
Lopez-de Ipina, K., Martinez-de Lizarduy, U., Barroso, N., Ecay-Torres, M., Martinez-Lage,
P., Torres, F. & Faundez-Zanuy, M. (2015), Automatic analysis of Categorical Verbal Flu-
ency for Mild Cognitive impartment detection: A non-linear language independent ap-
proach, in ‘Bioinspired Intelligence (IWOBI), 2015 4th International Work Conference on’,
IEEE, pp. 101–104.
123
Mackenzie, C., Brady, M., Norrie, J. & Poedjianto, N. (2007), ‘Picture description in neurologi-
cally normal adults: Concepts and topic coherence’, Aphasiology 21(3-4), 340–354.
MacWhinney, B. (2000), ‘The CHILDES Project: Tools for analyzing talk, 3rd edition.’, Lawrence
Erlbaum Associates, Mahwah, New Jersey.
MacWhinney, B., Bird, S., Cieri, C. & Martell, C. (2004), ‘TalkBank: Building an Open Unified
Multimodal Database of Communicative Interaction’, 4th International Conference on Language
Resources and Evaluation pp. 525–528.
MacWhinney, B., Fromm, D., Forbes, M. & Holland, A. (2011), ‘AphasiaBank: Methods for
Studying Discourse’, Aphasiology 25(11), 1286–1307.
Makhoul, J. (1973), ‘Spectral analysis of speech by linear prediction’, IEEE Transactions on Audio
and Electroacoustics 21(3), 140–148.
Mamede, N. J., Baptista, J., Diniz, C. & Cabarrao, V. (2012), STRING: An Hybrid Statistical and
Rule-Based Natural Language Processing Chain for Portuguese., in ‘International Confer-
ence on Computational Processing of Portuguese, Propor’.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J. & McClosky, D. (2014), The
Stanford CoreNLP Natural Language Processing Toolkit, in ‘Association for Computational
Linguistics (ACL) System Demonstrations’, pp. 55–60.
Manning, C., Raghavan, P. & Schutze, H. (2010), ‘Introduction to information retrieval’, Natural
Language Engineering 16(1), 100–103.
Marini, A., Andreetta, S., Del Tin, S. & Carlomagno, S. (2011), ‘A multi-level approach to the
analysis of narrative language in aphasia’, Aphasiology 25(11), 1372–1392.
Markel, J. & Gray, A. (1973), ‘On autocorrelation equations as applied to speech analysis’, IEEE
Transactions on Audio and Electroacoustics 21(2), 69–79.
Marrafa, P., Amaro, R., Mendes, S., Lourosa, S. & Chaves, R. P. (2006), ‘Temanet - wordnets
tematicas do portugues. http://www.instituto-camoes.pt/temanet.’.
McKeith, I. G., Boeve, B. F., Dickson, D. W., Halliday, G., Taylor, J.-P., Weintraub, D., Aarsland,
D., Galvin, J., Attems, J., Ballard, C. G., Bayston, A., Beach, T. G., Blanc, F., Bohnen, N., Bo-
nanni, L., Bras, J., Brundin, P., Burn, D., Chen-Plotkin, A., Duda, J. E., El-Agnaf, O., Feldman,
H., Ferman, T. J., ffytche, D., Fujishiro, H., Galasko, D., Goldman, J. G., Gomperts, S. N.,
Graff-Radford, N. R., Honig, L. S., Iranzo, A., Kantarci, K., Kaufer, D., Kukull, W., Lee, V. M.,
124
Leverenz, J. B., Lewis, S., Lippa, C., Lunde, A., Masellis, M., Masliah, E., McLean, P., Mollen-
hauer, B., Montine, T. J., Moreno, E., Mori, E., Murray, M., O’Brien, J. T., Orimo, S., Postuma,
R. B., Ramaswamy, S., Ross, O. A., Salmon, D. P., Singleton, A., Taylor, A., Thomas, A., Tira-
boschi, P., Toledo, J. B., Trojanowski, J. Q., Tsuang, D., Walker, Z., Yamada, M. & Kosaka, K.
(2017), ‘Diagnosis and management of dementia with Lewy bodies: Fourth consensus report
of the DLB Consortium’, Neurology .
McKeith, I. G., Galasko, D., Kosaka, K., Perry, E. K., Dickson, D. W., Hansen, L. A., Salmon,
D. P., Lowe, J., Mirra, S. S., Byrne, E. J., Lennox, G., Quinn, N. P., Edwardson, J. A., Ince, P. G.,
Bergeron, C., Burns, A., Miller, B. L., Lovestone, S., Collerton, D., Jansen, E. N., Ballard, C.,
de Vos, R. A., Wilcock, G. K., Jellinger, K. A. & Perry, R. H. (1996), ‘Consensus guidelines
for the clinical and pathologic diagnosis of dementia with Lewy bodies (DLB): report of the
consortium on DLB international workshop.’, Neurology 47(5), 1113–1124.
McKhann, G. M., Knopman, D. S., Chertkow, H., Hyman, B. T., Clifford R. Jack, J., Kawas,
C. H., Klunk, W. E., Koroshetz, W. J., Manly, J. J., Mayeux, R., Mohs, R. C., Morris, J. C.,
Rossor, M. N., Scheltens, P., Carrillo, M. C., Thies, B., Weintraub, S. & Phelps, C. H. (2011),
‘The diagnosis of dementia due to Alzheimer’s disease: Recommendations from the Na-
tional Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for
Alzheimer’s disease’, Alzheimer’s Dementia: The Journal of the Alzheimer’s Association 7(3), 263–
269.
MDS (2003), ‘Movement Disorder Society Task Force on Rating Scales for Parkinson’s Disease.
The Unified Parkinson’s Disease Rating Scale (UPDRS): Status and recommendations’.
Meignier, S. & Merlin, T. (2010), LIUM SpkDiarization: an open source toolkit for diarization,
in ‘CMU SPUD Workshop’.
Meinedo, H., Abad, A., Pellegrini, T., Trancoso, I. & Neto, J. (2010), The L2F Broadcast News
Speech Recognition System, in ‘Proc. Fala2010’.
Meinedo, H., Caseiro, D., Neto, J. & Trancoso, I. (2003), AUDIMUS.Media: a Broadcast News
speech recognition system for the European Portuguese language, in ‘Proc. International
Conference on Computational Processing of Portuguese Language (PROPOR)’.
Mentis, M. & Prutting, C. A. (1991), ‘Analysis of Topic as Illustrated in a Head-Injured and a
Normal Adult’, Journal of Speech, Language, and Hearing Research 34(3), 583–595.
Mermelstein, P. (1976), ‘Distance measures for speech recognition, psychological and instru-
mental’, Pattern recognition and artificial intelligence 116, 374–388.
125
Merriam-Webster (2019), ‘Merriam-Webster Online dictionary and thesaurus’, https://www.
merriam-webster.com. [Accessed 15-January-2019].
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013), ‘Efficient estimation of word representa-
tions in vector space’, arXiv preprint arXiv:1301.3781 .
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C. & Joulin, A. (2018), Advances in Pre-Training
Distributed Word Representations, in ‘Proceedings of the International Conference on Lan-
guage Resources and Evaluation (LREC 2018)’.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013), Distributed representa-
tions of words and phrases and their compositionality, in ‘Advances in neural information
processing systems’, pp. 3111–3119.
Miller, D. I., Talbot, V., Gagnon, M. & Messier, C. (2013), ‘Administration of neuropsychological
tests using interactive voice response technology in the elderly: validation and limitations’,
Frontiers in neurology 4.
Miller, G. A. (1995), ‘WordNet: a lexical database for English’, Communications of the ACM
38(11), 39–41.
Miranda, A. F. H. (2015), Influencia da Escolaridade na Dimensao Macrolinguıstica do Dis-
curso, Master’s thesis, Universidade Catolica Portuguesa.
Mitchell, T. M. (1997), Machine Learning, McGraw-Hill Education.
Moniz, H., Batista, B., Mata, A. I. & Trancoso, I. (accepted), Towards automatic language pro-
cessing and intonational labeling in European Portuguese, in N. Henriksen, M. Armstrong
& M. Vanrell, eds, ‘Interdisciplinary approaches to intonational grammar in Ibero-Romance’,
John Benjamins.
Moniz, H., Batista, F., Mata, A. I. & Trancoso, I. (2014), ‘Speaking style effects in the production
of disfluencies’, Speech Communication 65, 20–35.
Moniz, H., Mata, A. I., Hirschberg, J., Batista, F., Rosenberg, A. & Trancoso, I. (2014), ‘Extending
AuToBI to prominence detection in European Portuguese’, Speech Prosody 2014 .
Moniz, H., Mata, A. I. & Viana, M. C. (2007), On filled pauses and prolongations in European
Portuguese, in ‘Interspeech 2007’, Belgium.
Moniz, H., Pompili, A., Batista, F., Trancoso, I., Abad, A. & Amorim, C. (2015), Automatic
Recognition of Prosodic Patterns in Semantic Verbal Fluency Tests - An Animal Naming Task
126
for Edutainment Applications, in ‘18TH International Congress of Phonetic Sciences’, Inter-
national Phonetic Association.
Morgan, N. & Bourlard, H. (1995), ‘An introduction to hybrid HMM/connectionist continuous
speech recognition’, IEEE Signal Processing Magazine 12(3), 25–42.
Nasreddine, Z. S., Phillips, N. A., Bedirian, V., Charbonneau, S., Whitehead, V., Collin, I.,
Cummings, J. L. & Chertkow, H. (2005), ‘The Montreal Cognitive Assessment, MoCA: A
Brief Screening Tool For Mild Cognitive Impairment’, Journal of the American Geriatrics Soci-
ety 53(4), 695–699.
Negnevitsky, M. (2005), Artificial intelligence: a guide to intelligent systems, Pearson education.
Nicholas, M., Obler, L. K., Albert, M. L. & Helm-Estabrooks, N. (1985), ‘Empty speech in
Alzheimer’s disease and fluent aphasia’, Journal of Speech, Language, and Hearing Research
28(3), 405–410.
Nuance (2019), ‘Nuance Open Speech Recognizer’, https://www.nuance.com/
omni-channel-customer-engagement/voice-and-ivr/automatic-speech-recognition/
nuance-recognizer.html. Burlington, Massachusetts, EUA, [Accessed 12-March-2019].
Nunes, B. (2005), A Demencia em Numeros, In A. Castro-Caldas & A. Mendonca. A Doenca de
Alzheimer e Outras Demencias em Portugal. LIDEL.
Obler, L. K. & Albert, M. L. (1984), Language in the elderly aphasic and dementing patient., in
‘In M. T. Sarno (Ed.), Acquired aphasia. New York: Academic Press’, pp. 385–398.
Olesen, J., Gustavsson, A., Svensson, M., Wittchen, H.-U., Jonsson, B., Group, C. S. & Council,
E. B. (2012), ‘The economic cost of brain disorders in Europe’, European journal of neurology
19(1), 155–162.
Oppenheim, G. (1994), ‘The earliest signs of Alzheimer’s disease.’, J Geriatr Psychiatry Neurol
7(2), 116–120.
Oracle (2019), ‘Java Speech API Specifications’, https://www.oracle.com/technetwork/java/
speech-138007.html. [Accessed 12-March-2019].
Orimaye, S. O., Wong, J. S.-M. & Golden, K. J. (2014), Learning predictive linguistic features for
Alzheimer’s disease and related dementias using verbal utterances, in ‘Proceedings of the 1st
Workshop on Computational Linguistics and Clinical Psychology (CLPsych)’, sn, pp. 78–87.
127
Orozco-Arroyave, J. R., Arias-Londono, J. D., Vargas-Bonilla, J. & Noth, E. (2013), Perceptual
analysis of speech signals from people with Parkinson’s disease, in ‘International Work-
Conference on the Interplay Between Natural and Artificial Computation’, Springer, pp. 201–
211.
Orozco-Arroyave, J. R., Belalcazar-Bolanos, E. A., Arias-Londono, J. D., Vargas-Bonilla, J. F.,
Haderlein, T. & Noth, E. (2014), Phonation and articulation analysis of Spanish vowels for
automatic detection of Parkinson’s disease, in ‘Text, Speech and Dialogue: 17th International
Conference. Proceedings’, Springer International Publishing, Cham, pp. 374–381.
Orozco-Arroyave, J. R., Honig, F., Arias-Londono, J. D., Vargas-Bonilla, J. F., Daqrouq, K.,
Skodda, S., Rusz, J. & Noth, E. (2016), ‘Automatic detection of Parkinson’s disease in run-
ning speech spoken in three different languages.’, J Acoust Soc Am 139(1), 481–500.
Ortmanns, S. & Ney, H. (2000), ‘The time-conditioned approach in dynamic programming
search for LVCSR’, IEEE Transactions on Speech and Audio Processing 8(6), 676–687.
Pakhomov, S. V. S., Hemmy, L. S. & Lim, K. O. (2012), ‘Automated semantic indices related to
cognitive function and rate of cognitive decline.’, Neuropsychologia 50(9), 2165–2175.
Pakhomov, S. V. S., Marino, S. E., Banks, S. & Bernick, C. (2015), ‘Using Automatic Speech
Recognition to Assess Spoken Responses to Cognitive Tests of Semantic Verbal Fluency.’,
Speech Commun 75, 14–26.
Pakhomov, S. V., Smith, G. E., Chacon, D., Feliciano, Y., Graff-Radford, N., Caselli, R. & Knop-
man, D. S. (2010), ‘Computerized analysis of speech and language to identify psycholin-
guistic correlates of frontotemporal lobar degeneration’, Cognitive and Behavioral Neurology
23(3), 165.
Pasquier, F. & Petit, H. (1997), ‘Frontotemporal dementia: its rediscovery.’, Eur Neurol 38(1), 1–6.
Patwardhan, S. & Pedersen, T. (2006), Using WordNet-based context vectors to estimate the
semantic relatedness of concepts, in ‘Proceedings of the eacl 2006 workshop making sense of
sense-bringing computational linguistics and psycholinguistics together’, Vol. 1501, Trento,
pp. 1–8.
Paulsen, J. S. (2011), ‘Cognitive impairment in Huntington disease: diagnosis and treatment.’,
Curr Neurol Neurosci Rep 11(5), 474–483.
Pearson, K. (1895), ‘Notes on regression and inheritance in the case of two parents’, Proceedings
of the Royal Society of London 58, 240–242.
128
Pearson, K. (1992), On the criterion that a given system of deviations from the probable in the case of a
correlated system of variables is such that it can be reasonably supposed to have arisen from random
sampling, Springer New York, New York, NY, pp. 11–28.
Pinto, S., Cardoso, R., Sadat, J., Guimaraes, I., Mercier, C., Santos, H., Atkinson-Clement, C.,
Carvalho, J., Welby, P., Oliveira, P., D’Imperio, M., Frota, S., Letanneux, A., Vigario, M., Cruz,
M., Martins, I. P., Viallet, F. & Ferreira, J. J. (2016), ‘Dysarthria in individuals with Parkin-
son’s disease: a protocol for a binational, cross-sectional, case-controlled study in French and
European Portuguese (FraLusoPark)’, BMJ Open 6(11).
Pompili, A., Abad, A., Martins de Matos, D. & Pavao Martins, I. (2018), Topic coherence analy-
sis for the classification of Alzheimer’s disease, in ‘Proc. IberSPEECH 2018’, pp. 281–285.
Pompili, A., Abad, A., Martins de Matos, D. & Pavao Martins, I. (2019), Evaluating pragmatic
aspects of discourse production for the automatic identification of Alzheimer’s disease. Sub-
mitted to a special issue of the IEEE Journal of Selected Topics in Signal Processing (JSTSP) on
Automatic assessment of health disorders based on voice, speech and language processing.
Pompili, A., Abad, A., Romano, P., Martins, I. P., Cardoso, R., Santos, H., Carvalho, J.,
Guimaraes, I. & Ferreira, J. J. (2017), Automatic Detection of Parkinson’s Disease: An Exper-
imental Analysis of Common Speech Production Tasks Used for Diagnosis, in ‘International
Conference on Text, Speech, and Dialogue’, Springer, pp. 411–419.
Pompili, A., Abad, A., Trancoso, I., Fonseca, J., Martins, I. P., Leal, G. & Farrajota, L. (2011),
An on-line system for remote treatment of aphasia, in ‘Proceedings of the Second Workshop
on Speech and Language Processing for Assistive Technologies’, SLPAT ’11, Association for
Computational Linguistics, pp. 1–10.
Pompili, A., Amorim, C., Abad, A. & Trancoso, I. (2015), Speech and language technologies for
the automatic monitoring and training of cognitive functions, in ‘Proceedings of SLPAT 2015:
6th Workshop on Speech and Language Processing for Assistive Technologies’, Association
for Computational Linguistics, Dresden, Germany, pp. 103–109.
Post, M. & Bergsma, S. (2013), Explicit and implicit syntactic features for text classification, in
‘Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers)’, Vol. 2, pp. 866–872.
Pringsheim, T., Wiltshire, K., Day, L., Dykeman, J., Steeves, T. & Jette, N. (2012), ‘The incidence
and prevalence of Huntington’s disease: a systematic review and meta-analysis.’, Mov Disord
27(9), 1083–1091.
129
Proenca, J., Veiga, A., Candeias, S. & Perdigao, F. (2013), Acoustic, Phonetic and Prosodic Fea-
tures of Parkinson’s disease Speech, in ‘STIL-IX Brazilian Symposium in Information and
Human Language Technology, 2nd Brazilian Conference on Intelligent Systems, Brazil’.
Pudil, P., Novovicova, J. & Kittler, J. (1994), ‘Floating search methods in feature selection’, Pat-
tern recognition letters 15(11), 1119–1125.
Quinlan, J. R. (1986), ‘Induction of decision trees’, Machine learning 1(1), 81–106.
Quinlan, J. R. (2014), C4. 5: programs for machine learning, Elsevier.
Rabiner, L. R. (1989), ‘A tutorial on hidden Markov models and selected applications in speech
recognition’, Proceedings of the IEEE 77(2), 257–286.
Rabiner, L. R., Juang, B.-H. & Rutledge, J. C. (1993), Fundamentals of speech recognition, Vol. 14,
PTR Prentice Hall Englewood Cliffs.
Ramig, L. O., Fox, C. & Sapir, S. (2008), ‘Speech treatment for Parkinson’s disease’, Expert Review
of Neurotherapeutics 8(2), 297–309.
Ratnavalli, E., Brayne, C., Dawson, K. & Hodges, J. R. (2002), ‘The prevalence of frontotemporal
dementia.’, Neurology 58(11), 1615–1621.
Reilly, J., Troche, J. & Grossman, M. (2011), ‘Language processing in dementia’, The handbook of
Alzheimer’s disease and other dementias pp. 336–368.
Reilmann, R., Leavitt, B. R. & Ross, C. A. (2014), ‘Diagnostic criteria for Huntington’s disease
based on natural history.’, Mov Disord 29(11), 1335–1341.
Reynolds, D. (2009a), ‘Universal background models’, Encyclopedia of biometrics pp. 1349–1352.
Reynolds, D. A. (2009b), Gaussian Mixture Models, in ‘Encyclopedia of Biometrics’.
Reynolds, D. A., Quatieri, T. F. & Dunn, R. B. (2000), ‘Speaker verification using adapted Gaus-
sian mixture models’, Digital signal processing 10(1-3), 19–41.
Ripich, D. N. & Terrell, B. Y. (1988), ‘Patterns of discourse cohesion and coherence in
Alzheimer’s disease’, Journal of Speech and Hearing Disorders 53(1), 8–15.
Robert, P., Ferris, S., Gauthier, S., Ihl, R., Winblad, B. & Tennigkeit, F. (2010), ‘Review of
Alzheimer’s disease scales: is there a need for a new multi-domain scale for therapy evalua-
tion in medical practice?’, Alzheimer’s Res Ther 2(4), 24.
130
Roberts, R. & Knopman, D. S. (2013), ‘Classification and Epidemiology of MCI’, Clinics in Geri-
atric Medicine 29(4).
Rosen, W. G., Mohs, R. C. & Davis, K. L. (1984), ‘A new rating scale for Alzheimer’s disease.’,
Am J Psychiatry 141(11), 1356–1364.
Rosenberg, A. (2009), Automatic Detection and Classification of Prosodic Events, PhD thesis,
University of Columbia.
Rosenberg, A. (2010), AuToBI – A Tool for Automatic ToBI annotation, in ‘Interspeech 2010’.
Rosenblatt, F. (1958), ‘The perceptron: a probabilistic model for information storage and orga-
nization in the brain.’, Psychological review 65(6), 386.
Rusz, J., Cmejla, R., Ruzickova, H. & Ruzicka, E. (2011), ‘Quantitative acoustic measurements
for characterization of speech and voice disorders in early untreated Parkinson’s disease’,
The Journal of the Acoustical Society of America 129(1), 350–367.
Rusz, J., Klempir, J., Tykalova, T., Baborova, E., Cmejla, R., Ruzicka, E. & Roth, J. (2014), ‘Char-
acteristics and occurrence of speech impairment in Huntington’s disease: possible influence
of antipsychotic medication.’, J Neural Transm (Vienna) 121(12), 1529–1539.
Ryan, J. & Lopez, S. (2001), Wechsler Adult Intelligence Scale-III, In: Dorfman W.I., Hersen
M. (eds) Understanding Psychological Assessment. Perspectives on Individual Differences.
Springer, Boston, MA.
Saldert, C., Fors, A., Stroberg, S. & Hartelius, L. (2010), ‘Comprehension of complex discourse
in different stages of Huntington’s disease.’, Int J Lang Commun Disord 45(6), 656–669.
Salmon, D. P., Butters, N. & Chan, A. S. (1999), ‘The deterioration of semantic memory in
Alzheimer’s disease.’, Canadian Journal of Experimental Psychology/Revue canadienne de psy-
chologie experimentale 53(1), 108.
Santos, L., Correa Junior, E. A., Oliveira Jr, O., Amancio, D., Mansur, L. & Aluısio, S. (2017),
Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impair-
ment from Speech Transcripts, in ‘Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers)’, Association for Computational Lin-
guistics, pp. 1284–1296.
Santosa, F. & Symes, W. W. (1986), ‘Linear inversion of band-limited reflection seismograms’,
SIAM Journal on Scientific and Statistical Computing 7(4), 1307–1330.
131
Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhad-
ran, B., Picheny, M., Lim, L.-L., Roomi, B. & Hall, P. (2017), English Conversational Telephone
Speech Recognition by Humans and Machines, in ‘Proc. Interspeech’, pp. 132–136.
Sapir, S. (2006), ‘Effects of LSVT on speech articulation in dysarthric individuals with Parkin-
son’s disease: Acoustic and perceptual correlates.’, A paper presented at the Congress of the
European Federation of Neurological Societies, Istanbul, Turkey .
Sapir, S., Spielman, J. L., Ramig, L. O., Story, B. H. & Fox, C. (2007), ‘Effects of intensive voice
treatment (the Lee Silverman Voice Treatment [LSVT]) on vowel articulation in dysarthric
individuals with idiopathic Parkinson disease: Acoustic and perceptual findings’, Journal of
Speech, Language, and Hearing Research 50, 899–912.
Savino, M. (2004), Intonational cues to discourse structure in a variety of Italian, in ‘Regional
Variation in Intonation’, P. Gilles & J. Peters (eds.), Niemeyer: Tuebingen, pp. 145–159.
Savino, M., Bosco, A. & Grice, M. (2014), Intonational cues to item position in lists: Evidence
from a serial recall task, in ‘Proceedings of the International Conference on Speech Prosody’,
pp. 708–712.
Schlesinger, M. I. & Hlavac, V. (2002), Ten lectures on statistical and structural pattern recog-
nition, in ‘Volume 24 of Computational Imaging and Vision’, Kluwer Academic Press, Dor-
drecht, pp. 1–544.
Scholkopf, B. & Smola, A. J. (2001), Learning with kernels: support vector machines, regularization,
optimization, and beyond, MIT press.
Shao, J. (1993), ‘Linear model selection by cross-validation’, Journal of the American statistical
Association 88(422), 486–494.
Sheinerman, K. S. & Umansky, S. R. (2013), ‘Early detection of neurodegenerative diseases:
circulating brain-enriched microRNA’, Cell cycle (Georgetown, Tex.) 12(1), 1–2.
Shekim, L. O. & LaPointe, L. L. (1984), ‘Production of discourse in individuals with Alzheimer’s
Disease.’, Paper presented at International Neuropsychological Society Meetings, Houston,
TX.
Shriberg, E. (1994), Preliminaries to a Theory of Speech Disfluencies, PhD thesis, University of
California.
132
Shriberg, E. (2001), ‘To ”Errrr” is Human: Ecology and Acoustics of Speech Disfluencies’, Jour-
nal of the International Phonetic Association 31, 153–169.
Silbergleit, A. K., Johnson, A. F. & Jacobson, B. H. (1997), ‘Acoustic analysis of voice in indi-
viduals with amyotrophic lateral sclerosis and perceptually normal vocal quality’, Journal of
Voice 11(2), 222–231.
Silva, D. G., Oliveira, L. C. & Andrea, M. (2009), ‘Jitter estimation algorithms for detection of
pathological voices’, EURASIP Journal on Advances in Signal Processing 2009, 9.
Sirts, K., Piguet, O. & Johnson, M. (2017), Idea density for predicting Alzheimer’s disease from
transcribed speech, in ‘CoNLL’.
Skodda, S. & Schlegel, U. (2008), ‘Speech rate and rhythm in Parkinson’s disease’, Movement
Disorders 23(7), 985–992.
Skodda, S., Schlegel, U., Hoffmann, R. & Saft, C. (2014), ‘Impaired motor speech performance
in Huntington’s disease.’, J Neural Transm (Vienna) 121(4), 399–407.
Skodda, S., Visser, W. & Schlegel, U. (2011), ‘Vowel articulation in Parkinson’s disease.’, J Voice
25(4), 467–472.
Soffer, A. (1997), Image categorization using texture features, in ‘Proceedings of the Fourth
International Conference on Document Analysis and Recognition’, Vol. 1, IEEE, pp. 233–237.
Steck, A., Struhal, W., Sergay, S. M., Grisold, W., of Neurology, E. C. W. F. et al. (2013), ‘The
global perspective on neurology training: the World Federation of Neurology survey’, Journal
of the neurological sciences 334(1-2), 30–47.
Stevens, S. S. & Volkmann, J. (1940), ‘The relation of pitch to frequency: A revised scale’, The
American Journal of Psychology 53(3), 329–353.
Stevens, S. S., Volkmann, J. & Newman, E. B. (1937), ‘A scale for the measurement of the psy-
chological magnitude pitch’, The Journal of the Acoustical Society of America 8(3), 185–190.
Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Janin, A., Magimai-Doss, M., Wooters, C. &
Zheng, J. (2008), The SRI-ICSI Spring 2007 meeting and lecture recognition system, Springer,
pp. 450–463.
Stolcke, A. & Droppo, J. (2017), Comparing Human and Machine Errors in Conversational
Speech Transcription, in ‘Proc. Interspeech’, pp. 137–141.
133
Stone, M. (1974), ‘Cross-validatory choice and assessment of statistical predictions’, Journal of
the Royal Statistical Society: Series B (Methodological) 36(2), 111–133.
Stone, M. (1977), ‘Asymptotics For and Against Cross-Validation’, Biometrika 64(1), 29–35.
Strauss, E., Sherman, E. & Spreen, O. (2006), A Compendium of Neuropsychological Tests: Admin-
istration, Norms, and Commentary, 3 edn, Oxford University Press.
Stroop, J. R. (1935), ‘Studies of interference in serial verbal reactions’, Journal of Experimental
Psychology 18(6), 643–662.
Sveinbjornsdottir, S. (2016), ‘The clinical symptoms of Parkinson’s disease’, Journal of Neuro-
chemistry 139, 318–324.
Szoke, I., Schwarz, P., Matejka, P., Burget, L., Karafiat, M., Fapso, M. & Cernocky, J. (2005), Com-
parison of keyword spotting approaches for informal continuous speech, in ‘Ninth European
conference on speech communication and technology’.
Taler, V. & Phillips, N. A. (2008), ‘Language performance in Alzheimer’s disease and mild cog-
nitive impairment: a comparative review.’, J Clin Exp Neuropsychol 30(5), 501–556.
TalkBank (2017), ‘DementiaBank database’, https://dementia.talkbank.org. [Accessed 15-
January-2019].
Teixeira, J. P. & Fernandes, P. O. (2014), ‘Jitter, Shimmer and HNR classification within gender,
tones and vowels in healthy voices’, Procedia technology 16, 1228–1237.
Teixeira, J. P., Oliveira, C. & Lopes, C. (2013), ‘Vocal acoustic analysis–jitter, shimmer and hnr
parameters’, Procedia Technology 9, 1112–1122.
TelAsk (2019), ‘TelAsk Technologies’, https://telask.com. Ottawa, ON, Canada, [Accessed
12-March-2019].
Thesaurus (2019), ‘Online thesaurus’, https://www.thesaurus.com. [Accessed 15-January-
2019].
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal
Statistical Society: Series B (Methodological) 58(1), 267–288.
Titze, I. R. & Martin, D. W. (1998), ‘Principles of voice production’.
Toledo, C. M., Aluısio, S. M., dos Santos, L. B., Brucki, S. M. D., Tres, E. S., de Oliveira, M. O. &
Mansur, L. L. (2018), ‘Analysis of macrolinguistic aspects of narratives from individuals with
134
Alzheimer’s disease, mild cognitive impairment, and no cognitive impairment’, Alzheimer’s
& Dementia: Diagnosis, Assessment & Disease Monitoring 10, 31–40.
Tomovic, A., Janicic, P. & Keselj, V. (2006), ‘n-Gram-based classification and unsupervised
hierarchical clustering of genome sequences’, Computer methods and programs in biomedicine
81(2), 137–153.
Torres-Carrasquillo, P. A., Gleason, T. P. & Reynolds, D. A. (2004), Dialect identification using
Gaussian mixture models, in ‘ODYSSEY04-The Speaker and Language Recognition Work-
shop’.
Tree, J. E. F. (1995), ‘The effects of false starts and repetitions on the processing of subsequent
words in spontaneous speech’, Journal of Memory and Language 34(6), 709–738.
Treebank, P. (2019), ‘Penn Treebank II Constituent Tags’, http://www.surdeanu.info/mihai/
teaching/ista555-fall13/readings/PennTreebankConstituents.html. [Accessed 15-
January-2019].
Ulatowska, H. & Chapman, S. (1994a), Discourse Analysis and Applications. Studies In Adult Clin-
ical Populations, Hillsdale: Lawrence Elbaum Associates, chapter Discourse macrostructure
in aphasia, pp. pp. 29–46.
Ulatowska, H. K., Allard, L., Donnell, A., Bristow, J., Haynes, S. M., Flower, A. & North, A. J.
(1988), Discourse performance in subjects with dementia of the Alzheimer type, in ‘Neu-
ropsychological studies of nonfocal brain damage’, Springer, pp. 108–131.
Ulatowska, H. K. & Chapman, S. B. (1994b), ‘Discourse macrostructure in aphasia’, Discourse
analysis and applications: Studies in adult clinical populations pp. 29–46.
Vapnik, V. (1963), ‘Pattern recognition using generalized portrait method’, Automation and re-
mote control 24, 774–780.
Vasquez-Correa, J., Orozco-Arroyave, J. R., Arias-Londono, J. D., Vargas-Bonilla, J. F. & Noth,
E. (2013), Design and implementation of an embedded system for real time analysis of speech
from people with Parkinson’s disease, in ‘Symposium of Signals, Images and Artificial Vision
- 2013: STSIVA - 2013’, pp. 1–5.
Vizza, P., Tradigo, G., Mirarchi, D., Bossio, R. & Veltri, P. (2017), ‘On the Use of Voice Signals for
Studying Sclerosis Disease’, Computers 6(4), 30.
135
Vogel, A. P., Shirbin, C., Churchyard, A. J. & Stout, J. C. (2012), ‘Speech acoustic markers of early
stage and prodromal Huntington’s disease: a marker of disease onset?’, Neuropsychologia
50(14), 3273–3278.
Vorperian, H. K. & Kent, R. D. (2007), ‘Vowel acoustic space development in children: A syn-
thesis of acoustic and anatomic data’, Journal of Speech, Language, and Hearing Research .
Wang, S. & Starren, J. (1999), A Java Speech Implementation of the Mini Mental Status Exam,
in ‘Proc. AMIA Symposium’.
Watts, C. R. & Vanryckeghem, M. (2001), ‘Laryngeal dysfunction in Amyotrophic Lateral Scle-
rosis: a review and case report.’, BMC Ear Nose Throat Disord 1(1), 1.
Wechsler, D. (1997), Wechsler Memory Scale - Third Edition Manual, The Psychological Corpora-
tion.
Weismer, G., Jeng, J.-Y., Laures, J. S., Kent, R. D. & Kent, J. F. (2001), ‘Acoustic and intelligibility
characteristics of sentence production in neurogenic speech disorders’, Folia Phoniatrica et
Logopaedica 53(1), 1–18.
WHO (2017), ‘World Health Organization. Dementia Fact sheet N 362’.
Wikipedia (2014), ‘Wikipedia 2014’, http://dumps.wikimedia.org/enwiki/20140102/. (No
more available on 22-February-2019).
Wittchen, H.-U., Jacobi, F., Rehm, J., Gustavsson, A., Svensson, M., Jonsson, B., Olesen, J.,
Allgulander, C., Alonso, J., Faravelli, C. et al. (2011), ‘The size and burden of mental dis-
orders and other disorders of the brain in Europe 2010’, European neuropsychopharmacology
21(9), 655–679.
Wong, E. & Sridharan, S. (2002), Methods to improve Gaussian mixture model based language
identification system, in ‘Seventh International Conference on Spoken Language Processing’.
Wong, S.-M. J. & Dras, M. (2010), Parser features for sentence grammaticality classification, in
‘Proceedings of the Australasian Language Technology Association Workshop 2010’, pp. 67–
75.
Wu, W., Zheng, T. F., Xu, M.-X. & Bao, H.-J. (2006), Study on speaker verification on emotional
speech, in ‘Ninth International Conference on Spoken Language Processing’.
Yancheva, M., Fraser, K. C. & Rudzicz, F. (2015), Using linguistic features longitudinally to
predict clinical scores for Alzheimer’s disease and related dementias, in ‘Proceedings of
136
SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies’,
pp. 134–139.
Yancheva, M. & Rudzicz, F. (2016), Vector-space topic models for detecting Alzheimer’s dis-
ease., in ‘Proceedings of the 54th Annual Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers)’, pp. 2337–2346.
Yin, B., Ambikairajah, E. & Chen, F. (2006), Combining cepstral and prosodic features in lan-
guage identification, in ‘18th International Conference on Pattern Recognition (ICPR’06)’,
Vol. 4, IEEE, pp. 254–257.
Yorkston, K., Strand, E., Miller, R., Hillel, A. & Smith, K. (1993), ‘Speech Deterioration in Amy-
otrophic Lateral Sclerosis: Implications for the Timing of Intervention’, Journal of Medical
Speech-Language Pathology 1, 35–46.
Zeißler, V., Adelhardt, J., Batliner, A., Frank, C., Noth, E., Shi, R. P. & Niemann, H. (2006), The
prosody module, in ‘SmartKom: foundations of multimodal dialogue systems’, Springer,
pp. 139–152.
Zheng, R., Zhang, S. & Xu, B. (2004), Text-independent speaker identification using GMM-UBM
and frame level likelihood normalization, in ‘2004 International Symposium on Chinese Spo-
ken Language Processing’, IEEE, pp. 289–292.
Zou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the Elastic Net’, Journal
of the Royal Statistical Society, Series B 67, 301–320.
Zwicker, E. (1961), ‘Subdivision of the Audible Frequency Range into Critical Bands (Frequen-
zgruppen)’, The Journal of the Acoustical Society of America 33(2), 248–248.
137
138