162
UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR T ´ ECNICO Speech and language technologies applied to diagnosis and therapy of brain diseases Anna Maria Pompili Supervisor: Doctor Alberto Abad Gareta Co-Supervisor: Doctor Isabel Pav˜ ao Martins Thesis approved in public session to obtain the PhD Degree in Information Systems and Computer Engineering Jury final classification: Pass with Distinction 2019

UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

UNIVERSIDADE DE LISBOA

INSTITUTO SUPERIOR TECNICO

Speech and language technologies applied to diagnosis and therapy of brain diseases

Anna Maria Pompili

Supervisor: Doctor Alberto Abad GaretaCo-Supervisor: Doctor Isabel Pavao Martins

Thesis approved in public session to obtain the PhD Degree in

Information Systems and Computer Engineering

Jury final classification: Pass with Distinction

2019

Page 2: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´
Page 3: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

UNIVERSIDADE DE LISBOA

INSTITUTO SUPERIOR TECNICO

Speech and language technologies applied to diagnosis and therapy of brain diseases

Anna Maria Pompili

Supervisor: Doctor Alberto Abad Gareta

Co-Supervisor: Doctor Isabel Pavao Martins

Thesis approved in public session to obtain the PhD Degree in

Information Systems and Computer Engineering

Jury final classification: Pass with Distinction

Jury

Chairperson: Doctor Jose Manuel da Costa Alves Marques,

Instituto Superior Tecnico, Universidade de Lisboa

Members of the Committee:

Doctor Maria de Sao Luıs de Vasconcelos Fonseca e Castro Schoner,

Faculdade de Psicologia e de Ciencias da Educacao, Universidade de Porto

Doctor Mario Jorge Costa Gaspar da Silva,

Instituto Superior Tecnico, Universidade de Lisboa

Doctor Antonio Joaquim da Silva Teixeira, Universidade de Aveiro

Doctor David Manuel Martins de Matos,

Instituto Superior Tecnico, Universidade de Lisboa

Doctor Alberto Abad Gareta, Instituto Superior Tecnico, Universidade de Lisboa

Doctor Ana Rita Mendes Londral Gamboa,

UniSpital, University Hospital of Zurich, Suıca

Funding Institutions

Fundacao para a Ciencia e a Tecnologia

2019

Page 4: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´
Page 5: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Resumo

As doencas cerebrais, e mais especificamente os disturbios neurodegenerativos, incluem uma

gama de condicoes que afetam o cerebro, causando danos irreversıveis e progressivos. Nao ha

uma cura para muitas dessas doencas, mas a detecao precoce do inıcio dos sintomas pode aten-

uar o seu progresso. O processo atual para rastrear os disturbios neurodegenerativos apresenta

desvantagens importantes, sendo altamente dispendioso e demorado. Estes fatores tornam-se

particularmente onerosos quando e necessaria uma reavaliacao frequente, de forma a ajustar a

dosagem dos farmacos.

Esta tese aborda o uso de tecnologias da fala e da linguagem para contribuir para o di-

agnostico clınico de doencas neurodegenerativas. O uso dessas tecnologias pode facilitar o

processo de triagem dessas doencas e fornecer aos medicos uma ferramenta complementar

e objetiva de diagnostico. De acordo com a manifestacao dos sintomas clınicos, sao identifi-

cadas tres areas distintas nas quais esta dissertacao pode contribuir para o avanco do atual

estado da arte: monitorazacao das habilidades de fala, cognitivas e da linguagem. Relativa-

mente a essas areas, esta tese faz as seguintes contribuicoes: (1) definicao de um conjunto

geral e padrao de caracterısticas capazes de modelar os sintomas de um disturbio que afete

a producao motora da fala, como a doenca de Parkinson. Este conjunto de caracterısticas e

usado para avaliar a relevancia de diferentes tarefas de fala em portugues dedicadas a analise

da fonacao, respiracao e articulacao. Os resultados mostram que as tarefas de producao mais

importantes sao a leitura de frases prosodicas e a narracao de historias, que permitem alcancar

uma precisao de classificacao da doenca de Parkinson de 85.10 % e 82.32 %, respetivamente; (2)

implementacao online de um conjunto representativo de testes neuropsicologicos utilizados

na triagem da demencia, como o Defice Cognitivo Ligeiro, utilizando a tecnologia de recon-

hecimento automatico de fala. Para avaliar a viabilidade do instrumento de monitorazacao,

recolheu-se um corpus de fala em portugues, que inclui gravacoes de cinco pessoas com di-

agnostico de defice cognitivo e de cinco sujeitos saudaveis. O erro entre a avaliacao manual e a

automatica e relativamente pequeno, entre 0.80 e 3.00 para os pacientes e entre 0.80 e 2.80 para

o grupo de controlo, confirmando a viabilidade desse tipo de sistemas; (3) desenvolvimento

de um metodo automatico de analise de aspetos pragmaticos da producao de discurso, como

a coerencia de topico. Este metodo e ainda complementado com aspetos lexicais, sintaticos e

i

Page 6: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

semanticos do discurso, de modo a obter uma avaliacao abrangente da producao do discurso,

tarefa que ja provou ser util para a detecao da doenca de Alzheimer. Com este metodo, os resul-

tados da classificacao atingem uma precisao de 85.5% na identificacao automatica da doenca.

ii

Page 7: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Abstract

Brain diseases, and more specifically neurodegenerative disorders, include a range of con-

ditions that affect the brain causing irreversible and progressive damages. There is no cure

for many of these diseases, but the early detection of the symptoms onset may mitigate their

progress. The current process to screen neurodegenerative disorders present important dis-

advantages, being both highly costly and time-consuming. These factors become particularly

burdensome when frequent re-assessment is required to fine-tune dosage of drugs.

This thesis addresses the use of speech and language technologies to contribute to the clini-

cal diagnosis of neurodegenerative diseases. The use of these technologies may ease the screen-

ing process of these disorders and provide clinicians with a complementary, objective diag-

nostic tool. According to the manifestation of the clinical symptoms, three distinct areas are

identified in which this dissertation can contribute to the advance of the current state of the art:

monitoring of speech, cognitive, and language abilities. With respect to these areas, this the-

sis makes the following contributions: (1) definition of a general and standard set of features

that are able of modeling the symptoms of a disorder affecting motor production of speech,

such as Parkinson’s disease. This set of features is used to assess the relevance of different

speech tasks in Portuguese dedicated at evaluating phonation, respiration, and articulation.

Results show that the most important production tasks are reading of prosodic sentences and

storytelling, achieving a Parkinson’s disease classification accuracy of 85.10% and 82.32%, re-

spectively; (2) on-line implementation of a representative set of neuropsychological tests used

in the screening of dementia, such as Mild Cognitive Impairment, exploiting automatic speech

recognition technology. To evaluate the feasibility of the monitoring tool, a Portuguese speech

corpus including the recordings of five people diagnosed with cognitive impairments and five

healthy control subjects was collected. The error between the manual and the automatic evalu-

ation is relatively small, from 0.80 to 3.00 for the patients, and from 0.80 to 2.80 for the control

group, confirming the feasibility of such type of systems; (3) development of an automatic

method to analyze pragmatic aspects of discourse production, in particular to analyze topic

coherence. This method is further complemented with lexical, syntactic, and semantic aspects

of discourse, in order to provide a comprehensive evaluation of discourse production that is

shown to be useful for the detection of Alzheimer’s disease. In this way, classification results

iii

Page 8: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

achieve an accuracy of 85.5% in the automatic identification of the disease.

iv

Page 9: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Palavras-Chave

Keywords

Palavras-Chave

Diagnostico Clınico Automatico

Avaliacao de Doencas Neurodegenerativas

Analise Automatica de Fala

Reconhecimento Automatico de Fala

Processamento de Linguagem Natural

Keywords

Automatic Clinical Diagnosis

Neurodegenerative Diseases Assessment

Automatic Speech Analysis

Automatic Speech Recognition

Natural Language Processing

v

Page 10: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

vi

Page 11: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Acknowledgments

This doctoral research has been conducted at the Spoken Language Systems Laboratory (L2F)

at INESC-ID. The support of many people and various institutions contributed, directly and

indirectly, to the fulfillment of this work. I would like to take this opportunity to express my

gratitude to all of them.

My deepest gratitude goes to my scientific advisors, Prof. Alberto Abad and Prof. Isabel Pavao

Martins, for their technical knowledge, vision, and suggestions regarding both where to direct

my research, and how to pursue the expectations of this thesis. Given the interdisciplinary na-

ture of this doctoral research, it would not have been possible to fulfill the objectives of this the-

sis without their joint and complementary guidance. In different ways, they always supported

my work and encouraged my choices. Prof. Alberto Abad taught me the right approach to

address complicated problems, and readily helped me to overcome the many difficulties that I

had to tackle while pursuing the goals of this thesis. Prof. Isabel Pavao Martins allowed me to

acquire knowledge in the area of neuroscience, she shared with me her clinical point of view,

interesting research papers, and inspiring discussions. Also, she has been an active supporter

of this work by constantly pursuing new ideas and collaborations.

I wish to express my gratitude to Professor Isabel Trancoso not only for her continuous guid-

ance, but also for having pushed me to start this doctoral research and for having supported it

along the years with her enthusiasm and precious advice. During the time spent at the L2F, she

never missed a chance to demonstrate me her trust, and always provided me her full support

and availability.

I’m grateful to all the members of the Laboratorio de Estudos de Linguagem of the Univer-

sity Clinic of Neurology for warmly receiving me to their weekly meetings. Attending to the

presentations and discussions of research papers and clinical cases has been an enlightening

experience from a personal and a technical point of view.

I owe a very special acknowledgment to all the people that allowed the accomplishment of this

thesis by providing their valuable contributions:

vii

Page 12: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

• Dr. Filipa Miranda for introducing me to the subject of topic coherence and for her im-

portant advice provided during the development of this study;

• Dr. Rita Cardoso, Dr. Helena Santos, Dr. Joana Carvalho, Prof. Isabel Guimaraes, and

Prof. Joaquim J. Ferreira for sharing the FrasuloPark database;

• Dr. Jose Salgado, Dr. Ines Cunha and Dr. Vitorina Passao for having allowed the data

collection of cognitive impaired patients and for their availability, support, and precious

feedback provided during the development of a tool for the screening cognitive impair-

ments;

• Prof. Nuno Mamede, for having kindly provided an important resource that constituted

the baseline for some of the results achieved in this dissertation.

Next, I express my gratitude to the members of the Comissao de Acompanhamento da Tese,

Prof. Mario Silva, Prof. Antonio Teixeira, and Prof. David de Matos, for their suggestions and

recommendations, both in terms of scientific research and organization of this document.

I also would like to thank the support of the Portuguese research funding agency Fundacao

para a Ciencia e a Tecnologia (FCT) through the PhD scholarship SFRH/BD/97187/2013 dur-

ing the first four years of this work, as well as the Instituto Superior Tecnico funding during

the last year.

Thank you also to all the colleagues and room-mates that I have had the pleasure to know dur-

ing these years. With their kindness and friendship, they have been active supporters of this

experience.

Finally, my special thanks go to Paolo, my companion. He dealt with my frustrations on a daily

basis, giving me shine and structure, I am very grateful for his patience and understanding. His

advice, cares, and support have been invaluable to me to overcome the hardest difficulties and

to achieve this result.

viii

Page 13: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Contents

1 Introduction 1

1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Structure of this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Technical Background 7

2.1 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Bag-of-words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 N-gram models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Word embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Spoken language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Speech analysis and speaker characterization . . . . . . . . . . . . . . . . . 12

2.2.1.1 Speech analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1.2 Speaker characterization . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.1 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.1.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.1.4 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . 27

2.3.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.3 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.3.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Characterization of Neurodegenerative Diseases 33

3.1 Mild Cognitive Impairment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Alzheimer’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

ix

Page 14: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

3.3 Parkinson disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Dementia with Lewy bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Fronto Temporal Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Amyotrophic Lateral Sclerosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7 Huntington disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 Neurodegenerative diseases and SLT . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Related Work: SLT for Diagnosis of Neurodegenerative Diseases 45

4.1 Monitoring of speech abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Monitoring of cognitive abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.1 Semantic fluency tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 Cognitive tests assessing memory, attention, orientation . . . . . . . . . . 52

4.3 Monitoring of language abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Contributions to the Monitoring of Speech Abilities 65

5.1 Automatic detection of Parkinson’s Disease: an analysis of speech production

tasks used for diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Contributions to the Monitoring of Cognitive Abilities 71

6.1 Semantic verbal fluency test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.1.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.2 Automatic monitoring and training of cognitive functions . . . . . . . . . . . . . 76

6.2.1 Extending VITHEA for neuropsychological screening . . . . . . . . . . . . 76

6.2.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Contributions to the Monitoring of Language Abilities 83

7.1 Evaluating pragmatic aspects of discourse production for the automatic identifi-

cation of Alzheimer’s disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

x

Page 15: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

7.1.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.1.2 The proposed model to analyze topic coherence . . . . . . . . . . . . . . . 86

7.1.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.1.2.2 Clause segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.1.2.3 Coreference analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.1.2.4 Sentence embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.1.2.5 Topic hierarchy analysis . . . . . . . . . . . . . . . . . . . . . . . 88

7.1.3 Features for AD spoken discourse characterization . . . . . . . . . . . . . 90

7.1.3.1 Topic coherence features . . . . . . . . . . . . . . . . . . . . . . . 90

7.1.3.2 Other linguistic features . . . . . . . . . . . . . . . . . . . . . . . 91

7.1.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.1.4.1 Experiments using manual transcriptions . . . . . . . . . . . . . 95

7.1.4.2 Experiments using automatic transcriptions . . . . . . . . . . . . 98

7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Conclusions and Future Work 103

8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A Appendix 109

A.1 Excerpts of input/output processing . . . . . . . . . . . . . . . . . . . . . . . . . . 109

A.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

A.1.2 Clause segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

A.1.3 Coreference analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

A.2 Computation of semantic features . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

A.2.1 Specifications of an ICU list . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

A.2.2 Computing ICUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Bibliography 113

xi

Page 16: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

xii

Page 17: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

List of Tables

5.1 Description of the acoustic features based on 53 low-level descriptors plus 6

functionals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Demographic and clinical data for patient and control groups. . . . . . . . . . . . 67

5.3 Task-dependent recognition results on the 2-class detection task (PD vs. control). 69

6.1 WER for different language models: i) Generic ASR system: general purpose

language model trained on broadcast news, ii) Prebuilt list based: constrained

keyword model created from the list used in the STRING project, iii) Ontology

based: constrained keyword model created from the ontology Temanet. . . . . . 74

6.2 Performance of AuToBI using English and European Portuguese models and

three segmentation strategies: ASR-based, ontology-based (TemaNet), and

phone-based. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.3 Implemented cognitive tests. KWS: Keyword spotting, RBG: Rule-based gram-

mar, ALM: ad-hoc language model for keyword spotting. . . . . . . . . . . . . . . 78

6.4 Accuracy and WER according to the type of question. . . . . . . . . . . . . . . . . 79

6.5 MAE and MRAE (in brackets) by type of question and by neuropsychological test. 81

7.1 Statistical information on the Cookie Theft corpus . . . . . . . . . . . . . . . . . . 86

7.2 Summary of all extracted features (141 in total). The number of each type of

features is reported in parenthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.3 Summary of AD classification results (avg. and range accuracy % ) . . . . . . . . 99

xiii

Page 18: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

xiv

Page 19: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

List of Figures

2.1 Vocabulary and BOW vector representations for an example corpus of two doc-

uments containing each one sentence. . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Visual representation of the computation of P(Mary loves that person) using a

bi-gram model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 (a) A schematic diagram of the human speech production apparatus. (b) Wave-

form of /sees/, showing a voiceless phoneme /s/, followed by a voiced sound,

the vowel /iy/. The final sound, /z/, is a type of voiced consonant (Huang et al.

2001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Schematic representation of the extraction of a sequence of 39-dimensional

MFCC feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Jitter and Shimmer perturbation measures in a speech signal (Teixeira & Fernan-

des 2014). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Quadrilateral and triangular vowel space area for healthy subjects. The picture

was extracted from the work of Vizza et al. (2017), on the use of speech signal for

studying sclerosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Schematic representation of the main modules constituting an automatic speech

recognizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.8 A typical workflow used in the training process of a machine learning model. . . 24

2.9 (a) 3 possible hyperplanes for an SVM trained with linearly separable data, the

best hyperplane is shown with a solid line. (b) The hyperplane with the max-

imum distance from data points. (c) Soft-margin allowing some classification

errors. (d) Non-linearly separable data . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.10 A decision tree for the concept PlayTennis. This tree classifies Saturday mornings

according to whether or not they are suitable for playing tennis (Mitchell 1997). . 26

2.11 (a) A schematic drawing of a biological neural network with two neurons, (b) a

diagram of an artificial neuron, (c) an architecture of an artificial neural network

with three layers (Negnevitsky 2005). . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.12 An illustration of the k-fold cross-validation method with k=10. . . . . . . . . . . 31

xv

Page 20: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

6.1 An excerpt of an audio recording showing, respectively, from the top: the spec-

trogram, the F0, the textual transcriptions of the sound, and prosodic events clas-

sification. Red arrows indicate a continuation rise contour, while the yellow ar-

row indicates a finality contour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2 On the left side, MMSE scores of the human and automatic evaluations for the

patient speakers. On the right side, MMSE scores of the human and automatic

evaluations for the healthy speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.1 (a) The Cookie Theft picture, from the Boston Diagnostic Aphasia Examination

(Goodglass et al. 2001). (b) An excerpt of a topic hierarchy for the Cookie Theft

picture found in the work of Miranda (2015). . . . . . . . . . . . . . . . . . . . . . 85

7.2 The proposed method for modeling discourse as a hierarchy of topics. . . . . . . 87

7.3 Topic hierarchy building algorithm. (a) The current sentence is compared with

the topic clusters to identify its topic. (b) Identification of the level of special-

ization of the current sentence. If there are no nodes with the same topic of the

current sentence, this is considered a new topic. (c) If the current hierarchy con-

tains one or more nodes with the same topic of the current sentence, each of them

is analyzed with respect to the current one. (d) As a result, the current sentence

is added as a child of its closest node. . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.4 Variation of the classification accuracy with the SFS method, while increasing the

number of features. Results are presented for the set of topic coherence features

that provided the maximum accuracy. Features are computed on the manual

transcriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.5 Accuracy achieved with the top selected features using the fusion of different

sets. Results are computed on the manual transcriptions (top) and on the auto-

matic transcriptions (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

xvi

Page 21: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

List of Acronyms

AD Alzheimer’s Disease. xvii, 34, 35, 37, 38, 41–43, 50, 52, 56–62, 71, 76, 83–85, 91, 92, 94–101,

108

ADAS-Cog Alzheimer’s Disease Assessment Scale - Cognitive Subscale. 35, 71, 76–78, 104

ALS Amyotrophic Lateral Sclerosis. 38–41, 43, 65

ANN Artificial Neural Network. xiii, 27, 28, 56, 73

ANOVA analysis of variance. 29, 51, 57

ASR Automatic Speech Recognition. xvii, 5, 21, 50, 52–55, 72–76, 80, 98–101, 107

BDAE Boston Diagnostic Aphasia Examination. 56

BN Bayes Network. 56

BOW bag-of-words. xix, 8, 9, 11, 61, 62

CART Classification and regression tree. 26

CFD Castiglioni Fractal Dimension. 51

CHAT Codes for the Human Analysis of Transcripts. 56

CSR Continuous Speech Recognition. 23

DFT Discrete Fourier Transform. 15

DLB Dementia with Lewy bodies. 36, 37, 41

DT Decision Tree. xiii, 26, 27, 56, 57

EM Expectation-Maximization. 20

FTD Frontotemporal Dementia. 37, 38, 42, 43, 57

GMM Gaussian Mixture Model. 19, 20, 22, 48, 49

xvii

Page 22: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

HD Huntington Disease. 39–41, 43, 45, 65

HMM Hidden Markov Model. 21, 22, 73

ICU Information Content Unit. 92, 94, 110, 111

ID3 Iterative Dichotomiser 3. 26

IDFT inverse Discrete Fourier Transform. 15

IG Information Gain. 56

IVR interactive voice response. 50, 51

IWR Isolated Word Recognition. 23

KWS keyword spotting. 22, 73, 79

LASSO least absolute shrinkage and selection operator. 29

LOO leave-one-out. 30, 48, 49, 59

LPC Linear Prediction Coefficients. 16, 46, 47

LPCC Linear Prediction Cepstral Coefficients. 46, 47

LPO leave-p-out. 30

LSA Latent Semantic Analysis. 11

LVCSR Large Vocabulary Continuous Speech Recognition. 22, 23

MAE Mean Absolute Error. xvii, 59, 80–82

MAP Maximum a Posteriori. 20

MATTR moving-average type-token ratio. 58

MCI Mild Cognitive Impairment. 33, 34, 41, 43, 50–52, 54, 56, 60–62, 71, 76

MFCC Mel-frequency cepstral coefficient. xix, 15, 16, 19, 21, 46–49, 51, 58–60, 66

MLP Multilayer Perceptron. 27, 57, 73

MMSE Mini-Mental State Examination. 34, 35, 37, 53, 59, 71, 76, 77, 80, 104

MRAE Mean Relative Absolute Error. xvii, 80, 81

xviii

Page 23: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

NB Naıve Bayes. 56

NLP Natural Language Processing. 8, 12, 87, 92

NNLM neural network language model. 11, 12

OOV Out-Of-Vocabulary. 10, 79

PCA Principal Component Analysis. 46

PD Parkinson’s Disease. 35–37, 41, 43, 45–49, 65–69, 103, 107

PE Permutation Entropy. 51

PLP Perceptual Linear Prediction coefficients. 16, 17, 21, 46, 47, 73

POS Part of Speech. 57, 86, 87, 92, 109

PPA Primary Progressive Aphasia. 37, 38, 43

RASTA Relative Spectra coefficients. 46

RF Random Forest. xiii, 27, 60, 67, 69, 94

SFS sequential forward selection. xx, 95–97

SLP Speech Language Pathologist. 2, 3, 36, 39–41, 65

SLT Speech and Language Technology. xiv, 3, 4, 33, 40–42, 45, 50, 52, 62, 71, 82, 103–105

STT Speech to Text. 98

SVM Support Vector Machine. xiii, xix, 25, 26, 47–49, 51, 54, 56

TTR type-token ratio. 58, 91

UBM Universal Background Model. 19, 20, 48, 49

UHDRS Unified Huntington’s Disease Rating Scale. 40

UPDRS Unified Parkinson’s Disease Rating Scale. 36

VAI Vowel Articulation Index. 18, 19, 47

VSA Vowel Space Area. 17, 18, 47

xix

Page 24: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

VUV voiced/unvoiced. 48

WAB Western Aphasia Battery. 54

WAIS-III Wechsler Adult Intelligence Scale - III. 34

WAIS-IV Wechsler Adult Intelligence Scale - IV. 51

WER Word Error Rate. 31, 32, 52, 54, 55, 73, 74, 79, 80, 82, 98, 100, 101

WLM Wechsler Logical Memory. 54

WMS Wechsler Memory Scale. 54

WMS-IV Wechsler Memory Scale fourth edition. 51

xx

Page 25: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

1Introduction

Language is a fundamental ability in our daily lives, as it is used to communicate with the

world around us. The production of language is a complex, multidimensional skill that in-

volves different, interdependent cognitive domains. From a very general point of view, to

express meaning through language our thoughts have to be converted into a conceptual rep-

resentation, which corresponds to the generation of the message to be conveyed. This phase

implies access to semantic information for the retrieval and selection of the words to be used.

Then, syntactic properties of the identified lexical items are elaborated and the appropriate or-

der of words within the sentence is established. Finally, the conceptual representation of the

words to be spoken is transformed into a sequence of speech sounds to be pronounced, which

are sent from the brain to the articulatory system. In order for sounds to be produced correctly,

the lips, tongue, jaw, velum, and larynx must produce accurate movements at the right time

or the intended sounds become distorted (Dell et al. 1999, Dronkers & Ogar 2004, Garrett 1975,

Jay 2002). In conclusion, the utterance of a simple sentence requires the activation of a large-

scale neural network dedicated to semantic, syntactic, and phonological processing. Engaging

in a conversation is an even more cognitive-demanding task, since beside linguistic processing

it may also require access to memory, world knowledge, or to high-order cognitive functions

like planning. When considering also the integration of sensory and motor functions, and the

corresponding brain regions that control them, it is not surprising that speech production has

been described as one of the most complex human behaviors. A considerable set of widely

distributed brain regions is involved in speech production, a lesion in any of these areas may

disturb the equilibrium of this complex system and produce alterations in the resulting speech.

Brain disease is an umbrella term for a range of conditions that affect the brain in different

forms: neurodegenerative disorders, infections, trauma, stroke, and tumors. Neurodegenera-

tive diseases are incurable and debilitating conditions that result in a progressive degeneration

and death of neurons in different regions of the nervous system. As these result in permanent

damages, the condition tends to get worse as the disease progresses. There are more than six

hundred disorders affecting the nervous system. According to the clinical symptoms, they can

be classified into three main categories: i) diseases presenting cognitive decline, dementia and

alterations in high-order brain functions, ii) movement disorders, and iii) a combination of both

Page 26: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

symptoms (Kovacs 2014). Major clinical features representative of the first category include

deficits in various cognitive domains, like memory, attention, and language (e.g., Alzheimer’s

Disease, Dementia with Lewy bodies). Movement disorders are clinically associated with hy-

perkinetic, hypokinetic, and akinetic symptoms, like uncontrollable or slowness of movement,

and lack of spontaneous motility (e.g., Parkinson’s Disease). Current clinical practice to screen

neurodegenerative diseases requires an examination with a neurologist expert, which could

eventually be followed by an examination with a Speech Language Pathologist (SLP). The as-

sessment typically includes a medical examination, the manual administration of standardized

neuropsychological tests, and, eventually the perceptual evaluation of voice quality. The his-

tory from the patient is of particular relevance, both to consider previous similar cases in the

family, and to observe the evolution of the clinical picture according to the patient complaints.

The clinical evaluation could be quite long, depending to the types of tests and exams per-

formed, and to the cooperation of the patient. Neuroimaging studies are both invasive and

expensive, and have a limited use as a preliminary screening tool. Additionally, although there

are sophisticated quantitative and objective image analysis methods, standard clinical practice

in diagnostic imaging is qualitative in nature.

1.1 Motivations

There is no cure for neurodegenerative diseases, but treatment can still be of help. In this re-

gard, the diagnosis of early onset symptoms is critical to start the appropriate intervention and

mitigate disease progression (Sheinerman & Umansky 2013). However, the current process

to screen neurodegenerative disorders presents important disadvantages, being both highly

costly and time-consuming. These factors become particularly burdensome when frequent

re-assessment is required to fine-tune dosage of drugs. Another important concern regards the

reduced availability of specialized neurologists. Data published by the European Brain Council

show that 220.7 million people in Europe suffer from at least one neurological disease (Wittchen

et al. 2011). On the other hand, the number of specialized neurologists in the EU countries is

around 25,000. Depending on the country there are between 4 and 13 neurologists per 100,000

people (Olesen et al. 2012, Steck et al. 2013). When considering the rapidly aging global popu-

lation and an expected dramatic increase of neurological disorders, the number of specialized

clinicians is clearly insufficient to meet the growing needs. This problem is also of particular

relevance in remote areas with reduced medical resources, where the availability of specialized

physicians is even more limited.

2

Page 27: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

For all the above reasons, nowadays, there is an increasing need for additional, noninva-

sive, and cost-effective tools allowing a preliminary identification of diseases in their early

clinical stages. Further examinations using additional screening measures could then be per-

formed in a next step by an established clinician. Speech production, being the primary means

of interaction, plays a fundamental role in the diagnostic process. In fact, speech is used to

provide information about ourselves and to fulfill part of the clinical evaluation. Speech is an

ecological way to collect biometric information, as it can be elicited and recorded automatically

relatively easily, and at much lower cost than in-person clinical assessment. Additionally, it

naturally conveys important cues that can be further investigated and analyzed. In this regard,

Speech and Language Technology (SLT) could supply an important contribution to this area.

In fact, the development of automated methods based on the evaluation of speech and lan-

guage functionalities could be of great support in the clinical diagnosis. Not only by providing

a complementary diagnostic tool, but also to assess disease progression over time objectively

and accurately. As a matter of fact, the automatic analysis of voice and language allows to offer

an objective evaluation, that is consistent and independent from the experience of the clinician,

excluding in this way possible differences due to inter-expert-variability. Finally, the ability to

offer an alternative, remote assessment represents an opportunity to provide access to medical

services for those who might be otherwise deprived of a SLP.

In the light of these considerations, the major aim of this thesis was to conduct research

in the area of diagnosis of neurodegenerative diseases by means of SLT that can contribute to

an improvement of the current state of the art methods. Research on SLT applied to neurode-

generative diseases is an interdisciplinary area, which requires knowledge from the linguistic

domain, computer and electronics engineering, and partially from neuroscience. In fact, to

actively contribute to the diagnosis and therapy of neurological disorders, an understanding

of the symptoms caused by neural damages is fundamental, and even more important, how

these signs have an impact on speech and language functionalities. This is an area that is rel-

atively new to the research group where I am involved, whose primary focus is the automatic

processing of natural spoken language. The group holds a strong background in linguistic,

computer, and electronics engineering, gathered in over 20 years of research. However, until

the development of this thesis, research on the application of these technologies to the health

area was quite limited. In this context, it is also the purpose of this thesis to set the stage for

future investigations in this field to be developed in the group.

3

Page 28: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

1.2 Contributions

To contribute to the diagnosis of neurodegenerative diseases by means of SLT, it is required

to deeply investigate existing solutions applied to speech impairments, identifying current ap-

proaches and their limitations. However, before diving in the literature review of existing

solutions, there is first the need to analyze the most common neurodegenerative diseases, to

understand their symptoms and also the methods used in clinical practice for their diagnosis.

The ultimate goal is to identify common patterns between different disorders and then focus

the research on a sample of diseases representative of these disorders. From the results of this

investigation, I identified three main areas where SLT could provide its contributions. When

the impairment affects the organs related with the production of sounds, the resulting speech

may become distorted or unintelligible (e.g., Parkinson’s Disease). In this case, alterations in

voice could be analyzed and monitored through the analysis of the speech signal. When speech

production is preserved, it becomes a reliable means to investigate cognitive decline through

neuropsychological tests (e.g., Dementias). Finally, when the neurological damage affects the

areas of the brain related with the processing of language, the resulting speech could be im-

paired in different ways (e.g., aphasia, Fronto Temporal Dementia). In this case, the alterations

that occur in discourse production could also represent an important clue to screen cognitive

deterioration. The main contributions of this work in the three areas identified are explained

in more detail hereafter:

• Monitoring of speech abilities: when the brain lesions cause a dysfunction in the regu-

lation of the major brain structures involved in the control of movements, the production

of speech may also be affected. In fact, in these cases the muscles implicated in speech

production are also subject to specific dysfunctions, causing patients to experience dif-

ficulties in communication despite the existence of language competence. Different dis-

eases involving a neural degeneration of the brain may cause similar disorders on motor

speech abilities. The most common motor speech disorders caused by neurological in-

jury is dysarthria. This is characterized by a problem in any of the speech subsystems,

tongue, throat, lips or lungs leading to impairments in intelligibility, audibility, natural-

ness, and efficiency of vocal communication. These kinds of disorders are typically as-

sessed through a battery of vocal tests aiming at evaluating the patient’s abilities in tasks

involving phonation, respiration, articulation, and prosody. Through the understanding

of how these impairments affect the motor speech system, and the ability assessed by

each task, it is possible to identify sensible features able to model the problem under con-

4

Page 29: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

sideration. In the literature, there is an extensive body of research targeting the automatic

characterization of dysarthria through different set of speech features, speech tasks, and

machine learning models. However, from these studies it is not clear, which among the

several tasks administered could be more relevant for characterizing and monitor the dis-

order. For this reason, my contribution to the evaluation of motor speech disorders aims

at filling this gap, with a study focused on analyzing the importance of each individual

vocal task and, correspondingly, the importance of individual speech impairments in the

identification of the disease.

• Monitoring of cognitive abilities: many neurodegenerative diseases cause cognitive

dysfunctions in multiple cognitive domains. Impairments depend on the area of the

brain affected and may include alterations in visuospatial ability, planning, reasoning,

attention, memory, language, and personality. The process of screening neurodegenera-

tive diseases partly relies on a cognitive assessment in which several tests are adminis-

tered to the patient. In fact, in the clinical practice, a considerable number of batteries

of neuropsychological tests has been developed, since each test aims to assess different

cognitive functions. The majority of these tests require a verbal interaction from the pa-

tient to provide the desired answers. Thus, when language ability is spared, many of

these tests are eligible to be automated through Automatic Speech Recognition (ASR)

technology, reducing the need for the physical presence of a clinician. An online imple-

mentation of these tests could ease the screening process, providing great benefits to the

health community, both for speech and language pathologists and the elderly popula-

tion. A revision of the state of the art for this area highlighted very few works targeting

the automatic assessment of cognitive tests. Also, the implementation of systems able

to automatically administer and evaluate batteries of neuropsychological tests was even

more limited. Thus, one contribution to the screening of neurodegenerative diseases con-

sists of the automatic implementation of two widely used neuropsychological batteries.

These tests have been integrated into an online platform, (Abad et al. 2013) whose flexibil-

ity easily allows the creation of different types of test, and are evaluated automatically by

means of ASR. Additionally, another contribution is related to the automatic evaluation of

the semantic-verbal fluency test, a sensitive test for distinguish between Mild Cognitive

Impairment and Alzheimer’s Disease, a challenging task for current ASR systems.

• Monitoring of language abilities: neurological lesions that involve the areas of the brain

related with the production or understanding of language may compromise language

abilities in a variety of different ways. Speech may be affected at the phonological or

5

Page 30: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

syntactical level, may become non-fluent with problems in word-finding and words rep-

etitions, or fluent but poor in meaning, as in the case of aphasia. More generally, problems

may arise in language production at the level of discourse structure. Speech may result

poorly organized, may lack of coherence or present a substantial use of word repetitions

and a reduction in the use of more complex expression. The analysis of discourse produc-

tion is a complex and broad task, being evaluated along a micro and a macro dimension

that together address different aspects, such as coherence, cohesion, lexical, and syntactic

analysis. Discourse evaluation is usually performed through the analysis of spontaneous

speech elicited with different types of stimuli. In the literature, there has been a growing

interest in investigating the computational analysis of language impairment of neurode-

generative disorders. Overall, existing works assess the quality of discourse production

through the automatic analysis of a combination of lexical, syntactic, acoustic, and se-

mantic features. Few studies, however, approached linguistic deficits at a higher level of

processing, considering macrolinguistic aspects of discourse production such as cohesion

and coherence. For these reasons, my contribution to the assessment of language abili-

ties focuses on pragmatic aspects of discourse, and in particular, on the analysis of topic

coherence. This method is further complemented with lexical, syntactic, and semantic

aspects of discourse, in order to provide a comprehensive evaluation of discourse pro-

duction. Finally, the impact of using a speech recognition system to obtain automatically

the transcriptions of the speech samples is also evaluated.

1.3 Structure of this document

The remainder of this dissertation is structured as follows. Chapter 2 provides an introduc-

tion to some technical notions and methods that are frequently used in the areas of machine

learning, speech, and natural language processing that are relevant for the rest of this docu-

ment. In Chapter 3, I carry out a characterization of neurodegenerative diseases. This study

is required in order to identify a subset of disorders that are the focus of this thesis. Then, in

Chapter 4, I survey the related work relevant for this proposal. The state of the art of existing

speech technologies solutions applied to the areas of motor speech disorders, cognitive screen-

ing, and analysis of discourse production is reported. Afterwards, Chapters 5, 6, and 7 present

the contributions provided in each of the areas of interest of this thesis, reporting the key re-

sults achieved. The document ends with Chapter 8, where the conclusions of this dissertation

and some directions for future work are presented.

6

Page 31: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

2Technical Background

This chapter reports on a brief review of common topics frequently used in speech and lan-

guages technologies. In particular, three areas are considered: natural language processing,

spoken language processing, and machine learning. The goal of natural language processing

is to enable computers to perform useful tasks involving human language, such as human-

machine communication. It has a very broad scope, which extends to speech recognition and

language understanding. Spoken language processing, on the other hand, is focused on the

study of speech signals, including their digital processing, representations, and coding. Both

areas extensively rely on machine learning methods to accomplish their tasks. Machine learn-

ing is concerned with the study of mathematical models that allow computers to perform a

task without explicit instructions.

In the following sections, basic principles and more advanced concepts are introduced for

these three areas. The topics described where selected with the aim of providing the necessary

knowledge for the understanding of the concepts mentioned in this dissertation. Section 2.1 de-

scribes some text modeling techniques commonly used in the area of natural language process-

ing. Section 2.2 is dedicated at spoken language processing, with a focus on speech analysis,

speaker characterization, and speech recognition. Finally, Section 2.3 reports on some machine

learning models, feature selection approaches, and evaluation methods used in the machine

learning area.

2.1 Natural language processing

Natural language processing (NLP) is an area of computer science concerned with the compu-

tational processing and analysis of human language data. It is an interdisciplinary field with

a very wide purpose, which include machine translation, text summarization, and conversa-

tional agents, just to mention a few. In this section, the focus will be limited at providing an

overview of some commonly used language models. In general, the goal of a language model

is to capture salient statistical characteristics of the distribution of sequences of words in a nat-

ural language, allowing to make probabilistic predictions of the next word given preceding

ones. Broadly, there are two main categories of language models: statistical or count-based,

Page 32: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

and predictive models. Count-based methods compute the statistics of how often some words

occur with its neighbor words in a large text corpus. Predictive models directly try to predict

a word from its neighbors in terms of learned embedding vectors. Both approaches require

large text corpus either to compute the statistics of the model or to learn a compact, distributed

representations of words.

In the following, Section 2.1.1 introduces a straightforward approach for representing text

data, while Section 2.1.2 describes a statistical language model widely used in NLP and speech

recognition. Finally, Section 2.1.3 briefly reports on neural language model and word embed-

ding.

2.1.1 Bag-of-words

The bag-of-words (BOW) model (Manning et al. 2010) is a simple representation used in NLP

and information retrieval. The main idea behind this approach is that important words will

occur repeatedly in various documents. Under this assumption, the number of occurrences

represents the importance of a word. In a straightforward implementation, documents may be

modeled with a fixed-length vector representation. The length is determined by the number of

unique words existing in the corpus, which is usually referred to as the vocabulary. Then, each

position of the fixed-length vector accounts for the number of times a word of the vocabulary

exists in the current document.

To provide a practical example, consider a corpus composed of two text documents con-

taining a single sentence: 1) /Mary likes french movies, but Sara prefers horror movies ./, and 2)

/Mary loves that person ./. In this case the vocabulary will be composed of 11 words: but, french,

horror, likes, loves, mary, movies, person, prefers, sara, that. The corresponding BOW vector

representations are shown in Figure 2.1 for a clearer presentation.

Despite being a very simple approach, according to this implementation, the BOW model

preserves multiplicity by accounting for the occurrence of repeated words within a document.

Nevertheless, this approach presents some important limitations that should be mentioned.

First, it is an order-less representation of documents. This means that any information related

with the order or structure of words is disregarded. To account also for spatial information,

one should consider probabilistic language models like n-grams, described in the next section.

Another limitation is related with the fact that word occurrences may be a very poor representa-

tion for a text. Function words like the, a, to, are usually the most frequent terms in a document,

although clearly, they not are the most important. To address this problem, the frequency of

a term is usually normalized by the inverse of the document frequency (idf) (Manning et al.

8

Page 33: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

but french horror likes loves mary movies person prefers sara that

but1 1 1 1 0 1 2 0 1 1 0

1 . “ M a r y l i k e s f r e n c h m o v i e s , b u t S a r a p r e f e r s h o r r o r m o v i e s ”

but french horror likes loves mary movies person prefers sara that

but0 0 0 0 1 1 0 1 0 0 1

2 . “ M a r y l o v e s t h a t p e r s o n ”

Vocabulary

BOW

Vocabulary

BOW

Figure 2.1: Vocabulary and BOW vector representations for an example corpus of two docu-ments containing each one sentence.

2010). The idf corresponds to the number of documents in the corpus that contain the term,

thus balancing the fact that some words appear more frequently in general. Another inconve-

nient of the BOW model is that for a very large corpus, the length of the vocabulary might be

thousands or millions of positions, thus requiring either more computational resources or to

limit the size of the vocabulary.

2.1.2 N-gram models

The n-gram model (Jelinek & Mercer 1980, Katz 1987) is a probabilistic language model that

estimates the probability of a word based on the sequence of previous n words. In this way,

this model accounts for the fact that words in a sentence respect a particular order. Approaches

based on the n-gram model have been the dominant methodology for probabilistic language

modeling since the 1980s. In fact, due to their simplicity and scalability, they are widely

used in many areas, such as, computational biology (Tomovic et al. 2006), image (Soffer 1997),

speech (Hirsimaki et al. 2009) and language processing (Dunning 1994). When the estimation

of the current word depends on the previous two words, one has a tri-gram language model:

P(wi|wi−2, wi−1). Similarly, one can have uni-gram, P(wi), or bi-gram, P(wi|wi−1), language

models. For example, to calculate the probability of the sentence /Mary loves that person ./

9

Page 34: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

<s> Mary loves that person </s>

<s> Mary loves that person </s>

<s> Mary loves that person </s>

<s> Mary loves that person </s>

P(Mary|<s>)

P(loves|Mary)

P(that|loves)

P(person|that)

<s> Mary loves that person </s> P(</s>|person)

Figure 2.2: Visual representation of the computation of P(Mary loves that person) using a bi-gram model.

using a bi-gram model, one would take:

P(Mary loves that person)=

P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)P(</s>|person).

To make P(wi|wi−1) meaningful for i=1, the beginning of the sentence is padded with a distin-

guished token <s>; pretending in this way that w0 = <s>. In addition, it is necessary to place

a distinguished token </s> at the end of the sentence. The process to compute the probability

of a sentence with a bi-gram model is visually shown in Figure 2.2.

The frequencies with which the word wi occurs given that the previous word is wi−1, are

estimated on a training corpus. This is achieved by counting how often the sequence (wi−1, wi)

occurs and then normalize the count by the number of times wi−1 occurs.

N-grams have been criticized because they explicitly ignore any dependencies between

words preceding n − 1. Furthermore, a new observed sequence typically will have occurred

rarely or not at all in the training corpus. In particular, when modeling the joint distribution of

a sentence, the n-gram model described above would assign a zero probability to those word

sequences that were not encountered in the training corpus. This is an inherent problem of n-

gram training, known as Out-Of-Vocabulary (OOV) words. Smoothing techniques address this

problem by adjusting the probabilities estimation for unseen data (Jurafsky & Martin 2014).

Finally, n-grams, as any other statistical language model, suffer from the impact of the curse

of dimensionality. This arises when a huge number of different combinations of values of the

10

Page 35: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

input variables must be discriminated from each other, and at least one example per relevant

combination of values is needed (Bishop 2006).

2.1.3 Word embedding

Traditional statistical approaches are not able to capture information about the meaning of a

word or about its context. This means that potential relationships, such as contextual close-

ness, are not captured across collections of words. For example, neither the BOW, nor the

n-gram model can capture simple relationships, such as determining that the words dog and

cat both refer to animals that are often discussed in the context of household pets. Also, tra-

ditional approaches become unpractical when dealing with very large corpus, leading to a

large feature dimension and a sparse representation. Word embedding is a computationally-

efficient model for learning distributed representations of words that preserve linear regular-

ities (Mikolov, Sutskever, Chen, Corrado & Dean 2013). Differently from statistical language

model (e.g., n-grams), or count-based continuous vector representations (e.g., Latent Seman-

tic Analysis (LSA) (Dumais 2004)), word embeddings are learned by a neural network model.

In this way, the learning algorithm is exploited to discover the features that best characterize

the meaning of a word. These may include grammatical features like gender and number,

as well as semantic features like animate or invisible. These not mutually exclusive features

are continuous-valued. Consider that each word corresponds to a point in a feature space.

The goal of the learning algorithm, then, is to associate each word with a multidimensional

continuous-valued vector representation wherein each dimension corresponds to a semantic

or grammatical characteristic of words. The idea is that, in this feature space, semantically

similar words are closer to each other. This means that words such as dog and cat should have

similar word vectors to the word pet, whereas the word banana should be quite distant. This

forces the learned word features to correspond to a form of semantic and grammatical similar-

ity, and help the neural network to compactly represent them. A sequence of words can then

be transformed into a sequence of learned feature vectors. The neural network learns to map

that sequence of feature vectors to a prediction of interest, such as the probability distribution

over the next word in the sequence (Bengio 2008).

Mikolov (Mikolov, Chen, Corrado & Dean 2013) introduced two model architectures for

learning word embeddings that try to minimize computational complexity, the continuous Bag-

of-Words model (CBOW) and the continuous Skip-Gram model. Algorithmically, these models

are similar as they both rely on a training method composed of two steps. First, continuous

word vectors are learned using a simple model, then a n-gram neural network language model

11

Page 36: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

(NNLM) is trained on top of these distributed representations of words. The main difference

between the two models proposed by Mikolov is that the CBOW model predicts a target word

from previous and future context words, while the skip-gram model predicts context words

within a certain range before and after the current word. By training high dimensional word

vectors on a large amount of data, the resulting vectors can be used to answer subtle semantic

relationships between words. This could be achieved by performing simple algebraic opera-

tions with the vector representations of words, as in the following example:

Paris− France + Italy=Rome

That is, by knowing that the capital of France is Paris, it is possible to query which is the capital

of Italy.

2.2 Spoken language processing

From a general point of view, spoken language processing could be considered an area of NLP,

as speech applications are often integrated in natural language tasks, such in the case of speech

recognition and conversational agents. However, speech processing is actually an area with a

strong component of digital signal processing, that incorporates knowledge from the electrical

engineering and the linguistic fields. In fact, a fundamental part of speech processing deals

with the extraction of useful information from speech signals and with the identification of

efficient representations for these data. The information extracted from speech signals is used

for a wide range of applications, which include analysis of pathological voices and biometrics

identification.

In the remainder of this section, two important areas of speech processing are described.

Section 2.2.1 reports on the many types of information that can be extracted from speech sig-

nals, used either in speech recognition and in the analysis of voice quality. The section con-

cludes with a brief introduction to an area of speech processing dedicated at the creation of

speaker models typically applied in the tasks of speaker identification and recognition. Then,

Section 2.2.2 is devoted at detailing the main building blocks of a conventional automatic

speech recognizer.

2.2.1 Speech analysis and speaker characterization

Speech analysis is an area of speech processing dedicated at providing a quantitative evaluation

of voice quality. It is typically used in paralinguistic or clinical speech studies to detect physio-

logical changes in voice production. In fact, there are many medical conditions that adversely

12

Page 37: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

affect the voice. Diseases of the larynx are normally associated with breathiness and hoarse-

ness of the produced voice. Stroke or neurodegenerative diseases, may cause the production

of inconsistent speech sounds (e.g., apraxia of speech) or the weakness of the muscles involves

in speech production (e.g., dysarthria). Overall, voice quality is assessed through the analy-

sis of phonation, articulation, and prosody. The study of phonation analyzes the alterations

that occur in the vocal folds vibration process (e.g., respiration). Measures of articulation as-

sess modifications that may happen in the positioning or shape of the speech organs (e.g.,

tongue, lips). Finally, the study of prosody examines variations of loudness, intonation, and

timing. Computationally, phonation, articulation, and prosody are measured through spectral

and temporal parameters of speech.

Speaker characterization is an area of speech processing whose goal is to identify or verify

the identity of a person based upon his/her voice. It is typically used in security, medical, and

forensic applications. In fact, the impressive advancements that has occurred in recent years

in voice-based solutions, raised a growing interest in the automatic verification of a speaker’s

identity. Among the new consumer applications based on speech, one should also consider the

digital cloning of voice characteristics, and voice conversion. While the last offers new solu-

tions for privacy protection, it also brings the possibility of misuse the technology in order to

spoof someone’s identity. Voice is a central part of our identity and offers a low-cost biometric

solution for the authentication of a person. This is due to the fact that speech carries impor-

tant information about a speaker, such as gender, age, language, and dialect. Various acoustic

features can be used to model and characterize a speaker’s identity. After describing some of

these measures in Section 2.2.1.1, a brief introduction to the subject of speaker characterization

is provided in Section 2.2.1.2.

2.2.1.1 Speech analysis

In the remainder of this section, first, a very short review of the human speech production sys-

tem is reported for completeness. Then, some features that are widely used in speech analysis

are introduced.

Speech production

On a physiological level, speech production begins in the lungs which contract and push out

air that flows through the larynx and the glottis, the orifice between the vocal folds. Airflow

then proceeds through the pharynx, into the mouth between the tongue and palate, and is

finally emitted through the lips and the nose. A schematic representation of the human speech

13

Page 38: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

(a) (b)

Figure 2.3: (a) A schematic diagram of the human speech production apparatus. (b) Waveformof /sees/, showing a voiceless phoneme /s/, followed by a voiced sound, the vowel /iy/. Thefinal sound, /z/, is a type of voiced consonant (Huang et al. 2001).

apparatus is shown in Figure 2.3(a).

Upper and lower lips, upper and lower teeth, tongue, and roof of the mouth are among the

major articulators contributing to the production of different sounds. Sounds can be classified

into subgroups with particular properties according to the speech production apparatus and to

the position and motion of the articulators. When the vocal folds are held close together and

oscillate against one another during a speech sound, the sound is said to be voiced. When the

folds are too slack or tense to vibrate periodically, the sound is said to be unvoiced. Voiced

sound include vowels, their time and frequency structure present a roughly regular pattern

that voiceless sounds, such as consonants, lack. A voiced and an unvoiced sound are shown

in Figure 2.3(b). Articulation takes place in the mouth, between the oral cavity, which acts as

a resonator, and the articulators. The place and manner of articulation allow to differentiate

most speech sounds.

Fundamental frequency, formants, harmonics

The fundamental frequency, usually known as F0, corresponds to the rate of cycling (opening

and closing) of the vocal folds in the larynx during phonation of voiced sounds. The funda-

mental frequency is the lowest frequency of a speech signal, and it is usually perceived as the

loudest. It contributes more than any other single factor to the perception of pitch in speech,

the semi-musical rising and falling of voice tones.

The periodic glottal wave consists of the fundamental frequency and a number of harmonics

that are integral multiples of F0. When the shape of the vocal tract changes, the harmonics

present in the sound also change. More closure in the vocal folds will create stronger, higher

harmonics. The harmonics are not all of equal intensity, for example the vowel /a/ typically

has more energy than the vowel /o/ or /i/. Regions of frequency space where speech sounds

14

Page 39: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

DFT Melfilterbank log IDFT Delta

12MFCC12Δ MFCC12ΔΔMFCC1energy1Δ energy1ΔΔenergy

Figure 2.4: Schematic representation of the extraction of a sequence of 39-dimensional MFCCfeature vectors.

carry a lot of energy are known as formants. They arise from the vocal tract and filter the orig-

inal sound source. Speakers change the resonance frequencies by moving the articulators and

thereby changing the dimensions of the resonance cavities in the vocal tract. The first two for-

mants, F1 and F2, are, from a linguistic point of view the most important, since they uniquely

identify or characterize the vowels.

Mel-frequency cepstral coefficients

Probably the most widely known parameterization of the speech signal, used either in speech

recognition, and speech analysis, are Mel-frequency cepstral coefficients (MFCCs) (Davis &

Mermelstein 1980, Mermelstein 1976). They are of particular relevance because they provide

the ability to separate the vocal tract filter (the position of the tongue and the other articulators)

from information about the glottal source (the energy of the lungs). To compute the MFCC, first

spectral information is extracted from a speech sample through the Discrete Fourier Transform

(DFT). Then, the result of the DFT, which corresponds to the amount of energy at each fre-

quency band, is warped onto the mel scale. A mel (Stevens & Volkmann 1940, Stevens et al.

1937) is a unit of pitch defined so that pairs of sounds which are perceptually equidistant in

pitch are separated by an equal number of mels. The mapping between frequency in Hertz and

the mel scale is linear below 1000 Hz and logarithmic above 1000 Hz. This conversion allows

to approximates the sensitivity of the human hear, which is less sensitive at higher frequencies,

roughly above 1000 Hz. The next step in MFCC feature extraction consists of taking the log of

each of the mel spectrum values. This is motivated by the fact that, in general, human response

to signal level is logarithmic. Humans are less sensitive to slight differences in amplitude at

high amplitudes than at low amplitudes. Finally, the spectrum of the log spectrum is com-

puted through the inverse Discrete Fourier Transform (IDFT). The result of this transformation

is called the cepstrum.

For the purposes of MFCC extraction, generally, the first 12 cepstral values are considered.

These 12 coefficients will represent information solely about the vocal tract filter, cleanly sepa-

15

Page 40: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

rated from information about the glottal source. For speech recognition, features vectors of 39

coefficients are usually computed. The first twelve features are the MFCCs, whereas the thir-

teenth feature corresponds to the energy of the frame. This is computed as the sum over time

of the power of the samples in the frame. Then, for each of the 13 features, a delta (or veloc-

ity) feature, and a double delta (or acceleration) features are added. Each of the delta features

represents the change between frames in the corresponding cepstral feature, while each of the

delta delta features represents the change between frames in the corresponding delta features.

Linear Prediction Coefficients

Another approach frequently used in speech analysis is Linear Prediction Coefficients

(LPC) (Atal & Schroeder 1968, Burg 1967, Itakura & Saito 1968, Makhoul 1973, Markel & Gray

1973). This method allows to represent the spectral envelope of a speech signal in a compressed

form, providing an accurate estimation of speech parameters. The basic idea behind LPC is the

source-filter model. The vocal cords produce the sound source, while the vocal tract constitutes

the acoustic filter. In particular, LPC assumes the vocal tract to be approximate as a loss-less

tube characterized by its resonances, which give rise to formants. The transfer function of a

loss-less tube can be described by an all-pole linear filter. With a sufficient number of poles,

this type of filter is a good approximation for speech signals. LPC derives its name from the

fact that it predicts the current sample as a linear combination of past samples. The predictor

coefficients are estimated using short-term analysis. A segment of speech is selected in the

proximity of a given sample, then the LPC coefficients are estimated using the criterion of min-

imum mean squared error. The corresponding coefficients are those that minimize the total

prediction error. LPC analyzes the speech signal by estimating the formants, removing their

effects, and then estimating the intensity and frequency of the remaining signal. The process of

removing the formants is called inverse filtering, and the remaining signal after the subtraction

of the filtered modeled signal is called the residue.

Perceptual Linear Prediction coefficients

The short term power spectrum of speech estimated by LPC is widely used as it is a simple and

effective way of estimating the main parameters of speech signals. However, one of the main

disadvantages of LPC is that it approximates in the same way all frequencies of the analysis

band. This property is inconsistent with human hearing, as beyond about 800 Hz the spectral

resolution of hearing decreases with frequency. Additionally, hearing is more sensitive in the

middle frequency range of the audible spectrum. As a consequence, LPC not always preserve

16

Page 41: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

1229 João Paulo Teixeira and Paula Odete Fernandes / Procedia Technology 16 ( 2014 ) 1228 – 1237

involved. In the medical field this is a subjective assessment technique which leads to the lack of consensus among professionals. Therefore it became necessary to search for an objective assessment, in which the voices were analyzed by devices which are capable of measuring several acoustic parameters, as stated by Almeida [4]. Using speech signal processing it is possible to extract a set of parameters of the voice that may allow detecting pathologies of the vocal cords in individuals by comparing the data of patients with certain pathology with the data of persons considered with healthy voice.

The parameters obtained by the acoustic analysis have the advantage of describing the voice objectively. With the existence of normative databases characterizing voice quality or using intelligent tools combining the various parameters, it is possible to distinguish between normal and pathological voice or even identify or suggest the pathology. These tools allow the monitoring of clinical standpoint and reduce the degree of subjectivity of perceptual analysis, as Teixeira, et al. [5].

Currently, acoustic parameters commonly used in applications of acoustic analysis as well as the most referenced in the literature, are the fundamental frequency (F0), jitter, shimmer, HNR and frequency formants.

The measure of these parameters is performed in a recorded speech signal with the patient/control producing a long steady state vowel.

Measurements of F0 disturbance jitter and shimmer, has proven to be useful in describing the vocal characteristics. Jitter is defined as the parameter of frequency variation from cycle to cycle, and shimmer relates to the amplitude variation of the sound wave, as Zwetsch et al. [2] and [5, 6, 7]. In Fig. 1 the jitter and shimmer are represented.

8000 8500 9000 9500 10000-0.5

0

0.5Jitter

Shim

mer

Fig.1. Jitter and Shimmer perturbation measures in speech signal [6].

The jitter is affected mainly by the lack of control of vibration of the vocal cords; the voices of patients with

pathologies often have a higher percentage of jitter. The shimmer changes with the reduction of glottal resistance and mass lesions on the vocal cords and is

correlated with the presence of noise emission and breathiness. Diseases that affect larynx cause changes in the patient’s vocal quality. Early signs of deterioration of the

voice due to vocal malfunctioning are normally associated with breathiness and hoarseness of the produced voice. The most common signs that may indicate changes in the larynx relate hoarseness, breathiness and roughness. The transient hoarseness may result from abuse of the voice or the casual flu. But when the hoarseness persists and becomes a characteristic voice, is indicative of pathology of the larynx. Hoarseness can

Figure 2.5: Jitter and Shimmer perturbation measures in a speech signal (Teixeira & Fernandes2014).

or discard spectral details according to their auditory prominence. Perceptual Linear Prediction

coefficients (PLP) (Hermansky 1990), overcomes this drawback using a critical-band filtering

over the linear predictive analysis. The power spectrum is first warped onto a Bark scale,

then the Bark scaled spectra is convoluted with the power spectra of the critical band filter.

This simulates the frequency resolution of the ear which is approximately constant on the Bark

scale. In this way, PLP approximate the behavior of the human auditory system.

Jitter, shimmer

Jitter and shimmer are two measures of period perturbation commonly used in a comprehen-

sive voice examination (Brockmann et al. 2011, Farrus et al. 2007, Kreiman & Gerratt 2003, Silva

et al. 2009, Titze & Martin 1998). They assess the micro-instability of vocal fold vibrations by

representing the variability of the fundamental frequency from one cycle to the next. As shown

in Figure 2.5, the jitter measures the variations of the fundamental glottal period, while the

shimmer measures the variations of the fundamental glottal period amplitudes. The jitter is

affected mainly by the lack of control of vibration of the cords. The voices of patients with

pathologies often have higher values of jitter. The shimmer changes with the reduction of glot-

tal resistance and mass lesions on the vocal cords and it is correlated with the presence of noise

emission and breathiness (Teixeira et al. 2013). It is expected that patients with pathologies

have higher values of shimmer.

Vowel space area

The Vowel Space Area (VSA) is an acoustic index commonly used to assess the ability to prop-

erly articulate vowels (Kent & Kim 2003, Kuhl et al. 1997, Vorperian & Kent 2007). Some speech

17

Page 42: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Figure 2.6: Quadrilateral and triangular vowel space area for healthy subjects. The picture wasextracted from the work of Vizza et al. (2017), on the use of speech signal for studying sclerosis.

disorders, like dysarthria for instance, may be characterized by a centralization of vowels. This

phenomenon is associated with a reduction in the amount of movement of the tongue in pro-

nouncing the vowel. As a result, a centralized vowel is closer to the midpoint of the vowel

space than their referent vowel. In fact, due to centralization problems, vowels that normally

possess a high center formant frequency tend to have a lower frequency, while vowel formants

that normally have a low center frequency tend to have an higher frequency.

VSA refers to the two-dimensional area bounded by the lines connecting the coordinates

vertices of different vowels. The coordinates are obtained by plotting the F1 frequency as a

function of the F2 frequency. When the vowels /i/, /u/, and /A/ are considered, one is dealing

with a triangular vowel space. When the vowels /i/, /u/, /A/, and /æ/ are considered, the

resulting shape is a quadrilateral. A representation of a quadrilateral and triangular vowel

space area of healthy adults is shown in Figure 2.6

Vowel articulation index

Although the VSA should reflect changes in articulatory function, several studies found that

it failed to differentiate individuals perceptually judged to have abnormal articulation or poor

speech intelligibility (Ansel & Kent 1992, Bunton & Weismer 2001, Sapir et al. 2007, Weismer

et al. 2001). One possible explanation could be the large inter-speaker variability associated

with vowel formant measurements in general. To cope with this problem, Sapir (2006) intro-

duced the Vowel Articulation Index (VAI) an acoustic metric of vowel formant production, de-

signed to minimize the effects of inter-speaker variability and maximize sensitivity to formant

centralization and decentralization. The VAI can be calculated with the following formula:

18

Page 43: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

(F2/i/ + F1/A/)/(F2/u/ + F2/A/ +F1/i/ + F1/u/), where F2/i/ refers to the second formant of

the vowel /i/, F1/A/ refers to the first formant of the vowel /A/, and so on.

The vowel-formant elements in the VAI ratio are arranged such that elements in the nu-

merator (F2/i/, F1/A/) will decrease, while elements in the denominator (F2/u/, F2/A/, F1/i/,

F1/u/) will increase in the case of vowel centralization. In American English, the normal VAI

values are expected to be close to 1.0, as the sum of formant frequencies in the denominator is

very similar to the sum of formant frequencies in the numerator. For this reason, the VAI may

be considered a function that normalizes the relationships between the vowels across speakers.

2.2.1.2 Speaker characterization

In the area of speaker recognition, speaker modeling techniques aim at building dependable

speaker models in order to establish or confirm the identity of a speaker from a speech signal.

These approaches are used in the resolution of two related problems, speaker verification and

identification. While the first aims at verifying the truthfulness of a claim of identity, the latter

implies to establish the identity of an unknown speaker from a voice sample. Speaker verifi-

cation requires an enrollment phase to create the speaker model that is then used during the

verification phase. Speaker identification, on the other hand, requires a set of several labeled

speaker models that are used during the comparison with the unknown voice sample. For both

problems, the creation of the speaker models comprises a phase of preprocessing and feature

extraction analogous to the ones described in the previous sections. MFCCs are the features

most typically used to characterize the speaker models.

There are several approaches to the problem of speaker modeling, among them the GMM-

UBM framework is briefly introduced in the following section. This method has been exten-

sively used to establish the identity of a user, primarily in the area of speaker (Campbell et al.

2009, Liu et al. 2006, Zheng et al. 2004), language (Torres-Carrasquillo et al. 2004, Wong & Srid-

haran 2002, Yin et al. 2006), and emotion recognition (Bao et al. 2007, Kockmann et al. 2011, Wu

et al. 2006). Recently, it has been exploited also for modeling speech disorders (Bocklet et al.

2011, Orozco-Arroyave et al. 2016).

GMM-UBM

Reynolds (2009a) defined a Universal Background Model (UBM) as a model used in a biometric

verification system to represent general, person-independent feature characteristics. These fea-

tures are then compared against a model of person-specific feature characteristics to make an

accept or reject decision. Typically, in a speaker verification system, the UBM is modeled by a

19

Page 44: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Gaussian Mixture Model (GMM) trained with the Expectation-Maximization (EM) algorithm.

This method is used to find maximum likelihood parameters of a statistical model. In order to

represent general speech characteristics, the UBM is typically created using the speech samples

from a large number of speakers. In this case, it is important that the subpopulations compos-

ing the data should be balanced. For example, in using gender-independent data, one should

be sure there is a balance of male and female speech. Otherwise, the final model will be biased

towards the dominant subpopulation. Another approach considers the training of individual

UBMs over the subpopulations in the data, such as one for male and one for female speech,

and then combine the subpopulation models together. This method provides the advantages

that one can effectively use unbalanced data and can carefully control the composition of the

final UBM.

A speaker-dependent GMM, on the other hand, may be trained using the speech sam-

ples of a particular enrolled speaker. Alternatively, a speaker-dependent GMM may be de-

rived by adapting the parameters of the UBM using the speaker personal training data and

Maximum a Posteriori (MAP) estimation. This last approach provides a tighter coupling be-

tween the speaker’s model and the UBM, resulting in better performance than decoupled mod-

els (Reynolds et al. 2000). Then, in the case of speaker identification, each test segment is scored

against all speaker models to determine who is speaking. In the case of speaker verification,

each test segment is scored against the background model and a given speaker model to accept

or reject an identity claim. Noteworthy, the GMM-UBM paradigm is the basis for some of the

most successful developments in the field of speaker characterization, including factor analysis

methods, such as i-vectors (Dehak et al. 2010).

2.2.2 Automatic speech recognition

Traditional speech recognition systems do not actually perform the recognition or decoding

step directly on the speech signal. Rather, the recognition process is composed of several

phases: preprocessing and feature extraction, acoustic modeling, language modeling, and the

decoding phase. The overall process and its major components are represented in Figure 2.7,

and described hereafter.

Preprocessing and feature extraction

Initially, the speech waveform is preprocessed in order to enhance it and better prepare it for

the following phases. Typical preprocessing approaches may include preemphasis, which cor-

responds to boost the energy in the high frequencies, and noise reduction. More advanced

20

Page 45: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

SentencesFeaturesvectors

Speechsignal

LanguageModel Lexicon

AcousticModel

DecodingPreprocessingandfeatureextraction

Figure 2.7: Schematic representation of the main modules constituting an automatic speechrecognizer.

techniques may consider background, and overlapped speech removal. Then, the input signal

is divided into short frames of samples, which are converted to a meaningful set of features.

The duration of the frames is selected so that the speech waveform can be regarded as being

stationary. The goal of the feature extraction step is to derive, from each frame, a parameterised

version of the input signal that captures its important qualities while discarding unimportant

and distracting characteristics. More in detail, features should be robust against noise, acoustic

variations and other events that are irrelevant for the recognition process. Also features should

be sensitive to linguistic context, allowing to distinguish between different linguistic units (e.g.,

phones). Features typically used in speech recognition are the MFCC or PLP, introduced in Sec-

tion 2.2.1.1

Acoustic model

The next stage in the recognition process is to do a mapping of the speech vectors found at

the previous step and the underlying sequence of acoustic classes modeling concrete symbols

(such as phonemes, letters, and words). Acoustic modeling is arguably the central part of

any speech recognition system, playing a critical role in improving ASR performance. The

practical challenge is how to build accurate acoustic models that can truly reflect the spoken

language to be recognized. Typically, subword models like phonemes, diphones or triphones

are more often used as the unit of acoustic model with respect to word model. An extended

and successful statistical parametric approach to speech recognition is the Hidden Markov

21

Page 46: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Model paradigm (Rabiner 1989, Rabiner et al. 1993) that supports both acoustic and tempo-

ral modeling. HMMs model the sequence of feature vectors as a piecewise stationary pro-

cess. An utterance X=x1, . . . , xn, . . . , xN is modeled as a succession of discrete stationary states

Q=q1, . . . , qk, . . . , qK, K<N, with instantaneous transitions between these states. An HMM is

typically defined as a stochastic finite state automaton, usually with a left-to-right topology. It

is called ”hidden” Markov model because the underlying stochastic process (the sequence of

states) is not directly observable, but still affects the observed sequence of acoustic features.

HMM has been succesfully used in combination with Gaussian Mixture Models, a parametric

probability density function represented as a weighted sum of Gaussian component densi-

ties (Reynolds 2009b). In this approach, a GMM is associated to each state in order to describe

local characteristics of the data. For many years, the paradigm GMM-HMM represented the

dominant technology in acoustic modeling (Baker et al. 2009). Recently, deep learning methods

have been shown to outperform conventional GMM-based modeling approaches by achieving

important improvements in terms of recognition accuracy (Abdel-Hamid et al. 2012).

Language model

Knowledge of the rules of a language, the way in which words are connected together into

phrases, is expressed by the language model. It is an important building block in the recogni-

tion process, it is used to guide the search for an interpretation of the acoustic input. There are

two types of models that describe a language: grammar-based and statistical-based language

models. When the range of sentences to be recognized is very small, it can be captured by a

deterministic grammar that describes the set of allowed phrases. In large vocabulary applica-

tions, on the other hand, it is too difficult to write a grammar with sufficient coverage of the

language, therefore a stochastic grammar, typically an n-gram model is often used. When sub-

word models are used, the word model is then obtained by concatenating the subword models

according to the pronunciation transcription of the words provided by a dictionary or lexical

model. The purpose of a vocabulary is to map the orthography of the words to the units that

model the actual acoustic realization of the vocabulary entries. Lexicon generation may rely

on manual dictionaries or on automatic grapheme-to-phoneme modules, both rule-based or

data-driven learned approaches (or hybrid).

A different speech recognition task, known as keyword spotting (KWS), can be used when

the expected result of the recognition is limited to a reduced number of isolated words. This

kind of approaches search in the continuous audio stream a certain set of words of interest.

Broadly, KWS methods can be classified into two categories: based on Large Vocabulary Con-

22

Page 47: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

tinuous Speech Recognition (LVCSR) or based on acoustic matching of speech with keyword

models in contrast to a background model (Szoke et al. 2005). Methods based on LVCSR search

for the target keywords in the recognition results, usually in lattices or confusion networks.

Acoustic approaches, on the other hand, are very closely related to Isolated Word Recognition

(IWR). The language model in this case contains the words that should be recognized. Ad-

ditionally, they incorporate an alternative competing model to the list of keywords generally

known as background, garbage or filler speech model. A robust background model must be

able to provide low recognition likelihoods for the keywords and high likelihoods for out-of-

vocabulary words in order to minimize false alarms and false rejections (Abad et al. 2013).

Decoding

The last step in the recognition process is the decoding phase, its purpose is to find a se-

quence of words whose corresponding acoustic and language models best match the input

signal. Therefore, such a decoding process with trained acoustic and language models is of-

ten referred to as a search process. Its complexity varies according to the recognition strategy

and to the size of the vocabulary. With IWR word boundaries are known, thus the word with

the highest forward probability is chosen as the recognized word and the search problem be-

comes a simple pattern recognition problem. Search in Continuous Speech Recognition (CSR),

on the other hand, is more complicated since the search algorithm has to consider the possi-

bility of each word starting at any arbitrary time frame. Also, for small vocabulary tasks, it is

possible to expand the whole search network defined by the language and lexical restrictions

to directly apply conventional time-synchronous Viterbi search. However, in LVCSR systems

different strategies should be addressed. These, span from graph compaction techniques, on-

the-fly expansion of the search space (Ortmanns & Ney 2000) and heuristic methods.

2.3 Machine learning

Machine learning is a field of artificial intelligence dedicated at the study of adaptive mecha-

nisms that enable computers to learn from experience, learn by example, and learn by analogy.

Learning capabilities can improve the performance of an intelligent system over time. Ma-

chine learning algorithms build a statistical model from sample data, the model is then used to

perform specific tasks, like making predictions or decisions over new, unknown data. From a

broad perspective, the first step in the definition of a machine learning problem is the identifica-

tion of a domain of interest. This corresponds to defining the feature space. Then, the following

step is to train a model able to identify, from the feature space, significant relations for the prob-

23

Page 48: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Featureextraction

Inputdata

Pre-processing

Featureselection

Modeltraining

Modelvalidation MLmodel

Figure 2.8: A typical workflow used in the training process of a machine learning model.

lem at hand. In practice, this process usually includes several additional intermediate stages.

In fact, defining a feature space, besides an identification of relevant features, typically involves

also a stage of data preprocessing, feature extraction, and possibly, feature selection. Then, on

these features, the actual training of the machine learning model is performed. Finally, the

trained model is evaluated to understand its ability to process new information, not previously

analyzed. These steps are visually represented in Figure 2.8.

In the following, in Section 2.3.1, some common models used in machine learning prob-

lems are introduced, followed by a brief review on different feature selection approaches (Sec-

tion 2.3.2). Then, Section 2.3.3 describes two model evaluation approaches that are used to

assess the performance of a machine learning model and two common evaluation metrics used

in the areas of classification and speech recognition.

2.3.1 Machine learning models

Machine learning approaches are usually classified into two categories: supervised and unsu-

pervised learning. In the first one, the learning task infers a function from examples of data

consisting of input-output pairs. For example, if the task consists of determining if the speech

of an individual is affected from a specific impairment, the data provided to the model would

include speech signals with an associated label designating if the clinical condition is verified

or not. Of course, the data provided to the model would include speech signals of people with

and without the impairment. Typically, supervised learning is modeled as a problem of clas-

sification or regression. Classification algorithms are used when the outputs are restricted to a

known set of values. According to the previous example, the output would be the presence or

absence of a clinical condition. Regression algorithms provide a continuous output, meaning

that any value within a given range is possible. An example would be the severity of a disease

in a scale from 0 to 30. In unsupervised learning, the learning task infers a function from ex-

amples of data containing only inputs, but no desired output labels. In this way, unsupervised

methods are able to discover structure and patterns in the data.

In the literature, there are many computational models used for a variety of machine learn-

24

Page 49: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

(a) (b) (c) (d)

Figure 2.9: (a) 3 possible hyperplanes for an SVM trained with linearly separable data, the besthyperplane is shown with a solid line. (b) The hyperplane with the maximum distance fromdata points. (c) Soft-margin allowing some classification errors. (d) Non-linearly separabledata

ing tasks. Here, this revision is limited at introducing some of the most common methods that

are later referred in this document. All of them may be used to model a classification or a

regression problem.

2.3.1.1 Support Vector Machine

Support Vector Machine (SVM) (Vapnik 1963) is a family of discriminative binary linear classi-

fiers. This means that the algorithm learns a decision function directly from the data in order

to classify it into two possible groups. To do that, the algorithm constructs a hyperplane in a

high-dimensional space, which can be used to separate the input data (Figure 2.9(a)). A good

separation is achieved by the hyperplane that provides the maximum distance from the margin,

which corresponds to the nearest data point of any class. An example of a maximum margin is

depicted in Figure 2.9(b).

The traditional SVM algorithm fails to find a hyperplane when the input data that is not

linearly separable, as shown on Figure 2.9(d). Some extensions have been proposed to solve

this issue. The first, known as soft-margin (Boser et al. 1992), accounts for small deviations by

allowing that a reduced number of points close to the boundary is misclassified (Figure 2.9(c)).

The number of possible misclassifications is governed by a free parameter called the cost. It cor-

responds to the penalty associated with performing a classification error. Higher values of the

cost imply lower possibilities that the algorithm will misclassify a point. This method, while

it provides a solution for simpler problems, still fails to classify non-separable data as the one

showed in Figure 2.9(d). Thus, another approach relies on the use of kernel methods (Cortes

& Vapnik 1995), which perform a mapping of the original classification problem into another

metric space in which it is separable. Generally, the transformed space has a higher dimen-

sionality, with each of the dimensions being a combination of the original problem variables.

Common types of kernels used to separate non-linear data are polynomial and radial basis

25

Page 50: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Outlook

Humidity

Overcast

Yes

RainSunny

High Normal High Normal

No Yes No Yes

Wind

Figure 2.10: A decision tree for the concept PlayTennis. This tree classifies Saturday morningsaccording to whether or not they are suitable for playing tennis (Mitchell 1997).

kernels. SVM can also be used to solve multiclass classification problems. Typically, this is

achieved by reducing the original task into multiple binary classification problems (Duan &

Keerthi 2005).

2.3.1.2 Decision Tree

A Decision Tree (DT) is a decision support tool that uses a tree-like model. Each node in the

tree specifies a test on an attribute, and each branch descending from that node corresponds to

one of the possible outcomes of the test. Each leaf node represents a decision, or a class label.

An instance is classified by sorting it down the tree from the root to some leaf node. A visual

representation of a DT for the problem PlayTennis is shown in Figure 2.10.

In the machine learning area, Decision Tree (DT) (Mitchell 1997) is used as a non-parametric

supervised learning method both for classification and regression. This model predicts the

value of a target variable by learning simple decision rules inferred from the data features.

Tree models where the target variable can take a discrete set of values are called classification

trees. In these tree structures, leaves represent class labels and branches represent conjunctions

of features that lead to those class labels. Decision trees where the target variable can take con-

tinuous values are called regression trees. They are similar to classification trees, except that a

regression model is fitted to each node to give the predicted value of the target variable. Com-

mon decision tree algorithms are: Iterative Dichotomiser 3 (ID3) (Quinlan 1986), C4.5 (Quinlan

2014), Classification and regression tree (CART) (Breiman 1984).

26

Page 51: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

2.3.1.3 Random Forest

DT present many advantages, they are simple to understand and interpret, training is straight-

forward, and classification is fast. However, as for any other machine learning method, they

also include important disadvantages. The most important is related with the fact that DT

learners cannot be grown to arbitrary complexity because they may lose generalization accu-

racy on unseen data. For these reasons, Ho (1995) proposed a method to construct tree-based

classifiers with a capacity that can be arbitrarily expanded. This is achieved by building multi-

ple trees in randomly selected subspaces of the feature space. The trees built in this way gen-

eralize their classification in complementary ways, and their combined classification revealed

monotonically improvement. The idea of random subspace selection of Ho (1995) influenced

the design of Random Forest (RF), later proposed by Breiman (2001). RF are a way of build-

ing a forest of uncorrelated trees that are trained on different parts of the same dataset. To do

that, RF relies on bootstrap aggregating, a meta-algorithm in which new training sets are gen-

erated by sampling, uniformly and with replacement, from the original dataset. Each tree in

the ensemble votes for the most popular class, multiple decision trees are then averaged. With

respect to standard decision trees, RF provides a reduced interpretability, but generally they

greatly boost the performance of the final model.

2.3.1.4 Artificial Neural Network

An Artificial Neural Network (ANN) (Ivakhnenko & Lapa 1966, 1967, Rosenblatt 1958) can

be defined as an information-processing paradigm inspired by the structure and functions of

the human brain. The brain consists of a densely interconnected set of nerve cells, called neu-

rons. A neuron consists of a cell body, soma, a number of fibres called dendrites, and a sin-

gle long fibre called the axon. Electrical or chemical signals between neurons are exchanged

through synapses. An ANN consists of a number of simple and highly interconnected proces-

sors, called neurons, which are analogous to the biological neurons in the brain. In a similar

way to synapses, signals from one neuron to another are passed by weighted links connecting

neurons. A biological and an artificial neuron are depicted in Figure 2.11(a), and Figure 2.11(b).

The simplest form of ANN is called the perceptron (Rosenblatt 1958), it consists of a single

neuron with adjustable synaptic weights. This type of architecture is able to solve only lin-

early separable problem. This limitation is overcome with advanced forms of neural network,

namely the Multilayer Perceptron (MLP), a feedforward neural network with one or more hid-

den layers. Typically, the network consists of an input layer of source neurons, at least one

middle layer of computational neurons, and an output layer of computational neurons. The

27

Page 52: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

What is a neural network?A neural network can be defined as a model of reasoning based on the humanbrain. The brain consists of a densely interconnected set of nerve cells, or basicinformation-processing units, called neurons. The human brain incorporatesnearly 10 billion neurons and 60 trillion connections, synapses, between them(Shepherd and Koch, 1990). By using multiple neurons simultaneously, thebrain can perform its functions much faster than the fastest computers inexistence today.

Although each neuron has a very simple structure, an army of such elementsconstitutes a tremendous processing power. A neuron consists of a cell body,soma, a number of fibres called dendrites, and a single long fibre called theaxon. While dendrites branch into a network around the soma, the axonstretches out to the dendrites and somas of other neurons. Figure 6.1 is aschematic drawing of a neural network.

Signals are propagated from one neuron to another by complex electro-chemical reactions. Chemical substances released from the synapses cause achange in the electrical potential of the cell body. When the potential reaches itsthreshold, an electrical pulse, action potential, is sent down through the axon.The pulse spreads out and eventually reaches synapses, causing them to increaseor decrease their potential. However, the most interesting finding is that a neuralnetwork exhibits plasticity. In response to the stimulation pattern, neuronsdemonstrate long-term changes in the strength of their connections. Neuronsalso can form new connections with other neurons. Even entire collections ofneurons may sometimes migrate from one place to another. These mechanismsform the basis for learning in the brain.

Our brain can be considered as a highly complex, nonlinear and parallelinformation-processing system. Information is stored and processed in a neuralnetwork simultaneously throughout the whole network, rather than at specificlocations. In other words, in neural networks, both data and its processing areglobal rather than local.

Owing to the plasticity, connections between neurons leading to the ‘rightanswer’ are strengthened while those leading to the ‘wrong answer’ weaken. As aresult, neural networks have the ability to learn through experience.

Learning is a fundamental and essential characteristic of biological neuralnetworks. The ease and naturalness with which they can learn led to attempts toemulate a biological neural network in a computer.

Figure 6.1 Biological neural network

ARTIFICIAL NEURAL NETWORKS166

(a)

But does the neural network know how to adjust the weights?As shown in Figure 6.2, a typical ANN is made up of a hierarchy of layers, and theneurons in the networks are arranged along these layers. The neurons connectedto the external environment form input and output layers. The weights aremodified to bring the network input/output behaviour into line with that of theenvironment.

Each neuron is an elementary information-processing unit. It has a means ofcomputing its activation level given the inputs and numerical weights.

To build an artificial neural network, we must decide first how many neuronsare to be used and how the neurons are to be connected to form a network. Inother words, we must first choose the network architecture. Then we decidewhich learning algorithm to use. And finally we train the neural network, that is,we initialise the weights of the network and update the weights from a set oftraining examples.

Let us begin with a neuron, the basic building element of an ANN.

6.2 The neuron as a simple computing element

A neuron receives several signals from its input links, computes a new activationlevel and sends it as an output signal through the output links. The input signalcan be raw data or outputs of other neurons. The output signal can be either afinal solution to the problem or an input to other neurons. Figure 6.3 showsa typical neuron.

Figure 6.3 Diagram of a neuron

Table 6.1 Analogy between biological and artificial neural networks

Biological neural network Artificial neural network

Soma Neuron

Dendrite Input

Axon Output

Synapse Weight

ARTIFICIAL NEURAL NETWORKS168

(b)

Although a present-day artificial neural network (ANN) resembles the humanbrain much as a paper plane resembles a supersonic jet, it is a big step forward.ANNs are capable of ‘learning’, that is, they use experience to improve theirperformance. When exposed to a sufficient number of samples, ANNs cangeneralise to others they have not yet encountered. They can recognise hand-written characters, identify words in human speech, and detect explosivesat airports. Moreover, ANNs can observe patterns that human experts failto recognise. For example, Chase Manhattan Bank used a neural network toexamine an array of information about the use of stolen credit cards – anddiscovered that the most suspicious sales were for women’s shoes costingbetween $40 and $80.

How do artificial neural nets model the brain?An artificial neural network consists of a number of very simple and highlyinterconnected processors, also called neurons, which are analogous to thebiological neurons in the brain. The neurons are connected by weighted linkspassing signals from one neuron to another. Each neuron receives a number ofinput signals through its connections; however, it never produces more than asingle output signal. The output signal is transmitted through the neuron’soutgoing connection (corresponding to the biological axon). The outgoingconnection, in turn, splits into a number of branches that transmit the samesignal (the signal is not divided among these branches in any way). The outgoingbranches terminate at the incoming connections of other neurons in thenetwork. Figure 6.2 represents connections of a typical ANN, and Table 6.1shows the analogy between biological and artificial neural networks (Medskerand Liebowitz, 1994).

How does an artificial neural network ‘learn’?The neurons are connected by links, and each link has a numerical weightassociated with it. Weights are the basic means of long-term memory in ANNs.They express the strength, or in other words importance, of each neuron input.A neural network ‘learns’ through repeated adjustments of these weights.

Figure 6.2 Architecture of a typical artificial neural network

INTRODUCTION, OR HOW THE BRAIN WORKS 167

(c)

Figure 2.11: (a) A schematic drawing of a biological neural network with two neurons, (b) adiagram of an artificial neuron, (c) an architecture of an artificial neural network with threelayers (Negnevitsky 2005).

input signals are propagated in a forward direction on a layer-by-layer basis. An example of

an ANN with three layers is provided in Figure 2.11(c). Similarly to the human brain, ANN

are able to learn, or in other words, they are able to use experience to improve their perfor-

mance. While in a biological neural network, learning involves adjustments to the synapses,

an ANN learns through repeated adjustments of the weights. Different methods have been

proposed for learning, the most popular of them is the back-propagation algorithm (Bryson

& Ho 1969). With this approach, the network is presented with a training set of input sam-

ples. The network then propagates the input samples from layer to layer until an output is

generated by the output layer. If this outcome is different from the desired output, an error is

calculated and then propagated backwards through the network from the output layer to the

input layer. The weights are modified as the error is propagated. This process is repeated sev-

eral times until a stop condition is satisfied. The back-propagation algorithm cannot guarantee

an optimal solution and usually converges to a set of suboptimal weights. With the recent

advancements in deep learning, ANN-based architectures have gained much interest in the

area of speech processing, achieving very high performance when a large amount of data is

available for training (Abdel-Hamid et al. 2012).

2.3.2 Feature selection

In the machine learning area, there are two common approaches for identifying the set of fea-

tures used to train a model. A traditional way, in which the features to use are carefully selected

considering previous knowledge of the problem under assessment, and a brute-force approach

in which thousands of general-purpose features are typically extracted. In both cases, after

a feature extraction, features should be evaluated according to their relevance for the prob-

lem under examination and, consequently, a reduced number of them should be selected. In

fact, identifying a relevant subset of features typically improves the performance of the model,

leading to shorter training times and an enhanced generalization ability.

28

Page 53: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

The process of feature selection may be described as a search technique for proposing new

feature subsets. This implies an evaluation measure to score the different subsets. The simplest

algorithm may perform an exhaustive search of the feature space in order to find the feature

set that minimizes the error rate. Except for very small feature sets, this is typically computa-

tionally intractable and a metaheuristic algorithm is often used. The choice of the evaluation

metric strongly influences the algorithm, and allows to distinguish among three main cate-

gories of feature selection approaches: filters, wrappers, and embedded methods (Guyon &

Elisseeff 2003).

• Filter feature selection methods score each feature independently from the others with

a statistical measure. Typically, the statistical measure evaluates the correlation of the

feature with the outcome variable. The features are then ranked by the score obtained

and either selected to be kept or removed from the feature set. In this way, the process

of features selection is independent of any machine learning algorithm. According to

the type of problem, several methods may be used to evaluate the features. Usually, for

continuous values, the Pearson’s correlation coefficient (Pearson 1895) is used, but other

methods include ANOVA (Fisher 1919) or the Chi-Square test (Pearson 1992).

• Wrapper methods consider the selection of a set of features as a search problem, evaluat-

ing and comparing among them different combinations of features. Each feature subset is

scored using a predictive model. For this reason, wrapper methods are computationally

very intensive, but usually provide the best performing feature set for the model con-

sidered. Some common examples of wrapper methods are: sequential forward feature

selection (Pudil et al. 1994), backward feature elimination (Pudil et al. 1994), and recur-

sive feature elimination.

• Embedded methods learn which features best contribute to the accuracy of the model

while the model is being created. The most common type of embedded feature selec-

tion methods are regularization methods. These include least absolute shrinkage and

selection operator (LASSO) regression (Santosa & Symes 1986, Tibshirani 1996) and Elas-

tic net (Zou & Hastie 2005), which have built-in penalization functions to discard some

coefficients and thus reducing the complexity of the model.

2.3.3 Model evaluation

The machine learning models reviewed in the previous sections are built using a large amount

of data from which the algorithm may infer meaningful patterns to later predict new samples.

29

Page 54: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

This is called the training phase. After this phase, the built model should be evaluated in order

to assess how well the produced result generalizes to new, unseen data. This is called the evalu-

ation or test phase. There are two common approaches to evaluate the performance of a model,

holdout and cross-validation. The first consists in dividing the data into three subsets: training,

validation, and test set. The training set is used to build the model, whereas the validation set is

used to assess the performance of the model when adjustments of its parameters are required.

The test set should not contain data previously used in the training phase as is used to assess

the future performance of a model on unseen data. Alternatively, one of the most widely used

approaches to estimate the accuracy of a predictive model is cross-validation (Bishop 2006). It

also involves the partition of the original dataset in two complementary subsets, one for train-

ing, one for testing. However, to reduce variability, multiple rounds of cross-validation are

performed using different partitions. The validation results are then averaged over the rounds

to provide an estimate of the model’s predictive performance. This method is usually preferred

when the amount of data available for the training and test phases are reduced, as for example

in the area of health.

There are two common methods of cross-validation, exhaustive and non-exhaustive ones.

In the first case, training and test phases are performed on all possible ways of dividing the

original dataset. One example of this approach often used is known as leave-one-out (LOO)

cross-validation (Geisser 1975, Stone 1974, 1977), which can be considered a special case of

leave-p-out (LPO) cross-validation (Shao 1993) with p = 1. It involves using p observation(s) for

validation and the remaining for training. This process is repeated until all the samples in the

dataset have been divided into a validation set of p observation(s) and a corresponding training

set.

Non-exhaustive cross-validation methods do not compute all the possible combination of

splitting the original dataset. These methods are an approximation of LPO cross-validation.

Probably the most widely used implementation is k-fold cross-validation. In this approach, the

original dataset is randomly partitioned into k equal sized subsets. Of the k subsets, a single

subset is retained as the validation data for testing the model, and the remaining k - 1 subsets

are used as training data. The cross-validation process is then repeated k times, with each of

the k subsets used exactly once as the validation data. The k results can then be averaged to

produce a single estimation. The advantage of this method is that all observations are used for

both training and validation, and each observation is used for validation exactly once. A visual

representation of this approach is depicted in Figure 2.12. In performing cross-validation, it

is usually a good practice to verify that each fold contains roughly the same proportions of

30

Page 55: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Trainingfolds Testfold

1st iteration

2nd iteration

3rd iteration

Dataset

…10th iteration

Figure 3.10: ADD CAPTION.

fold contains roughly the same proportions of observations with a given categorical value, such as the

class outcome value. This is called stratified k -fold cross-validation.

3.3.6.1 Evaluation metrics

Word Error Rate (WER) is a common metric of the performance of a speech recognition system. It

measures the differences between a recognized word sequence, also called the hypothesis, and the

corresponding spoken word sequence, the reference. There are three types of errors that may occur

as a result of comparing these two sequences. Words that exist only in the hypothesis are called

insertions. Likewise, words that exist only in the reference are called deletions. Finally, word existing in

both sequences, but improperly recognized, are called substitutions. Thus, the WER is measured as the

sum of number of insertions, deletion, and substitutions, divided by the total number of words existing in

the reference.

31

Figure 2.12: An illustration of the k-fold cross-validation method with k=10.

observations with a given output value, such as the class outcome. This approach is called

stratified cross-validation.

2.3.3.1 Evaluation metrics

According to the type of machine learning problem, several measures may be used to describe

the validity of a model. Usually, different areas have different preferences for specific metrics

due to different goals. For instance, in medicine, sensitivity and specificity are often used,

while in information retrieval precision and recall are preferred. In this section, the accuracy

and Word Error Rate (WER) are briefly introduced, they are two measures widely used in the

areas of classification and speech recognition.

In a classification problem, the accuracy measures the fraction of all instances correctly

categorized. Consider as an example a test for the presence of a disease performed on a set

of subjects. Some people will actually have the disease, and if the test correctly identify them,

then these instances are called true positives (TP). Some other subjects have the disease, but the

test incorrectly claims they do not have it. In this case, these instances are called false negatives

(FN). Subjects that do not have the disease, and the test correctly identify them are called true

negatives (TN). Finally, healthy people who have a positive test result are called false positives

(FP). Provided with these definitions, the accuracy of a binary classifier is then computed as the

sum of the true positives and true negatives instances, divided by the total number of instances

existing in the dataset:

Accuracy=TP + TN

TP + TN + FP + FN

The WER is a common metric of the performance of a speech recognition system. It mea-

sures the differences between a recognized word sequence, also called the hypothesis, and the

corresponding spoken word sequence, the reference. There are three types of errors that may

31

Page 56: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

occur as a result of comparing these two sequences. Words that exist only in the hypothesis are

called insertions. Likewise, words that exist only in the reference are called deletions. Finally,

word existing in both sequences, but improperly recognized, are called substitutions. Thus, the

WER is measured as the sum of number of substitutions, deletion, and insertions, divided by

the total number of words existing in the reference:

WER=S + D + I

N

2.4 Summary

In this chapter, some introductory notions of technical concepts required for the understanding

of the rest of this document were reviewed. Topics related to the areas of natural language

processing, spoken language processing, and machine learning were covered. Each of these

areas is very extensive and providing a description of the several existing approaches in each

of them is beyond the scope of this chapter. Relatively to the area of natural language process-

ing, a simple method for representing text data was first briefly introduced, then statistical and

predictive language models were approached. The section dedicated to spoken language pro-

cessing required to cover a broader set of topics in order to ease the reading of the next chapters.

A characterization of speech production and some of the many types of information that can

be extracted from speech signals were provided. Then, a brief review on a speaker modeling

technique used in the tasks of speaker identification and recognition is also reported. The sec-

tion concluded with some basic notions about the main components of a speech recognition

system. Finally, some models commonly used in machine learning problems are presented

together with an overview on feature selection approaches, and a description of a standard

approach to evaluate machine learning models. Additionally, two common evaluation metrics

in classification problems and in speech recognition were described. These metrics have been

also used to assess the contributions provided in this dissertation.

32

Page 57: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

3Characterization of

Neurodegenerative Diseases

Neurodegenerative diseases affect millions of people worldwide. The origin of these disorders

is identified with nerve cells in the brain or in the peripheral nervous system that gradually

lose their functionality. The clinical condition becomes progressively worse over time, until

ultimately nerve cells die. The risk of neurodegenerative diseases increases with age. Consid-

ering that lifespan has been extended notably in the last decades, it is not surprising that their

prevalence is also increasing. This creates a critical need to improve the understanding of these

disorders to develop new approaches for their prevention and treatment. For these reasons, the

most common neurodegenerative diseases are briefly introduced in this chapter. For each of

them, the main symptoms and the clinical criteria used for diagnosis are reported. The chapter

ends with a summary in which (i) it is shown how Speech and Language Technology (SLT) can

provide benefits to the diagnostic process of the observed neurodegenerative diseases, and (ii)

I identify two diseases on which to focus the rest of this study.

3.1 Mild Cognitive Impairment

Mild Cognitive Impairment (MCI) is a brain function syndrome involving the onset and evolu-

tion of cognitive decline greater than expected for an individual’s age and education level, but

that does not interfere notably with the activities of daily life. Prevalence in population-based

epidemiological studies ranges from 16% to 20% in adults older than 60 years. Some people

with MCI seem to remain stable or regress to normal over time, but 20% to 40% progress to

dementia within five years (Roberts & Knopman 2013). MCI can thus be regarded as a risk

state for dementia and its identification could lead to secondary prevention by controlling risk

factors such as systolic hypertension. The amnestic subtype of MCI has a high risk of progres-

sion to Alzheimer’s disease and could constitute a prodromal stage of this disorder (Gauthier

et al. 2006).

In 2011, the American National Institute on Aging and the Alzheimer’s Association (NIA-

AA) published the core clinical criteria for the diagnosis of MCI. These require the existence

of a concern regarding a change in cognition together with an impairment in one or more

cognitive domains. Symptoms should allow the preservation of independence in functional

Page 58: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

abilities, and there should be no evidence of a significant impairment in social or occupational

functioning (Albert et al. 2011).

Cognitive assessment includes tests of episodic memory. These are useful for identifying

amnestic MCI patients who have a high likelihood of progressing to Alzheimer’s disease de-

mentia within a few years. Since other cognitive domains can be impaired in addition to mem-

ory, the assessment typically also includes tests that evaluate executive functions (e.g., reason-

ing, problem solving, planning), language (e.g., naming, fluency, and comprehension), visu-

ospatial skills, and attentional control. Many validated clinical neuropsychological measures

are available to assess these cognitive domains. Among them, the Mini-Mental State Examina-

tion (MMSE) (Folstein et al. 1975) is strongly widespread for screening of MCI and Alzheimer’s

disease due to its brevity, high sensitivity, and ease of administration and scoring (Nunes 2005).

Another commonly used battery is the Wechsler Adult Intelligence Scale - III (WAIS-III) (Ryan

& Lopez 2001). It is considered ”the gold standard” in intelligence testing, providing informa-

tion about the overall level of intellectual functioning and the presence or absence of significant

intellectual disability.

3.2 Alzheimer’s disease

Dementias represent a broad category of brain diseases that cause a long term, gradual de-

crease in multiple cognitive functions. They are responsible for the greatest burden of neu-

rodegenerative diseases, with Alzheimer’s Disease (AD) representing the most common cause

of dementia, contributing to 60%-70% of cases (WHO 2017).

AD is characterized by loss of neurons and synapses in the cerebral cortex and in certain

subcortical regions. It causes, gradually and progressively, the change and destruction of the

nervous tissues. At an early stage, AD is characterized by alterations of memory and of spa-

tial and temporal orientation. With the progression of the disease, other neuropsychological

changes arise, such as language impairments, visuospatial deficits and changes in abstraction

and judgment. At a later stage, the disease may lead to the development of apraxia (difficulty

in organizing motor actions intentionally). AD is diagnosed when there are cognitive or behav-

ioral symptoms that represent a decline from previous levels of functioning and interfere with

the ability to function at work or at usual activities. The diagnostic process may include brain

imaging and cerebrospinal fluid exams. Cognitive impairment should be diagnosed through

an objective cognitive assessment and it should involve at least two of the following domains:

memory, reasoning, visuospatial abilities, language, personality (McKhann et al. 2011). Al-

though memory impairment due to medial temporal lobe damage is the characteristic symp-

34

Page 59: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

tom of AD, language problems are also prevalent and existing literature confirms they are an

important factor. The most well-known symptoms of impaired language abilities include nam-

ing, word-finding difficulties, repetitions, an overuse of indefinite and vague terms, and inap-

propriate use of pronouns (Ahmed, Haigh, de Jager & Garrard 2013, Almor et al. 1999, Forbes

et al. 2002, Kempler 1984, 1995, Kempler et al. 1987, Kim & Thompson 2004, Oppenheim 1994,

Reilly et al. 2011, Salmon et al. 1999, Taler & Phillips 2008, Ulatowska et al. 1988).

When it comes to discourse, syntactic and semantic deficits in language processing con-

strain the production of meaningful speech. The discourse of AD patients is described as flu-

ent but not informative, characterized by incomplete and short sentences (Hier et al. 1985,

Nicholas et al. 1985), poorly organized, and with a disproportionate deficit in maintaining co-

hesion (Shekim & LaPointe 1984), and coherence (Appell et al. 1982, Glosser & Deser 1991,

Hutchinson & Jensen 1980, Obler & Albert 1984, Ripich & Terrell 1988).

No treatments stop or reverse the progression of this disease, though some may temporar-

ily improve the symptoms. The disease onset is often mistakenly attributed to aging or stress.

Detailed neuropsychological testing can reveal mild cognitive difficulties up to eight years be-

fore a person fulfills the clinical criteria for the diagnosis of AD (Backman et al. 2004). Among

the most used neuropsychological measures, there are the MMSE and the Alzheimer’s Dis-

ease Assessment Scale - Cognitive Subscale (ADAS-Cog) (Rosen et al. 1984). The latter is the

most widely administered tool in AD trials (Cano et al. 2010, Robert et al. 2010), being used to

measure cognitive performance and detect therapeutic efficacy in cognition. The ADAS-Cog

consists of eleven tasks assessing six areas of cognition: memory; language; ability to orien-

tate to time, place, person; construction of simple designs; planning; and performing simple

behaviors in pursuit of a predefined goal. The battery takes approximately 30 to 45 minutes to

complete, depending on the AD severity stage of the patient.

3.3 Parkinson disease

Parkinson’s Disease (PD) is due to the progressive death of neurons in the substantia nigra, a

region of the mid-brain. This has the effect to decrease the synthesis of dopamine, which causes

a dysfunction in the regulation of major brain structures involved in the control of movements.

PD is the second most common neurodegenerative disorder after AD, affecting about 1% of

people older than 60 years (de Lau & Breteler 2006). About 89% of PD patients develop speech

disorders (Ramig et al. 2008).

The cardinal motor signs of PD include the characteristic clinical picture of resting tremor,

rigidity, bradykinesia, and impairment of postural reflexes, while non-motor symptoms in-

35

Page 60: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

clude behavioral disorders, sleep and sensory abnormalities. These symptoms slowly worsen

during the disease with a nonlinear progression. Dementia becomes common in the advanced

stages of the disease (Sveinbjornsdottir 2016). PD patients often develop a speech disorder

referred as hypokinetic dysarthria. This is characterized by weakness, paralysis, lack of coordi-

nation in the motor speech system, affecting respiration, phonation, articulation, and prosody.

These deficits may result in an altered speech characterized by a reduced intelligibility, natu-

ralness, and overall efficiency of vocal communication. Deficits in phonation are related with

vocal fold bowing and incomplete closing of vocal folds. These can result in a decreased loud-

ness and an impaired ability to produce normal phrasing and intensity. Articulation deficits

are manifested as a reduced amplitude and velocity of the articulatory movements in the lips,

tongue, and jaw. Patients may report imprecise stop consonants, produced as fricative, and de-

fects in the ability to make rapid articulator movements in the repetition of a consonant–vowel

combination. Prosodic impairments comprise changes in loudness, pitch, and timing, which

overall contribute to the resulting intelligibility of speech.

The standard method to evaluate and rate the neurological state of Parkinson’s patients

is based on the revised version, provided by the Movement Disorders Society, of the Unified

Parkinson’s Disease Rating Scale (UPDRS) (MDS 2003). The motor part of the UPDRS (Section

III) addresses speech evaluating volume, prosody, clarity, and repetition of syllables. Speech

symptoms of PD are typically assessed by a SLP through several speaking tasks thought to

measure the extent of speech and voice disorders. The most traditional of them are the sus-

tained vowel phonation, rapid syllable repetition (diadochokinesis or DDK), and variable read-

ing of short sentences, longer passages, or freely spoken spontaneous speech (Goberman &

Coelho 2002). Each of these tasks evaluates a specific impairment caused by dysarthria, such

as difficulties in consonant-vowel articulation, phonation, respiration, and prosody.

3.4 Dementia with Lewy bodies

The predominant histological feature of Dementia with Lewy bodies (DLB) is the presence of

cortical and subcortical Lewy bodies, clumps of alpha-synuclein protein in neurons. DLB is the

second most common type of degenerative dementia in the elderly, possibly accounting for up

to 15% of all dementia cases (McKeith et al. 1996).

Dementia, defined as a progressive cognitive decline of sufficient magnitude to interfere

with usual daily activities, is an essential requirement for DLB diagnosis. Prominent or persis-

tent memory impairment may not necessarily occur in the early stages, but is usually evident

as the disease progresses. Deficits on tests of attention, executive function, and visuopercep-

36

Page 61: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

tual ability may instead be especially prominent and occur early. Core clinical features also

include fluctuation in cognition, recurrent visual hallucination, or motor features of parkinson-

ism (McKeith et al. 2017). This disease presents a pronounced clinical and neuropathological

overlap with AD as well as PD with dementia (PDD).

Dementia screening batteries such as the MMSE and the Montreal Cognitive Assessment

(MOCA) (Nasreddine et al. 2005) are useful to characterize global impairment in DLB. How-

ever, neuropsychological assessment should include measures assessing different cognitive do-

mains that are capable of highlighting clinical deficits typical of this disease. Measures of atten-

tion and executive function that differentiate DLB from AD and normal aging include tests of

processing speed and divided attention (e.g., Stroop tasks, phonemic fluency, and trail making

tests). These tests are particularly important because they assess the brain’s ability to attend

to multiple stimuli simultaneously, while evaluating the reaction time. A version of the Stroop

test (Stroop 1935), for instance, requires to read aloud the name of a color which is printed out

in a color different by the name (e.g., the word ”red” printed in blue ink instead of red ink).

Tests of verbal fluency (Benton et al. 1994), on the other hand, aim at assessing verbal initia-

tive ability, inhibition ability, and the difficulty in switching among tasks. In these tests, the

participants should produce as many words as they can think of beginning with a particular

letter (phonemic fluency) or belonging to a particular category (semantic fluency), in a con-

strained time of 60 seconds. Examples of useful probes of spatial and perceptual difficulties

include tasks of figure copy (e.g., intersecting pentagons and complex figure copy). Memory

and object naming tend to be less affected in DLB and are best evaluated through story recall,

verbal list learning, and confrontation naming tasks to detect impairments of word-finding

abilities (McKeith et al. 2017).

3.5 Fronto Temporal Dementia

Frontotemporal Dementia (FTD) defines a heterogeneous group of clinical syndromes

marked by the progressive, focal neurodegeneration of the frontal and anterior temporal

lobes (Pasquier & Petit 1997). It is the third most common dementia for individuals older

than 65 years (Brunnstrom et al. 2009, Ratnavalli et al. 2002). FTD affects brain regions im-

plicated with motivation, reward processing, personality, social cognition, attention, executive

functioning, and language.

Currently, FTD incorporates three clinical subtypes known as variants of Primary Progres-

sive Aphasia (PPA): non-fluent, semantic, and logopenic PPA. Patients are first diagnosed with

FTD and then are divided into clinical variants based on the relative presence or absence of

37

Page 62: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

salient speech and language features. The diagnosis of FTD requires initial and progressive

decline in social functioning and changes in personality, characterized by a progressive deteri-

oration of behavior accompanied by three out of the following features: disinhibition, apathy,

loss of empathy, eating behavior changes, compulsive behaviors, and an executive predomi-

nant pattern of dysfunction on cognitive testing. The main language domains considered to

classify disease’s variants include speech production features (e.g., grammar, motor speech,

sound errors, and word-finding pauses), repetition, single-word and syntax comprehension,

confrontation naming, semantic knowledge, and reading/spelling.

The clinical diagnosis of the non-fluent variant of primary progressive aphasia (nfvPPA)

requires either agrammatism in language production or effortful, slow, and labored speech

with inconsistent sound errors and distortions (apraxia of speech) (Gorno-Tempini et al. 2011).

The semantic variant of primary progressive aphasia (svPPA) preserves language fluency, but is

characterized by anomia, severe single-word comprehension deficits, loss of object knowledge,

semantic and paraphasic errors (Gorno-Tempini et al. 2004). These variants affect between 20%-

25% of patients diagnosed with FTD (Johnson et al. 2005). Finally, the third variant known as

logopenic variant of primary progressive aphasia (lvPPA) presents word retrieval and sentence

repetition deficits. Spontaneous speech is characterized by a slow rate, but without a clear

agrammatism. This syndrome is the most recently identified among the variants of PPA and

presents some patterns in overlap with AD especially in the early age of onset (Henry & Gorno-

Tempini 2010).

Tasks typically used in the diagnostic process include picture description and story retelling

tests to evaluate grammatical structure, diadochokinesis task to assess motor speech capabil-

ities, repetitions, confrontation naming, and sentence or single-word comprehension (Gorno-

Tempini et al. 2011).

3.6 Amyotrophic Lateral Sclerosis

Amyotrophic Lateral Sclerosis (ALS) is characterized by a rapid, progressive degeneration of

motor neurons in the brain and spinal cord, which ultimately leads to paralysis and prema-

ture death. Overall, the prevalence of ALS is low, approximately 5 in 100,000 individuals, but

incidence increases with age, showing a peak between 55 and 75 years (Bertram & Tanzi 2005).

Primarily characterized by weakness and atrophy in the muscles of the extremities, de-

creased muscle tone, and fasciculations, this disease is often subtyped into several variants

according on the site of onset (i.e., bulbar, spinal, and respiratory). Approximately the 70%

of patients are affected from the spinal form of the disease, the 25% of the cases reports bul-

38

Page 63: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

bar onset, and the remaining 5% has initial trunk or respiratory involvements (Kiernan et al.

2011). While the spinal variant presents symptoms that may start with upper and lower limbs

muscle weakness, the bulbar subtype presents speech and swallowing difficulties, being char-

acterized by respiratory problems, tongue atrophy, and by the eventual loss of speech intelli-

gibility (Yorkston et al. 1993). In this subtype speech problems are often present as an early

manifestation of the disease, possibly affecting the phonatory, articulatory, resonatory, and res-

piratory speech subsystems. As a result, ALS patients may experience both dysarthria and

dysphagia (difficulty in swallowing). The pattern of speech impairments includes effortful,

slow productions with short phrases, inappropriate pauses, imprecise consonants, hypernasal-

ity, strain-strangled voice, as well as a decreased pitch and loudness range (Watts & Vanryck-

eghem 2001). Acoustic analysis of the voice has confirmed a deviant fundamental frequency,

amplitude and frequency perturbations, voice range, vocal quality, and phonatory instabil-

ity (Silbergleit et al. 1997).

The criteria for the diagnosis of ALS (Brooks 1994) require the presence of signs of degen-

eration of lower motor neuron (weakness, or paralysis accompanied by loss of muscle tone),

upper motor neuron (paralysis accompanied by severe spasticity and rigidity), and progres-

sive spread of signs within a region to other. At the same time, it is required pathological,

neuroimaging and electrophysiological evidence of the absence of other diseases that might

explain the observed clinical signs. Abnormal speech or swallowing studies and abnormal pul-

monary or larynx function are among the clinical features used to support the diagnosis. The

progression of speech impairments can be quite rapid. Case studies report an overall decay of

speech intelligibility (from 98% to 48%) and pulmonary function in an observation period of

only two years (Kent et al. 1991). Due to this aggressive loss, ALS patients should be frequently

re-assessed by SLPs. In order to be comprehensive, the assessment should individually eval-

uate the articulatory, respiratory, phonatory, and resonatory subsystem. The first is assessed

using kinematic measures (e.g., speed, strength) of facial components (jaw, lips, and tongue),

while the respiratory subsystem is evaluated considering aerodynamic (e.g., oral pressure, air-

flow) and acoustics variables. Both may use specialized instruments and speaking tasks of

spontaneous speech. The evaluation of the phonatory subsystem relies on voice characteristics

using the maximum phonation time task, while the evaluation of the resonatory subsystem is

based on the analysis of velopharyngeal muscle weakness.

3.7 Huntington disease

Huntington Disease (HD) is caused by a degeneration of neurons in the basal ganglia and in

cortical regions, affecting the areas of the brain involved in movement, cognition, and emotions.

39

Page 64: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Its prevalence is similar to that of ALS, affecting approximately 2.71 in 100,000 individuals

worldwide (Pringsheim et al. 2012).

HD is characterized by a progressive motor dysfunction, behavioral changes and cognitive

decline resulting in dementia. From a clinical perspective, HD is primarily manifested by in-

voluntary movements known as chorea, which may be accompanied by bradykinesia, motor

impersistence, and deficits in movement planning, aiming, tracing, and termination (Berardelli

et al. 1999, Paulsen 2011). A primary consequence of chorea is the onset of a motor speech

disorder characterized as hyperkinetic dysarthria. The main patterns of this disorder are im-

precise consonants, prolonged intervals, variable rate, monopitch, harsh voice, inappropriate

silence, distorted vowels, and excessive loudness variations (Hartelius et al. 2003, Saldert et al.

2010).

Currently, HD is formally diagnosed based on the presence of the HD gene and on the

development of motor symptoms that are unequivocal signs of HD, matching the fourth confi-

dence level (≥ 99% confidence) in the Diagnostic Confidence Level of the Unified Huntington’s

Disease Rating Scale (UHDRS) (Reilmann et al. 2014). These criteria, however, may miss the

earliest signs and symptoms of the disorder, which could occur up to 10-15 years before the

disease’s onset. In this period, individuals may experience the gradual appearance of subtle

motor, cognitive, and behavioral changes, but do not meet the current criteria for formal HD di-

agnosis. As such, new diagnostic categories for HD are being proposed, based on an improved

understanding of natural history (Reilmann et al. 2014). Also, some recent studies are starting

to consider motor speech deficits and language difficulties as a clinical indicator of disease on-

set and marker of disease progression (Rusz et al. 2014, Skodda et al. 2014, Vogel et al. 2012).

These studies evaluate major observed impairments in HD patients, including deviations in

phonation (i.e., increased pitch, harsh voice), poor oral motor performance (i.e., reduced co-

ordination of tongue and lips) and alterations in speech timing and prosody (i.e., shortened

phrase length). The tasks typically used are: syllable repetition, sustained vowel phonation,

reading of a passage, and freely spoken spontaneous speech.

3.8 Neurodegenerative diseases and SLT

In the previous sections, the diagnosis process of several neurodegenerative diseases was re-

ported, observing that it partially relies on different kind of tests. According to the disease,

the assessment may include neuropsychological tests or a perceptual assessment of voice qual-

ity, and is typically performed by a neurologist expert or a SLP. In any case, the evaluation

should be repeated over time to monitor the disease progression and adjust drugs administra-

40

Page 65: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

tion. Disease monitoring turns the administration of screening tests even more burdensome,

since it implies the physical dislocation of one of the two parties, the clinician or the patient. In

both cases, it could raise some inconveniences, namely for the additional stress that the patient

and his/her caregiver have to face besides the daily routine, and for the scarcity of clinicians

in remote places with limited resources. The possibility of providing such kind of tests as a

service, available for instance on the internet or by phone, will be useful for allowing remote

clinical assessment. According to the disease, SLT may provide additional advantages to the

diagnostic process of neurodegenerative disorders, these are highlighted in the remainder of

this section according to disease clinical symptoms.

Neurological disorders that present motor impairments (e.g., PD, HD, ALS) affect the

speech apparatus, while preserving syntactic, semantic, and pragmatic abilities of language

production. In these kind of diseases, the neurological injury causes the weakness, paralysis or

lack of coordination of the speech organs that contribute to the production of sounds, like vocal

folds, lungs or jaws, with evident consequences on the resulting voice quality. Given their na-

ture, these disorders are evaluated on speech tasks targeted to analyze phonation, articulation,

and prosody. During the administrations of these tasks, the SLP should be able to perform a

perceptual evaluation of the speech functionalities, and to compare these results with those of

a previous examination. As such, the assessment is strongly dependent on the expertise of the

clinician. SLT could overcome these limitations and provide a great contribution to the exe-

cution of these tasks. In fact, there is an important area of speech processing that deals with

the identification of representative features of the vocal tract and its possible deviations due

to diseases. This will lead to an objective, deterministic and reproducible evaluation, and will

ease the comparison with previous data of the same patient. Additionally, the evaluation could

be performed remotely, with obvious advantages already highlighted previously.

Neurodegenerative disorders affecting cognitive functions require a different evaluation,

based on the administration of several cognitive tests. The typology of tests used is strictly de-

pendent on the disease and on the cognitive domains impaired. The diagnosis of MCI, AD, and

DLB is based on neuropsychological tests assessing primarily memory, orientation and higher

order functions like planning, and attention. The diagnosis of isolated language deficits relies

on cognitive stimuli assessing a specific functionality, such as the ability to repeat a word or

a sentence, or the ability to name an object or a person. In both cases, the majority of these

tests include a verbal component provided in response to a visual or spoken stimulus solicited

by the clinician. Also, the result of a neuropsychological evaluation involving cognitive de-

cline typically provides a numerical score, which is adjusted by accounting for age and literacy.

41

Page 66: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Then, it is compared with reference normative values in order to establish if the obtained score

is considered normal or below expectations. Due to their nature, and to the need of continu-

ously monitor the cognitive decline over time, neuropsychological tests lend themselves natu-

rally to be automated through SLT. A tool including the digitized version of these tests, with

the possibility of an immediate evaluation through automatic speech recognition, is feasible

and could be of valuable support in health care centers: first, the therapist will have access

to an organized archive of tests; second, tests could be administered in the traditional way, or

remotely, when the physical dislocation of the subject is hampered by logistic constraints or

physical disabilities; finally, recordings and evaluations could be stored and available for later

consultation.

A different evaluation is required for the class of disorders that present language impair-

ments (e.g., FTD, AD), whose assessment includes tests requiring complex language abilities,

like discourse production. In this case, the analysis relies on samples of spontaneous speech,

elicited through different types of cognitive stimuli. Also, the way in which speech is elicited

will directly influence the resulting discourse and its characteristics, and should be considered

in the assessment. A descriptive speech is obtained through the description of an image or

an object. A narrative speech emerges through the recall of an event and involves memory,

while, a procedural speech includes instructions directed to explain how to perform a task,

and involves higher order functions like planning. Spontaneous speech samples are typically

recorded and then analyzed in terms of their phonological, syntactic, semantic, and pragmatic

features. The process to obtain these measures is based, first on the manual transcription of

the recordings and then on the subsequent identification and annotation of linguistic elements,

such as lexical items (e.g., noun, verb, adjective), sentence clause boundaries, and cohesive el-

ements. Through these annotations, word frequencies, and other statistics are then manually

computed in order to assess discourse production in terms of its correctness, fluency, informa-

tion conveyed, and overall coherence. Due to the time that this type of analysis requires, this

approach could not provide an immediate feedback in clinical settings environments. Addi-

tionally, it may also lead to different inter-expert assessments due to the intrinsic, ambiguous

nature of spontaneous language. Providing this kind of analysis in an automatic way, through

natural language processing and automatic speech recognition, will allow to overcome these

limitations, and provide clinicians with a complementary tool to evaluate complex language

abilities.

42

Page 67: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

3.9 Summary

In this chapter, several neurodegenerative disorders were introduced, reporting their major

symptoms and core clinical criteria used for diagnosis. From this analysis, Alzheimer and

Parkinson’s diseases are among the most important neurodegenerative disorders due to their

high prevalence, representing, respectively, the first and second most common neurodegen-

erative diseases affecting people older than 60 years. Additionally, both diseases can be con-

sidered, individually, representative of other disorders that present similar symptoms. This is

especially the case for the overlap between reported language impairments in AD and some

subtypes of FTD. In fact, although memory impairment is the main symptom of AD, language

problems are also prevalent. Initially, the semantic domain is impaired, as the disease worsens,

also the syntax and phonology domains get affected. An extensive use of pronouns, accom-

panied by a reduced use of nouns, is a hallmark of both AD and the semantic variant of PPA.

To continue, word-finding difficulties, characteristic of the speech of AD patients, are also one

of the core features of the logopenic variant of PPA. Another example is MCI and the several

variants that exist of dementia. Although along the course of the disease quite different evolu-

tions may occur, the appearance of the initial symptoms may often be similar. Amnestic and

orientation impairments are commonly referred as the first symptoms manifested in MCI, AD,

and in some cases, ALS. In what concerns PD, one can observe that the motor degeneration af-

fecting the speech apparatus, known as dysarthria, can be developed also in HD and, to some

extent, in ALS. Based on these considerations, it is probably the case that the development of

automatic methods targeted for the diagnosis of a particular disorder, could be easily extended

to other disorders with similar onset or evolution.

Finally, considering that these diseases represent two of the most common neurological

disorders, the choice of focusing on them will intrinsically bring other advantages. Among

them, the most practical one is concerned with the availability of data. This issue becomes

increasingly important when dealing with current machine learning approaches that typically

require a considerable amount of input data to allow for a good generalization of the problem

under consideration. Although nowadays there is an increasing availability of digital resources

in many areas (e.g., online newspapers and encyclopedias), this is not straightforward for the

area of speech processing applied to health. In this context, one is dealing with sensitive data

that are difficult to gather, since they consist of the speech recordings of subjects with a clinical

condition. For this reason, data are also subjected to privacy and ethical concerns. The process

of data collection should be supported by a detailed protocol that should be validated by an

ethics commission. The protocol should specify the objectives of the study and how the privacy

43

Page 68: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

and security of the data will be guaranteed. To conclude, the distribution of the observed

population should be balanced in terms of gender, age, and education in order to be validated

with existing normative values. By selecting the two most widespread diseases, it is more

probable to find publicly available data sets.

44

Page 69: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

4Related Work: SLT for Diagnosis

of Neurodegenerative Diseases

In Chapter 3, several neurodegenerative diseases that affect different speech and language ca-

pabilities were introduced. Their relevance, major symptoms and diagnostic criteria were ana-

lyzed. Then, the study focused on how current speech and language technology may provide

benefits to the diagnostic process of these diseases. Finally, Alzheimer and Parkinson’s diseases

were identified as the disorders on which to focus the next part of this research. From this re-

port, it is also possible to observe and identify the link that exists between speech and some

neurodegenerative diseases. In some cases, speech production can be considered a marker of

central nervous system integrity, leading to a frequent motor disorder known as dysarthria

(e.g., Parkinson’s Disease (PD) and Huntington Disease (HD)). In other instances, when speech

production is spared, it becomes a standard way to screen cognitive disorders that may present,

at least initially, similar onset. Finally, speech production could be a link between different

diseases that report similar impact on language functionality, such in the case of dementias

where correctness, fluency, and meaningfulness become impoverished over time. These con-

siderations motivated the three contributions of this thesis, mentioned in Chapter 1, namely,

monitoring of speech, cognitive, and language abilities. In this chapter, for each of them the

literature review is presented by describing the most relevant works that provide automated

solutions based on SLT.

4.1 Monitoring of speech abilities

In the last few years, there has been a growing interest from the research community in motor

speech disorders. The current state of the art includes an extensive body of research targeting

an automatic characterization of dysarthria in order to discriminate among PD patients and

healthy subjects. Overall, these studies have considered the analysis of different speech fea-

tures that should be able to reflect the physical impairments caused by the disease. A selection

of the most relevant works existing on this topic is reported in the following paragraphs.

Rusz et al. (2011) investigated a set of quantitative acoustic measurements for the character-

ization of speech and voice disorders in early untreated PD patients. The corpus is composed

of 46 Czech native speakers, 23 individuals diagnosed with an early stage of PD, and 23 healthy

Page 70: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

individuals matched for age. Eight vocal tasks were used in the study, including two different

versions of the sustained phonation of vowels /a/, /i/, /u/, the rapid /pa/-/ta/-/ka/ sylla-

bles repetition, a monologue, and the reading of various tasks with different characteristics (i.e.,

sentences with varied stress patterns, and sentences to be read according to specific emotions

among others). The study considered measures traditionally used for evaluating phonation,

articulation, and prosody in PD and, additionally, introduced some new measurements of ar-

ticulation. Among the set of features, there are: the fundamental frequency (F0), jitter, shimmer,

first and second formant frequency (F1, and F2 respectively), speech rate, pause, variations in

loudness, and articulation accuracy. The most representative features are then identified ac-

cording to two criteria: i) selection of measures with statistically significant differences between

the two groups, and ii) removal of highly correlated variables. After this phase, 19 out of 32

measures were selected. The Wald task (Schlesinger & Hlavac 2002) was used to separately as-

sess each measure for its ability of classifying subjects as PD, healthy, or not sure. Results have

shown that the 26.77% of subjects were correctly classified according to their group, the 71.97%

were classified as indecisive situation, while the remaining 1.26% were classified to the inverse

group. Variations of F0 in monologue and sentences read according to specific emotions, were

the best method for discriminating PD patients.

Orozco-Arroyave et al. (2013) explored the discriminant capability of different perceptual

features for automatically classifying between people with PD and healthy individuals. Fea-

tures included in the study considered LPC, Linear Prediction Cepstral Coefficients (LPCC),

MFCC, PLP, and two versions of Relative Spectra coefficients (RASTA), with and without cep-

stral filtering (RASTA-PLP-CEPS, RASTA-PLP-SPEC). The number of coefficients is 12 for each

type of feature except RASTA-PLP-SPEC, for which 27 coefficients are estimated. Four statis-

tics are computed for each kind of features: mean, standard deviation, skewness and kurtosis.

The corpus consisted of the speech recordings of 20 patients and 20 healthy subjects while

performing the sustained vowel phonation task for the five Spanish vowels. Each subject re-

peated the task three times for each vowel. Data is balanced by gender and age. Following the

feature extraction phase, the authors performed feature selection using Principal Component

Analysis (PCA). Then, a two-layer classification scheme was implemented. The first stage of

classification considers each kind of features individually, the second stage combines the results

obtained previously into a new feature space. The 70% of the data is used for feature selection

and for training the classifier, while the remaining 30% is used for testing. Each stage of the

classification process is repeated ten times per each pair of subsets (training and testing), form-

ing a total of 100 independent realizations of the experiment. Classification was performed

46

Page 71: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

using SVM (Cortes & Vapnik 1995) trained with a Gaussian kernel (Scholkopf & Smola 2001).

The best accuracy for each vowel was achieved using different features. For vowel /a/ the

PLP parameters exhibited the best results (76.19%), while for vowels /i/ and /u/ the best fea-

tures were the MFCC (75.30%, 76.28%). The best results for vowels /e/ and /o/ are obtained

when five subsets of features are combined. For the case of vowel /e/ the considered features

are RASTA-PLP- SPEC, MFCC, PLP, LPC and RASTA-PLP-CEPS (77.22%), while for vowel

/o/ the set of features include MFCC, LPCC, RASTA-PLP-CEPS, PLP and RASTA-PLP- SPEC

(81.08%).

Skodda et al. (2011) analyzed the ability to articulate vowels in a group of PD patients

suffering from mild hypokinetic dysarthria. Results were compared with the ones obtained

with a control group. The goal of the study is to confirm the hypothesis that PD patients

present a reduced working space for vowels with respect to healthy control, even when voice

intelligibility is preserved. In fact, limited movements of the articulators, as may be the case in

hypokinetic dysarthria, should be characterized by a lowering of high frequency formants and

by an elevation of normally low frequency formants. To this purpose, the authors resort to the

notions of VAI, and triangular VSA. The relationship of vowel space with the net speech rate

and with the global motor impairment of the disease were also investigated. Analysis of speech

rate was performed by measuring the length of each syllable and pause. As such, the net speech

rate was defined as syllables per second related to the total speech time minus the sum of all the

pauses. The dataset is composed of German speakers, 68 patients and 32 healthy individuals,

each participant performed a reading task composed of four complex sentences. Each of the

vowels /a/, /i/, and /u/ were extracted 10 times from different words within the text. Results

have shown that PD patients present reduced formant transitions and a restricted acoustic

vowel space. However, these impairments were independent from global motor function and

the stage of the disease. The triangular VSA was found to be reduced in male but not in female

PD speakers, whereas measurement of VAI seemed to be more applicable for the differentiation

of PD and healthy speakers.

In another work, Orozco-Arroyave et al. (2014) exploited acoustic measures for analyzing

phonation and articulation in PD patients and for distinguishing them from a control group.

The corpus is composed of 50 patients and 50 healthy subjects while performing three repeti-

tions of the five Spanish vowels. Participants are balanced by gender and age. The acoustic

features considered include: F0 and measures of its variability (jitter, shimmer, correlation di-

mension), F1, F2, the VAI, the triangular VSA, and three new measures: the vocal prism, the

vocal pentagon, and the vocal polyhedron. The base of the vocal prism is the triangular VSA,

47

Page 72: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

while its altitude is given by the variability of the pitch estimated on the vowels /a/, /i/, and

/u/. The vertexes of the vocal pentagon are composed by the values of the F1 and F2 for the five

Spanish vowels. The base of the vocal polyhedron is formed by the vocal pentagon, while its

edges are given by the pitch variability obtained from the five Spanish vowels. From these mea-

sures, the authors computed different features based on their geometrical properties (e.g., area,

volume). For each feature set, mean value, standard deviation, kurtosis, and skewness were

also computed. Classification was performed in two stages, in the first a linear Bayesian classi-

fier allowed the identification of those features with a minimum accuracy of 61%. This subset

of features is then included in the second phase, where an SVM (Cortes & Vapnik 1995) with

Gaussian kernel (Scholkopf & Smola 2001) was used to classify between PD and healthy con-

trol. The parameters of the SVM are optimized using a 10-fold cross-validation strategy. Two

of the features introduced by the authors were selected for the second phase of classification

(std[Vprism], centPenta[F2u]). The best result was achieved using a combination of articulation

and phonation features, providing an accuracy of 81.3%.

Bocklet et al. (2011) investigated acoustic, prosodic and voice-related features to perform

the automatic classification of PD. The analysis was performed using three different systems.

Articulation is characterized through statistical modeling of acoustic features, using the first 39

MFCCs. The authors implemented a GMM-UBM approach to obtain speaker specific GMMs.

The means of each speaker are then used as speaker-specific features. Prosodic analysis was

based on a voiced/unvoiced (VUV) decision, voiced segments are then used to compute the

F0, energy, duration, pauses, jitter, shimmer, and different statistics. Voice and phonation were

modeled by a glottal excitation system based on a two-mass vocal fold modeling. Apart from

a phonation task, the corpus used in this study is the same as the one described in the work of

Rusz et al. (2011). Classification was performed with SVM (Cortes & Vapnik 1995), using LOO

cross-validation. The three systems achieved different recognition rates with different tasks.

Prosodic system: 90.5% on reading a text of 136 words, acoustic system: 88.1% on reading

of sentences containing words with varied stress patterns, glottal excitation system: 78.6% on

reading a text of 136 words, and reading of sentences according to specific emotions. Feature

selection was performed with a correlation-based (Hall 1999) approach, which prefers subsets

of features highly correlated with the class, but with a low inter-correlation among them. After

this approach, the recognition rate of the three systems are: 88.1% (prosodic system), 100%

(acoustic system), and 83.3% (glottal excitation).

In a recent work, Orozco-Arroyave et al. (2016) investigated the characterization of the

speech signal into voiced and unvoiced frames to automatically classify PD patients. Voiced

48

Page 73: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

frames are used to compute different features assessing prosody (i.e., F0, jitter, shimmer) and

articulation (i.e., F1, F2, MFCC). The intuition behind the use of unvoiced frames stems from

the fact that PD patients develop problems in the correct pronunciation of stop and voiceless

consonants. Thus, there could be also important information in those frames where the vocal

folds should not vibrate. Unvoiced frames are modeled using 12 MFCC and 25 bands scaled

according to the Bark scale (Zwicker 1961). Similar to the work of Bocklet et al. (2011), acous-

tic features are modeled using a GMM-UBM strategy. Prosodic features were computed using

the Erlangen prosody module (Zeißler et al. 2006). The dataset contains the recordings of PD

patients and healthy individuals while performing four tasks (reading of isolated words, text,

sentences, and rapid syllable repetition) in three different languages (German, Spanish, Czech).

The classification model is a radial basis SVM (Scholkopf & Smola 2001) evaluated with 10-fold

and LOO cross-validation strategies (Geisser 1975, Stone 1974, 1977), according to the language.

The proposed approach is directly compared with other standard approaches classically used

for speech modeling, such as (1) noise measures, MFCC, and vocal formants extracted from

voiced segments; (2) MFCC extracted from the utterances without pauses and modeled using

a GMM-UBM strategy; and (3) different prosodic features extracted with the Erlangen prosody

module. Results obtained using unvoiced frames have proven to be more accurate than classi-

cal approaches, reaching an accuracy that ranges from 85% to 99%, depending on the language

and the speech task. Cross-language experiments were also performed following a two-step

strategy. The system was trained with the recordings of one language and then tested on the

remaining ones. Additionally, subsets of the language used for testing were included in the

training set and excluded from the test set incrementally. In general, the accuracy ranged from

60% to 100% when recordings of the language that was going to be tested were moved from

testing and added to training.

4.2 Monitoring of cognitive abilities

Far from being complete, Chapter 3 also introduced some of the numerous neuropsychological

tests used for screening cognitive performance and tracking alterations of cognition over time.

The wide range of existing test batteries is partially motivated by the need to have screening

measures able to distinguish the many types of neurodegenerative diseases affecting cogni-

tive abilities. As observed, these disorders present specific patterns regarding the regions of

the brain affected and the consequent symptoms. Accordingly, the screening process and the

measures adopted will also vary, having to be sufficiently sensitive to differentiate between

different disorders.

49

Page 74: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Since different cognitive stimuli require different underlying solutions, I restrict this revi-

sion to two different types of studies. First, the studies related with an automatic analysis

of semantic verbal fluency tasks are summarized. I recall that in these tests the participants

should produce as many words as they can remember belonging to a particular semantic cate-

gory. This is an interesting task for the technological challenges it raises for current SLT. Then,

I describe those works that include tests assessing cognitive functions, such as memory, atten-

tion, or orientation, and that provide a completely automated solution based on ASR.

4.2.1 Semantic fluency tests

Pakhomov et al. (2012) were among the first authors targeting an automatic characterization

of verbal fluency tasks. Results on these kinds of tests are related with the ability to organize

semantic information into conceptually related clusters, and with the strategy used to access

these clusters. Thus, to provide an automatic assessment of clustering and switching strategies,

the authors resort to the notions of semantic similarity and semantic relatedness. The compu-

tation of these measures relies on the publicly available lexical database WordNet (Fellbaum

2010, Miller 1995). In this resource, each word is characterized by its morphosyntactic category

(e.g., noun, adjective), senses (possible different meanings), glosses (definitions), and semantic

relations (e.g., synonymy). To estimate how semantically similar two words are, the hyponymic

(i.e., ”is-a”) relation between words was used. In this way, WordNet was represented as a hier-

archy and the distance between two words was calculated as the distance between the locations

of these words in the hierarchy. Semantic relatedness has been computed using the Gloss Vec-

tors approach (Patwardhan & Pedersen 2006). This method leverages on WordNet and on word

co-occurrence frequency information computed from large corpora. A semantic representation

of a word is built as a high-dimensional second-order context vector, wherein each dimension

is represented by a term that co-occurs with the terms contained in the gloss of the word be-

ing analyzed. The corpus is composed of 113 patients with MCI and possible or probable AD.

Data are in the English language. Patients were administered a comprehensive test battery

considering the verbal fluency task for the animals category, and other tests assessing different

cognitive domains. The latter have been included to verify their relationship with the semantic

indices investigated. Results have shown that similarity and relatedness indices were corre-

lated with tests assessing executive functions, attention, and memory. Statistical differences

between the MCI and the probable AD group were also investigated, finding, surprisingly, that

the AD group produced higher scores in the similarity and relatedness indices.

Miller et al. (2013) evaluated the feasibility of interactive voice response (IVR) technology to

50

Page 75: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

provide neuropsychological tests to older adults. Participants were administered the Wechsler

Adult Intelligence Scale - IV (WAIS-IV), the Wechsler Memory Scale fourth edition (WMS-IV),

the verbal fluency task for the fruit category, and the digit span forward and backward test.

The study involved 158 English speaking subjects, with an age ranging from 65 to 92 years. The

algorithms for the IVR tasks were developed by TelAsk technologies (TelAsk 2019), the word

recognition engine used is the Nuance Open Speech Recognizer (Nuance 2019). The system

was not trained to optimize the recognition of individual speakers. No other further details on

the recognition approach were provided. The feasibility of the system was analyzed in terms

of its capability of independently administer and score simple neuropsychological tests, and

in terms of its capability to provide comparable results to in-person administration. Results

have shown that only 4% of participants were unable to complete all the tasks, indicating that,

overall, the system was easy to use. In the verbal fluency task, 90% of the fruits were correctly

recognized, while in the digit span tests, the number of sequences correctly recognized was

of 93%, and 95%. Overall, clinician and IVR system scoring in the three tests were highly

correlated (r=0.89, r=0.95, r=0.94), but the study also reports a lack of high agreement between

clinician and computer scoring (41.1%, 63.8%, 68%). According to the authors, this represents

the greatest obstacle to the use of these systems in clinical practice. To conclude, the authors

also acknowledge that the different mode of administration of the tasks may change what the

tests measure, and that IVR may possibly introduce new variables in the cognitive evaluation.

Lopez-de Ipina et al. (2015) performed an automated analysis of a semantic verbal flu-

ency task in order to distinguish between MCI and control subjects. To this purpose, the au-

thors used a corpus composed of 187 healthy subjects and 38 patients diagnosed with MCI.

Speech samples were processed in order to compute several linear and non linear features,

among which the first 12 MFCC, the pitch and its variation, the Castiglioni Fractal Dimension

(CFD) (Castiglioni 2010), and the Permutation Entropy (PE). The fractal dimension quantifies

the roughness of a temporal signal and estimates its degrees of freedom. According to the

authors, this feature has the ability to capture the dynamics of a system and thus may reveal

relevant variations in speech utterances. Feature selection was performed with the analysis

of variance (ANOVA) (Fisher 1919) test, which reduced the original feature set by more than

the 50%. Experiments were performed with SVM (Cortes & Vapnik 1995) using 10-fold cross-

validation, results were provided for the two groups separately. Overall, there was a strong

difference in terms of classification accuracy between the control and the MCI group. In fact,

with a combination of linear features, CFD, and PE; the authors reported an accuracy of 85%

and 50% in classifying the control and the MCI group, respectively. Using feature selection on

51

Page 76: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

the same set of features the accuracy for the control group improved to 90%, while for the MCI

group decreased to 40%.

In a following work, Pakhomov et al. (2015) exploited ASR to assess the spoken responses

produced to the semantic verbal fluency test for the animals category. The authors used a

combined approach consisting of a constrained language model, a speaker-adapted acoustic

model, and confidence scoring to filter the ASR output. The corpus was composed of 38 En-

glish speaking professional fighters participating in a longitudinal study of effects of repetitive

head trauma on brain function. Responses were recorded and manually evaluated. The assess-

ment, comprised also the reading of a text passage (∼30 seconds) that was used to perform the

adaptation of the acoustic model. The language model was trained using a corpus of responses

to the animal verbal fluency test provided by 1367 subjects. Finally, the authors also exper-

imented with confidence scores to filter the ASR output of the responses. Responses to the

verbal fluency test contained a large number of disfluencies, noise, and non-speech events that

led to relatively poor baseline ASR performances (WER 89%). However, both speaker adapta-

tion and confidence scoring, individually, improved the baseline result leading to a reduction

of the WER to 70%. Using only the adaptation of the acoustic model, the correlation between

automatically and manually computed scores was relatively high (r=0.80). A closer inspection

revealed the existence of individual samples in which there were clear differences between the

two scores. Extraneous comments were found as the biggest contributors to these discrepan-

cies. Using a constrained language model, these non-animal words are likely to result in lower

overall confidence scores, and thus they can be easily filtered out. In fact, after the confidence

scoring approach, the correlation between automatically and manually computed scores im-

proved to 0.86. Overall, the confidence score approach reduced the number of insertions, but

with the trade-off of an increased number of deletions. The combination of speaker adaptation

and confidence scores filtering led to an improvement of the results, with a reduction of the

WER to 53%.

4.2.2 Cognitive tests assessing memory, attention, orientation

Currently, there are few works targeting the automatic administration of cognitive tests

through the integration of SLT. Among them, I mention the work of Coulston et al. (2007).

The authors investigated the use of a computerized in-home monitoring system incorporating

SLT for the early detection of AD. The system is designed as a kiosk, providing an unattended

battery of questionnaires and cognitive tests. Appointments are scheduled either in person

or by phone, the selected times are entered by the study coordinators via a web interface. A

52

Page 77: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

session has an approximate duration of 30 minutes, responses are typically recorded and, with

just one exception, processed manually. Instructions to the user are explained through a short

video. Interactions take place exclusively via touchscreen or speech; in fact, speech recogni-

tion is enabled for simple navigation through the interface (e.g., a yes/no question). The client

is synchronized regularly with a remote server either to upload the results of the evaluation,

which is cached on the local file system on the client machine, or to check for newly scheduled

testing appointments. Questionnaires include self-report questions about the quality of life,

medication adherence, and how well participants are able to complete activities of daily life.

Cognitive tests include four tasks relying on speech technology and a task requiring partici-

pants to connect labeled dots using their finger on the touchscreen. Among the speech tasks,

the study considered: word list recall, backward digit span, category fluency and the East

Boston story recall. At the time of the study, speech recognition was used to automatically rec-

ognize and score only the backward digit span. Speech detection was used to determine when

to send the patient an encouragement to continue the task, or to ask if he/she has finished. The

work does not include any evaluation of the system.

Wang & Starren (1999) implemented a speech-enabled version of the MMSE (Folstein et al.

1975) in order to evaluate the capabilities of the Java Speech Application Programming Inter-

face (JSAPI) (Oracle 2019) for speech recognition and synthesis. Interactions with the system

may happen through voice, mouse, and keyboard. In order to implement the MMSE com-

pletely, some questions had to be modified. This was needed for those questions that require

human supervision, such as tasks in which the patient has to perform different actions or those

tasks requiring reading or writing abilities. In this application, the ASR system uses a rule-

based grammar. In fact, rule-based recognition is well suited when the number of inputs is

limited, providing, in general, higher accuracy of recognition. For all the tasks except a ques-

tion related with the current date, the grammar was programmed in advance. In what concerns

the date question, since it changes daily, the grammar had to be dynamically generated. The

system integrates a scoring module that computes the result of each individual answer. The

scoring component was implemented with a Boolean variable that is set to true for each ques-

tion answered correctly. However, at the date of completion of the study the score for writing

a sentence was processed manually, but the possibility of feeding the patient’s input into an

external syntactic parser was being explored. Usability tests performed with five graduate stu-

dents revealed an overall satisfaction with the system. Furthermore, the average automatic

score of 24.8 computed by the system was quite close to the average manual score of 26. To

conclude, the authors reported that, for significantly impaired patients, interaction relying en-

53

Page 78: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

tirely on the computer would probably be unpractical, but the system still has potential for

clinical use as a routine screening tool for cognitive disorders.

Lehr et al. (2012) developed a framework to automatically analyze the responses provided

to the Wechsler Logical Memory (WLM) test, part of the Wechsler Memory Scale (WMS) bat-

tery (Wechsler 1997). During the test, which is used to assess memory function, the examiner

reads a brief narrative that the subject is required to retell twice, immediately and after an inter-

val of about 30 minutes. Responses are graded according to how many key story elements are

recalled, in any order, from a list of 25 predetermined story elements. The corpus is composed

of 72 English speaking subjects, 35 diagnosed with MCI and 37 healthy individuals. Three

different adaptations strategies of the acoustic model were evaluated, leading to important im-

provements of the WER (47.5%, 39.8%, and 41.7%). The automatic transcriptions were then

used to derive word-level alignments between each retelling and the WLM source narrative.

The Berkeley aligner (Liang et al. 2006) was used to obtain the alignments, it was trained on

a source-to-retelling and retelling-to-retelling parallel corpus. The alignments, along with the

WLM administration guidelines, allowed to determine which retelling words are matches for

the story elements. Finally, the story elements are used as features for diagnostic classification.

Each subject is associated to a feature vector of length 50, containing 25 story element features

for the immediate retelling and 25 story element features for the delayed retelling. The fea-

tures correspond to the 25 WLM story elements having a value of 1 if the story element was

recalled and 0 otherwise. An SVM (Cortes & Vapnik 1995) model was trained with the story el-

ements feature vectors manually extracted from the held-out dataset. The model is then tested

on the story element feature vectors extracted from the ASR output with the three acoustic

models. Results have shown that when the ASR quality improves, classification accuracy also

improves (75.4%, 77.7%, 80.9%), yielding outcomes comparable to that of manually-derived

features (81.5%).

Hakkani-Tur et al. (2010) investigated the usability of automated methods for evaluating

verbal cognitive status assessment tests. The work is focused on two types of tests: a story-

recall task, used to assess memory and language functioning, and a picture description task,

used to assess the information content in speech. For the story retelling stimulus, the WLM

subtest of the WMS was used (Wechsler 1997), while for the picture description test, the Picnic

picture included in the Western Aphasia Battery (WAB) was selected (Kertesz 1982). The corpus

is composed of 123 English participants of ages 20-102. The goal of the work is to prove that

measures derived automatically from the subject’s speech provide high correlation with cor-

responding measures extracted manually. For these reasons, speech samples were manually

54

Page 79: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

transcribed and annotated, and also processed with an ASR in order to obtain the automatic

transcriptions. The speech recognizer was developed for recognition of meetings with close

talking microphones, using acoustic data of young speakers (Stolcke et al. 2008), no model

adaptation was performed. Preliminary results have shown a WER of 30.7% and 26.7%, for

the story retelling and the picture description tasks, respectively. For the story retelling test,

the authors extracted 35 atomic semantic content units, a sentence-level piece of information

comparable to a fact, while for the picture description test, a list of 36 units subdivided in 4 key

categories was used. The information content of the descriptions was then evaluated based

on the number of information units produced. Recall, precision, and F-score of uni-grams and

bi-grams are then computed on the story retelling test, while recall is computed for the picture

description test. Finally, the correlation between the manual evaluation scores and automatic

metrics is derived both from manual and ASR transcriptions. With respect to the story re-

call test, uni-gram F-score provided the highest correlation, being that manual transcriptions

achieved an higher score (r=0.85) than automatic ones (r=0.70). Regarding the picture descrip-

tion task, the correlation of uni-gram recall computed on the manual and automatic transcrip-

tions was of 0.93, and 0.89, respectively.

4.3 Monitoring of language abilities

Language abilities are evaluated through the assessment of isolated, specific functionality (e.g.,

naming), or through the evaluation of discourse production. The last can provide a broader and

more comprehensive vision of linguistic impairments. In fact, discourse production can be as-

sessed along two dimensions: microlinguistic, concerned with lexical and syntactic processing,

and macrolinguistic, focused on pragmatic processing. The first yields data about language-

specific abilities for processing phonological, lexical, and syntactic aspects of single words and

sentences. The second, depends on the integration of linguistic and non-linguistic knowl-

edge for maintaining conceptual, semantic, and pragmatic organization at the suprasentential

level (Kintsch 1994, Kintsch & Van Dijk 1978, Marini et al. 2011). In a discourse production task,

microlinguistic aspects are quantified by lexical error measures (i.e., verbal paraphasias and use

of indefinite terms), and by syntactic measures consisting of omissions or errors in grammati-

cal forms and syntactic complexity. Macrolinguistic aspects of language production are instead

quantified by rating of the cohesion and coherence. In the literature revised for this area, I have

found several works targeting an automatic evaluation of discourse production. The focus of

these studies is on some aspects of the micro and macro linguistic dimensions, with the last be-

ing approached very recently. Overall, these works assess the quality of discourse production

55

Page 80: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

through the automatic analysis of a combination of lexical, syntactic, acoustic, and semantic

features.

Among the works targeting an automatic analysis of narrative speech, I briefly recall the

work of Hakkani-Tur et al. (2010), presented at the end of the previous section. In this work,

the authors used a picture description test to assess the information content in speech. To this

end, they considered the number of information units produced with respect to a predefined

list of 36 units subdivided in 4 key categories.

Orimaye et al. (2014) investigated five different computational models for predicting AD

and related dementias using several syntactic and lexical features. The corpus used in this work

is the DementiaBank (MacWhinney et al. 2011, TalkBank 2017), a public database for the study

of communication in dementia. The collection was gathered in the context of a yearly basis lon-

gitudinal study: demographic data, together with the education level, are provided. It contains

the recordings of healthy individuals and subjects diagnosed with AD, MCI, and other disor-

ders, while performing four tasks. Among these, there are also the descriptions of the Cookie

Theft picture from the BDAE (Goodglass et al. 2001). Recordings were manually transcribed at

word level following the TalkBank Codes for the Human Analysis of Transcripts (CHAT) proto-

col (MacWhinney 2000). Data are in English language. In their work, Orymaie et al. considered

242 samples from patients diagnosed with various dementias, mostly of the AD type, and 242

samples from the control group. The authors identified 21 relevant features: 9 syntactic, 11 lexi-

cal, and age as a confounding feature. Syntactic features relied both on the annotations existing

in the original transcriptions and on the annotations extracted with the Stanford parser (Klein

& Manning 2003a). These features included the set of coordinated, subordinated, and reduced

sentences, the dependency distance used as a measure of grammatical complexity, the number

of dependencies, predicates and their average, and production rules. Lexical features consid-

ered the total number of utterances and their mean length, the total number of function and

unique words, revisions, and the number of word repetitions. A revision indicates that the

patient retraced a preceding error and then made a correction. The feature extraction stage is

followed by statistical tests to identify the most important features. Additionally, the authors

also performed a stage of feature selection with the Information Gain (IG) method. In this

way, they found that the eight features with the highest IG value matched the subset of eight

significant features identified through the statistical tests. Experiments were performed using

five different models, SVM (Cortes & Vapnik 1995), Naıve Bayes (NB), Bayes Network (BN),

J48, DT, and ANN. Performance is measured in terms of precision, recall, and F-score, 10-fold

cross-validation was implemented for each model. Results identified SVM (Cortes & Vapnik

56

Page 81: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

1995) as the best predicting algorithm, achieving the highest F-score of 74% on the disease

group. Overall, results have demonstrated that the patient group used less complex sentences

than the control group and produced more grammatical errors.

Jarrold et al. (2014a) evaluated the predictive capabilities of different machine learning al-

gorithms in the problem of diagnosing four dementia subtypes. Lexical and acoustic features

were automatically computed from the speech recordings, and the associated transcriptions, of

the Picnic picture description test (Kertesz 1982). The corpus is composed of 9 healthy individ-

uals and 39 patients diagnosed with AD (N=9), FTD (N=9), svPPA (N=13), and nfvPPA (N=8).

Acoustic features were extracted with the Meeting Understanding system (Stolcke et al. 2008)

and consider measures related with the duration of consonants, vowels, pauses, and other

acoustic-phonetic categories. Lexical features included frequencies of different morphosyntac-

tic categories, also known as Part of Speech (POS) annotations, and frequencies of words orga-

nized according to 81 categories. To evaluate the sensitivity to speech recognition errors, fea-

tures were computed relying both on the automatic and on the manual transcriptions. Lexical

and acoustic features were combined to form a unique vector characterizing each speaker. The

most informative features were selected through a one-way ANOVA (Fisher 1919) performed

on each feature in each group with respect to the diagnosis. Evaluation was conducted using

5-fold cross-validation over the set of patients. Logistic regression, MLP (Rosenblatt 1958), and

DT (Mitchell 1997) were evaluated. MLP achieved a slightly better performance with an ac-

curacy of 88% in the classification of AD versus FTD, and AD versus control subjects. When

comparing this result with the one obtained using manual transcriptions, the authors found a

difference in classification accuracy of only 2-3%.

Fraser et al. (2016) automatically computed a number of linguistic and acoustic variables

from the recordings, and the associated transcription, of the Cookie Theft picture description

task (Goodglass et al. 2001), contained in the DementiaBank database (MacWhinney et al. 2011,

TalkBank 2017). The dementia group included participants with a diagnosis of possible AD

or probable AD, resulting in 240 samples from 167 participants. The control group included

233 samples from 97 speakers. The authors considered a large number of features, more than

350, derived from the areas of linguistic, psycholinguistics, and speech processing. Relying on

previous studies showing that AD patients may report altered proportion of nouns, adjectives,

and verbs (Bucks et al. 2000a, Jarrold et al. 2014b), the frequency of occurrence of different POS

tags was calculated. Syntactic complexity was measured through mean length of sentences,

T-units, clauses, and scores calculated on the results of a parse tree computed using the Stan-

ford parser (Klein & Manning 2003a). Then, in order to further explore syntactic differences,

57

Page 82: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

the frequency of occurrence of different grammatical constituents were quantified. Vocabulary

richness was assessed using type-token ratio (TTR), moving-average type-token ratio (MATTR)

(Covington & McFall 2010), Brunet’s index (Brunet 1978), and Honore’s statistic (Honore 1978).

Psycholinguistics features were included with the intuition that a semantic impairment may

be manifested through an increased reliance on familiar words. Thus, different norms were

used to rate content words, nouns and verbs in terms of familiarity, imageability, and age-

of-acquisition. To deal with a decreased information content, the authors computationally

measured the mentioning of relevant lexical items relying on a list of expected information

units (Croisile et al. 1996). Finally, acoustic analysis included several features used in the litera-

ture as indicative of pathological speech and the first 42 MFCC. Multilinear logistic regression

with 10-fold cross-validation was used to classify between AD and healthy control. At each iter-

ation, the 90% of the data is used to train the model and to select the most useful features, while

the remaining 10% is retained for validation. Feature selection was performed choosing the N

features with the highest Pearson’s correlation coefficient between each feature and the binary

class. The maximum average accuracy was 81.9%, achieved with the 35 top-ranked features.

Using all the features, the classification accuracy drops to 58.5%. Furthermore, to examine the

underlying structure of the data, the authors performed an exploratory factor analysis, finding

that four factors were the most relevant: semantic impairment, acoustic abnormality, syntactic

impairment, and information impairment.

In a following work, Fraser & Hirst (2016) investigated distributed word representations to

detect semantic changes that may occur in AD. The authors constructed two semantic spaces,

one for the control and one for the patient group, and analyzed the differences between them.

To this end, they built a simple word-word co-occurrence model with the transcriptions of

the Cookie Theft picture (Goodglass et al. 2001). The corpus used in this study contains the

same data described in the previous work of these authors (Fraser et al. 2016). Differences in

the groups were found in eleven word vectors: /three/, /another/, /put/, /side/, /getting/,

/spill/, /splash/, /which/, /up/, /say/, and /fall/. A contextual analysis for these words re-

vealed two different scenarios. In one case, control participants used a number of context

words not used by the AD participants. These words were associated by the authors with a

certain attention to detail. In the other case, AD participants did not use a number of context

words used by the control group. These words were associated with implausible details. Then,

the authors performed a shifting of word vector representations in order to understand how

the words examined have moved in the vector space. Results showed that, in many cases,

the word representations in the AD and control corpora lied very close to each other. When

58

Page 83: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

vector representations were quite distant, the corresponding words were used in different con-

texts. An example is provided for the verb /getting/. Examining the surrounding vectors, it

appears that /getting/ is closer to /running/, /overflowing/, and /falling/ in the AD corpus,

while it is closer to words like /reaching/ and /ask/ in the control corpus. Thus, in order to

discover the multiple senses and the different context in which these terms appear in the two

corpora, the authors also performed a cluster analysis. In this case results revealed that most

word senses were used by both groups, some rare senses that were used only by the AD group

corresponded to semantic errors.

Yancheva et al. (2015) used a set of 477 automatically extracted lexicosyntactic, acoustic, and

semantic features to estimate clinical MMSE (Folstein et al. 1975) scores along time. The corpus

used in the study consists of the recordings of the Cookie Theft picture (Goodglass et al. 2001),

contained in the DementiaBank database (MacWhinney et al. 2011, TalkBank 2017). The au-

thors considered only subjects with associated MMSE scores, resulting in 393 speech samples

from 255 subjects (165 AD, 90 control subjects). To estimate clinical MMSE scores, a bivari-

ate dynamic Bayes Network was used to represent the longitudinal progression of linguistic

features and MMSE scores. Lexicosyntactic features were extracted from syntactic parse trees

constructed with the Brown parser (Charniak & Johnson 2005) and from the annotations pro-

vided with the transcriptions of the narratives. A total of 182 measures was computed, among

which vocabulary richness, syntactic complexity, repetitions, and phrase types. Acoustic mea-

sures included the standard MFCC, formant features, and measures of disruptions in vocal

fold vibration regularity, leading, overall, to 210 features. Finally, semantic measures assessed

the ability to describe concepts and objects of the Cookie Theft picture. To this purpose, 85 fea-

tures were used to verify that a key concept was mentioned, and to compute word frequencies.

Then, three feature selection methods were exploited to identify the most informative mea-

sures. The first method selected a set of top 10 features, which was corroborated by the second

and third method. Interestingly, acoustic features were not included among the top 10 fea-

tures. Performance was measured in terms of the Mean Absolute Error (MAE) between actual

and predicted MMSE scores. In the experimental phase, both the size of the feature set and

the feature selection methods varied. Experiments were performed with LOO cross-validation.

The lowest MAE of 3.83 was achieved using the correlation with the MMSE as a feature se-

lection method and choosing the top 40 features. To evaluate the effect of longitudinal data

in the prediction of the MMSE score, the authors repeated the same experiment, but dividing

the dataset according to the number of longitudinal samples. In this case, results showed that

the lowest MAE for each feature selection method was found on the dataset with the highest

59

Page 84: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

number of longitudinal visits (≥3).

In a following work, Yancheva & Rudzicz (2016) presented a generalizable method to au-

tomatically generate and evaluate the information content conveyed from the description of

the Cookie Theft picture (Goodglass et al. 2001). The data selected contained 255 speech sam-

ples from 168 participants diagnosed with probable or possible AD, and 241 samples from 98

healthy controls. The authors trained a word vector model on a large general-purpose corpus

composed of Wikipedia 2014 (Wikipedia 2014) and Gigaword 5 (LDC 2019). The trained model

consisted of 400,000 word vectors, in 50 dimensions. Then, vector representations for each

word in the DementiaBank corpus were extracted using the previously trained model. Two

cluster models were built, one for each group, using the k-means algorithm. Clusters represent

topics, or groups of semantically related word vectors, discussed by the respective group of

subjects. All previous works related to the manual definition of content units for the Cookie

Theft picture were combined with the content units defined by a speech language pathologist

expert. To evaluate the generated clusters, the Euclidian distance between each clinical con-

tent unit and its closest cluster centroid, in each model, was computed. For both groups, the

recall was of 96.8%. This measure was defined as the percentage of content units whose dis-

tance to the cluster centroid was less than the distance of 99.7% of the datapoints in the cluster.

Different semantic features were extracted from the two models and then used in the classifi-

cation among patient or control. Experiments were performed with a RF classifier and 10-fold

cross-validation, varying the cluster model and the feature set. Using a set of 12 features au-

tomatically extracted, it was achieved an F-score of 0.74. This value is higher than the score

obtained with a set of 85 manual features (0.72). With a combination of the 12 features auto-

matically extracted, and the set of lexicosyntactic and acoustic features introduced by Fraser

et al. (2016), the F-score improved to 0.80.

Hernandez-Domınguez et al. (2018) approached the automatic evaluation of information

content from the recordings and associated transcriptions of the Cookie Theft picture (Good-

glass et al. 2001), contained in the DementiaBank database (MacWhinney et al. 2011, TalkBank

2017). The authors selected 262 participants among AD, MCI, and healthy control, providing

a total of 517 transcriptions (257 AD, 43 MCI, and 217 healthy control samples). Additionally,

25 healthy controls and their transcriptions were retained for the generation of a referent that

is used to automatically evaluate the informativeness and the pertinence of the descriptions.

The referent is created by extracting patterns that consider different manners of describing ac-

tions or situations. Linguistic and phonetic features were also considered, by accounting for

the frequency of different word classes, measures of vocabulary richness, and MFCC. Over-

60

Page 85: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

all, a total of 105 features were computed. The authors investigated the correlation between

each feature and the severity of the disease, which was measured on a three-point rating scale

(healthy = 0, MCI = 1, and AD = 2). From this analysis, they found that the information cov-

erage measures appeared to be the variables most strongly correlated with the severity of the

cognitive impairment. Classification experiments were initially performed between the group

of healthy control and AD patients. In a following phase, the MCI group was joined with the

AD group. Evaluation was conducted with different classifiers using 10-fold cross-validation.

Results show an F-score of 81%, and 82%, for the first and second experiments respectively, in

the identification of patients with cognitive impairments.

Very few studies approached linguistic deficits at a higher level of processing. Among

them, in the remainder of this section, I describe the work of Santos et al. (2017) and the work

of Toledo et al. (2018).

Santos et al. (2017) assessed coherence and cohesion in a population of subjects with MCI. The

dataset used in this study consists of manually transcribed samples of spontaneous speech

elicited with different types of stimuli: (i) the description of the Cookie Theft picture (Goodglass

et al. 2001), contained in the DementiaBank database (MacWhinney et al. 2011, TalkBank 2017),

(ii) the telling of the Cinderella story, and (iii) the immediate and delayed recall of Portuguese

narratives of the Arizona Battery for Communication Disorders of Dementia (ABCD) battery.

From the DementiaBank database, the authors selected 43 transcriptions for the MCI and the

control group. The Cinderella and the ABCD datasets included, respectively, 20 and 23 subjects

with MCI, and 20 elderly control. Discourse transcripts were modeled as a complex network

using the word adjacency model (i Cancho et al. 2004). With this approach, each distinct word

becomes a node and words that are adjacent in the text are connected by an edge. The authors

trained two 100-dimensional word embeddings models, for English and Portuguese language,

using Wikipedia dumps from October and November 2016 respectively. These models are

then used to enrich the complex networks. New edges are added between words whose word

vectors had a cosine similarity higher than a given threshold. Classification was performed

using topological metrics of the network, BOW representations, and linguistic features. These

were extracted with the tool Coh-Metrix (Graesser et al. 2004), which includes measures of

lexical diversity, syntactic complexity, word informatoin, and text cohesion through latent se-

mantic analysis. For the Portuguese language, the tool Coh-Metrix-Dementia (Aluısio et al.

2016) was used. Experiments were performed with different classifiers, using using 5-fold

cross-validation and different combinations of features. Depending on the dataset used, the

accuracy was of 52% (Cinderella), 65% (DementiaBank), and 74% (ABCD), achieved with a

61

Page 86: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

combination of topological features computed on the enriched network, BOW, and linguistic

features.

Toledo et al. (2018) analyzed macrolinguistic aspects of speech on a corpus of 60 Portuguese

subjects divided in three groups: AD, MCI, and a healthy control. Participants were required

to narrate the Cinderella story. Discourse samples were recorded, manually preprocessed and

transcribed. In order to extract macrostructural characteristics of discourse, features were com-

puted with the tool Coh-Metrix-Dementia (Aluısio et al. 2016) and by manual marking. The

analysis investigated two variables of discourse production: i) informativity and narrative

structure, and ii) global coherence and modalization (e.g., comments external to the content

of the story). To account for the first variable, the authors considered the number of proposi-

tions of each text. For the analysis of global coherence, the amount of empty emissions, the

total ideas density feature, and the latent semantic analysis features were considered. Statisti-

cal analyses were performed to verify the features and metrics able of differentiating the three

groups. The nonparametric Kruskal-Wallis (Kruskal & Wallis 1952) test was used to compare

performance among the three groups regarding the variables of interest. Results showed that

AD individuals presented less propositions than the MCI and healthy individuals, indicating

less informative discourses with less reference to what was expected for the narrative. They

also presented higher numbers of empty emissions without reference to the narrative, indicat-

ing greater difficulty to maintain the theme. Additionally, AD individuals found difficulty in

the planning and organization of the ideas related to the topic, demonstrating compromise of

the textual macroplane. It was not possible to differentiate each group based on features related

with global coherence.

4.4 Summary

In this chapter, the state of the art of SLT solutions applied to the monitoring of speech, lan-

guage, and cognitive abilities has been presented. From this revision, it is possible to under-

stand the accomplishments achieved in each of these areas, as well as limitations of current

research. Regarding speech abilities, one can observe that existing works investigated different

speech tasks and several acoustic measures to characterize the symptoms caused by dysarthria

in PD. Nevertheless, few studies analyzed common speech production tasks typically used for

diagnosis in terms of their utility for automatic PD discrimination. Research in the diagnosis

of cognitive abilities through neurospsychological tests showed that few works targeted the

automatic administration of cognitive tests through the integration of SLT. Furthermore, none

of these works target the Portuguese language. Existing solutions are only partially automated

62

Page 87: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

and are focused on the implementation of a specific test, not providing the possibility to extend

the work to other tests. In the area of automatic monitoring of language abilities, I witnessed

an increasing body of research dedicated to the analysis of lexical, syntactic, and semantic as-

pects of discourse production. Up to now, however, very few works faced pragmatic aspects

of language. The limitations highlighted in each of these areas represent opportunities to con-

tribute to the research of automatic diagnostic methods of neurodegenerative diseases and will

be further developed in this dissertation.

63

Page 88: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

64

Page 89: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

5Contributions to the Monitoring of

Speech Abilities

As described in Chapter 3, dysarthria is a motor speech disorder characterized by weakness,

paralysis, or lack of coordination in the motor speech system, affecting respiration, phonation,

articulation and prosody. This impairment is characteristic of diseases such as Parkinson’s

Disease (PD), Huntington Disease (HD), and Amyotrophic Lateral Sclerosis (ALS). Several

speaking tasks are used to evaluate the extent of motor voice disorders, the most traditional

ones include the sustained vowel phonation, diadochokinesis, and variable reading of short

sentences, longer passages or freely spoken spontaneous speech (Goberman & Coelho 2002).

These tasks are subjected to a perceptual evaluation from the Speech Language Pathologist

(SLP), that should be able to compare current outcomes with those resulting from a previous

evaluation. In this context, an automatic analysis of the result of these tests would provide an

additional evaluation that could be used to support the one provided by the SLP. From the

literature review of Chapter 4, I found an extensive body of research that investigated the use

of sensible acoustic measures able to represent the symptoms of this disorder. These studies

differ on many aspects: on the set of features considered, on the speech tasks used for the anal-

ysis, and on the statistical approach used in the characterization of the problem. Few studies

analyzed common speech production tasks typically used for diagnosis in terms of their utility

for automatic PD discrimination. For these reasons, my first approach to this problem targeted

the definition of a standard feature set and a classification strategy that can be suitable to un-

derstand the relevance of the various tasks. This work was published in TSD 2017 (Pompili

et al. 2017).

5.1 Automatic detection of Parkinson’s Disease: an anal-ysis of speech production tasks used for diagnosis

In this study, I am not interested in comparing the large amount of different acoustic measures

and learning approaches that have emerged along the years, but rather in defining a feature

set and an evaluation method in order to assess different speech tasks. To this end, I consider

some of the measures that are repeatedly mentioned in the majority of the works examined.

These features were carefully selected considering their sensitivity to represent deficits at var-

Page 90: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Descriptors Functionals

Logarithmic F0 (1), Loudness (1)mean and stdev, mean and stdev ofthe slope of rising/falling signal parts (x6)

Jitter (1), Shimmer (1), Formant 1 bandwidth (1),Formant 1, 2, 3 frequency (3), amplitude (3),Harmonic to Noise Ratio (1),Harmonic difference: H1-H2 (1), H1-A3 (1),MFCC [1-12] (12), LOGenergy (1), First and secondderivative of MFCC and Log-energy (26)

mean and stdev (x2)

Table 5.1: Description of the acoustic features based on 53 low-level descriptors plus 6 func-tionals.

ious dimension of language production: phonation, articulation, prosody. In the literature,

the most traditional measures used in examining phonation include measurement of F0, jitter,

shimmer, and Harmonics to Noise Ratio (HNR) (Orozco-Arroyave et al. 2014, Rusz et al. 2011).

Articulation is typically assessed considering differences in vocal tract resonances. The first

and second formant frequencies and the vowel space area are frequently studied (Proenca et al.

2013, Vasquez-Correa et al. 2013). Prosodic analysis includes measurements of F0, intensity,

articulation rate, pause, and rhythm (Bocklet et al. 2013, Rusz et al. 2011, Skodda & Schlegel

2008).

The complete set of features is reported in Table 5.1. These features have been extracted with

the openSMILE toolkit (Eyben et al. 2010), in order to allow the reproducibility of the results.

First, these features are initially computed at the frame level, the so-called low-level descrip-

tors, which are obtained based on a subset of the Geneva Minimalistic Acoustic Parameter Set

(GeMAPS) (Eyben et al. 2016) and the MFCC pre-built configuration files. Then, in a second

step, two functionals (mean and standard deviation) are applied in order to obtain a feature

vector of constant length for the whole utterance. For some features, (F0 and loudness), mean

and standard deviation of the slope of rising/falling signal parts were also computed. Finally,

a 114-dimensional feature vector composed of 78 MFCC-based features and 36 GeMAPS-based

features was obtained. Some other features, also frequently mentioned in the literature (i.e., the

articulation rate, pause analysis, or VSA), were not considered in this work in order to build a

general-purpose feature set, which could be suitable for each task under assessment.

5.1.1 Corpus

The FraLusoPark database (Pinto et al. 2016) has been used to assess the relevance of different

speech tasks. This is a new corpus of 140 European Portuguese speakers, 65 healthy control

and 75 PD subjects, age-matched and sex-matched with the control group. Each participant

66

Page 91: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

PD patients ControlsM F M F

Gender 38 37 34 31Age 64.6±11.9 66.9±8.5 62.4±12.4 66.6±14.4Years diagnosed 6.7±4.5 10.8±5.6 —- —-MDS-UPDRS-III 32.1±12.9 38.3±14.5 —- —-

Table 5.2: Demographic and clinical data for patient and control groups.

was required to perform 8 different speech tasks with an increasing complexity in a fixed or-

der: (1) three repetitions of the sustained phonation of the vowel /a/, (2) two repetitions of

the maximum phonation time (vowel /a/ sustained as long as possible), (3) oral diadochoki-

nesis (repetition of the pseudo-word /pataka/ at a fast rate for 30 s.), (4-5) reading aloud of 10

words and 10 sentences, (6) reading aloud of a short text (”The North Wind and the Sun”), (7)

storytelling speech guided by visual stimuli, and (8) reading aloud of a set of sentences with

specific prosodic properties. The total duration of the recordings is 6 hours and 31 minutes for

the control group, and 7 hours and 30 minutes for the PD group. Demographics data of the

corpus are presented in Table 5.2. The study was approved by the ethics committee of the Fac-

ulty of Medicine at the Santa Maria University Hospital (Lisbon, Portugal). Data was manually

preprocessed in order to remove the therapist’s speech and the spontaneous interventions in-

troduced by the subject that were not directly related with the task. After that, recordings were

down-sampled to 16 kHz.

5.1.2 Evaluation

The selected model is a Random Forest classifier as implemented in the WEKA toolkit (Hall

et al. 2009). This implementation relies on bootstrap aggregating, also known as bagging, a

machine learning ensemble meta-algorithm designed to improve the stability and accuracy of

machine learning algorithms used in statistical classification and regression. Bagging reduces

variance and helps to avoid over-fitting. A stratified k-fold cross-validation per speaker strat-

egy is used for training and evaluation of each speech task separately, with k being equal to 5.

In this way, it is ensured that the train and the test sets at each iteration do not contain the same

speakers. Also, the percentage of speakers of each class is balanced in the two data sets at each

iteration.

On a first attempt, the recordings of every speech production task for each speaker have

been processed as described previously to obtain a feature vector of 114 elements. I refer to this

approach as sentence-level feature extraction. This strategy results in a single feature vector

per speaker and task. In other words, cross-validation experiments for each task are limited

67

Page 92: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

to only 140 sample vectors, which will probably result in poorly trained models and less reli-

able results. Alternatively, in order to increase the number of samples, I have also performed

a segmental feature extraction strategy. In this case, this strategy results in a feature vector,

as previously described, for each audio subsegment of fixed length equal to 4 seconds with a

time shift of 2 seconds. This approach permits increasing the amount of training samples for

the cross-validation experiments, besides extracting more detailed information of the speech

productions. Table 5.3 shows classification accuracy (%) results for each speech production

task following the two feature extraction strategies described previously: sentence-level and

segmental. As expected, the former approach led to poorer results, mostly motivated by the

reduced number of training samples (only 112 in each task at each cross-validation iteration).

However, one may also argue that in this way valuable information is lost when applying

the functionals to long speech segments as the ones corresponding to each speech produc-

tion. On the other hand, the segmental feature extraction strategy leads to very remarkable im-

provements in terms of classification accuracy. In particular, the reading words task achieves a

maximum of 40.6% relative improvement, followed by the reading sentences task with 31.5%

relative improvement.

Overall, from these results it is possible to observe that the reading prosodic sentences task

achieved the best recognition accuracy (85.10%). In fact, this is the best performing task also

in the case of sentence-level feature extraction. This observation confirms the relevance of this

task, which was carefully designed in order to explore language-general and language-specific

details of PD dysprosody. The second most discriminant task in terms of automatic PD classi-

fication is the storytelling one (82.32%). As a matter of fact, this task corresponds to the pro-

duction of spontaneous speech, since the subject has to create a story based on temporal events

represented in a picture. Although its overall duration is extremely variable and dependent on

the speaker, this task definitely contains many important acoustic and prosodic information.

This result is very encouraging for the development of tele-monitoring applications that may

use spontaneous speech recorded over the telephone.

The next most discriminant tasks are those consisting of reading short passages of text and

sentences. Again, I believe that these productions are richer in terms of acoustic and prosodic

information, which makes them more convenient for automatic PD detection in contrast to less

informative rapid syllables repetitions or maximum phonation time of vowel /a/. In general, it

is likely that more complex tasks will contain more linguistics phenomena, like for instance co-

articulations, that may provide important cues for discrimination. Moreover, these more com-

plex tasks consist generally of longer speech productions, which is expected to be beneficial

68

Page 93: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Feature Extraction Results - Accuracy (%)Task Sentence Level SegmentalSustained vowel phonation (/a/) 55.00 58.14Maximum phonation time (/a/) 60.00 75.65Rapid syllables repetitions 60.71 73.28Reading of word 54.29 76.35Reading of sentences 62.14 81.74Reading of text 65.00 79.86Storytelling guided by visual stimuli 66.43 82.32Reading of prosodic sentences 70.71 85.10

Table 5.3: Task-dependent recognition results on the 2-class detection task (PD vs. control).

for the segmental feature extraction approach. Nevertheless, both feature extraction strategies

provide coherent results in terms of identifying the top-4 most significant speech production

tasks. Finally, I observe that the sustained phonation of vowel /a/ is the task that achieved the

worst results with the segmental approach by a large margin (58.14%). However, this perfor-

mance is in line with the results of Bocklet et al. (2011), where the authors found that read texts

and monologue were the most meaningful tasks for the automatic detection of PD, while the

phonation task achieved the poorest recognition rate.

5.2 Summary

In this chapter, the potential discriminative ability of a large set of speech production tasks

used in the automatic detection of PD has been analyzed. For this purpose, the FraLusoPark

database (Pinto et al. 2016) has been used. This resource contains data from European Por-

tuguese PD and healthy speakers while performing 8 tasks designed to assess speech disor-

ders at various dimensions. For each task, automatic classification experiments have been con-

ducted using a RF classifier and a custom set of acoustic features carefully selected based on the

study of the state of the art. The experimental results have shown that the most important pro-

duction tasks are reading of prosodic sentences and storytelling, achieving a PD classification

accuracy of 85.10% and 82.32%, respectively. These tasks definitely contain more acoustic and

prosodic information than the sustained vowel phonation or the reading of a word. Their selec-

tion, thus, may indicate that in the identification of PD a comprehensive evaluation of speech

impairments is more important than the assessment of isolated abilities, such as phonation or

articulation.

69

Page 94: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

70

Page 95: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

6Contributions to the Monitoring of

Cognitive Abilities

In Chapter 3, some of the numerous neuropsychological tests used for screening cognitive

performance were introduced. As previously observed, many of them are eligible to be ad-

ministered remotely through current Speech and Language Technology (SLT) solutions. Their

automatic implementation represents an appealing target both for the technical challenges it

may raise and for the advantages it may bring to the community. However, from the literature

revised in Chapter 4, it was possible to observe that existing solutions present important limi-

tations. In fact, in some cases, they are only partially automated. Also, these works are focused

on the implementation of a specific test, not providing the possibility to easily extend the work

to other tests.

In this doctoral research, the first approach to neuropsychological tests targeted the seman-

tic verbal fluency task. This test reveals to be particularly useful in the screening of dementia,

allowing to differentiate between Alzheimer’s Disease (AD) and Mild Cognitive Impairment

(MCI) (Pakhomov et al. 2012). The inclusion of this test in the Mini-Mental State Examination

(MMSE) has been recommended as a means of increasing its sensitivity (Strauss et al. 2006).

This is a demanding task for current speech recognition technology, as it requires the spon-

taneous production of a list of items belonging to an unconstrained domain. This problem

has been approached both with the construction of a tailored language model and exploit-

ing prosodic hints from the linguistic area. The results of this work were published in ICPhS

2015 (Moniz et al. 2015).

Next, the focus of my work shifted towards the automatic implementation of two widely

used neuropsychological batteries: the MMSE and Alzheimer’s Disease Assessment Scale -

Cognitive Subscale (ADAS-Cog). They are two general batteries used in the screening of var-

ious dementias, in particular of AD and MCI. The MMSE is so widely used in preliminary

screening methods exactly because it is somehow ”general purpose”, providing a quick and

generic evaluation to understand if the condition of initial cognitive decline is met. This work

takes a step towards introducing a set of neuropsychological tests for AD and MCI, intended

for the Portuguese population. These tests have been integrated into an online platform, ex-

tending a system previously used for the remote rehabilitation of aphasia. As far as I know, it

is the only platform of this type implemented for the Portuguese population. This work was

Page 96: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

published in SLPAT 2015 (Pompili et al. 2015).

6.1 Semantic verbal fluency test

Semantic verbal fluency tests require the patient to name as many items as possible belong-

ing to a specific category, within a time constrained interval, typically one minute. The most

common category is animals (Strauss et al. 2006), though other commonly used are food or

first names. In animal naming tests, the target of this work, the score corresponds to the sum

of all admissible words, where names of extinct, imaginary, or magic animals are considered

admissible, while inflected forms and repetitions are considered inadmissible. ASR could be of

valuable support in the automation of fluency tasks, even though their implementation raises

important challenges. The first is related with the open domain nature addressed by the task.

In fact, current speech recognition technology is able to provide quite reliable results when

dealing with problems whose domain could be somehow limited (Abad et al. 2013). One of

the components of a speech recognition system, the language model, is particularly affected by

the nature of the task, since it should contain the knowledge of the rules of a language, being

used to guide the search for an interpretation of the acoustic input. If the language model does

not contain a given word, this will never be recognized. Another challenge is represented by

the number of disfluencies produced in these kind of tasks, which may significantly affect the

recognition accuracy. In fact, disfluencies are relatively frequent in spontaneous speech, but,

in this context, they may be particularly relevant, because of the cognitive load required by

the test and its duration. The first challenge is addressed with the automatic construction of a

language model suited for the task. The second challenge is faced by investigating the prosodic

patterns that the nature of the task induces to produce.

6.1.1 Corpus

The corpus used in this study consists of a database of recordings of 42 native Portuguese

healthy speakers (19 females and 23 males), with ages varying from 20 to 65 years, different

education, socioeconomic status, and cultural background. The corpus was collected by the

author of this dissertation with the aim of having a diverse sample of adults for assessing the

animal naming task. Orthographic transcriptions were manually produced for each session,

all the events were classified as a word from an animal list, as a disfluency, or as other events,

namely comments. The overall duration of the corpus is approximately 43 minutes, of which

about 21 minutes are silent, about 15 minutes include speech, while the remaining ones con-

tain disfluencies and other paralinguistic events (e.g., laugh, cough) or background noise. The

72

Page 97: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

number of valid words is 1171, while the number of disfluencies is 321, representing 27% of the

whole corpus. This percentage is clearly not in line with the ones reported by (Clark 1996, Lev-

elt 1989, Moniz et al. accepted, Moniz, Batista, Mata & Trancoso 2014, Shriberg 1994, 2001, Tree

1995), which indicate an interval of 5%-10% of disfluencies in human-human conversations.

This very high disfluency rate is interpretable by task effect, in particular, naming animals un-

der strict temporal constraints. 75% of the data, corresponding to 31 speakers, have been used

for training the animal naming task, while the remaining ones have been used for testing.

6.1.2 Evaluation

The experiments here described use the in-house ASR engine named Audimus (Meinedo et al.

2010, 2003), a large vocabulary continuous speech recognition module. Audimus is a hybrid

recognizer that combines the strengths of ANN and HMM (Morgan & Bourlard 1995). The

baseline system incorporates three MLP outputs trained with PLP features, log-RelAtive Spec-

TrAl features, and Modulation SpectroGram features. This version integrates a generic lan-

guage model trained on broadcast news, encompassing 100k words. Initial experiments using

the standard version of Audimus, as expected, due to the challenges described above, led to

very poor results with an average WER around 105%. Thus, in order to improve these results, a

technique known as keyword spotting (KWS) was exploited. This approach already proved to

be appropriate for dealing with naming tasks and also for filtering speech disfluencies (Abad

et al. 2013, Moniz et al. 2007, Pompili et al. 2011). I recall that keyword spotting aims at detect-

ing a certain set of words of interest in the continuous audio stream. This is achieved through

the acoustic match of speech with a keyword model in contrast to a background model (ev-

erything that is not a keyword). In this approach, the keyword model contains the names of

admissible animals that will be accepted by the speech recognition system. The size of this list

may have a significant impact on the outcome of the recognizer. In fact, if a keyword is missing

from the list, it will never be detected; on the other hand, longer lists will increase the perplex-

ity of the keyword model. The initial model consisted of an existing list used in the context

of the STRING project by a finite-state incremental parser to add semantic information to the

output of a part-of-speech tagger (Mamede et al. 2012). This list contains 6044 animal names,

grouped, classified, and labeled with their semantic category, without inflected forms. To com-

pute the likelihood of the different target terms, it was taken into account that some names are

more common than others. Thus, the total number of results provided by a web search engine

for a particular term has been exploited to compute this likelihood. The retrieval strategy had

to be refined several times in order to find the optimal approach. In fact, initial queries have

73

Page 98: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Language Model Train set Test SetGeneric ASR system 88.95 105.47Prebuilt list based 16.80 21.22Ontology based 11.94 19.94

Table 6.1: WER for different language models: i) Generic ASR system: general purpose lan-guage model trained on broadcast news, ii) Prebuilt list based: constrained keyword modelcreated from the list used in the STRING project, iii) Ontology based: constrained keywordmodel created from the ontology Temanet.

led to incorrect counts due to homonyms of some terms. The final approach consisted of using

the animal name and the associated semantic category. Finally, the likelihood associated with

each term also allows to sort the list numerically and thus to reduce its size by filtering out less

popular terms. After several experiments, the language model that achieved the best results

contained the 802 most popular animal names. The WER achieved using this language model

decreased to 21.22%. In a following step, in order to further enhance the accomplished results,

and to easily extend the animal naming task to different semantic domains, different resources

were considered. In particular, the Portuguese lexical-conceptual networks TemaNet (Marrafa

et al. 2006) has been used. This resource is organized in twelve semantic domains. TemaNet

is of particular interest for the animal naming task because it is highly structured. In fact, the

hyponyms of animals are organized in a hierarchy of several layers that include, among others,

the separation between male and female. This is relevant not only because in Portuguese, un-

like English, there are different words to express the genre of an animal, but also because the

animal naming task evaluation rules require to account for genre differences.

The first approach with this resource, however, did not impose constraints on the depth of

the hierarchy or on the type of the extracted information. All the subtypes of animals have

been accepted, leading to a keyword model composed of 400 keywords. Then, in a similar

way to the keyword model created from the prebuilt list, the likelihood of the target terms

has been computed by exploiting the total number of results provided by a web search engine.

Experiments with the ontology-based keyword model reported a reduction of the average WER

up to 4% and 2% for the train and test corpus, with respect to the prebuilt list based keyword

model.

One characteristic of this task that hinders the performance of the ASR-based approach is

related with the extraordinary number of disfluencies present in list evocations. It was ob-

served that the elements that actually belong to the list display a characteristic prosodic list

effect (e.g., /dog, cat, cow, horse/), that is not present in the other events unrelated with the task

(e.g., /ah! I already said that. . . /). Thus, one can hypothesize if it is possible to differentiate a

74

Page 99: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Figure 6.1: An excerpt of an audio recording showing, respectively, from the top: the spectro-gram, the F0, the textual transcriptions of the sound, and prosodic events classification. Redarrows indicate a continuation rise contour, while the yellow arrow indicates a finality contour.

word as an element of a list from other types of events by exploiting prosodic patterns. It has

been shown that list effects, or serial recall tasks, display prosodic features mostly character-

ized by two patterns: i) a continuation rise contour, a rising F0 movement from the nuclear

or prenuclear syllables up to the end of the phrase; and ii) a finality contour, a fall from the

nuclear or prenuclear syllables until the end of the phrase (Savino 2004, Savino et al. 2014). The

continuation contour expresses that the list is to be continued and the finality contour that the

item is the last one of a recall series or the last one in the entire file. These two patterns are

visually shown in Figure 6.1.

To confirm the hypothesis that prosodic hints may distinguish possible animal names, the

AuToBI (Rosenberg 2009, 2010) tool has been used. AuToBI, the Automatic ToBI annotation sys-

tem for Standard American English (SAE) is a publicly available tool, which detects and classi-

fies tones and break indices using the acoustic correlates (pitch, intensity, spectral balance, and

pause/duration). This tool requires an initial segmentation of the events to classify. To this end,

the results of the recognition experiments performed in the previous section have been used to

produce the required segmentation input. Two types of configurations of the ASR were selected

to segment the input data: one using a generic language model (Generic ASR system), and

another using a language model specifically built for this task through the Portuguese word-

net TemaNet (Ontology based). Alternatively to the language model based configurations, a

phone-based segmentation is also obtained with the same ASR using a phone-loop grammar.

Results are shown in Table 6.2 for the different segmentation strategies. Regarding AuToBI,

experiments were performed with both the English and European Portuguese models (Moniz,

Mata, Hirschberg, Batista, Rosenberg & Trancoso 2014), although the last were trained on a

75

Page 100: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Segmentation AuToBI model Accuracy

Generic ASR systemPT 71.8%EN 89.1%

Ontology basedPT 72.7%EN 84.3%

Phone basedPT 77.8%EN 91.8%

Table 6.2: Performance of AuToBI using English and European Portuguese models and threesegmentation strategies: ASR-based, ontology-based (TemaNet), and phone-based.

small data set of about 33 minutes. By applying the AuToBI English models with the generic

ASR language model, the system achieved an accuracy of 89.1% in the prediction of potential

animal names. With the Portuguese prosodic models, the accuracy decreases to 71.8%. Such

impoverishment of the performance may be interpreted by the fact that the Portuguese models

used significantly less training data than the English ones. The best performance is achieved

with the phone recognizer using AuToBI English models, with an accuracy of 91.8%.

6.2 Automatic monitoring and training of cognitive func-tions

After approaching the semantic verbal fluency task, the focus of this doctoral research shifted

towards the automatic implementation of two popular neuropsychological test batteries: the

MMSE (Folstein et al. 1975) and the ADAS-Cog (Rosen et al. 1984). These tests are used for

screening cognitive performance and tracking alterations of cognition over time in AD and

MCI. They involve the assessment of different capabilities, such as orientation to time and

place, attention and calculus, language (naming, repetition, and comprehension), or immediate

and delayed recall. In the remainder of this chapter, I summarize some of the key results of this

work, more details can be found in (Pompili et al. 2015).

6.2.1 Extending VITHEA for neuropsychological screening

The baseline for the integration of the MMSE and ADAS-Cog was an automatic web-based

system named VITHEA (Abad et al. 2013). The system aims at acting as a ”virtual therapist”,

incorporating an animated character with speech syntesis capability and ASR to provide word

naming exercises for the remote rehabilitation from aphasia. The ASR engine integrated in

the monitoring tool corresponds to the in-house speech recognizer Audimus (Meinedo et al.

2010, 2003). In order to provide robust feedback to word naming exercises, the speech recog-

nizer resorts to a keyword spotting technique (Abad et al. 2012). The platform comprises two

76

Page 101: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

specific modules, dedicated respectively to the patients, for carrying out the therapy sessions,

and to the clinicians, for the administration of the functionalities related to them (e.g., man-

age patient data, manage exercises, and monitor user performance). VITHEA is used daily by

patients and speech therapists and has received several awards from both the speech and the

health-care communities. The success of this platform and its flexibility, that allows to create

different exercises, have motivated its use as a foundation for this study. The main goal was

the automation of the exercises in MMSE and ADAS-Cog that involve speech. Additionally,

the animal naming test was also implemented in the platform. As explained in the following,

the automation of such tests has raised several technological challenges, both for the automatic

speech recognition and text-to-speech synthesis technologies. Extending VITHEA for includ-

ing neuropsychological tests also involved important alterations in the original platform, both

on the patient and the clinician modules. However, the flexibility of this platform allows for

the easy addition of new categories of exercises. These can then be combined in multiple ways

by the clinician to form new tests, and to create different exercises of the same type. Extensions

were related with the usability of the interface, adapted to meet the needs of an aging popu-

lation with cognitive impairments, and with the presentations of the tests. In fact, following

the feedback received from the neurologists involved in the study, for some stimuli, optional

instructions and semantic hints were provided. The behavior of the animated character was

altered to provide a random feedback when the patient switches among different classes of

stimuli. Finally, the platform was updated to store additional information of the patient’s pro-

file needed for the assessment of some subtests (i.e., place of birth, age, etc.), and the result of

the assessment in terms of the score obtained.

Since the selected neuropsychological tests comprise common or similar questions, one may

approach their concrete implementation organized by type of questions and the underlying

technology with which they were implemented, rather than per test. They are briefly summa-

rized in Table 6.3. Each type of question has set different challenges, each of which has been

addressed individually with ad-hoc solutions. Overall, a total of 185 stimuli belonging to dif-

ferent types of tests have been selected for their implementation in the platform. The scoring

methodology of the implemented tests is straightforward, one point is given to each correct an-

swer provided. For all the tests but one, this corresponds to a single item correctly produced,

only in the case of the repetition exercise the patient has to repeat the whole sentence to get a

score of 1.

77

Page 102: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Test Battery Description Technology

Naming objects and fingers MMSE /Adas-Cog Name a series of objects shown in pictures KWS

Repetition MMSE Repeat the sentence /O rato roeu a rolha/(/the mouse gnawed the stopper/) KWS

Attention and calculation MMSE Starting on 30, successively subtract 3 KWSOrientation to time, placeand person

MMSE /Adas-Cog Questions about time, city KWS/

RBG

Word recognition Adas-CogLearn a list with 12 words. Recall the word froma new list containing 12 new distractors. Theprocess is repeated 3 times

RBG

Evocation MMSE /Adas-Cog Recall a list of word (previously learned or not) ALM

Verbal Fluency – Name as many items as possible in a given cate-gory in 1 min. ALM

Table 6.3: Implemented cognitive tests. KWS: Keyword spotting, RBG: Rule-based grammar,ALM: ad-hoc language model for keyword spotting.

6.2.2 Corpus

To evaluate the feasibility of the monitoring tool, an ad-hoc speech corpus has been collected.

This includes recordings of five people diagnosed with cognitive impairments and five healthy

control subjects. All the participants are Portuguese native speakers. Recordings took place in

different environments with different acoustic conditions. Healthy subjects were recorded in a

quiet, domestic environment, while patients were recorded at CHPL, the Psychiatric Hospital

of Lisbon. No particular constraints were imposed over background noise conditions. Each

session consisted approximately of a 20-30 minutes recording. The data was originally captured

with the platform at 16 kHz, and later down-sampled to 8 kHz to match the acoustic models

sampling frequency.

6.2.3 Evaluation

Due to the extensiveness of the ADAS-Cog test, it was unfeasible to evaluate all the imple-

mented neuropsychological tests. In fact, an estimation of the total duration of the evalua-

tion indicated that it would take more than two hours, which was considered unacceptable.

Thus, only a representative subset of all the tests was selected, comprising a total of 41 stimuli.

The system was evaluated by considering its ability of correctly transcribing the participants’

speech. In fact, answers incorrectly recognized will produce an automatic score of the test that

will underestimate or overestimate the actual result of the participant. Thus, the performance

of the recognition process provides a measure of the reliability of the platform as a screening

tool. To assess the result of the automatic recognition, two evaluation metrics have been con-

sidered, depending on the type of automated tests. On the one hand, the answers to the tests

78

Page 103: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Question type /technology

Patients HealthyAccuracy %

KWS 77.39 88.70RGB 74.29 88.57

(a)

Question type /technology

Patients HealthyWER %

ALM (w/o animals) 20.00 8.16ALM (animals) 74.42 46.48

(b)

Table 6.4: Accuracy and WER according to the type of question.

based on KWS are evaluated as right or wrong, thus their performance can be computed by ac-

counting for the number of coincidences between the manual and automatic result divided by

the total number of exercises, that is, the metric used is the classification accuracy. On the other

hand, the evocation exercises differ from the KWS exercises since the answers to these stimuli

cannot be evaluated as right or wrong, but instead the number of terms correctly recalled needs

to be counted. For this reason, the WER between the manual and the automatic users’ answer

has been computed. These results are summarized in Tables 6.4a and 6.4b.

KWS-based exercises results have shown an average accuracy around 77% and 89% for pa-

tients and healthy subjects, respectively. These results can be considered quite promising, being

comparable to those reported in (Abad et al. 2012), in an evaluation with aphasia patients. No

significance differences were found between those tests relying on simple KWS and for those

ones relying on rule-based grammars. Evocation exercises are divided into two categories:

those requiring a limited number of words to recall and those considering an open domain of

possible answers complying to a specific semantic category (e.g., animal naming test). The av-

erage WER computed for patients and control group in the class of evocation exercises with a

closed domain was 20.00% and 8.16%, respectively. However, the average WER computed for

patients and control group on the animal naming test was much higher, 74.42% and 46.48%,

respectively. Previous results obtained for the same task, but using a corpus of healthy subjects

(Section 6.1), showed a WER of around 20% on the test data. Thus, in comparison, current out-

comes witness a strong impoverishment of the performance. This result was partially expected

since the previous work used a different corpus, containing data from a heterogeneous set of

healthy subjects. The corpus used in this study, on the other hand, is composed exclusively

of elderly participants whose average ages ranges around 73 and 75 years, for the healthy and

patient group respectively. This is reflected on the resulting speech, which is characterized by

a reduced intensity, a reduced pitch, and a hoarse voice. These characteristics represent an

additional challenge for the speech recognizer. Also, after a closer analysis, it was possible to

note that deletions and substitutions were the main source of error. This may be partly ex-

plained by the amount of OOV keywords used by both the healthy and patient group. In fact,

79

Page 104: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

the healthy group uttered 71 unique animal names, but 23 of them were missing from the key-

word model, generating the 39% of deletions and the 42% of substitutions. The patient group

uttered a total of 43 unique animal names, of which 26 were missing from the keyword model.

In this case, 66% of the errors were deletions, while the 28% were substitutions. These results

can be partially compared with the ones achieved by Pakhomov et al. (2015). In that work, the

authors assessed the spoken responses of the animals naming test using a combined approach

that exploits a constrained language model, a speaker-adapted acoustic model, and confidence

scoring to filter the ASR output. Nevertheless, they achieved a WER of 53% using a corpus

composed by younger subjects (mean age 28.4). While this represents a quite high WER, it is

still lower than the error obtained in this study for the patient group, implying that either the

acoustic and the language models could still be improved.

A global evaluation of the platform has been also performed in order to assess its reliability

as an automatic screening tool. A straightforward evaluation method consists of comparing

the manual and the automatic scores achieved by the user for each type of test. The scores

were calculated according to the traditional assessment performed when applying a neuropsy-

chological test. Table 6.5 reports the MAE, the Mean Relative Absolute Error (MRAE), and the

maximum score for the previously described subsets of stimuli. The maximum score for each

test depends on the number of stimuli selected. In addition, the results for the MMSE are also

reported. For this test, the scores achieved by each speaker are also shown in Figures 6.2.

In general, the results were better for healthy people than for patients. This was an expected

achievement due to the impaired condition of the patients, which were reflected on the quality

of their speech. The MAE error reported for question type and technology ranges from 0.80

to 3.00 for the patients, and from 0.80 to 2.80 for the control group. Comparing the MAE with

the maximum possible score it can be observed that the difference between the automatic and

the manual evaluation is relatively small. For instance, observing the results for the patient

group, one may notice that the MAE for the questions based on keyword spotting is 3.00 out

of a maximum score of 23, which corresponds to a relative error of 13%. The same analysis for

the control group leads to a relative error of 11.3%.

Through the evaluation and data collection, I had the opportunity to gather important feed-

back about the platform. I acknowledge that individuals with an advanced impaired condition

may have more difficulties in using the system, especially when the condition is worsened by

deafness or computer illiteracy, two factors rather common in the elderly. Patients with a more

pronounced cognitive impairment or with auditory impairments, may have difficulties in un-

derstanding the questions being asked. The computer illiteracy, however, may no longer be a

80

Page 105: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

1 2 3 4 50

5

10

15

20

25

Patient

MM

SE

score

Human

Auto

6 7 8 9 100

5

10

15

20

25

Healthy

MM

SE

score

HumanAuto

Figure 6.2: On the left side, MMSE scores of the human and automatic evaluations for thepatient speakers. On the right side, MMSE scores of the human and automatic evaluations forthe healthy speakers.

Question type / Technology Max. Score Patients HealthyKWS 23 3.00 (26%) 2.60 (12%)RBG 14 2.80 (37%) 1.60 (15%)ALM (w/o animals) 11 0.80 (23%) 0.80 (11%)ALM (animals) – 2.60 (34%) 1.80 (17%)MMSE 22 2.20 (21%) 2.80 (14%)

Table 6.5: MAE and MRAE (in brackets) by type of question and by neuropsychological test.

problem in the not so distant future.

Nevertheless, both the patients and healthy subjects demonstrated their appreciation for the

platform, showing interest in using the platform regularly. Particularly, some of the patients

were captivated by the animated virtual character, they liked its cartoon nature and the fact

that it interacted with them verbally.

6.3 Summary

In this chapter, the automatic implementation of some widely used neuropsychological tests

has been addressed. Initially, a specific task, the semantic verbal fluency test, has been targeted.

In this type of test, the patient has to name as many items as possible belonging to a specific

category, within a time interval of one minute. This is a challenging task for current speech

recognition technology, both for its open domain nature, and because of the cognitive load

it imposes, which induces the production of a considerable number of disfluencies. The first

problem has been addressed with the automatic construction of a language model suited for

the task. In this way, it was possible to witness an important improvement of the recognition,

81

Page 106: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

with a reduction of the WER from 105.5%, achieved with a general purpose language model to

19.9%, achieved with the developed language model. The second problem, the high amount

of disfluencies, is approached by investigating the use of prosodic patterns to predict potential

animal names. In this case, it was possible to distinguish disfluencies with an accuracy of

91.8%.

Then, the focus of my work shifted towards the automatic implementation of two neuropsy-

chological test batteries commonly applied by neurologists to assess the cognitive condition of

a person. They have been developed as an automatic web-based tool with SLT integration. In

this way, the tool could be used also for remote monitoring of cognitive impairments. As far

as I know, it represents the only platform of this type implemented for the Portuguese popula-

tion. The system has been assessed both with healthy subjects and patients. The MAE between

the manual and the automatic evaluation was relatively small, showing the feasibility of such

type of system. Additionally, the flexibility of the used platform allows the very easy creation

of new exercises of the same type, with different stimuli. In this way, it could be easily ex-

tended to include different types of exercises that can be used for the daily training of cognitive

abilities.

82

Page 107: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

7Contributions to the Monitoring of

Language Abilities

As observed at the end of Chapter 3, the evaluation of discourse impairments requires both the

manual transcription of the speech samples and the subsequent identification and annotation

of predefined linguistic features. These requirements preclude the applicability of discourse

analysis to clinical settings. Additionally, a manual analysis may also lead to different inter-

expert assessments due to the intrinsic, ambiguous nature of spontaneous language. For this

purpose, an automated analysis of discourse impairments would provide clinicians with an

additional screening tool that could be used in an objective way in clinical settings. The lit-

erature review presented in Chapter 4 showed that lexical, syntactic, and semantic aspects of

language production have been widely investigated, confirming the viability of these methods

in the identification of Alzheimer’s Disease (AD). However, very few studies approached lin-

guistic deficits at a higher level of processing, considering macrolinguistic aspects of discourse

production. For this reason, I developed a method targeting the analysis of pragmatic aspects

of a discourse. This approach is further complemented by considering lower levels aspects of

language processing, such as lexical, syntactic, and semantic abilities. Overall, the analysis of

such a wide set of language characteristics should provide a comprehensive evaluation of dis-

course production. Additionally, with the aim of automating the entire process of this type of

analysis, a speech recognition system is used to obtain the transcriptions of the recordings. This

work was published in IberSPEECH 2018 (Pompili et al. 2018) and accepted for publication to

a special issue of the IEEE Journal of Selected Topics in Signal Processing (JSTSP) on Automatic

assessment of health disorders based on voice, speech and language processing (Pompili et al.

2019).

7.1 Evaluating pragmatic aspects of discourse productionfor the automatic identification of Alzheimer’s disease

Macrolinguistic aspects of language production are quantified by rating of the cohesion and

coherence. While cohesion expresses the semantic relationship between elements, coherence is

related to the conceptual organization of speech, and is usually analyzed through the study of

local, global, and topic coherence. Local coherence refers to the conceptual links that maintain

Page 108: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

meaning between proximate propositions within smaller textual units. Global coherence refers

to the way in which the discourse is organized with respect to an overall plan. Finally, topic

coherence refers to the organization and maintenance of the topics used within the discourse.

In fact, topics should be structured according to an internal organization, in order to achieve

an information hierarchy, which is essential for effective communication (Ulatowska & Chap-

man 1994b). To the best of my knowledge, there are no computational studies targeting an

automatic analysis of topic coherence. This subject has been investigated only in the clinical

literature. It was introduced in 1991 in the work of Mentis & Prutting (1991), whose focus was

the study of topic introduction and management. A topic was described as a clause identifying

the question of immediate concern, while a subtopic was an elaboration or expansion of one

aspect of the main topic. Several years later, Brady et al. (2003) analyzed topic coherence and

topic maintenance in individuals with right hemisphere brain damage. This work extended

the one of Mentis & Prutting (1991) with the inclusion of the notion of sub-subtopic and sub-

sub-subtopic. Topic and subdivisional structures were further categorized as new, related, or

reintroduced. In a later study, Mackenzie et al. (2007) used discourse samples elicited through

a picture description task to determine the influences of age, education, and gender on the con-

cepts and topic coherence of 225 healthy adults. Results confirmed education level as a highly

important variable affecting the performance of healthy adults. More recently, Miranda (2015)

investigated the influence of education in the macrolinguistic dimension of discourse evalua-

tion, considering concepts analysis, local, global and topic coherence, and cohesion. Results

corroborated the ones obtained by Mackenzie et al. (2007), confirming the effect of literacy in

this type of analysis.

In this chapter, I propose a novel approach to automatically discriminate AD based on the

analysis of topic coherence. A discourse is modeled as a graph encoding a hierarchy of topics,

a relatively small set of pragmatic features are extracted from this hierarchical structure and

used to discriminate AD. Results have shown comparable classification performance with cur-

rent state of the art. Then, this approach is further extended with the introduction of a wider

set of linguistic features. In fact, the set of topic coherence features is broadened with new

measures assessing pragmatic aspects of discourse. This set of features is also integrated with

lexical, syntactic, and semantic features. It is expected that the analysis of different aspects of

discourse production will contribute to an improved automatic discrimination of AD. Addi-

tionally, the method proposed depends on accurate manually produced transcriptions of the

speech narratives. This is a common requirement to many studies targeting an automatic char-

acterization of linguistic impairments in AD (Fraser & Hirst 2016, Fraser et al. 2016, Orimaye

84

Page 109: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

(a) (b)

Figure 7.1: (a) The Cookie Theft picture, from the Boston Diagnostic Aphasia Examination(Goodglass et al. 2001). (b) An excerpt of a topic hierarchy for the Cookie Theft picture foundin the work of Miranda (2015).

et al. 2014, Santos et al. 2017, Toledo et al. 2018, Yancheva et al. 2015, Yancheva & Rudzicz 2016).

However, this requirement still hampers the applicability of computational approaches to clin-

ical settings. Thus, I assess the impact of using automatically generated transcriptions of the

spoken narratives by a speech recognition system. In this sense, the type of errors introduced

and in which way they impact on the performance of the proposed method are analyzed.

7.1.1 Corpus

Data used in this study are obtained from the DementiaBank database (MacWhinney et al. 2011,

TalkBank 2017), which is part of the larger TalkBank project (Becker et al. 1994, MacWhinney

et al. 2004). The collection was already introduced in Chapter 4, I briefly recall that among other

assessments, participants were required to provide the description of the Cookie Theft picture,

shown in Figure 7.1(a). Each speech sample was recorded and then manually transcribed at

word level. Narratives were segmented into utterances and annotated with disfluencies, filled

pauses, repetitions, and other more complex events. Among these, retracing and reformulation

are used to indicate abandoned sentences where the speaker starts to say something, but then

stops. While in the former the speaker may maintain the same idea changing the syntax, the

latter involves a complete restatement of the idea. For the purposes of this study, only partic-

ipants diagnosed with probable AD were selected, resulting in 234 speech samples from 147

patients. Control participants were also included, resulting in 241 speech samples from 98

speakers. Table 7.1 reports additional information about the size of the corpus, demographic

and clinical data. More details about the study cohort can be found in Becker et al. (1994).

85

Page 110: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Age range (avg.) MMSErange (avg.)

Audioduration N. of words

Controls 46-80 (63.84) 26-30 (29.06) 04h:13m 26591

AD 53-88 (71.31) 8-30 (19.36) 05h:04m 23029

Table 7.1: Statistical information on the Cookie Theft corpus

7.1.2 The proposed model to analyze topic coherence

The topics used during discourse production should contain an internal, structural organiza-

tion, in order to achieve an information hierarchy. This organizational structure allows a grad-

ual organization of information that is essential for an effective communication (Ulatowska &

Chapman 1994a). Being important for both the speaker and the listener, this type of organiza-

tion highlights the key concepts and indicates the degrees of importance and relevance within

the discourse. Mackenzie et al. (2007) provided an example of a topic hierarchy based on the

Cookie Theft picture description task, which was later extended in the study of Miranda (2015).

To allow a better understanding of the problem at hand, an excerpt of this hierarchy is also

reported in Figure 7.1(b).

The number of relevant topics that can be described from the Cookie Theft picture, is limited

to the concepts that are explicitly represented in the image (e.g., garden), and to those ones that

can be implicitly suggested by the scene (e.g., weather). Taking this into account, the problem of

building a topic hierarchy from a transcript can be modeled with a semi-supervised approach

in which a predefined set of topics clusters is used to guide the assignment of a new topic to a

level in the hierarchy.

Both for the creation of the topics clusters, and for the analysis of a new discourse sample,

a multistage approach is followed to transform the original transcriptions into a representation

suitable for subsequent analysis, as shown in Figure 7.2. Initially, the transcriptions are pre-

processed, then syntactic information is extracted and used to separate sentences into clauses

and to identify coreferential expressions. Finally, a sentence vector representation is computed

based on the word embeddings extracted for each word in a clause. Each stage of this process

is further described in the following sections.

7.1.2.1 Preprocessing

.In order to prepare the transcriptions for the next stage of the pipeline, all the annotations

were removed. Disfluencies were disregarded and contractions were expanded to their canon-

ical form. Once the preprocessing phase is concluded, POS tags are automatically extracted

using the Stanford University lexicalized probabilistic parser (Klein & Manning 2003b). Ap-

86

Page 111: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Topichierarchycreation

ClausesegmentationPreprocessing Coreference

analysisSentenceembedding

Topicclustersdefinition

Figure 7.2: The proposed method for modeling discourse as a hierarchy of topics.

pendix A.1.1 provides an excerpt of a transcription before and after the preprocessing phase,

and its corresponding POS tag annotations.

7.1.2.2 Clause segmentation

The next step requires identifying dependent and independent clauses. In fact, complex, com-

pound, or complex-compound sentences may contain references to multiple topics. The sen-

tence /the sink is overflowing while she is wiping a plate and not looking/ is an example of this

problem, as it is composed by an independent clause (/the sink is overflowing/) and a dependent

one (/she is wiping a plate and not looking/).

A possible way to cope with the separation of different sentence types is by using syntactic

parse trees. Thus, in a similar way to the work of Feng et al. (2012), clause and phrase-level tags

are used for the identification of dependent and independent clauses. For the former, the tag

SBAR is used, while for the latter, the proposed solution checks the sequence of nodes along

the tree to verify if the tag S or the tags [NP VP] appear in the sequence (Treebank 2019). The

corresponding parse tree for the sentence shown as an example is reported in Appendix A.1.2.

7.1.2.3 Coreference analysis

The analysis of coreference proves to be particularly useful in higher level NLP applications

that involve language understanding, such as in discourse analysis (Boytcheva et al. 2001).

Strictly related with the notions of anaphora and cataphora, coreference resolution goes beyond

the relation of dependence implicated by these concepts. It allows to identify when two or more

expressions refer to the same entity in a text.

In this study, the analysis of coreference has been performed with the Stanford coreference

resolution system (Manning et al. 2014), taking into account the segmentation performed in the

previous step.

During the process of building the hierarchy, coreference information is used to guide the

assignment of a subtopic to the corresponding level in the hierarchy. For this purpose, the re-

sults provided by the coreference system are constrained to those relationships in which the

87

Page 112: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

referent and the referred terms are mentioned in different clauses, and to those referred men-

tions that belong to the set of third-person personal pronouns (i.e., he, she, it, they). Examples

of an accepted and a rejected relationships are provided in Appendix A.1.3.

One may argue that there are some limitations to this method. First, by constraining in this

way the results provided by the coreference analysis, possible interesting relationships may be

lost. Also, the use of third-person personal pronouns is a fragile approach and may lead to

incorrect relationships in the case of grammatically incorrect sentences.

However, this method still provides a simple and preliminary way to exploit the coreference

information, which is valuable in the process of building the topic hierarchy.

7.1.2.4 Sentence embeddings

In the last step of the pipeline, discourse transcripts are converted into a representation suitable

to compare and measure differences between sentences. In particular, the transformed tran-

scripts should be robust to syntactic and lexical differences and should provide the capability

to capture semantic regularities between sentences. For this purpose, I rely on a pre-trained

model of word vector representations containing 2 million word vectors, in 300 dimensions,

trained with fastText on Common Crawl (Mikolov et al. 2018). In the process of converting a

sentence into its vector space representation, first a selection of four lexical items (nouns, pro-

nouns, verb, adjectives) is performed. Then, for each word, the corresponding word vector is

extracted and finally the average over the whole sentence is computed.

7.1.2.5 Topic hierarchy analysis

To create a topic hierarchy from a transcript, a methodology that is partly inspired by current

clinical practice is followed. Thus, in modeling the problem, it is not possible to impose a

predefined order or structure in the way topics and subtopics may be presented, as this will

depend on how the discourse is organized. However, one can take advantage of the closed

domain nature of the task to define a reduced number of clusters of broad topics that will help

to guide the construction of the hierarchy and the identification of off-topic clauses.

Topic clusters definition

As mentioned, the proposed solution relies on the supervised creation of a predefined number

of clusters of broad topics. Each cluster contains a representative set of sentences that are re-

lated with the topic of the cluster. 10 clusters were defined: main scene, mother, boy, girl, children,

garden, weather, unrelated, incomplete, and no-content. The purpose of the cluster unrelated was

to match those sentences in which the participant is not performing the task (e.g., questions

88

Page 113: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Currenthierarchy

themotheriswearinganapron

Notrelated

Topicclusters

Garden

zGirl

Mother

?...…

...

Mother

Mainscene

a)Topicidentification

themotheriswearingan

apron

stoolisreadyfall

boyistakingcookies

shelefttapopen

motheriswashingdishes

Mainscene

motheriswearingan

apron

stoolisreadyfall

boyistakingcookies

shelefttapopen

motheriswashingdishes

Mainscene

stoolisreadyfall

boyistakingcookies

shelefttapopen

motheriswashingdishes

Mainscene

b) Topiclevelidentification d)Topicassignmentc) Topiclevelidentification

Subgraph‘Mother’ Updatedhierarchy

Figure 7.3: Topic hierarchy building algorithm. (a) The current sentence is compared with thetopic clusters to identify its topic. (b) Identification of the level of specialization of the currentsentence. If there are no nodes with the same topic of the current sentence, this is considereda new topic. (c) If the current hierarchy contains one or more nodes with the same topic of thecurrent sentence, each of them is analyzed with respect to the current one. (d) As a result, thecurrent sentence is added as a child of its closest node.

directed to the interviewer). The clusters incomplete and no-content are expected to match sen-

tences that may be characteristic of a language impairment. They identify fragments of text

that do not represent a complete sentence (e.g., /overflowing sink/) and expressions that do

not add semantic information about the image (e.g., /fortunately there is nothing happening out

there/, /what is going on/). To build the clusters, around 35% of the data from the control group

is used. Each sentence has been manually annotated with the corresponding cluster label and

clusters are simply modeled by the complete set of sentences belonging to them. These clusters

are used as topic references for building the topic hierarchy of new transcriptions, as described

next.

Topic hierarchy building algorithm

The algorithm to build the topic hierarchy relies on the cosine similarity between sentence em-

beddings. The first step consists of verifying which is, among the 10 topic clusters defined, the

one that best matches the content of the current sentence (Figure 7.3(a)). This is achieved by

computing the cosine similarity between the current sentence embeddings and each sentence

embeddings in each topic clusters. The highest result determines the cluster for the new sen-

tence. In the following step, one needs to assign the current sentence embeddings to a level in

the current hierarchy (Figure 7.3(b)). This implies establishing whether one is dealing with a

new or a repeated topic and its level of specialization (i.e., subtopic, sub-subtopic, etc.). This

is achieved by first identifying, in the current hierarchy, the subgraph whose nodes belong to

the same cluster of the current sentence (e.g., the subgraph corresponding to the mother cluster

in Figure 7.3). Then, the cosine similarity between the current sentence and each node of this

89

Page 114: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

subgraph is computed. This process is shown in Figure 7.3(c). The new sentence is considered

a child of the closest node if the similarity is higher than a threshold. Otherwise, it is considered

a repeated topic (Figure 7.3(d)). If there is no subgraph, the sentence embedding is added as

a new topic. If the new topic turns out to be a coreferential expression, this kind of informa-

tion supersedes the cosine metric strategy, and the new topic is added directly as a child of its

referent.

7.1.3 Features for AD spoken discourse characterization

In this section, I report the set of features used in this study. First, pragmatic features are

described. These are computed from the topic hierarchy and include measures of topic, global,

and local coherence. Then, a set of additional lexical, morphosyntactic, and semantic features

is introduced. The complete set of features is listed in Table 7.2.

7.1.3.1 Topic coherence features

From the output that is produced at each step of the processing pipeline (i.e., from processing

steps described in Sections 7.1.2.1 to 7.1.2.4, and shown in Figure 7.2), and from the final topic

hierarchy, a set of 37 measurements was identified as of potential interest to characterize topic

coherence.

In a similar way to the standard clinical evaluation, the number of topics, sub-subtopics and

sub-sub-subtopics introduced, as well as the total number of repeated topics are accounted for.

Then, with the aim of investigating more thoroughly the subtopics produced, the number of

topics in each topic cluster related with the Cookie Theft picture are also considered. Addition-

ally, the number of sentences that, during the creation of the topic hierarchy, were classified as

unrelated, incomplete, or no-content is computed. These features were added to explicitly model

language impairments. In fact, the sentences classified as unrelated should identify those ex-

pressions in which the participant is not performing the task, but instead asks questions to

the interviewer. In a similar way, sentences classified as incomplete should model fragments of

a sentence, while sentences classified as no-content should identify those expressions that do

not add semantic information about the image. Furthermore, in the literature the mean cosine

value between all possible pairs and adjacent pairs of sentences have been used as measures of

global and local coherence (Graesser et al. 2004). This approach has been extended to the set

of topics constituting the final hierarchy, to the set of repeated topics, and to those classified

as unrelated or no-content. Finally, features characterizing the topology of the hierarchy, such

as the average number of outgoing edges, were also included. The complete set of features is

90

Page 115: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

reported in Table 7.2.

7.1.3.2 Other linguistic features

As observed in Chapters 3 and 4, language deficits in AD typically include word-finding diffi-

culties, naming impairments, and semantic errors. These problems contribute to an incoherent,

circumlocutory speech characterized by the use of indefinite and vague terms. To analyze these

deficits in the context of narrative speech, and, thus, providing a more comprehensive evalu-

ation of language abilities, I integrate the topic coherence feature set with a number of lexical,

syntactic, and semantic features. These features and the methodology used to compute them

are detailed in the remainder of this section.

Lexical features

An excessive use of indefinite and generic terms, could be analyzed through measures that aim

at revealing the richness and diversity of the lexicon. For this purpose, one of the most widely

reported metric, which has been used in many linguistic and clinical research studies (Bucks

et al. 2000b, Fraser et al. 2016, Johansson 2009, Kettunen 2014), is the TTR. It is a sample measure

of vocabulary size, representing the ratio of the total vocabulary to the overall text length.

In addition, the Brunet’s index (Brunet 1978) and the Honore’s statistic (Honore 1978), two

alternative measures of richness of vocabulary were also computed. Brunet’s index quantifies

lexical richness without being sensitive to text length. It is calculated according to the following

formula: W=NV (−.165), where N is the total text length and V is the total vocabulary used

by the participant. The Honore’s statistic evaluates the richness of a lexicon by counting the

number of words that occur only once. It is calculated according to the following formula:

R=100 · log N/(1− V1/V), where V1 is the number of words spoken only once, V is the total

vocabulary used, and N is the total text length.

Morphosyntactic and syntactic features

Problems with pronominal reference may also contribute to a less informative speech. In fact,

several studies have found that the discourse of AD patients contains an overuse, often inap-

propriate, of pronouns (Almor et al. 1999, Kempler 1984, 1995, Kempler et al. 1987, Ripich &

Terrell 1988). This type of problems, in particular the overuse of demonstratives (e.g., /here/)

and references without a clear antecedent (e.g., /this/) have also been found in the work of Ula-

towska et al. (1988). AD patients have also shown impaired verb production, verb naming, and

impaired verb knowledge in sentence processing (Kim & Thompson 2004, Reilly et al. 2011).

91

Page 116: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

To account for these and other linguistic phenomena, the frequency of occurrence of differ-

ent word classes was computed by relying on the POS information obtained with the Stanford

parser (Klein & Manning 2003b). The frequency of each class is computed at the sentence level

and then normalized by the total number of words in a narrative. Finally, frequencies are aver-

aged over all the sentences.

In the same vein of word class frequencies, the frequency of different types of produc-

tion rules is accounted for. This type of analysis has been used in several NLP classification

tasks (Post & Bergsma 2013, Wong & Dras 2010), and in problems aiming at identifying AD and

related dementias (Fraser et al. 2016, Orimaye et al. 2014, Yancheva et al. 2015). In a context-free

grammar, the set of production rules is used to describe the rules of a grammar. A production

is typically a relation of the form A→β, where A is a non-terminal symbol (each non-terminal

represents a different type of phrase or clause), and β is a string of symbols (the actual content).

More concretely, instances of these rules may be of the form: NP→NP VP, a noun phrase (NP)

that consists of a noun phrase and a verb phrase (VP), or NN→ /mother/, a noun (NN) that

is a terminal.

To identify a meaningful set of production rules, the phrase structure trees of the 35% of the

data that were retained for building the topic clusters were examined. This analysis provided

more than three thousand different production rules, thus a reduced subset of rules was se-

lected with the definition of the two following conditions: i) the left hand side should belong to

a restricted set of constituent tags, and ii) the right hand side should not be a terminal symbol.

Provided with this list of production rules, for each of them, the corresponding frequency of

occurrence is accounted for. The result is then normalized by the total number of rules in the

narrative. Overall, the total number of syntactic and morphosyntactic features is 52.

Semantic features

A decline in semantic content is consistent with the claims that describe the discourse of AD

patients as empty, containing little or no information (Ahmed, Haigh, de Jager & Garrard 2013).

Analyzing the narratives of a picture description task, several authors have found that the AD

group, in comparison to the control, produced less content elements and made shorter descrip-

tions with fewer categories of information (Ahmed, Haigh, de Jager & Garrard 2013, Croisile

et al. 1996, Hier et al. 1985, Nicholas et al. 1985). Computationally, the identification of infor-

mation content has been approached by several authors (Bucks et al. 2000b, Fraser et al. 2016,

Hakkani-Tur et al. 2010, Hernandez-Domınguez et al. 2018, Jarrold et al. 2014a, Sirts et al. 2017,

Yancheva et al. 2015, Yancheva & Rudzicz 2016). In these studies, the mention of a given con-

cept is assessed through a predefined list of Information Content Units (ICUs). In this respect,

92

Page 117: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Type Description

Number of topics (T1), subtopics (T2), sub-subtopics (T3), and sub-sub-subtopics introduced (T4).

Number of topics produced in each topic cluster, namely main scene (T5),mother (T6), boy (T7), girl (T8), children (T9), garden (T10), weather (T11).

Proportion of dependent (T12) and independent clauses to the totalnumber of sentences (T13).

Total number of coreferential mentions (T14).

Total number of repeated topics, subtopics, sub-subtopics, and sub-sub-subtopics (T15).

Number of sentences that were classified as unrelated (T16), incomplete(T17), or no-content (T18) in the first step of the main algorithm.

Mean, standard deviation, and coefficient of variation (the ratio of thestandard deviation to the mean) of the cosine similarity between twotemporally consecutive topics (T19-T21), all pairs of topics (T22-24), allpairs of repeated topics (T25-27), and those classified as unrelated (T28-T30) or no-content (T31-T33).

Length of the longest path from the root node to all leaves (T34).

Average number of outgoing edges of all nodes (T35).

Total number of sentences (T36).

Topic coherence

Ratio of dependent to independent clauses (T37).

Lexical TTR (L1), Brunet’s index (L2), and Honore’s statistic (L3).

Word class frequencies: adverb (M1), verbs (M2), noun (M3), pronouns(M4), and adjectives (M5).

Number of times a production rule is used (M6-M41).Morphosyntacticand syntactic

Rate, proportion, and average length of noun (M42-M44), verb (M45-M47), and prepositional phrases (M48-M50). Ratio of nouns to verbs(M51) and of pronouns to nouns (M52).

SemanticICUs to consider the mention of a key concept in the Cookie Theft picture(S1-S23).

Frequency of occurrence of specific keywords relevant for the CookieTheft picture (S24-S49).

Table 7.2: Summary of all extracted features (141 in total). The number of each type of featuresis reported in parenthesis.

some works have recently approached a completely automatic evaluation of information con-

tent (Hakkani-Tur et al. 2010, Hernandez-Domınguez et al. 2018, Jarrold et al. 2014a, Yancheva

& Rudzicz 2016).

To account for information content features, the approach described in the work of Croisile

et al. (1996) has been followed. The authors examined 23 information units in four categories

(i.e., subjects, places, objects, and actions). Overall, these concepts are assumed to constitute a

93

Page 118: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

complete description of the Cookie Theft picture:

• three subjects: boy, girl, and mother,

• two places: kitchen and exterior seen through the window,

• eleven objects: cookie, jar, stool, sink, plate, dishcloth, water, window, cupboard, dishes,

and curtains,

• seven actions: boy taking or stealing, boy or stool falling, woman drying or washing

dishes/plate, water overflowing or spilling, action performed by the girl, woman uncon-

cerned by the overflowing, woman indifferent to the children.

Information units are used to identify the mentioning of a concept in a narrative, as such, they

should be robust to lexical or semantic variations of the same content (e.g., mother, lady, wife,

mom). Also, another issue is related with the ICU ’action performed by the girl’, whose defini-

tion is too vague and needs to be further specified in order to be approached computationally.

Details on the definition of the ICUs can be found in Appendix A.2.1. Given the ICU defi-

nitions, the feature corresponding to categories subjects, objects, and places were computed by

simply verifying the mention of the corresponding items in the text. To compute the men-

tioning of the ICU category actions, the dependency representations provided by the Stanford

Parser (Klein & Manning 2003b) were examined. More details about the ICU features compu-

tation can be found in Appendix A.2.2.

Finally, in a similar way to the work of Fraser et al. (2016), the set of ICU features is ad-

ditionally complemented with the occurrences of specific words that may be of relevance to

the Cookie Theft picture. That is, the number of times that a given uni-gram is referred to is

computed. This may highlight possible subtle variations in the linguistic patterns used by the

AD and the control group. Overall, the total number of semantic features is 49.

7.1.4 Evaluation

AD classification experiments have been performed on a subset of the Cookie Theft corpus using

a RF classifier. As described in Section 7.1.2, 35% of the data was held-out to define the topic

clusters. Hence, only 65% of the data has been used for experimental validation, that is, for

training and testing AD classifiers. This consists of 148 discourse samples from the control

group, and 153 from the dementia group. The two sets together contained 1241 unique words,

885 from control subjects, and 878 from AD patients. On average, each transcribed narrative

contained around 12-13 sentences, with the patients group producing shorter descriptions. A

stratified k-fold cross validation per subject strategy was implemented, with k number of folds

94

Page 119: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

equal to 10. In the following, I report the average and range accuracy at the 90% confidence

level computed from the results of each fold.

In order to identify the most discriminant feature for AD classification, I implemented a

method based on the sequential forward selection (SFS) (Pudil et al. 1994). The SFS algorithm

is an iterative search approach in which a model is trained with an incremental number of

features. Starting with no features, at each iteration the accuracy of the model is tested by

adding, one at a time, each of the features that were not selected in a previous iteration. The

feature that yields the best accuracy is retained for further processing. The method ends when

the addition of a new feature does not improve the performance of the model. In this study, I

implemented a variation of the SFS that explores a larger features space in order to find better

solutions to the problem at hand. That is, I removed the constraint of terminating the search

as soon as a first local maximum is found, and performed an extended search until the last

feature was selected. Then, the modified approach selects the minimal set of features that meet

a certain performance convergence criterion. In this case, this is defined by the attainment of a

classification performance of at most 1% worse than the global maximum.

7.1.4.1 Experiments using manual transcriptions

This section reports on the validation of different set of features computed on the manual tran-

scriptions: experiments with topic coherence features, with additional linguistic features, and

fusion of these two sets.

Topic coherence features results

Using only the set of features computed through the multistage approach yielding the topic

hierarchy, the average accuracy in classifying AD was 79.0% ± 4.8%. This performance is

achieved with a selection of 11 features out of 37. Figure 7.4 reports the classification accuracy

obtained with the SFS algorithm on this subset of features.

From this image, it is possible to note that the number of topics (T1) was the first feature se-

lected, providing, alone, an average accuracy of 66%. The second and the fifth features selected

were the coefficient of variation between two temporally consecutive topics (T21) and between

all pairs of topics (T24). Nevertheless, other statistical measures related with the mean and

standard deviation of the cosine value between adjacent (T19, T20) and all pairs of topics (T22,

T23) were not considered discriminant for classification by the SFS algorithm. These findings

are in agreement with the results achieved by Toledo et al. (2018). However, these features

were introduced because in the literature they have been used as an index of local and global

95

Page 120: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

T1 T21 T16 T10 T24 T14 T12 T35 T8 T34 T30.50

0.55

0.60

0.65

0.70

0.75

0.80

Accu

racy

Features

Figure 7.4: Variation of the classification accuracy with the SFS method, while increasing thenumber of features. Results are presented for the set of topic coherence features that providedthe maximum accuracy. Features are computed on the manual transcriptions.

coherence (Graesser et al. 2004, Santos et al. 2017, Toledo et al. 2018). In order to understand

why they were not considered relevant for classification, their average values were analyzed in

more detail. In this way, one can note that differences between the AD and the control group

are relatively small, which may represent a possible explanation. From this analysis, I also dis-

covered that, in line with the results achieved by Toledo et al. (2018), the AD group achieved

higher scores in the mean value of the cosine between adjacent topics, rather than between

all pairs of topics. This difference has been associated with a greater difficulty in keeping the

theme throughout the discourse. A significant impairment in global, but not in local coherence

was also found by Glosser & Deser (1991) while assessing macrolinguistic patterns of discourse

production in AD patients. Dijkstra et al. (2002) justified this finding with the assumption that

global coherence and elaborations on a topic require more cognitive resources than local co-

herence, which needs only the activation of information that is relevant between continuous

sentences.

Another relevant result regards the number of coreferential mentions (T14) and the propor-

tion of dependent clauses (T12). Analyzing in more detail these data, it is confirmed that, on

average, there is a large difference between the two groups for these statistics. Two opposite

patterns also emerge. AD patients produced a greater number of coreferential mentions and

a reduced number of dependent clauses with respect to the control group. These results are

in agreement with the findings that AD speech is characterized by an increased use of pro-

nouns and a reduced number of subordinate clauses (Ahmed, Haigh, de Jager & Garrard 2013,

Croisile et al. 1996, Kemper et al. 1993, Ripich & Terrell 1988, Ulatowska et al. 1988).

96

Page 121: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Linguistic features results

Using the SFS feature selection approach with the linguistic features, the accuracy improves

to 82.6% ± 5.1%. This result was achieved with the selection of 55 features, out of 104. This

outcome is consistent with those achieved in similar works of the state of the art (Hernandez-

Domınguez et al. 2018, Yancheva & Rudzicz 2016), and in particular with the one of Fraser

et al. (2016). I recall, in fact, that the range of linguistic measures analyzed in this study can

be considered a subset of those implemented by Fraser et al. In this way, it is possible to draw

a comparison between the two approaches. In both studies, the frequencies of adverbs (M1),

verbs (M2), and nouns (M3) were identified in the set of the most important features. The same

applies to the rate of prepositional phrases (M48). An analysis of the mean values of these

features in the two groups confirmed a trend in agreement with the current state of the art.

AD speech contains a reduced number of nouns, verbs, and prepositional phrases (Ahmed,

de Jager, Haigh & Garrard 2013, Croisile et al. 1996, Kave & Goral 2016, Kemper et al. 1993).

For what concerns semantic features, a partial overlap was also found in the set of ICUs and

word occurrences that were considered relevant for classification. I acknowledge that in this

study a larger number of semantic features was selected in comparison to the one of Fraser

et al. These differences may be explained either by the different feature set, being that the

ones implemented by Fraser et al. also contain acoustic and psycholinguistics measures, or by

different computational implementations.

Fusion of features results

When combining the topic coherence features with the set of lexical, syntactic, and semantic

features, the accuracy improves from 82.6% ± 5.1% to 85.5% ± 2.9%. This result is achieved

with the identification of a restricted number of features, only 19 out of 141, corresponding to

13% of the total number of features. This is a relevant outcome if considering that the use of

these two sets individually required the selection of 30% and 53% of the total number of fea-

tures, achieving a lower accuracy. As shown in Figure 7.5 (top), the selected subset includes

16 features assessing syntactic, and semantic abilities, and 3 features evaluating pragmatic as-

pects of language. This distribution confirms that the set of other linguistic features is more

comprehensive and covers a wide range of phenomena, assessing language impairment at dif-

ferent levels of processing. On the other hand, pragmatic features encode in a compact way a

different type of information, complementing lower level aspects of language production.

97

Page 122: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Figure 7.5: Accuracy achieved with the top selected features using the fusion of different sets.Results are computed on the manual transcriptions (top) and on the automatic transcriptions(bottom).

7.1.4.2 Experiments using automatic transcriptions

The generation of manual transcriptions of discourse samples is a laborious, time-consuming

task that also requires expert linguistic knowledge. This need hampers the applicability of the

type of proposed computational analysis to clinical settings. Thus, the use of a speech recogni-

tion system to automatically produce the transcriptions can remove this constraint. However,

this may be at the cost of a negative impact on performance due to recognition errors. Nowa-

days, state of the art automatic speech recognition (ASR) systems can achieve accuracy levels

comparable to human transcribers (Word Error Rate (WER) of ∼5-6%) in certain spontaneous

speech recognition tasks (Saon et al. 2017, Stolcke & Droppo 2017). Nevertheless, when it

comes to the recognition of atypical speech, such as speech from elderly people, recognition

errors typically get worse. This performance drop is even exacerbated in the case of speech

and language affecting diseases. Hakkani-Tur et al. (2010) reported a WER of 30.7%, and 26.7%

in recognizing the speech of elderly, healthy adults while performing a picture description and

a story recall tasks. Lehr et al. (2012) achieved a WER of 34.5% on a corpus of MCI and elderly

subjects while performing a story recall task.

In this section, the approach to automatic transcription generation is described. Then, the

impact of automatic transcriptions on the AD classification task is assessed.

Automatic transcription generation

The Google Cloud Platform (Google 2019a) and the Google Cloud Speech to Text (STT) (Google

2019b) API have been used to obtain the automatic transcriptions for the Cookie Theft corpus.

Recordings that were originally encoded in MP3 format, were converted to 16 kHz sampling

98

Page 123: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Manualtranscriptions

Automatictranscriptions

Topic coherence Linguistic Fusion Fusion

Accuracy 79.0±4.8 82.6±5.1 85.5±2.9 79.7±3.5

Table 7.3: Summary of AD classification results (avg. and range accuracy % )

rate WAV audio files, a coding format accepted by the Google API. The quality of the record-

ings is quite poor, as they were originally collected on tapes in the late ’80s. Besides, some of

them contain a background noise or a very low voice. However, no speech enhancement tech-

nique was applied. Moreover, before performing ASR, the original recording sessions were

segmented in order to remove the clinician interventions, which are extraneous to the descrip-

tion of the participant and need to be ignored. For this segmentation task, a speaker diariza-

tion system could have been used (Bonastre et al. 2005, Meignier & Merlin 2010). However,

the sentence level manual transcriptions have been used to obtain the utterance boundaries.

Notice that overlapped speech segments due to the superimposition of the clinician voice were

also kept. In addition to speaker changes, the corpus manual annotations were also used to

obtain manual sentence boundaries. This is due to the fact that the ASR system provided au-

tomatic sentence boundaries based on long pauses that did not correspond to the actual end

of the sentence. In fact, a correct identification of sentence boundaries is essential for model-

ing the topic hierarchy in the proposed method and for the extraction of some of the features

described in Section 7.1.3. While there are natural language processing tools that perform au-

tomatic sentence segmentation, most of them require the speech transcripts to include accurate

punctuation marks, a feature that is not currently available.

AD classification results

Automatic transcriptions were obtained both for the train and test data. Apart from the way in

which the transcriptions were generated, these experiments are performed with the exact same

procedure used in previous Section (7.1.4.1). This means that the algorithm used to build the

hierarchy, the dataset separation, and also the partitions used in cross validation experiments

are exactly the same. The experiments were performed using the complete set of features that

includes topic coherence and other linguistic features. In this way, a classification accuracy of

79.7% ± 3.5% was achieved. This result was obtained through the selection of 10 features out

of 141, corresponding to 7% of the total number of features. Table 7.3 reports a summary of the

results achieved in the task of AD classification, using both manual and automatic transcrip-

tions.

99

Page 124: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

A comparison between the set of features that provided the best classification accuracy on

the automatic transcriptions and those identified on the manual transcriptions can be done by

observing Figure 7.5. In this way, it is possible to notice that the selected subset includes 8

features assessing lexical, syntactic, and semantic abilities, and 2 features evaluating pragmatic

aspects of language. That is, a 20% in contrast to the 13% when using manual transcriptions.

Interestingly, the total number of coreferential mentions (T14) is the only pragmatic feature that

appear in both selections. Although the type of information used in the analysis of coreference

is strongly affected by recognition errors, this feature continues to be of particular relevance for

the discrimination of the disease, as confirmed by an analysis of its mean values. The number

of topics (T1) is now selected, while the number of incomplete sentences (T17) and the average

number of outgoing edges (T35) are no longer selected. These observations suggest on the one

hand that ASR errors affect differently some of the proposed features, and on the other hand,

that a small number of these features seem to be more insensitive to these errors (20% selected

in contrast to 13%, as noted previously).

Transcription errors analysis

As expected, due to recognition errors, one can observe a negative impact on the model’s per-

formance in terms of classification accuracy. The WER was computed using the manual tran-

scriptions of the Cookie Theft corpus as ground truth. The WER for the control and the AD

groups was, respectively, of 37%, and 43%. An analysis of the audio recordings confirmed that

such a high error rate is partly due to the poor quality of the recordings. In fact, those audio

segments that reported a higher error rate either contained a lot of background noise, or the

recorded voice was of very low energy. Another possible source of error for the recognizer is

related with the speaking style of the subjects, which in some cases was really fast, presenting

a high rate of coarticulation. Nevertheless, it is worth noting the robustness of the selected set

of features and the proposed method, which is able to achieve up to a 79.7% AD classification

accuracy even with such an high WER.

Analyzing ASR results in more detail, it was possible to verify that deletions (i.e., words or

sentences that were not recognized) were the main source of error followed by substitutions

(i.e., words that were incorrectly recognized). Indeed, both for the AD and the control group,

the number of deletions was almost the double of the amount of substitutions, contributing, re-

spectively, to the 26% and 22% of the error. Additionally, for both groups, the features obtained

with the manual and the automatic transcripts were analyzed. In this way, it was possible to

observe that the majority of the features computed with the automatic transcriptions showed a

100

Page 125: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

lower average value in comparison to the manual ones. This outcome is in agreement with the

high rate of deletions found. Only a reduced number of features showed an opposing trend.

That is, their average value was found higher with respect to the corresponding value com-

puted on the manual transcriptions. Among the topic coherence features, this phenomenon

was observed for the number of sentences classified as incomplete (T17). As expected, due to

deletions its average value considerably increases, resulting however in a less discriminant fea-

ture for the AD classification task, as the experiments with automatic transcriptions confirm.

Lexical and syntactic features also reflected the alterations existing in the transcriptions by

showing an increasing trend in some measures. Among them, I mention the Honore’s statis-

tic (Honore 1978) (L3), the frequency of nouns (M3) and other features that are related with

this measure. Notably, an higher value in the Honore’s statistic is associated with the use of a

richer vocabulary. This phenomenon is most likely due to substitutions errors. The frequency

of nouns (M3), the proportion of nouns to verb (M51), the rate, proportions, and average length

of noun phrases (M42-M44) also provided an increasing trend in both groups. A closer inspec-

tion revealed that this increment was indeed justified by a reduction in the number of noun

phrases and by a general reduction of sentences length.

7.2 Summary

In this chapter, I explored an approach to automatically classify AD based on the analysis of

a wide set of linguistic features. Pragmatic abilities of language processing are evaluated by

modeling discourse samples into a hierarchy of information. Through this process, a number

of features related with the analysis of topic coherence is computed. This set is then comple-

mented with various lexical, syntactic, and semantic features in order to provide a comprehen-

sive evaluation of language deficits in AD. Classification experiments achieved an accuracy of

85.5%, which is in line with the current state of the art. These results confirm that by incor-

porating features representing pragmatic aspects of discourse, is it possible to attain a better

characterization of language impairments in AD. This, contributes to an improved ability of

current computational approaches to provide an objective, complementary evaluation. Ad-

ditionally, I evaluated the impact of using a speech recognizer to automatically produce the

transcriptions needed for this type of computational analysis. This is an important step to-

wards the applicability of these kind of approaches in a clinical setting. In this case, results

provided a lower accuracy of 79.7% in the automatic identification of the disease, mostly due

to the negative effect of ASR deletion errors. Nevertheless, this is still a remarkable disease

classification result considering that the WER in these data was around 40%.

101

Page 126: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

102

Page 127: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

8Conclusions and Future Work

This Chapter presents the final remarks of this dissertation. Initially, the major achievements

of the research carried out in the areas of monitoring of speech, cognitive, and language abili-

ties are summarized. Then, these results are discussed in the context of the central aim of this

thesis, which is to contribute to the current state of the art on the diagnosis of neurodegener-

ative diseases by means of Speech and Language Technology (SLT) (Section 8.1). Finally, new

directions for future research are suggested (Section 8.2).

8.1 Contributions

This dissertation addressed the problem of providing complementary diagnostic methods

based on SLT for the diagnosis of neurodegenerative diseases. To this end, a preliminary study

over the major symptoms and the core criteria used in the clinical diagnosis of these disorders

has been conducted. This research allowed to identify three distinct areas deserving further

investigations: the monitoring of speech, cognitive, and language abilities. For each of these

areas, the relevant state of the art was reviewed, identifying current progress and limitations.

In monitoring of speech abilities, I found an extensive body of research that investigated

several acoustic measures able to represent the symptoms of dysarthria, a motor speech disor-

der typically developed in Parkinson’s Disease (PD). Dysarthria is commonly assessed through

the evaluation of several speech production tasks. Few studies analyzed the discriminative

ability of these tasks in what concerns the identification of PD. In my research, I have defined a

standard set of acoustic features, which includes measures evaluating speech disorders at vari-

ous dimensions and thus can be suitable to assess the relevance of the different speech tasks. To

this end, I have used a database containing Portuguese PD patients and healthy speakers per-

forming 8 tasks designed to assess phonation, articulation, and prosody. Results have shown

that the most important production tasks were reading of prosodic sentences and storytelling,

achieving a PD classification accuracy of 85.10% and 82.32%, respectively. These tasks elicit the

production of continuous speech and definitely contain more acoustic and prosodic informa-

tion than the sustained vowel phonation or the reading of a word. Their selection may indicate

that, in the identification of PD, a comprehensive evaluation of speech impairments is more

Page 128: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

important than the assessment of isolated abilities, such as phonation or articulation.

Relatively to the monitoring of cognitive abilities, I found that many neuropsychological

tests are eligible to be administered through current SLT solutions, providing evident advan-

tages to the community. However, the literature review showed that existing solutions present

important limitations. For these reasons, I targeted the automatic implementation of some neu-

ropsychological tests widely used for screening cognitive performance. First, I addressed the

semantic verbal fluency task. This test represents a challenge for current speech recognition

technology, as it requires the spontaneous production of a list of items belonging to an uncon-

strained domain. To assess its automatic implementation, I collected a database containing the

recordings of 42 healthy speakers. This challenge has been faced with the construction of a

tailored language model and exploiting prosodic hints from the linguistic area. Both methods

provided an improvement in the prediction of potential animal names. Then, I approached the

automatic implementation of the MMSE and ADAS-Cog, two neuropsychological test batter-

ies. They have been developed as an automatic web-based tool with SLT integration. In this

way, the tool could be used also for remote monitoring of cognitive impairments. As far as I

know, it represents the only platform of this type implemented for the Portuguese population.

To evaluate the feasibility of the monitoring tool, a speech corpus including the recordings of

5 people diagnosed with cognitive impairments and 5 healthy control subjects was collected.

The error between the manual and the automatic evaluation was relatively small (from 0.80 to

3.00 for the patients, and from 0.80 to 2.80 for the control group), confirming the feasibility of

such type of system. Additionally, the flexibility of the platform used allows to easily extend

the tests implemented to different types of exercises that can be used for the daily training of

cognitive abilities.

In monitoring of language abilities, I found that the analysis of discourse impairments re-

quires both the manual transcriptions of continuous speech samples and the subsequent iden-

tification and annotation of predefined linguistic features. These requirements preclude the

applicability of discourse analysis in clinical settings and may also lead to different inter-expert

assessments. I developed an automatic method targeting the analysis of pragmatic aspects of a

discourse. This approach is further complemented by considering lower levels aspects of lan-

guage processing, such as lexical, syntactic, and semantic abilities. Overall, the analysis of such

a wide set of language characteristics should provide a comprehensive evaluation of discourse

production. The method has been evaluated with a publicly available corpus devoted at the

study of communication in dementia. In the experiments, I show that pragmatic features pro-

vide complementary information, increasing accuracy from 82.6% to 85.5% in the detection of

104

Page 129: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

dementia. Additionally, with the aim of automating the entire process of this type of analysis,

a speech recognition system has been used to obtain the transcriptions of the recordings.

To conclude, I believe that these contributions provide relevant advances in the current state

of the art on the diagnosis of neurodegenerative diseases by means of SLT, fulfilling the major

aim of this thesis. In particular, the research carried out in the areas of monitoring speech,

cognitive and language abilities makes a step ahead towards a future in which clinicians may

be provided with complementary diagnostic tools. The work carried out during this doctoral

research accomplished also another important objective, to introduce the research group where

I am involved to the interdisciplinary area of diagnosis of neurodegenerative diseases. This

result can be confirmed by the amount of new studies in this area that have started in the

group along these years. This statement is based on an international project targeting speech

therapy, in which I have been directly involved, and on a European consortium dedicated at the

analysis of pathological speech. Furthermore, new master theses related with the topics of this

dissertation have been developed, in some of which, I was directly involved as co-supervisor.

Finally, the work carried out in this doctoral research allowed to establish new collaborations

with neurologists from different institutions, specialized in different areas, which will be of

great importance for the development of future interdisciplinary research.

Publication list

The work carried out in the context of this dissertation have led to the following publications:

• Anna Pompili, Alberto Abad, David Martins de Matos, Isabel P. Martins, Pragmatic as-

pects of discourse production for the automatic identification of Alzheimer’s disease, accepted

for publication to a special issue of the IEEE Journal of Selected Topics in Signal Pro-

cessing (JSTSP) on Automatic assessment of health disorders based on voice, speech and

language processing, May 2019.

• Anna Pompili, Alberto Abad, David Martins de Matos, Isabel P. Martins, Topic coher-

ence analysis for the classification of Alzheimer’s disease, In Proceedings IberSPEECH 2018,

Barcelona, Spain, November 2018.

• Anna Pompili, Cristiana Filipa Lopes Amorim, Alberto Abad, Isabel Trancoso, Speech and

language technologies for the automatic monitoring and training of cognitive functions, In Work-

shop on Speech and Language Processing for Assistive Technologies (SLPAT), Dresden,

Germany, September 2015.

105

Page 130: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

• Helena Moniz, Anna Pompili, Fernando Batista, Isabel Trancoso, Alberto Abad, Cristiana

Filipa Lopes Amorim, Automatic Recognition of Prosodic Patterns in Semantic Verbal Fluency

Tests - an Animal Naming Task for Edutainment Applications, In International Congress of

Phonetic Sciences (ICPhS 2015), Glasgow, Scotland, UK, August 2015.

• Anna Pompili, Alberto Abad, Paolo Romano, Isabel P. Martins, Rita Cardoso, Helena

Santos, Joana Carvalho, Isabel Guimaraes, and Joaquim J. Ferreira. Automatic Detection

of Parkinson’s Disease: An Experimental Analysis of Common Speech Production Tasks Used

for Diagnosis. In International Conference on Text, Speech, and Dialogue, pp. 411–419,

Springer, August 2017.

Some of the works carried out during the doctoral research have been omitted from the pre-

vious list because they were not directly related with the topics identified in this dissertation.

However, it is relevant to mention them in the following:

• Ruben Solera-Urena, Helena Moniz, Fernando Batista, Vera Cabarrao, Anna Pompili,

Ramon Astudillo and Isabel Trancoso, Uma abordagem de aprendizagem semi-supervisionada

para a percepcao automatica de personalidade, baseada em pistas acustico-prosodicas em domınios

com poucos recursos, accepted for publication to Revista da Associacao Portuguesa de

Linguıstica, March 2019.

• Javier Tejedor, Doroteo T. Toledano, Paula Lopez-Otero, Laura Docio-Fernandez, Jorge

Proenca, Fernando Perdigao, Fernando Garcıa-Granada, Emilio Sanchis, Anna Pompili

and Alberto Abad, ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation,

EURASIP Journal on Audio, Speech, and Music Processing, April 2018.

• Sonia Reis, Anna Pompili, Alberto Abad, Jorge Baptista, O proverbio como estımulo num

terapeuta virtual, VI Simposio mundial de estudos da lıngua portuguesa, Santarem, Por-

tugal, October 2017.

• Ruben Solera Urena, Helena Moniz, Fernando Batista, Vera Cabarrao, Anna Pompili, Ra-

mon Fernandez Astudillo, Joana Carvalho Filipe de Campos, Ana Paiva, Isabel Trancoso,

A Semi-Supervised Learning Approach for Acoustic-Prosodic Personality Perception in Under-

Resourced Domains, In Proc. of Interspeech 2017, pages 929–933, Stockholm, Sweeden,

August 2017.

• Anna Pompili, Alberto Abad, The L2F Query-by-Example Spoken Term Detection system

for the ALBAYZIN 2016, In Albayzin Evaluation - IberSPEECH 2016, Lisbon, Portugal,

November 2016.

106

Page 131: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Lists of invited presentations:

• Vania Mendonca, Anna Pompili, Ruben Santos, Isabel Trancoso, Luısa Coheur e Alberto

Abad. E-Inclusao no L2F: as tecnologias da lıngua ao servico da saude, educacao e comunicacao.

VII Workshop on Linguistics, Language Development and Impairment, Lisbon, Portugal.

2017

• Isabel Trancoso, Alberto Abad, Luısa Coheur, Anna Pompili Pompili, Cristiana Amorim,

Vania Mendonca. Virtual Therapists. Clef 2016 - Conference and Labs of the Evaluation

Forum. VII Conferencia sobre Information Access Evaluation meets Multilinguality, Mul-

timodality, and Interaction. Evora, Portugal, 2016

• Vania Mendonca, Anna Pompili, Alberto Sardinha, Luısa Coheur. VITHEA-Kids: Adapt-

ing the VITHEA platform to children with Autism Spectrum Disorder. VI Workshop on Lin-

guistics, Language Development and Impairment. Lisbon, Portugal, 2016

• Anna Pompili, Alberto Abad, Isabel Trancoso, Joao Paulo Carvalho. Pos-VITHEA: as

tecnologias da fala para melhorar as funcionalidades cognitivas. V Workshop on Linguistics,

Language Development and Impairment, Lisbon, Portugal, 2015

• Alberto Abad, Anna Pompili, Isabel Trancoso, Jose Fonseca, Pedro Fialho. VITHEA -

Terapia remota para patologias da fala. II encontro de terapeutas da fala do Alentejo, Evora,

Portugal, 2014.

• Alberto Abad, Anna Pompili, Isabel Trancoso, Jose Fonseca. VITHEA O potencial das

tecnologias da fala no tratamento da afasia. IV Workshop on Linguistics, Language Develop-

ment and Impairment, Lisbon, Portugal, 2014.

8.2 Future Work

Considering the different topics addressed in this dissertation, there are naturally a number

of directions that can be taken as future works. For what concerns the monitoring of speech

abilities, it would be important to validate the results achieved in the automatic identification

of PD with different datasets, in order to confirm their robustness. In the area of monitoring of

cognitive abilities, the results achieved with the implementation of the semantic verbal fluency

test have shown that there is still much room for improvement. Future extensions may consider

using confidence scores to filter ASR results, as in the approach of Pakhomov et al. (2015), or

training a specific acoustic model suited to elderly voice characteristics. Additionally, for what

concerns neuropsychological tests, future research could target the development of new tests

107

Page 132: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

addressing cognitive stimulation, rather than the diagnosis of diseases. In the area of moni-

toring language abilities, it would be important to incorporate automatic speaker and sentence

segmentation processing, in order to fully remove the dependence on manual transcriptions. A

more challenging line of research considers the extension of this kind of analysis to other types

of discourse production tasks, including open-domain ones. Finally, it is important to control

some clinical and demographic variables of the samples in future studies, namely participants

degree of literacy and the severity or stage of disease. The control of these factors will improve

the diagnostic value of the results.

Probably one of the most important limitations of these studies is the lack of standard, con-

sistent datasets publicly available. This was foreseen in the definition of the objectives of this

thesis. However, it was expected to find large databases with a greater ease by focusing this

research on two widespread diseases. The DementiaBank is one of the largest collections used

in the literature for assessing language impairments in AD. Nevertheless, its size still repre-

sents a problem for modern computational approaches. For this reason, an ambitious future

line of research envisions the development of tools for the general analysis of speech and lan-

guage abilities, rather than for detecting specific diseases. The results provided by these tools

will contain general-purpose statistics and information over the analyzed samples. These data

could be interpreted by a clinician in a similar way as current blood tests are used. In this way,

without the aim of identifying a specific disorder, the dependence from a database contain-

ing data of subjects diagnosed with that disease is also removed. Without this restriction, it is

possible to share different types of data for model training.

108

Page 133: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

AAppendix

This appendix contains additional information about the study on monitoring of language abil-

ities presented in Chapter 7. Section A.1 provides some examples of input/output for the in-

termediate steps of the topic coherence analysis’ process. Section A.2 describes some technical

details related with the computation of semantic features.

A.1 Excerpts of input/output processing

A.1.1 Preprocessing

The examples below show an excerpt of an original transcription contained in the Dementia-

Bank database (MacWhinney et al. 2011, TalkBank 2017), before and after the preprocessing

phase. Its corresponding POS tag annotations are also shown.

- /there’s [//] the sink is overflowing while she’s wiping &uh &uh &k &uh a plate and not looking

&=laughs ./

- /the sink is overflowing while she is wiping a plate and not looking ./

- the/DT sink/NN is/VBZ overflowing/VBG while/IN she/PRP is/VBZ wiping/VBG a/DT

plate/NN and/CC not/RB looking/VBG ./

A.1.2 Clause segmentation

The example below shows the syntactic parse tree obtained with the Stanford University lex-

icalized probabilistic parser (Klein & Manning 2003b) for the sentence: /the sink is overflowing

while she is wiping a plate and not looking ./

(ROOT

(S

(NP (DT the) (NN sink))

(VP (VBZ is)

(VP (VBG overflowing)

(SBAR (IN while)

Page 134: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

(S

(NP (PRP she))

(VP (VBZ is)

(VP

(VP (VBG wiping)

(NP (DT a) (NN plate)))

(CC and)

(RB not)

(VP (VBG looking))))))))

(. .)))

A.1.3 Coreference analysis

The excerpts below provide some examples of the relations identified by the Stanford coref-

erence resolution system (Manning et al. 2014). Subscripts r and a denote a rejected and an

accepted relationship according to the methodology described in Section 7.1.2.3.

1. /a boya is standing up on a stoolr ./

/hea is falling of a stoolr ./

2. /and shea is getting her feet wet.. ./

/shea is also oblivious to the fact ./

/that herr kids are stealing cookies.. ./

A.2 Computation of semantic features

A.2.1 Specifications of an ICU list

The definition of the Information Content Units (ICUs) list required different specifications for

some of the categories indicated in the work of Croisile et al. (1996).

The sets of subjects and objects were simply defined by complementing each individual con-

cept with its synonyms and semantic variations. This was achieved by consulting the work

of Pakhomov et al. (2010) and online dictionaries (Merriam-Webster 2019, Thesaurus 2019).

The category places required the specification of several n-grams that may match a description

related to something seen in the kitchen, or in the garden, that is, seen through the kitchen win-

dow. Finally, ICUs in the actions category where defined through the triple subject-verb-object.

110

Page 135: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

The set of acceptable terms for the subject element corresponds to those specified in the ICU

subjects category. The definition of the verb element includes: synonyms, a complete specifica-

tion of the phrasal verbs that are accepted, and n-grams (e.g., is unaware). The object element of

the triple subject-verb-object was defined only in case of ambiguity (i.e., woman unconcerned

by the overflowing or woman indifferent to the children). The ICU ’action performed by the

girl’ was identified with the following activities: girl asking for cookies, girl with her finger to

her mouth, girl saying to be quiet, girl trying to help, girl reacting to the fall.

A.2.2 Computing ICUs

The ICU categories subjects, objects, and places, were computed by simply verifying the mention

of the corresponding items in the text. This approach suffers from important limitations that

were already highlighted by Fraser et al. (2016). In fact, if a noun is inappropriately substituted,

or a concept is defined in an unpredictable way, it could be either attributed to the wrong ICU

or could not be accounted at all.

To compute the mentioning of the ICU category actions, similarly to the work of Fraser et al.

(2016), the dependency representations provided by the Stanford Parser (Klein & Manning

2003b) were examined. An example of this representation for the sentence /a boy is standing up

on a stool ./ is shown in the snippet below:

det(boy-2, a-1)

nsubj(standing-4, boy-2)

aux(standing-4, is-3)

root(ROOT-0, standing-4)

compound:prt(standing-4, up-5)

case(stool-8, on-6)

det(stool-8, a-7)

nmod(standing-4, stool-8)

In particular, it is verified if there is a typed dependency that identifies the subject of the

sentence (i.e., nsubj) and that matches one of the elements specified in the subjects category.

If there is a correspondence, the same process is applied to the verb, and eventually to the

object of the sentence. Verbs are normalized to their root form, and compounds are accounted.

N-grams are computed by searching for sequences of n words.

111

Page 136: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

112

Page 137: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Bibliography

Abad, A., Pompili, A., Costa, A. & Trancoso, I. (2012), Automatic word naming recognition for

treatment and assessment of aphasia, in ‘Proc. Interspeech’.

Abad, A., Pompili, A., Costa, A., Trancoso, I., Fonseca, J., Leal, G., Farrajota, L. & Martins, I. P.

(2013), ‘Automatic word naming recognition for an on-line aphasia treatment system’, Com-

puter Speech & Language 27(6), 1235 – 1248. Special Issue on Speech and Language Processing

for Assistive Technology.

Abdel-Hamid, O., Mohamed, A.-r., Jiang, H. & Penn, G. (2012), Applying convolutional neural

networks concepts to hybrid NN-HMM model for speech recognition, in ‘2012 IEEE interna-

tional conference on Acoustics, speech and signal processing (ICASSP)’, IEEE, pp. 4277–4280.

Ahmed, S., de Jager, C. A., Haigh, A.-M. F. & Garrard, P. (2013), ‘Semantic processing in con-

nected speech at a uniformly early stage of autopsy-confirmed Alzheimer’s disease.’, Neu-

ropsychology 27(1), 79.

Ahmed, S., Haigh, A.-M. F., de Jager, C. A. & Garrard, P. (2013), ‘Connected speech as a marker

of disease progression in autopsy-proven Alzheimer’s disease’, Brain 136(12), 3727–3737.

Albert, M. S., DeKosky, S. T., Dickson, D., Dubois, B., Feldman, H. H., Fox, N. C., Gamst, A.,

Holtzman, D. M., Jagust, W. J., Petersen, R. C., Snyder, P. J., Carrillo, M. C., Thies, B. & Phelps,

C. H. (2011), ‘The diagnosis of mild cognitive impairment due to Alzheimer’s disease: Rec-

ommendations from the National Institute on Aging-Alzheimer’s Association workgroups

on diagnostic guidelines for Alzheimer’s disease’, Alzheimer’s & Dementia 7(3), 270 – 279.

Almor, A., Kempler, D., MacDonald, M. C., Andersen, E. S. & Tyler, L. K. (1999), ‘Why do

Alzheimer patients have difficulty with pronouns? Working memory, semantics, and refer-

ence in comprehension and production in Alzheimer’s disease’, Brain and language 67(3), 202–

227.

Aluısio, S., Cunha, A. & Scarton, C. (2016), Evaluating progression of Alzheimer’s disease by

regression and classification methods in a narrative language test in Portuguese, in ‘Inter-

national Conference on Computational Processing of the Portuguese Language’, Springer,

pp. 109–114.

113

Page 138: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Ansel, B. M. & Kent, R. D. (1992), ‘Acoustic-phonetic contrasts and intelligibility in the

dysarthria associated with mixed cerebral palsy’, Journal of Speech, Language, and Hearing

Research 35(2), 296–308.

Appell, J., Kertesz, A. & Fisman, M. (1982), ‘A study of language functioning in Alzheimer

patients’, Brain and language 17(1), 73–91.

Atal, B. S. & Schroeder, M. R. (1968), ‘Predictive coding of speech signals’, Report of the 6th Int.

Congress on Acoustics . Tokio, Japan.

Backman, L., Jones, S., Berger, A., Laukka, E. & Smalltt, B. (2004), ‘Multiple cognitive deficits

during the transition to Alzheimer’s disease’, Journal of internal medicine 256(3), 195–204.

Baker, J. M., Deng, L., Glass, J., Khudanpur, S., Lee, C., Morgan, N. & O’Shaughnessy, D.

(2009), ‘Developments and directions in speech recognition and understanding, Part 1 [DSP

Education]’, IEEE Signal Processing Magazine 26(3), 75–80.

Bao, H., Xu, M.-X. & Zheng, T. F. (2007), Emotion attribute projection for speaker recognition on

emotional speech, in ‘Eighth Annual Conference of the International Speech Communication

Association’.

Becker, J. T., Boiler, F., Lopez, O. L., Saxton, J. & McGonigle, K. L. (1994), ‘The natural history

of Alzheimer’s disease: description of study cohort and accuracy of diagnosis’, Archives of

Neurology 51(6), 585–594.

Bengio, Y. (2008), ‘Neural net language models’, Scholarpedia 3(1), 3881.

Benton, A., Hamsher, K. & Sivan, A. (1994), Multilingual Aphasia Examination: Manual of Instruc-

tions, AJA Assoc.

Berardelli, A., Noth, J., Thompson, P. D., Bollen, E. L., Curra, A., Deuschl, G., van Dijk, J. G.,

Topper, R., Schwarz, M. & Roos, R. A. (1999), ‘Pathophysiology of chorea and bradykinesia

in Huntington’s disease.’, Mov Disord 14(3), 398–403.

Bertram, L. & Tanzi, R. E. (2005), ‘The genetic epidemiology of neurodegenerative disease’,

Journal of Clinical Investigation 115(6), 1449–1457.

Bishop, C. M. (2006), Pattern recognition and machine learning, springer.

Bocklet, T., Noth, E., Stemmer, G., Ruzickova, H. & Rusz, J. (2011), Detection of persons with

Parkinson’s disease by acoustic, vocal, and prosodic analysis, in ‘2011 IEEE Workshop on

Automatic Speech Recognition & Understanding’, pp. 478–483.

114

Page 139: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Bocklet, T., Steidl, S., Noth, E. & Skodda, S. (2013), Automatic evaluation of Parkinson’s speech-

acoustic, prosodic and voice related cues., in ‘Interspeech’, pp. 1149–1153.

Bonastre, J., Wils, F. & Meignier, S. (2005), ALIZE, a free toolkit for speaker recognition, in

‘Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal

Processing, 2005.’, Vol. 1, pp. I/737–I/740 Vol. 1.

Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992), A training algorithm for optimal margin

classifiers, in ‘Proceedings of the fifth annual workshop on Computational learning theory’,

ACM, pp. 144–152.

Boytcheva, S., Dobrev, P. & Angelova, G. (2001), Cgextract: Towards extraction of conceptual

graphs from controlled english, Contributions to ICCS-2001, 9th International Conference of

Conceptual Structures,.

Brady, M., Mackenzie, C. & Armstrong, L. (2003), ‘Topic use following right hemisphere

brain damage during three semi-structured conversational discourse samples’, Aphasiology

17(9), 881–904.

Breiman, L. (1984), Classification and Regression Trees, New York: Routledge.

Breiman, L. (2001), ‘Random forests’, Machine learning 45(1), 5–32.

Brockmann, M., Drinnan, M. J., Storck, C. & Carding, P. N. (2011), ‘Reliable jitter and shimmer

measurements in voice clinics: the relevance of vowel, gender, vocal intensity, and funda-

mental frequency effects in a typical clinical task’, Journal of voice 25(1), 44–53.

Brooks, B. R. (1994), ‘El Escorial World Federation of Neurology criteria for the diagnosis of

amyotrophic lateral sclerosis’, Journal of the Neurological Sciences 124, 96–107.

Brunet, E. (1978), Le vocabulaire de Jean Giraudoux. Structure et evolution., number 1 in ‘Collection

”Travaux de linguistique quantitative”’, Slatkine.

Brunnstrom, H., Gustafson, L., Passant, U. & Englund, E. (2009), ‘Prevalence of dementia sub-

types: a 30-year retrospective survey of neuropathological reports.’, Arch Gerontol Geriatr

49(1), 146–149.

Bryson, A. E. & Ho, Y.-C. (1969), Applied Optimal Control: Optimization, Estimation, and Control,

Waltham, Mass: Blaisdell Pub. Co.

115

Page 140: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Bucks, R. S., Singh, S., Cuerden, J. M. & Wilcock, G. K. (2000a), ‘Analysis of spontaneous, con-

versational speech in dementia of Alzheimer type: Evaluation of an objective technique for

analysing lexical performance’, Aphasiology 14(1), 71–91.

Bucks, R. S., Singh, S., Cuerden, J. M. & Wilcock, G. K. (2000b), ‘Analysis of spontaneous, con-

versational speech in dementia of Alzheimer type: Evaluation of an objective technique for

analysing lexical performance’, Aphasiology 14(1), 71–91.

Bunton, K. & Weismer, G. (2001), ‘The relationship between perception and acoustics for a high-

low vowel contrast produced by speakers with dysarthria’, Journal of Speech, Language, and

Hearing Research .

Burg, J. P. (1967), ‘Maximum Entropy Spectral Analysis’, Proceedings of 37th Meeting, Society of

Exploration Geophysics . Oklahoma City.

Campbell, J. P., Shen, W., Campbell, W. M., Schwartz, R., Bonastre, J.-F. & Matrouf, D. (2009),

Forensic speaker recognition, in ‘IEEE Signal Processing Magazine’, Institute of Electrical and

Electronics Engineers, pp. 95–103.

Cano, S. J., Posner, H. B., Moline, M. L., Hurt, S. W., Swartz, J., Hsu, T. & Hobart, J. C. (2010),

‘The ADAS-cog in Alzheimer’s disease clinical trials: psychometric evaluation of the sum

and its parts’, J Neurol Neurosurg Psychiatry 81(12), 1363–1368.

Castiglioni, P. (2010), ‘Letter to the Editor: What is wrong in Katz’s method? Comments on:

A note on fractal dimensions of biomedical waveforms’, Computers in biology and medicine

40(11-12), 950–952.

Charniak, E. & Johnson, M. (2005), Coarse-to-fine n-best parsing and MaxEnt discriminative

reranking, in ‘Proceedings of the 43rd annual meeting on association for computational lin-

guistics’, Association for Computational Linguistics, pp. 173–180.

Clark, H. H. (1996), Using language, Cambridge University Press.

Cortes, C. & Vapnik, V. (1995), ‘Support-vector networks’, Machine learning 20(3), 273–297.

Coulston, R., Klabbers, E., Villiers, J. d. & Hosom, J.-P. (2007), Application of speech technol-

ogy in a home based assessment kiosk for early detection of Alzheimer’s disease, in ‘Eighth

Annual Conference of the International Speech Communication Association’.

Covington, M. A. & McFall, J. D. (2010), ‘Cutting the Gordian knot: The moving-average type–

token ratio (MATTR)’, Journal of quantitative linguistics 17(2), 94–100.

116

Page 141: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Croisile, B., Ska, B., Brabant, M. J., Duchene, A., Lepage, Y., Aimard, G. & Trillet, M. (1996),

‘Comparative study of oral and written picture description in patients with Alzheimer’s dis-

ease.’, Brain Lang 53(1), 1–19.

Davis, S. & Mermelstein, P. (1980), ‘Comparison of parametric representations for monosyllabic

word recognition in continuously spoken sentences’, IEEE transactions on acoustics, speech, and

signal processing 28(4), 357–366.

de Lau, L. M. & Breteler, M. M. (2006), ‘Epidemiology of Parkinson’s disease’, The Lancet Neu-

rology 5(6), 525–535.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P. & Ouellet, P. (2010), ‘Front-end factor anal-

ysis for speaker verification’, IEEE Transactions on Audio, Speech, and Language Processing

19(4), 788–798.

Dell, G. S., Chang, F. & Griffin, Z. M. (1999), ‘Connectionist models of language production:

Lexical access and grammatical encoding’, Cognitive Science 23(4), 517–542.

Dijkstra, K., Bourgeois, M., Petrie, G., Burgio, L. & Allen-Burge, R. (2002), ‘My Recaller is on Va-

cation: Discourse Analysis of Nursing-Home Residents With Dementia’, Discourse Processes

33(1), 53–76.

Dronkers, N. & Ogar, J. (2004), ‘Brain areas involved in speech production’.

Duan, K.-B. & Keerthi, S. S. (2005), Which is the best multiclass SVM method? An empirical

study, in ‘International workshop on multiple classifier systems’, Springer, pp. 278–285.

Dumais, S. T. (2004), ‘Latent semantic analysis’, Annual review of information science and technol-

ogy 38(1), 188–230.

Dunning, T. (1994), Statistical identification of language, Computing Research Laboratory, New

Mexico State University Las Cruces, NM, USA.

Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., Andre, E., Busso, C., Devillers, L. Y., Epps,

J., Laukka, P., Narayanan, S. S. & Truong, K. P. (2016), ‘The Geneva Minimalistic Acoustic

Parameter Set (GeMAPS) for Voice Research and Affective Computing’, IEEE Transactions on

Affective Computing 7(2), 190–202.

Eyben, F., Wollmer, M. & Schuller, B. (2010), Opensmile: The Munich Versatile and Fast Open-

source Audio Feature Extractor, in ‘Proceedings of the 18th ACM International Conference

on Multimedia’, MM ’10, ACM, New York, NY, USA, pp. 1459–1462.

117

Page 142: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Farrus, M., Hernando, J. & Ejarque, P. (2007), Jitter and shimmer measurements for speaker

recognition, in ‘Eighth annual conference of the international speech communication associ-

ation’.

Fellbaum, C. (2010), WordNet, in ‘Theory and applications of ontology: computer applications’,

Springer, pp. 231–243.

Feng, S., Banerjee, R. & Choi, Y. (2012), Characterizing stylistic elements in syntactic structure,

in ‘Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language

Processing and Computational Natural Language Learning’, Association for Computational

Linguistics, pp. 1522–1533.

Fisher, R. A. (1919), ‘XV.—The correlation between relatives on the supposition of Mendelian

inheritance.’, Earth and Environmental Science Transactions of the Royal Society of Edinburgh

52(2), 399–433.

Folstein, M. F., Folstein, S. E. & McHugh, P. R. (1975), ‘”Mini-mental state”. A practical method

for grading the cognitive state of patients for the clinician’, J Psychiatr Res 12(3), 189–198.

Forbes, K. E., Venneri, A. & Shanks, M. F. (2002), ‘Distinct patterns of spontaneous speech

deterioration: an early predictor of Alzheimer’s disease.’, Brain Cogn 48(2-3), 356–361.

Fraser, K. C. & Hirst, G. (2016), Detecting semantic changes in Alzheimer’s disease with vector

space models, in ‘Proceedings of LREC 2016 Workshop. Resources and processing of lin-

guistic and extra-linguistic data from people with various forms of cognitive/osychiatric

impairments (RaPID-2016), Monday 23rd of May 2016’, number 128, Linkoping University

Electronic Press, Linkopings universitet, p. 1 to 8.

Fraser, K. C., Meltzer, J. A. & Rudzicz, F. (2016), ‘Linguistic Features Identify Alzheimer’s Dis-

ease in Narrative Speech.’, J Alzheimers Dis 49(2), 407–422.

Garrett, M. F. (1975), ‘Syntactic process in sentence production’, Psychology of learning and moti-

vation: Advances in research and theory. (9), 133–177.

Gauthier, S., Reisberg, B., Zaudig, M., Petersen, R. C., Ritchie, K., Broich, K., Belleville, S.,

Brodaty, H., Bennett, D., Chertkow, H., Cummings, J. L., de Leon, M., Feldman, H., Ganguli,

M., Hampel, H., Scheltens, P., Tierney, M. C., Whitehouse, P. & Winblad, B. (2006), ‘Mild

cognitive impairment’, The Lancet 367(9518), 1262 – 1270.

Geisser, S. (1975), ‘The predictive sample reuse method with applications’, Journal of the Ameri-

can statistical Association 70(350), 320–328.

118

Page 143: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Glosser, G. & Deser, T. (1991), ‘Patterns of discourse production among neurological patients

with fluent language disorders’, Brain and language 40(1), 67–88.

Goberman, A. M. & Coelho, C. (2002), ‘Acoustic analysis of Parkinsonian speech I: speech

characteristics and L-Dopa therapy.’, NeuroRehabilitation 17(3), 237–246.

Goodglass, H., Kaplan, E. & Barresi, B. (2001), The Boston Diagnostic Aphasia Examination, Balti-

more: Lippincott, Williams & Wilkins.

Google (2019a), ‘Google cloud platform’, https://cloud.google.com/. [Accessed on 15-

January-2019].

Google (2019b), ‘Google cloud speech-to-text api’, https://cloud.google.com/

speech-to-text/. [Accessed 15-January-2019].

Gorno-Tempini, M. L., Dronkers, N. F., Rankin, K. P., Ogar, J. M., Phengrasamy, L., Rosen, H. J.,

Johnson, J. K., Weiner, M. W. & Miller, B. L. (2004), ‘Cognition and anatomy in three variants

of primary progressive aphasia.’, Ann Neurol 55(3), 335–346.

Gorno-Tempini, M. L., Hillis, A. E., Weintraub, S., Kertesz, A., Mendez, M., Cappa, S. F., Ogar,

J. M., Rohrer, J. D., Black, S., Boeve, B. F., Manes, F., Dronkers, N. F., Vandenberghe, R., Ras-

covsky, K., Patterson, K., Miller, B. L., Knopman, D. S., Hodges, J. R., Mesulam, M. M. &

Grossman, M. (2011), ‘Classification of primary progressive aphasia and its variants.’, Neu-

rology 76(11), 1006–1014.

Graesser, A. C., McNamara, D. S., Louwerse, M. M. & Cai, Z. (2004), ‘Coh-Metrix: Analysis of

text on cohesion and language’, Behavior research methods, instruments, & computers 36(2), 193–

202.

Guyon, I. & Elisseeff, A. (2003), ‘An introduction to variable and feature selection’, Journal of

machine learning research 3(Mar), 1157–1182.

Hakkani-Tur, D., Vergyri, D. & Tur, G. (2010), Speech-based automated cognitive status assess-

ment, in ‘Eleventh Annual Conference of the International Speech Communication Associa-

tion’.

Hall, M. A. (1999), ‘Correlation-based feature selection for machine learning’.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I. H. (2009), ‘The WEKA

Data Mining Software: An Update’, SIGKDD Explor. Newsl. 11(1), 10–18.

119

Page 144: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Hartelius, L., Carlstedt, A., Ytterberg, M., Lillvik, M. & Laakso, K. (2003), ‘Speech disorders in

mild and moderate Huntington’s disease: results of dysarthria assessment of 19 individuals’,

Journal of Medical Speech-Language Pathology 1, 1–14.

Henry, M. & Gorno-Tempini, M. (2010), ‘The logopenic variant of primary progressive aphasia’,

Current opinion in neurology 23(6), 633–637.

Hermansky, H. (1990), ‘Perceptual linear predictive (PLP) analysis of speech’, the Journal of the

Acoustical Society of America 87(4), 1738–1752.

Hernandez-Domınguez, L., Ratte, S., Sierra-Martınez, G. & Roche-Bergua, A. (2018),

‘Computer-based evaluation of Alzheimer’s disease and mild cognitive impairment patients

during a picture description task’, Alzheimer’s & Dementia: Diagnosis, Assessment & Disease

Monitoring 10, 260–268.

Hier, D. B., Hagenlocker, K. & Shindler, A. G. (1985), ‘Language disintegration in dementia:

Effects of etiology and severity’, Brain and language 25(1), 117–133.

Hirsimaki, T., Pylkkonen, J. & Kurimo, M. (2009), ‘Importance of high-order n-gram models in

morph-based speech recognition’, IEEE Transactions on Audio, Speech, and Language Processing

17(4), 724–732.

Ho, T. K. (1995), Random decision forests, in ‘Proceedings of 3rd international conference on

document analysis and recognition’, Vol. 1, IEEE, pp. 278–282.

Honore, A. (1978), Some Simple Measures of Richness of Vocabulary, number 7, Association of

Literary and Linguistic Computing Bulletin.

Huang, X., Acero, A., Hon, H.-W. & Reddy, R. (2001), Spoken language processing: A guide to

theory, algorithm, and system development, Vol. 1, Prentice hall PTR Upper Saddle River.

Hutchinson, J. M. & Jensen, M. (1980), A pragmatic evaluation of discourse communication

in normal and senile elderly in a nursing home., in ‘In L. K. Obler & M. L. Albert (Eds.)

Language and communication in the Elderly. Lexington, MA: Lexington Books.’, pp. 59–73.

i Cancho, R. F., Sole, R. V. & Kohler, R. (2004), ‘Patterns in syntactic dependency networks’,

Physical Review E 69(5), 051915.

Itakura, F. & Saito, S. (1968), Analysis synthesis telephony based upon the maximum likelihood

method, in ‘Proc. 6th Int. Congress on Acoustics’. Tokio, Japan.

120

Page 145: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Ivakhnenko, A. G. & Lapa, V. G. (1966), Cybernetic predicting devices, Technical report, Purdue

University School of Electrical Engineering.

Ivakhnenko, A. G. & Lapa, V. G. (1967), Cybernetics and Forecasting Techniques Modern Analytic

and Computational Method in Science and Mathematics, New York: American Elsevier Publish-

ing Company, Inc.

Jarrold, W., Peintner, B., Wilkins, D., Vergryi, D., Richey, C., Gorno-Tempini, M. L. & Ogar, J.

(2014a), Aided diagnosis of dementia type through computer-based analysis of spontaneous

speech, in ‘Proceedings of the ACL Workshop on Computational Linguistics and Clinical

Psychology’, pp. 27–36.

Jarrold, W., Peintner, B., Wilkins, D., Vergryi, D., Richey, C., Gorno-Tempini, M. L. & Ogar, J.

(2014b), Aided diagnosis of dementia type through computer-based analysis of spontaneous

speech, in ‘Proceedings of the ACL Workshop on Computational Linguistics and Clinical

Psychology’, pp. 27–36.

Jay, T. B. (2002), The Psychology of Language, New Jersey: Pearson Education.

Jelinek, F. & Mercer, R. L. (1980), Interpolated estimation of Markov source parameters from

sparse data, in E. S. Gelsema & L. N. Kanal, eds, ‘Proceedings, Workshop on Pattern Recog-

nition in Practice’, North Holland, Amsterdam, pp. 381–397.

Johansson, V. (2009), ‘Lexical diversity and lexical density in speech and writing: A develop-

mental perspective’, Working Papers in Linguistics 53, 61–79.

Johnson, J. K., Diehl, J., Mendez, M. F., Neuhaus, J., Shapira, J. S., Forman, M., Chute, D. J.,

Roberson, E. D., Pace-Savitsky, C., Neumann, M., Chow, T. W., Rosen, H. J., Forstl, H., Kurz,

A. & Miller, B. L. (2005), ‘Frontotemporal lobar degeneration: demographic characteristics of

353 patients.’, Arch Neurol 62(6), 925–930.

Jurafsky, D. & Martin, J. H. (2014), Speech and language processing, Vol. 3, Pearson London.

Katz, S. (1987), ‘Estimation of probabilities from sparse data for the language model component

of a speech recognizer’, IEEE transactions on acoustics, speech, and signal processing 35(3), 400–

401.

Kave, G. & Goral, M. (2016), ‘Word retrieval in picture descriptions produced by individuals

with Alzheimer’s disease’, Journal of clinical and experimental neuropsychology 38(9), 958–966.

121

Page 146: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Kemper, S., LaBarge, E., Ferraro, F. R., Cheung, H., Cheung, H. & Storandt, M. (1993), ‘On the

preservation of syntax in Alzheimer’s disease: Evidence from written sentences’, Archives of

neurology 50(1), 81–86.

Kempler, D. (1984), Syntactic and symbolic abilities in Alzheimer’s disease, PhD thesis, UCLA.

Kempler, D. (1995), ‘Language changes in dementia of the Alzheimer type’, Dementia and com-

munication pp. 98–114.

Kempler, D., Curtiss, S. & Jackson, C. (1987), ‘Syntactic preservation in Alzheimer’s disease’,

Journal of Speech, Language, and Hearing Research 30(3), 343–350.

Kent, R. D. & Kim, Y.-J. (2003), ‘Toward an acoustic typology of motor speech disorders’, Clinical

linguistics & phonetics 17(6), 427–445.

Kent, R. D., Sufit, R. L., Rosenbek, J. C., Kent, J. F., Weismer, G., Martin, R. E. & Brooks, B. R.

(1991), ‘Speech deterioration in amyotrophic lateral sclerosis: a case study.’, J Speech Hear Res

34(6), 1269–1275.

Kertesz, A. (1982), The Western aphasia battery, Grune and Stratton, New York, NY.

Kettunen, K. (2014), ‘Can type-token ratio be used to show morphological complexity of lan-

guages?’, Journal of Quantitative Linguistics 21(3), 223–245.

Kiernan, M. C., Vucic, S., Cheah, B. C., Turner, M. R., Eisen, A., Hardiman, O., Burrell, J. R. &

Zoing, M. C. (2011), ‘Amyotrophic Lateral Sclerosis.’, Lancet 377(9769), 942–955.

Kim, M. & Thompson, C. K. (2004), ‘Verb deficits in Alzheimer’s disease and agrammatism:

Implications for lexical organization’, Brain and language 88(1), 1–20.

Kintsch, W. (1994), ‘Text comprehension, memory, and learning.’, American Psychologist

49(4), 294.

Kintsch, W. & Van Dijk, T. A. (1978), ‘Toward a model of text comprehension and production.’,

Psychological review 85(5), 363.

Klein, D. & Manning, C. D. (2003a), Accurate Unlexicalized Parsing, in ‘Proceedings of the

41st Annual Meeting on Association for Computational Linguistics - Volume 1’, ACL ’03,

Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 423–430.

Klein, D. & Manning, C. D. (2003b), Accurate unlexicalized parsing, in ‘Proceedings of the 41st

Annual Meeting on Association for Computational Linguistics-Volume 1’, Association for

Computational Linguistics, pp. 423–430.

122

Page 147: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Kockmann, M., Burget, L. et al. (2011), ‘Application of speaker-and language identification

state-of-the-art techniques for emotion recognition’, Speech Communication 53(9-10), 1172–

1185.

Kovacs, G. G. (2014), Neuropathology of Neurodegenerative Diseases: A Practical Guide, Cambridge

University Press.

Kreiman, J. & Gerratt, B. R. (2003), Jitter, shimmer, and noise in pathological voice quality

perception, in ‘ISCA Tutorial and Research Workshop on Voice Quality: Functions, Analysis

and Synthesis’.

Kruskal, W. H. & Wallis, W. A. (1952), ‘Use of Ranks in One-Criterion Variance Analysis’, Journal

of the American Statistical Association 47(260), 583–621.

Kuhl, P. K., Andruski, J. E., Chistovich, I. A., Chistovich, L. A., Kozhevnikova, E. V., Ryskina,

V. L., Stolyarova, E. I., Sundberg, U. & Lacerda, F. (1997), ‘Cross-language analysis of phonetic

units in language addressed to infants’, Science 277(5326), 684–686.

LDC (2019), ‘English Gigaword Fifth Edition’, https://catalog.ldc.upenn.edu/LDC2011T07.

[Accessed 22-February-2019].

Lehr, M., Shafran, I. & Roark, B. (2012), Fully automated neuropsychological assessment for

detecting mild cognitive impairment, in ‘In Interspeech’.

Levelt, W. (1989), Speaking, MIT Press, Cambridge, Massachusetts.

Liang, P., Taskar, B. & Klein, D. (2006), Alignment by Agreement, in ‘Proceedings of the Main

Conference on Human Language Technology Conference of the North American Chapter of

the Association of Computational Linguistics’, HLT-NAACL ’06, Association for Computa-

tional Linguistics, Stroudsburg, PA, USA, pp. 104–111.

Liu, M., Dai, B., Xie, Y. & Yao, Z. (2006), Improved GMM-UBM/SVM for speaker verification,

in ‘2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceed-

ings’, Vol. 1, IEEE, pp. I–I.

Lopez-de Ipina, K., Martinez-de Lizarduy, U., Barroso, N., Ecay-Torres, M., Martinez-Lage,

P., Torres, F. & Faundez-Zanuy, M. (2015), Automatic analysis of Categorical Verbal Flu-

ency for Mild Cognitive impartment detection: A non-linear language independent ap-

proach, in ‘Bioinspired Intelligence (IWOBI), 2015 4th International Work Conference on’,

IEEE, pp. 101–104.

123

Page 148: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Mackenzie, C., Brady, M., Norrie, J. & Poedjianto, N. (2007), ‘Picture description in neurologi-

cally normal adults: Concepts and topic coherence’, Aphasiology 21(3-4), 340–354.

MacWhinney, B. (2000), ‘The CHILDES Project: Tools for analyzing talk, 3rd edition.’, Lawrence

Erlbaum Associates, Mahwah, New Jersey.

MacWhinney, B., Bird, S., Cieri, C. & Martell, C. (2004), ‘TalkBank: Building an Open Unified

Multimodal Database of Communicative Interaction’, 4th International Conference on Language

Resources and Evaluation pp. 525–528.

MacWhinney, B., Fromm, D., Forbes, M. & Holland, A. (2011), ‘AphasiaBank: Methods for

Studying Discourse’, Aphasiology 25(11), 1286–1307.

Makhoul, J. (1973), ‘Spectral analysis of speech by linear prediction’, IEEE Transactions on Audio

and Electroacoustics 21(3), 140–148.

Mamede, N. J., Baptista, J., Diniz, C. & Cabarrao, V. (2012), STRING: An Hybrid Statistical and

Rule-Based Natural Language Processing Chain for Portuguese., in ‘International Confer-

ence on Computational Processing of Portuguese, Propor’.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J. & McClosky, D. (2014), The

Stanford CoreNLP Natural Language Processing Toolkit, in ‘Association for Computational

Linguistics (ACL) System Demonstrations’, pp. 55–60.

Manning, C., Raghavan, P. & Schutze, H. (2010), ‘Introduction to information retrieval’, Natural

Language Engineering 16(1), 100–103.

Marini, A., Andreetta, S., Del Tin, S. & Carlomagno, S. (2011), ‘A multi-level approach to the

analysis of narrative language in aphasia’, Aphasiology 25(11), 1372–1392.

Markel, J. & Gray, A. (1973), ‘On autocorrelation equations as applied to speech analysis’, IEEE

Transactions on Audio and Electroacoustics 21(2), 69–79.

Marrafa, P., Amaro, R., Mendes, S., Lourosa, S. & Chaves, R. P. (2006), ‘Temanet - wordnets

tematicas do portugues. http://www.instituto-camoes.pt/temanet.’.

McKeith, I. G., Boeve, B. F., Dickson, D. W., Halliday, G., Taylor, J.-P., Weintraub, D., Aarsland,

D., Galvin, J., Attems, J., Ballard, C. G., Bayston, A., Beach, T. G., Blanc, F., Bohnen, N., Bo-

nanni, L., Bras, J., Brundin, P., Burn, D., Chen-Plotkin, A., Duda, J. E., El-Agnaf, O., Feldman,

H., Ferman, T. J., ffytche, D., Fujishiro, H., Galasko, D., Goldman, J. G., Gomperts, S. N.,

Graff-Radford, N. R., Honig, L. S., Iranzo, A., Kantarci, K., Kaufer, D., Kukull, W., Lee, V. M.,

124

Page 149: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Leverenz, J. B., Lewis, S., Lippa, C., Lunde, A., Masellis, M., Masliah, E., McLean, P., Mollen-

hauer, B., Montine, T. J., Moreno, E., Mori, E., Murray, M., O’Brien, J. T., Orimo, S., Postuma,

R. B., Ramaswamy, S., Ross, O. A., Salmon, D. P., Singleton, A., Taylor, A., Thomas, A., Tira-

boschi, P., Toledo, J. B., Trojanowski, J. Q., Tsuang, D., Walker, Z., Yamada, M. & Kosaka, K.

(2017), ‘Diagnosis and management of dementia with Lewy bodies: Fourth consensus report

of the DLB Consortium’, Neurology .

McKeith, I. G., Galasko, D., Kosaka, K., Perry, E. K., Dickson, D. W., Hansen, L. A., Salmon,

D. P., Lowe, J., Mirra, S. S., Byrne, E. J., Lennox, G., Quinn, N. P., Edwardson, J. A., Ince, P. G.,

Bergeron, C., Burns, A., Miller, B. L., Lovestone, S., Collerton, D., Jansen, E. N., Ballard, C.,

de Vos, R. A., Wilcock, G. K., Jellinger, K. A. & Perry, R. H. (1996), ‘Consensus guidelines

for the clinical and pathologic diagnosis of dementia with Lewy bodies (DLB): report of the

consortium on DLB international workshop.’, Neurology 47(5), 1113–1124.

McKhann, G. M., Knopman, D. S., Chertkow, H., Hyman, B. T., Clifford R. Jack, J., Kawas,

C. H., Klunk, W. E., Koroshetz, W. J., Manly, J. J., Mayeux, R., Mohs, R. C., Morris, J. C.,

Rossor, M. N., Scheltens, P., Carrillo, M. C., Thies, B., Weintraub, S. & Phelps, C. H. (2011),

‘The diagnosis of dementia due to Alzheimer’s disease: Recommendations from the Na-

tional Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for

Alzheimer’s disease’, Alzheimer’s Dementia: The Journal of the Alzheimer’s Association 7(3), 263–

269.

MDS (2003), ‘Movement Disorder Society Task Force on Rating Scales for Parkinson’s Disease.

The Unified Parkinson’s Disease Rating Scale (UPDRS): Status and recommendations’.

Meignier, S. & Merlin, T. (2010), LIUM SpkDiarization: an open source toolkit for diarization,

in ‘CMU SPUD Workshop’.

Meinedo, H., Abad, A., Pellegrini, T., Trancoso, I. & Neto, J. (2010), The L2F Broadcast News

Speech Recognition System, in ‘Proc. Fala2010’.

Meinedo, H., Caseiro, D., Neto, J. & Trancoso, I. (2003), AUDIMUS.Media: a Broadcast News

speech recognition system for the European Portuguese language, in ‘Proc. International

Conference on Computational Processing of Portuguese Language (PROPOR)’.

Mentis, M. & Prutting, C. A. (1991), ‘Analysis of Topic as Illustrated in a Head-Injured and a

Normal Adult’, Journal of Speech, Language, and Hearing Research 34(3), 583–595.

Mermelstein, P. (1976), ‘Distance measures for speech recognition, psychological and instru-

mental’, Pattern recognition and artificial intelligence 116, 374–388.

125

Page 150: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Merriam-Webster (2019), ‘Merriam-Webster Online dictionary and thesaurus’, https://www.

merriam-webster.com. [Accessed 15-January-2019].

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013), ‘Efficient estimation of word representa-

tions in vector space’, arXiv preprint arXiv:1301.3781 .

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C. & Joulin, A. (2018), Advances in Pre-Training

Distributed Word Representations, in ‘Proceedings of the International Conference on Lan-

guage Resources and Evaluation (LREC 2018)’.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013), Distributed representa-

tions of words and phrases and their compositionality, in ‘Advances in neural information

processing systems’, pp. 3111–3119.

Miller, D. I., Talbot, V., Gagnon, M. & Messier, C. (2013), ‘Administration of neuropsychological

tests using interactive voice response technology in the elderly: validation and limitations’,

Frontiers in neurology 4.

Miller, G. A. (1995), ‘WordNet: a lexical database for English’, Communications of the ACM

38(11), 39–41.

Miranda, A. F. H. (2015), Influencia da Escolaridade na Dimensao Macrolinguıstica do Dis-

curso, Master’s thesis, Universidade Catolica Portuguesa.

Mitchell, T. M. (1997), Machine Learning, McGraw-Hill Education.

Moniz, H., Batista, B., Mata, A. I. & Trancoso, I. (accepted), Towards automatic language pro-

cessing and intonational labeling in European Portuguese, in N. Henriksen, M. Armstrong

& M. Vanrell, eds, ‘Interdisciplinary approaches to intonational grammar in Ibero-Romance’,

John Benjamins.

Moniz, H., Batista, F., Mata, A. I. & Trancoso, I. (2014), ‘Speaking style effects in the production

of disfluencies’, Speech Communication 65, 20–35.

Moniz, H., Mata, A. I., Hirschberg, J., Batista, F., Rosenberg, A. & Trancoso, I. (2014), ‘Extending

AuToBI to prominence detection in European Portuguese’, Speech Prosody 2014 .

Moniz, H., Mata, A. I. & Viana, M. C. (2007), On filled pauses and prolongations in European

Portuguese, in ‘Interspeech 2007’, Belgium.

Moniz, H., Pompili, A., Batista, F., Trancoso, I., Abad, A. & Amorim, C. (2015), Automatic

Recognition of Prosodic Patterns in Semantic Verbal Fluency Tests - An Animal Naming Task

126

Page 151: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

for Edutainment Applications, in ‘18TH International Congress of Phonetic Sciences’, Inter-

national Phonetic Association.

Morgan, N. & Bourlard, H. (1995), ‘An introduction to hybrid HMM/connectionist continuous

speech recognition’, IEEE Signal Processing Magazine 12(3), 25–42.

Nasreddine, Z. S., Phillips, N. A., Bedirian, V., Charbonneau, S., Whitehead, V., Collin, I.,

Cummings, J. L. & Chertkow, H. (2005), ‘The Montreal Cognitive Assessment, MoCA: A

Brief Screening Tool For Mild Cognitive Impairment’, Journal of the American Geriatrics Soci-

ety 53(4), 695–699.

Negnevitsky, M. (2005), Artificial intelligence: a guide to intelligent systems, Pearson education.

Nicholas, M., Obler, L. K., Albert, M. L. & Helm-Estabrooks, N. (1985), ‘Empty speech in

Alzheimer’s disease and fluent aphasia’, Journal of Speech, Language, and Hearing Research

28(3), 405–410.

Nuance (2019), ‘Nuance Open Speech Recognizer’, https://www.nuance.com/

omni-channel-customer-engagement/voice-and-ivr/automatic-speech-recognition/

nuance-recognizer.html. Burlington, Massachusetts, EUA, [Accessed 12-March-2019].

Nunes, B. (2005), A Demencia em Numeros, In A. Castro-Caldas & A. Mendonca. A Doenca de

Alzheimer e Outras Demencias em Portugal. LIDEL.

Obler, L. K. & Albert, M. L. (1984), Language in the elderly aphasic and dementing patient., in

‘In M. T. Sarno (Ed.), Acquired aphasia. New York: Academic Press’, pp. 385–398.

Olesen, J., Gustavsson, A., Svensson, M., Wittchen, H.-U., Jonsson, B., Group, C. S. & Council,

E. B. (2012), ‘The economic cost of brain disorders in Europe’, European journal of neurology

19(1), 155–162.

Oppenheim, G. (1994), ‘The earliest signs of Alzheimer’s disease.’, J Geriatr Psychiatry Neurol

7(2), 116–120.

Oracle (2019), ‘Java Speech API Specifications’, https://www.oracle.com/technetwork/java/

speech-138007.html. [Accessed 12-March-2019].

Orimaye, S. O., Wong, J. S.-M. & Golden, K. J. (2014), Learning predictive linguistic features for

Alzheimer’s disease and related dementias using verbal utterances, in ‘Proceedings of the 1st

Workshop on Computational Linguistics and Clinical Psychology (CLPsych)’, sn, pp. 78–87.

127

Page 152: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Orozco-Arroyave, J. R., Arias-Londono, J. D., Vargas-Bonilla, J. & Noth, E. (2013), Perceptual

analysis of speech signals from people with Parkinson’s disease, in ‘International Work-

Conference on the Interplay Between Natural and Artificial Computation’, Springer, pp. 201–

211.

Orozco-Arroyave, J. R., Belalcazar-Bolanos, E. A., Arias-Londono, J. D., Vargas-Bonilla, J. F.,

Haderlein, T. & Noth, E. (2014), Phonation and articulation analysis of Spanish vowels for

automatic detection of Parkinson’s disease, in ‘Text, Speech and Dialogue: 17th International

Conference. Proceedings’, Springer International Publishing, Cham, pp. 374–381.

Orozco-Arroyave, J. R., Honig, F., Arias-Londono, J. D., Vargas-Bonilla, J. F., Daqrouq, K.,

Skodda, S., Rusz, J. & Noth, E. (2016), ‘Automatic detection of Parkinson’s disease in run-

ning speech spoken in three different languages.’, J Acoust Soc Am 139(1), 481–500.

Ortmanns, S. & Ney, H. (2000), ‘The time-conditioned approach in dynamic programming

search for LVCSR’, IEEE Transactions on Speech and Audio Processing 8(6), 676–687.

Pakhomov, S. V. S., Hemmy, L. S. & Lim, K. O. (2012), ‘Automated semantic indices related to

cognitive function and rate of cognitive decline.’, Neuropsychologia 50(9), 2165–2175.

Pakhomov, S. V. S., Marino, S. E., Banks, S. & Bernick, C. (2015), ‘Using Automatic Speech

Recognition to Assess Spoken Responses to Cognitive Tests of Semantic Verbal Fluency.’,

Speech Commun 75, 14–26.

Pakhomov, S. V., Smith, G. E., Chacon, D., Feliciano, Y., Graff-Radford, N., Caselli, R. & Knop-

man, D. S. (2010), ‘Computerized analysis of speech and language to identify psycholin-

guistic correlates of frontotemporal lobar degeneration’, Cognitive and Behavioral Neurology

23(3), 165.

Pasquier, F. & Petit, H. (1997), ‘Frontotemporal dementia: its rediscovery.’, Eur Neurol 38(1), 1–6.

Patwardhan, S. & Pedersen, T. (2006), Using WordNet-based context vectors to estimate the

semantic relatedness of concepts, in ‘Proceedings of the eacl 2006 workshop making sense of

sense-bringing computational linguistics and psycholinguistics together’, Vol. 1501, Trento,

pp. 1–8.

Paulsen, J. S. (2011), ‘Cognitive impairment in Huntington disease: diagnosis and treatment.’,

Curr Neurol Neurosci Rep 11(5), 474–483.

Pearson, K. (1895), ‘Notes on regression and inheritance in the case of two parents’, Proceedings

of the Royal Society of London 58, 240–242.

128

Page 153: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Pearson, K. (1992), On the criterion that a given system of deviations from the probable in the case of a

correlated system of variables is such that it can be reasonably supposed to have arisen from random

sampling, Springer New York, New York, NY, pp. 11–28.

Pinto, S., Cardoso, R., Sadat, J., Guimaraes, I., Mercier, C., Santos, H., Atkinson-Clement, C.,

Carvalho, J., Welby, P., Oliveira, P., D’Imperio, M., Frota, S., Letanneux, A., Vigario, M., Cruz,

M., Martins, I. P., Viallet, F. & Ferreira, J. J. (2016), ‘Dysarthria in individuals with Parkin-

son’s disease: a protocol for a binational, cross-sectional, case-controlled study in French and

European Portuguese (FraLusoPark)’, BMJ Open 6(11).

Pompili, A., Abad, A., Martins de Matos, D. & Pavao Martins, I. (2018), Topic coherence analy-

sis for the classification of Alzheimer’s disease, in ‘Proc. IberSPEECH 2018’, pp. 281–285.

Pompili, A., Abad, A., Martins de Matos, D. & Pavao Martins, I. (2019), Evaluating pragmatic

aspects of discourse production for the automatic identification of Alzheimer’s disease. Sub-

mitted to a special issue of the IEEE Journal of Selected Topics in Signal Processing (JSTSP) on

Automatic assessment of health disorders based on voice, speech and language processing.

Pompili, A., Abad, A., Romano, P., Martins, I. P., Cardoso, R., Santos, H., Carvalho, J.,

Guimaraes, I. & Ferreira, J. J. (2017), Automatic Detection of Parkinson’s Disease: An Exper-

imental Analysis of Common Speech Production Tasks Used for Diagnosis, in ‘International

Conference on Text, Speech, and Dialogue’, Springer, pp. 411–419.

Pompili, A., Abad, A., Trancoso, I., Fonseca, J., Martins, I. P., Leal, G. & Farrajota, L. (2011),

An on-line system for remote treatment of aphasia, in ‘Proceedings of the Second Workshop

on Speech and Language Processing for Assistive Technologies’, SLPAT ’11, Association for

Computational Linguistics, pp. 1–10.

Pompili, A., Amorim, C., Abad, A. & Trancoso, I. (2015), Speech and language technologies for

the automatic monitoring and training of cognitive functions, in ‘Proceedings of SLPAT 2015:

6th Workshop on Speech and Language Processing for Assistive Technologies’, Association

for Computational Linguistics, Dresden, Germany, pp. 103–109.

Post, M. & Bergsma, S. (2013), Explicit and implicit syntactic features for text classification, in

‘Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics

(Volume 2: Short Papers)’, Vol. 2, pp. 866–872.

Pringsheim, T., Wiltshire, K., Day, L., Dykeman, J., Steeves, T. & Jette, N. (2012), ‘The incidence

and prevalence of Huntington’s disease: a systematic review and meta-analysis.’, Mov Disord

27(9), 1083–1091.

129

Page 154: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Proenca, J., Veiga, A., Candeias, S. & Perdigao, F. (2013), Acoustic, Phonetic and Prosodic Fea-

tures of Parkinson’s disease Speech, in ‘STIL-IX Brazilian Symposium in Information and

Human Language Technology, 2nd Brazilian Conference on Intelligent Systems, Brazil’.

Pudil, P., Novovicova, J. & Kittler, J. (1994), ‘Floating search methods in feature selection’, Pat-

tern recognition letters 15(11), 1119–1125.

Quinlan, J. R. (1986), ‘Induction of decision trees’, Machine learning 1(1), 81–106.

Quinlan, J. R. (2014), C4. 5: programs for machine learning, Elsevier.

Rabiner, L. R. (1989), ‘A tutorial on hidden Markov models and selected applications in speech

recognition’, Proceedings of the IEEE 77(2), 257–286.

Rabiner, L. R., Juang, B.-H. & Rutledge, J. C. (1993), Fundamentals of speech recognition, Vol. 14,

PTR Prentice Hall Englewood Cliffs.

Ramig, L. O., Fox, C. & Sapir, S. (2008), ‘Speech treatment for Parkinson’s disease’, Expert Review

of Neurotherapeutics 8(2), 297–309.

Ratnavalli, E., Brayne, C., Dawson, K. & Hodges, J. R. (2002), ‘The prevalence of frontotemporal

dementia.’, Neurology 58(11), 1615–1621.

Reilly, J., Troche, J. & Grossman, M. (2011), ‘Language processing in dementia’, The handbook of

Alzheimer’s disease and other dementias pp. 336–368.

Reilmann, R., Leavitt, B. R. & Ross, C. A. (2014), ‘Diagnostic criteria for Huntington’s disease

based on natural history.’, Mov Disord 29(11), 1335–1341.

Reynolds, D. (2009a), ‘Universal background models’, Encyclopedia of biometrics pp. 1349–1352.

Reynolds, D. A. (2009b), Gaussian Mixture Models, in ‘Encyclopedia of Biometrics’.

Reynolds, D. A., Quatieri, T. F. & Dunn, R. B. (2000), ‘Speaker verification using adapted Gaus-

sian mixture models’, Digital signal processing 10(1-3), 19–41.

Ripich, D. N. & Terrell, B. Y. (1988), ‘Patterns of discourse cohesion and coherence in

Alzheimer’s disease’, Journal of Speech and Hearing Disorders 53(1), 8–15.

Robert, P., Ferris, S., Gauthier, S., Ihl, R., Winblad, B. & Tennigkeit, F. (2010), ‘Review of

Alzheimer’s disease scales: is there a need for a new multi-domain scale for therapy evalua-

tion in medical practice?’, Alzheimer’s Res Ther 2(4), 24.

130

Page 155: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Roberts, R. & Knopman, D. S. (2013), ‘Classification and Epidemiology of MCI’, Clinics in Geri-

atric Medicine 29(4).

Rosen, W. G., Mohs, R. C. & Davis, K. L. (1984), ‘A new rating scale for Alzheimer’s disease.’,

Am J Psychiatry 141(11), 1356–1364.

Rosenberg, A. (2009), Automatic Detection and Classification of Prosodic Events, PhD thesis,

University of Columbia.

Rosenberg, A. (2010), AuToBI – A Tool for Automatic ToBI annotation, in ‘Interspeech 2010’.

Rosenblatt, F. (1958), ‘The perceptron: a probabilistic model for information storage and orga-

nization in the brain.’, Psychological review 65(6), 386.

Rusz, J., Cmejla, R., Ruzickova, H. & Ruzicka, E. (2011), ‘Quantitative acoustic measurements

for characterization of speech and voice disorders in early untreated Parkinson’s disease’,

The Journal of the Acoustical Society of America 129(1), 350–367.

Rusz, J., Klempir, J., Tykalova, T., Baborova, E., Cmejla, R., Ruzicka, E. & Roth, J. (2014), ‘Char-

acteristics and occurrence of speech impairment in Huntington’s disease: possible influence

of antipsychotic medication.’, J Neural Transm (Vienna) 121(12), 1529–1539.

Ryan, J. & Lopez, S. (2001), Wechsler Adult Intelligence Scale-III, In: Dorfman W.I., Hersen

M. (eds) Understanding Psychological Assessment. Perspectives on Individual Differences.

Springer, Boston, MA.

Saldert, C., Fors, A., Stroberg, S. & Hartelius, L. (2010), ‘Comprehension of complex discourse

in different stages of Huntington’s disease.’, Int J Lang Commun Disord 45(6), 656–669.

Salmon, D. P., Butters, N. & Chan, A. S. (1999), ‘The deterioration of semantic memory in

Alzheimer’s disease.’, Canadian Journal of Experimental Psychology/Revue canadienne de psy-

chologie experimentale 53(1), 108.

Santos, L., Correa Junior, E. A., Oliveira Jr, O., Amancio, D., Mansur, L. & Aluısio, S. (2017),

Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impair-

ment from Speech Transcripts, in ‘Proceedings of the 55th Annual Meeting of the Association

for Computational Linguistics (Volume 1: Long Papers)’, Association for Computational Lin-

guistics, pp. 1284–1296.

Santosa, F. & Symes, W. W. (1986), ‘Linear inversion of band-limited reflection seismograms’,

SIAM Journal on Scientific and Statistical Computing 7(4), 1307–1330.

131

Page 156: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhad-

ran, B., Picheny, M., Lim, L.-L., Roomi, B. & Hall, P. (2017), English Conversational Telephone

Speech Recognition by Humans and Machines, in ‘Proc. Interspeech’, pp. 132–136.

Sapir, S. (2006), ‘Effects of LSVT on speech articulation in dysarthric individuals with Parkin-

son’s disease: Acoustic and perceptual correlates.’, A paper presented at the Congress of the

European Federation of Neurological Societies, Istanbul, Turkey .

Sapir, S., Spielman, J. L., Ramig, L. O., Story, B. H. & Fox, C. (2007), ‘Effects of intensive voice

treatment (the Lee Silverman Voice Treatment [LSVT]) on vowel articulation in dysarthric

individuals with idiopathic Parkinson disease: Acoustic and perceptual findings’, Journal of

Speech, Language, and Hearing Research 50, 899–912.

Savino, M. (2004), Intonational cues to discourse structure in a variety of Italian, in ‘Regional

Variation in Intonation’, P. Gilles & J. Peters (eds.), Niemeyer: Tuebingen, pp. 145–159.

Savino, M., Bosco, A. & Grice, M. (2014), Intonational cues to item position in lists: Evidence

from a serial recall task, in ‘Proceedings of the International Conference on Speech Prosody’,

pp. 708–712.

Schlesinger, M. I. & Hlavac, V. (2002), Ten lectures on statistical and structural pattern recog-

nition, in ‘Volume 24 of Computational Imaging and Vision’, Kluwer Academic Press, Dor-

drecht, pp. 1–544.

Scholkopf, B. & Smola, A. J. (2001), Learning with kernels: support vector machines, regularization,

optimization, and beyond, MIT press.

Shao, J. (1993), ‘Linear model selection by cross-validation’, Journal of the American statistical

Association 88(422), 486–494.

Sheinerman, K. S. & Umansky, S. R. (2013), ‘Early detection of neurodegenerative diseases:

circulating brain-enriched microRNA’, Cell cycle (Georgetown, Tex.) 12(1), 1–2.

Shekim, L. O. & LaPointe, L. L. (1984), ‘Production of discourse in individuals with Alzheimer’s

Disease.’, Paper presented at International Neuropsychological Society Meetings, Houston,

TX.

Shriberg, E. (1994), Preliminaries to a Theory of Speech Disfluencies, PhD thesis, University of

California.

132

Page 157: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Shriberg, E. (2001), ‘To ”Errrr” is Human: Ecology and Acoustics of Speech Disfluencies’, Jour-

nal of the International Phonetic Association 31, 153–169.

Silbergleit, A. K., Johnson, A. F. & Jacobson, B. H. (1997), ‘Acoustic analysis of voice in indi-

viduals with amyotrophic lateral sclerosis and perceptually normal vocal quality’, Journal of

Voice 11(2), 222–231.

Silva, D. G., Oliveira, L. C. & Andrea, M. (2009), ‘Jitter estimation algorithms for detection of

pathological voices’, EURASIP Journal on Advances in Signal Processing 2009, 9.

Sirts, K., Piguet, O. & Johnson, M. (2017), Idea density for predicting Alzheimer’s disease from

transcribed speech, in ‘CoNLL’.

Skodda, S. & Schlegel, U. (2008), ‘Speech rate and rhythm in Parkinson’s disease’, Movement

Disorders 23(7), 985–992.

Skodda, S., Schlegel, U., Hoffmann, R. & Saft, C. (2014), ‘Impaired motor speech performance

in Huntington’s disease.’, J Neural Transm (Vienna) 121(4), 399–407.

Skodda, S., Visser, W. & Schlegel, U. (2011), ‘Vowel articulation in Parkinson’s disease.’, J Voice

25(4), 467–472.

Soffer, A. (1997), Image categorization using texture features, in ‘Proceedings of the Fourth

International Conference on Document Analysis and Recognition’, Vol. 1, IEEE, pp. 233–237.

Steck, A., Struhal, W., Sergay, S. M., Grisold, W., of Neurology, E. C. W. F. et al. (2013), ‘The

global perspective on neurology training: the World Federation of Neurology survey’, Journal

of the neurological sciences 334(1-2), 30–47.

Stevens, S. S. & Volkmann, J. (1940), ‘The relation of pitch to frequency: A revised scale’, The

American Journal of Psychology 53(3), 329–353.

Stevens, S. S., Volkmann, J. & Newman, E. B. (1937), ‘A scale for the measurement of the psy-

chological magnitude pitch’, The Journal of the Acoustical Society of America 8(3), 185–190.

Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Janin, A., Magimai-Doss, M., Wooters, C. &

Zheng, J. (2008), The SRI-ICSI Spring 2007 meeting and lecture recognition system, Springer,

pp. 450–463.

Stolcke, A. & Droppo, J. (2017), Comparing Human and Machine Errors in Conversational

Speech Transcription, in ‘Proc. Interspeech’, pp. 137–141.

133

Page 158: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Stone, M. (1974), ‘Cross-validatory choice and assessment of statistical predictions’, Journal of

the Royal Statistical Society: Series B (Methodological) 36(2), 111–133.

Stone, M. (1977), ‘Asymptotics For and Against Cross-Validation’, Biometrika 64(1), 29–35.

Strauss, E., Sherman, E. & Spreen, O. (2006), A Compendium of Neuropsychological Tests: Admin-

istration, Norms, and Commentary, 3 edn, Oxford University Press.

Stroop, J. R. (1935), ‘Studies of interference in serial verbal reactions’, Journal of Experimental

Psychology 18(6), 643–662.

Sveinbjornsdottir, S. (2016), ‘The clinical symptoms of Parkinson’s disease’, Journal of Neuro-

chemistry 139, 318–324.

Szoke, I., Schwarz, P., Matejka, P., Burget, L., Karafiat, M., Fapso, M. & Cernocky, J. (2005), Com-

parison of keyword spotting approaches for informal continuous speech, in ‘Ninth European

conference on speech communication and technology’.

Taler, V. & Phillips, N. A. (2008), ‘Language performance in Alzheimer’s disease and mild cog-

nitive impairment: a comparative review.’, J Clin Exp Neuropsychol 30(5), 501–556.

TalkBank (2017), ‘DementiaBank database’, https://dementia.talkbank.org. [Accessed 15-

January-2019].

Teixeira, J. P. & Fernandes, P. O. (2014), ‘Jitter, Shimmer and HNR classification within gender,

tones and vowels in healthy voices’, Procedia technology 16, 1228–1237.

Teixeira, J. P., Oliveira, C. & Lopes, C. (2013), ‘Vocal acoustic analysis–jitter, shimmer and hnr

parameters’, Procedia Technology 9, 1112–1122.

TelAsk (2019), ‘TelAsk Technologies’, https://telask.com. Ottawa, ON, Canada, [Accessed

12-March-2019].

Thesaurus (2019), ‘Online thesaurus’, https://www.thesaurus.com. [Accessed 15-January-

2019].

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal

Statistical Society: Series B (Methodological) 58(1), 267–288.

Titze, I. R. & Martin, D. W. (1998), ‘Principles of voice production’.

Toledo, C. M., Aluısio, S. M., dos Santos, L. B., Brucki, S. M. D., Tres, E. S., de Oliveira, M. O. &

Mansur, L. L. (2018), ‘Analysis of macrolinguistic aspects of narratives from individuals with

134

Page 159: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Alzheimer’s disease, mild cognitive impairment, and no cognitive impairment’, Alzheimer’s

& Dementia: Diagnosis, Assessment & Disease Monitoring 10, 31–40.

Tomovic, A., Janicic, P. & Keselj, V. (2006), ‘n-Gram-based classification and unsupervised

hierarchical clustering of genome sequences’, Computer methods and programs in biomedicine

81(2), 137–153.

Torres-Carrasquillo, P. A., Gleason, T. P. & Reynolds, D. A. (2004), Dialect identification using

Gaussian mixture models, in ‘ODYSSEY04-The Speaker and Language Recognition Work-

shop’.

Tree, J. E. F. (1995), ‘The effects of false starts and repetitions on the processing of subsequent

words in spontaneous speech’, Journal of Memory and Language 34(6), 709–738.

Treebank, P. (2019), ‘Penn Treebank II Constituent Tags’, http://www.surdeanu.info/mihai/

teaching/ista555-fall13/readings/PennTreebankConstituents.html. [Accessed 15-

January-2019].

Ulatowska, H. & Chapman, S. (1994a), Discourse Analysis and Applications. Studies In Adult Clin-

ical Populations, Hillsdale: Lawrence Elbaum Associates, chapter Discourse macrostructure

in aphasia, pp. pp. 29–46.

Ulatowska, H. K., Allard, L., Donnell, A., Bristow, J., Haynes, S. M., Flower, A. & North, A. J.

(1988), Discourse performance in subjects with dementia of the Alzheimer type, in ‘Neu-

ropsychological studies of nonfocal brain damage’, Springer, pp. 108–131.

Ulatowska, H. K. & Chapman, S. B. (1994b), ‘Discourse macrostructure in aphasia’, Discourse

analysis and applications: Studies in adult clinical populations pp. 29–46.

Vapnik, V. (1963), ‘Pattern recognition using generalized portrait method’, Automation and re-

mote control 24, 774–780.

Vasquez-Correa, J., Orozco-Arroyave, J. R., Arias-Londono, J. D., Vargas-Bonilla, J. F. & Noth,

E. (2013), Design and implementation of an embedded system for real time analysis of speech

from people with Parkinson’s disease, in ‘Symposium of Signals, Images and Artificial Vision

- 2013: STSIVA - 2013’, pp. 1–5.

Vizza, P., Tradigo, G., Mirarchi, D., Bossio, R. & Veltri, P. (2017), ‘On the Use of Voice Signals for

Studying Sclerosis Disease’, Computers 6(4), 30.

135

Page 160: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

Vogel, A. P., Shirbin, C., Churchyard, A. J. & Stout, J. C. (2012), ‘Speech acoustic markers of early

stage and prodromal Huntington’s disease: a marker of disease onset?’, Neuropsychologia

50(14), 3273–3278.

Vorperian, H. K. & Kent, R. D. (2007), ‘Vowel acoustic space development in children: A syn-

thesis of acoustic and anatomic data’, Journal of Speech, Language, and Hearing Research .

Wang, S. & Starren, J. (1999), A Java Speech Implementation of the Mini Mental Status Exam,

in ‘Proc. AMIA Symposium’.

Watts, C. R. & Vanryckeghem, M. (2001), ‘Laryngeal dysfunction in Amyotrophic Lateral Scle-

rosis: a review and case report.’, BMC Ear Nose Throat Disord 1(1), 1.

Wechsler, D. (1997), Wechsler Memory Scale - Third Edition Manual, The Psychological Corpora-

tion.

Weismer, G., Jeng, J.-Y., Laures, J. S., Kent, R. D. & Kent, J. F. (2001), ‘Acoustic and intelligibility

characteristics of sentence production in neurogenic speech disorders’, Folia Phoniatrica et

Logopaedica 53(1), 1–18.

WHO (2017), ‘World Health Organization. Dementia Fact sheet N 362’.

Wikipedia (2014), ‘Wikipedia 2014’, http://dumps.wikimedia.org/enwiki/20140102/. (No

more available on 22-February-2019).

Wittchen, H.-U., Jacobi, F., Rehm, J., Gustavsson, A., Svensson, M., Jonsson, B., Olesen, J.,

Allgulander, C., Alonso, J., Faravelli, C. et al. (2011), ‘The size and burden of mental dis-

orders and other disorders of the brain in Europe 2010’, European neuropsychopharmacology

21(9), 655–679.

Wong, E. & Sridharan, S. (2002), Methods to improve Gaussian mixture model based language

identification system, in ‘Seventh International Conference on Spoken Language Processing’.

Wong, S.-M. J. & Dras, M. (2010), Parser features for sentence grammaticality classification, in

‘Proceedings of the Australasian Language Technology Association Workshop 2010’, pp. 67–

75.

Wu, W., Zheng, T. F., Xu, M.-X. & Bao, H.-J. (2006), Study on speaker verification on emotional

speech, in ‘Ninth International Conference on Spoken Language Processing’.

Yancheva, M., Fraser, K. C. & Rudzicz, F. (2015), Using linguistic features longitudinally to

predict clinical scores for Alzheimer’s disease and related dementias, in ‘Proceedings of

136

Page 161: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies’,

pp. 134–139.

Yancheva, M. & Rudzicz, F. (2016), Vector-space topic models for detecting Alzheimer’s dis-

ease., in ‘Proceedings of the 54th Annual Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers)’, pp. 2337–2346.

Yin, B., Ambikairajah, E. & Chen, F. (2006), Combining cepstral and prosodic features in lan-

guage identification, in ‘18th International Conference on Pattern Recognition (ICPR’06)’,

Vol. 4, IEEE, pp. 254–257.

Yorkston, K., Strand, E., Miller, R., Hillel, A. & Smith, K. (1993), ‘Speech Deterioration in Amy-

otrophic Lateral Sclerosis: Implications for the Timing of Intervention’, Journal of Medical

Speech-Language Pathology 1, 35–46.

Zeißler, V., Adelhardt, J., Batliner, A., Frank, C., Noth, E., Shi, R. P. & Niemann, H. (2006), The

prosody module, in ‘SmartKom: foundations of multimodal dialogue systems’, Springer,

pp. 139–152.

Zheng, R., Zhang, S. & Xu, B. (2004), Text-independent speaker identification using GMM-UBM

and frame level likelihood normalization, in ‘2004 International Symposium on Chinese Spo-

ken Language Processing’, IEEE, pp. 289–292.

Zou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the Elastic Net’, Journal

of the Royal Statistical Society, Series B 67, 301–320.

Zwicker, E. (1961), ‘Subdivision of the Audible Frequency Range into Critical Bands (Frequen-

zgruppen)’, The Journal of the Acoustical Society of America 33(2), 248–248.

137

Page 162: UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TECNICO´

138