Click here to load reader

Tiago Santos Barata Recuperação de informação baseada em ... · PDF fileidentificação de características de frases informativas em textos biomédicos. Para este propósito

  • View
    212

  • Download
    0

Embed Size (px)

Text of Tiago Santos Barata Recuperação de informação baseada em ... ·...

Universidade de AveiroDepartamento de Eletrnica,Telecomunicaes e Informtica

2013

Tiago Santos Barata

Nunes

Recuperao de informao baseada em frases

para textos biomdicos

A sentence-based information retrieval system for

biomedical corpora

The problems are solved, not by giving new information,

but by arranging what we have known since long.

Ludwig Wittgenstein

Universidade de AveiroDepartamento de Eletrnica,Telecomunicaes e Informtica

2013

Tiago Santos Barata

Nunes

Recuperao de informao baseada em frases

para textos biomdicos

A sentence-based information retrieval system for

biomedical corpora

Universidade de AveiroDepartamento de Eletrnica,Telecomunicaes e Informtica

2013

Tiago Santos Barata

Nunes

Recuperao de informao baseada em frases

para textos biomdicos

A sentence-based information retrieval system for

biomedical corpora

Dissertao apresentada Universidade de Aveiro para cumprimento dos re-

quisitos necessrios obteno do grau de Mestre em Engenharia de Com-

putadores e Telemtica, realizada sob a orientao cientfica do Doutor Jos

Lus Oliveira, Professor Associado do Departamento de Eletrnica, Telecomu-

nicaes e Informtica da Universidade de Aveiro, e do Doutor Srgio Matos,

Investigador Auxiliar da Universidade de Aveiro.

Para os meus pais, Jos e Mila, por permitirem que eu tenha aqui che-

gado e por todo o apoio incondicional ao longo deste percurso. Sem

vocs este documento no existiria.

o jri / the jury

presidente / president Toms Antnio Mendes Oliveira e Silva

Professor Associado da Universidade de Aveiro

vogais / examiners committee Erik M. van Mulligen

Professor Auxiliar do Medical Informatics Department do Erasmus Medical Center Rotterdam

(arguente principal)

Srgio Guilherme Aleixo de Matos

Investigador Auxiliar da Universidade de Aveiro (co-orientador)

agradecimentos /

acknowledgements

Writing a research thesis about a relatively complex subject is not an easy

task and is certainly not something one can do alone.

I would like to thank first my professors and supervisors Jos Lus Oliveira and

Srgio Matos, for the invaluable guidance and support through all the process

of writing this thesis. You helped me to stay focused and on path, especially

when I felt overwhelmed with all the different possibilities and directions that

could be pursued.

To my external advisors Erik van Mulligen and Jan Kors, for making me feel

at home during my short internship at the Biosemantics Research Group, for

being always available, supportive and helping me ask the right questions at

the right time.

To my colleagues at the University of Aveiro Bioinformatics Group and the

ones at the Erasmus Medical Center in Rotterdam. You helped me professi-

onally and personally with our countless conversations, both about work and

other topics.

A big thank you to all my friends, especially the closest ones, for encoura-

ging me to focus on work and joining me in those unforgettable stress-relieve

moments that allowed me to keep going.

Finally I want to thank my family for the unconditional support and interest in

my progress. A special mention goes to my parents and sister, who made all

this possible and gave me the necessary strength and motivation to reach the

end of this stage of my life.

Palavras Chave recuperao de informao, extrao de informao, minerao de texto,

aprendizagem automtica, processamento de linguagem natural, bioinform-

tica.

Resumo O desenvolvimento de novos mtodos experimentais e tecnologias de alto

rendimento no campo biomdico despoletou um crescimento acelerado do

volume de publicaes cientficas na rea. Inmeros repositrios estrutura-

dos para dados biolgicos foram criados ao longo das ltimas dcadas, no

entanto, os utilizadores esto cada vez mais a recorrer a sistemas de recu-

perao de informao, ou motores de busca, em detrimento dos primeiros.

Motores de pesquisa apresentam-se mais fceis de usar devido sua flexibi-

lidade e capacidade de interpretar os requisitos dos utilizadores, tipicamente

expressos na forma de pesquisas compostas por algumas palavras.

Sistemas de pesquisa tradicionais devolvem documentos completos, que ge-

ralmente requerem um grande esforo de leitura para encontrar a informao

procurada, encontrando-se esta, em grande parte dos casos, descrita num

trecho de texto composto por poucas frases. Alm disso, estes sistemas fa-

lham frequentemente na tentativa de encontrar a informao pretendida por-

que, apesar de a pesquisa efectuada estar normalmente alinhada seman-

ticamente com a linguagem usada nos documentos procurados, os termos

usados so lexicalmente diferentes.

Esta dissertao foca-se no desenvolvimento de tcnicas de recuperao de

informao baseadas em frases que, para uma dada pesquisa de um utiliza-

dor, permitam encontrar frases relevantes da literatura cientfica que respon-

dam aos requisitos do utilizador. O trabalho desenvolvido apresenta-se em

duas partes. Primeiro foi realizado trabalho de investigao exploratria para

identificao de caractersticas de frases informativas em textos biomdicos.

Para este propsito foi usado um mtodo de aprendizagem automtica. De

seguida foi desenvolvido um sistema de pesquisa de frases informativas. Este

sistema suporta pesquisas de texto livre e baseadas em conceitos, os resul-

tados de pesquisa apresentam-se enriquecidos com anotaes de conceitos

relevantes e podem ser ordenados segundo vrias estratgias de classifica-

o.

Keywords information retrieval, information extraction, text mining, machine learning, na-

tural language processing, bioinformatics.

Abstract Modern advances of experimental methods and high-throughput technology

in the biomedical domain are causing a fast-paced, rising growth of the vol-

ume of published scientific literature in the field. While a myriad of structured

data repositories for biological knowledge have been sprouting over the last

decades, Information Retrieval (IR) systems are increasingly replacing them.

IR systems are easier to use due to their flexibility and ability to interpret user

needs in the form of queries, typically formed by a few words.

Traditional document retrieval systems return entire documents, which may

require a lot of subsequent reading to find the specific information sought, fre-

quently contained in a small passage of only a few sentences. Additionally, IR

often fails to find what is wanted because the words used in the query are lex-

ically different, despite semantically aligned, from the words used in relevant

sources.

This thesis focuses on the development of sentence-based information re-

trieval approaches that, for a given user query, allow seeking relevant sen-

tences from scientific literature that answer the user information need. The

presented work is two-fold. First, exploratory research experiments were con-

ducted for the identification of features of informative sentences from biomed-

ical texts. A supervised machine learning method was used for this purpose.

Second, an information retrieval system for informative sentences was devel-

oped. It supports free text and concept-based queries, search results are en-

riched with relevant concept annotations and sentences can be ranked using

multiple configurable strategies.

Contents

Contents i

List of Figures v

List of Tables ix

Acronyms xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Relationship and Event Extraction . . . . . . . . . . . . . . . . . . . 6

2.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Model Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Evaluating Performance . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Indexing and Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.4 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.5 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.6 Sentence Retrieval Applications . . . . . . . . . . . . . . . . . . . . . 14

3 Model Proposal and Implementation 17

3.1 Sentence Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1.1 Classification Framework . . . . . . . . . . . . . . . . . . . 20

3.1.2 Learning Resources . . . . . . .