41
Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science [email protected] Department of Biomedical Informatics The Ohio State University

Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science [email protected] [email protected] Department of Biomedical

Embed Size (px)

Citation preview

Page 1: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Literature Mining and OntologyBMI/IBGP 730

Autumn, 2011 Yang Xiang, Ph.D. in Computer Science

[email protected]

Department of Biomedical InformaticsThe Ohio State University

Page 2: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 3: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 4: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

What is Literature (Text) Mining?

• The purposes of Literature Mining– Find relevant documents– Discover knowledge (what is knowledge?)

• e.g. opinion mining (sentiment analysis)• e.g. document similarity

• The advantage of computer-based Literature Mining– Simply, computers can search much more documents!– Computers can ‘think’ and discover knowledge.

• We will focus on biomedical literature mining in the following

Page 5: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Why Literature Mining is Very Popular in Biomedical Science?

• Biomedical science studies nature subjects.– Species– Genes– Phenotypes– Diseases….

Page 6: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 7: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Popular Tools for Biomedical Literature Mining – Document search

• Google– Google Scholar: http://scholar.google.com

• ISI web of knoledge– www.isiknowledge.com

• Pubmed– www.ncbi.nlm.nih.gov/pubmed

• Scopus– www.scopus.com

Page 8: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Tools for Biomedical Literature Mining – Knowledge discovery

• The Gene Ontology– http://www.geneontology.org/

• Gene answer– www.geneanswers.com

Page 9: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 10: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Techniques Behind Literature Mining

• Interdisciplinary– Computer Science

• Information retrieval• Data mining• Natural Language Processing• Machine learning

– Library Science– Biomedical Science– Linguistics

• Computational linguistics

– Statistics– And more!

• Two main research areas (some overlaps)– Information Retrieval– Natural Language Processing

Page 11: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Basic Text Search Algorithm

• Assume text size is n.• Assume search string size is m.• How to design an efficient algorithm to find all

matches in the text?– Brutal force algorithm, O(mn).– Boyer-Moore Heuristics, O(mn), but fast in most cases

for English text.– KMP (Knuth-Morris-Pratt) algorithm, O(m+n).

H e l l o , w o r l d

w o r l d

…… text

String to match

Page 12: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 13: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Information Retrieval (Indexing)

• Archiving (preprocessing) documents for fast search– Preprocessing time– Query time– Index size

Page 14: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 15: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Programming language processing (C++, Java, etc)

• Lexical analysisy=x+10;

• Syntax analysis

lexeme Token typey identifier= assignment operatorx identifier+ addition operator10 number; end of statement

assignment operator

identifierexpression

identifier number

expression expression

x 10

+=

y

Page 16: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Natural Language Processing• Lexical level

– Stemming (including lemmatizing): find the root of a wordswimming, swam, swim, swimmer swim

– Stemming rule may vary (balance between overstemming and understemming)– Typical algorithm (Porter Stemming algorithm)– Alias, Synonym

• Grammatical level– Parsing

“…We find Gene1 interacts with Gene2…”

Sentence

Noun phrase Verb phrase

Gene1Verb

interact

Noun phrase

Gene2

Page 17: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 18: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Statistical and Data Mining Processing

• Statistical– Count the word frequency– Count the expression frequency

• Data Mining– Mining the set of frequent words– Association Rule Mining

Page 19: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Document Classification

• E.g., classify all documents related to coffee and health

• Various machine learning algorithms can be applied here.

Coffee and health related

documents

Documents show

benefits

Documents showrisk

Cardioprotective

Laxative

Cholesterol

Anxiety

Page 20: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Accuracy vs Relevancyin Pattern Recognition/Machine Learning

• Precision=|{relevant docs}∩{retrieved docs}|/| {retrieved docs}|

• Recall= |{relevant docs}∩{retrieved docs}|/|{relevant docs}|

• Fall-out |{nonrelevant docs}∩{retrieved docs}|/|{nonrelevant docs}|

Page 21: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 22: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Ontology

• According to philosophy, ontology is a systematic account of Existence

• In information science, ontology is a representation of concepts and their relationships, often by directed graphs

Page 23: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Ontology Example (Informal)fish

fresh water salt water

North American Asian ……Europe

Common Carp

mirror Carp invasive

native

Crappie

Page 24: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Ontology Example: Scientifc classification

Animalia

Chordata Hemichordata…

Actinopterygii Sarcopterygii…

Neopterygii Chondrostei…

Teleostei …

Cypriniformes

Cyprinidae

Kingdom

Phylum

Class

Subclass

Infraclass

Order

Family

Page 25: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 26: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Gene Ontology (GO) ConsortiumMolecular function

Nucleic acid binding

enzyme

helicaseDNA binding

DNA helicase

ATP-dependent DNA helicase

DNAmetabolis

cell

…… …

Reference: Gene Ontology: tool for the unification of biology, nature genetics, 2000 http://dx.doi.org/ 10.1038/75556

Page 27: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 28: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Unified Medical Language System (UMLS)

• A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains:– Metathesaurus– Semantic Network– SPECIALIST Lexicon

• UMLS contains data more than ontologies• Maintained by US National Library of Medicine• Website: http://www.nlm.nih.gov/research/umls/

Page 29: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

UMLS - Metathesaurus

• Number of biomedical concepts > 1 million• Stem from over 100 incorporated controlled source

vocabularies:– ICD (International Statistical Classification of Diseases and Related

Health Problems)– MeSH (Medical Subject Headings)– SNOMED CT (Systematized Nomenclature of Medicine – Clinical

Terms)– LOINC (Logical Observation Identifiers Names and Codes)– Gene Ontology– OMIM (Mendelian Inheritance in Man)…

http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

Page 30: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

UMLS - Semantic Network• Semantic types (categories)

– Entity• Physical Object

– Organism…

– Event• Actitivity

– Behavior…

• Semantic relationships (connecting two concepts)– isa– assoicated_with

• physically_related_to– part_of

• spatially_related_to– location_of

…http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.htmlhttp://www.clres.com/semrels/umls_relation_list.html

Drug A

treats

Disease B

Gene A

disease_is_marked_by_gene

treated_by

Page 31: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology

• Applications of Literature Mining and Ontology

Page 32: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Use of ontology systems

• Statistical– Gene ontology enrichment test

• Indexing– Reachability– Distance– Path

Page 33: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Represent Ontology by Graphs

• Directed Graph• Directed Acyclic Graph (DAG): Most ontologies

fall into this type.• Directed Tree

Directed Graph DAG Tree

Page 34: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Reachability

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

?Query(1,11) Yes

?Query(3,9) No

The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ?

Page 35: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Distance

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

?Query dG(1, 11)

=3

The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v?

Page 36: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Path

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ?

Find a path from 1 to 11

Page 37: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

The estimated difficulty of building a very efficient indexing graph database schemes

(based on current research)

Reachability Distance Path

Directed Tree easy easy easy

Directed Acyclic Graph medium hard hard

Directed Graph medium hard hard

Reference: R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs", Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608.R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query", Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826.

Page 38: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Outline

• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing

• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Ontology use and indexing

• Applications of Literature Mining and Ontology

Page 39: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Applications of Literature Mining and Ontology - I

• Build confirmed gene-phenotype relations– Human Phenotype Ontology (HPO)– Built from Online Mendelian Inheritance in Man

(OMIM) database.– http://human-phenotype-ontology.org/

Reference: Robinson PN, Mundlos S. The Human Phenotype Ontology. Clinical Genetics 77(6) 2010: 525–534. http://dx.doi.org/10.1111/j.1399-0004.2010.01436.x

Page 40: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Applications of Literature Mining and Ontology - II

• MetaMap program and CKC Mining– MetaMap: Mapping biomedical text to UMLS Metathesaurus.– CKC (Conceptual Knowledge Constructs) represents a path connecting

several concepts in the UMLS.– Knowledge Discovery using MetaMap and CKC mining.

Reference: Aronson, A.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In : AMIA Symposium, p.17 (2001)Payne, P., Borlawsky, T., Kwok, A., Greaves, A.: Supporting the design of translational clinical studies through the generation and verification of conceptual knowledge-anchored hypotheses. In : AMIA Annual Symposium Proceedings, p.566 (2008)

Literature MetaMap

……

… .…

C CKCs

phenotypes

bio-molecular

Page 41: Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical

Thanks!

Questions?