NLP for Biomedicine- Ontology building and Text Mining -
Junichi Tsujii
GENIA Project(http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/)
Computer ScienceGraduate School of Information Science and Technology
University of TokyoJAPAN
My Talk
1. Background : Why NLP in Biomedicines
2. Examples of NLP in Biomedicines
3. Text Mining and NLP
4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition
5. Concluding Remarks
My Talk
1. Background : Why NLP in Biomedicines
2. Examples of NLP in Biomedicines
3. Text Mining and NLP
4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition
5. Concluding Remarks
Why NLP in Biomedicine ?
From Biology and Medical Sciences
From Natural Language Processing
Why NLP in Biomedicine ?
From Biology and Medical Sciences
From Natural Language Processing
by D. Devos
Genome sequencing.
Function
Sequence
Structure
Sequence, structure and function
Information Exploitation
Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previouslycharacterized functions in a separate process.
The use of available information (published papers, etc.) is a key stepfor the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used tosubstantiate working hypothesis that are experimentally explored.
[C.Blaschke, A.Valencia: 2001]
Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previouslycharacterized functions in a separate process.
The use of available information (published papers, etc.) is a key stepfor the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used tosubstantiate working hypothesis that are experimentally explored.
[C.Blaschke, A.Valencia: 2001]
Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previouslycharacterized functions in a separate process.
The use of available information (published papers, etc.) is a key stepfor the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used tosubstantiate working hypothesis that are experimentally explored.
[C.Blaschke, A.Valencia: 2001]
Why NLP in Biomedicine ?
From Biology and Medical Sciences
From Natural Language Processing
Revolution in LT in the last decade
Information
KnowledgeLanguageTexts
GrammarSyntax-Semantic Mapping
Interpretation based on Knowledge
Machine Learning
Knowledge Acquisition
Statistical Biases
Huge Ontology: Next Revolution ?Bio-Medical Application: UMLS, Gene Ontology, etc.
My Talk
1. Background : Why NLP in Biomedicines
2. Examples of NLP in Biomedicines
3. Text Mining and NLP
4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition
5. Concluding Remarks
What can we do in Biomedical domains by NLP ?
Examples
Protein-Protein Interaction extracted from texts
by C. Blaschke
Organized Knowledge through terms
by C. Blaschke
From Data to Understanding :Interpretation by Language
Oliveros, Blaschke et al., GIW 2000
Information Extraction from TextsQA Answering Systems
Characteristics of Signal Pathway (1)
• Granularity of Knowledge Units Different types of entities
which are interrelated with each other
Cells, Sub-locations of cellsProteins, substructures of proteins,Subclasses of proteinsIons, other chemical substances
Genes, RNA, DNA
G-protein coupled receptor pathway modelfigure from TRANSPATH
CSNDB( National Institute of Health Sciences)
• A data- and knowledge- base for signaling pathways of human cells.– It compiles the information on biological molecules,
sequences, structures, functions, and biological reactions which transfer the cellular signals.
– Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically.
– CSNDB is constructed on ACEDB and inference engine CLIPS , and has a linkage to TRANSFAC.
– Final goal is to make a computerized model for various biological phenomena.
Example. 1
• A Standard Reaction Excerpted @[Takai98]
Signal_Reaction:
“EGF receptor Grb2” From_molecule “EGF receptor”To_molecule “Grb2”Tissue “liver”Effect “activation”Interaction
“SH2+phosphorylated Tyr”Reference [Yamauchi_1997]
Example. 3
• A Polymerization Reaction Excerpted @[Takai98]
Signal_Reaction:
“Ah receptor + HSP90 ” Component “Ah receptor” “HSP90”Effect “activation dissociation”Interaction
“PAS domain of Ah receptor” Activity
“inactivation of Ah receptor”Reference [Powell-Coffman_1998]
My Talk
1. Background : Why NLP in Biomedicines
2. Examples of NLP in Biomedicines
3. Text Mining and NLP
4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition
5. Concluding Remarks
Theories in ScienceObserved Data
Observable Non-Observable
Data Mining
Objects of Science
Knowledge In Mind
Non-Observable
DescriptionsOf Knowledge
Observable
Observed Data
Quantitative Data
Mathematical Formula
Qualitative, Structures, Classification
OntologyTexts
Objects Of Science
Knowledge In Mind
Non-Observable
Descriptions Of Knowledge
Observable
Natural Language
Incomplete System Diversity Ambiguity
Theories in ScienceObserved Data
Observable Non-Observable
Data Mining
Objects of Science
Knowledge In Mind
Non-ObservableObservable
Observed Data
Quantitative Data
Mathematical Formula
Qualitative, Structures, Classification
OntologyTexts
DescriptionsOf Knowledge
Data Mining+
Text Mining
Knowledge in MindDescriptions of KnowledgeObservable
Non-Observable
CharacteristicsOf Language
Text Mining
Objects of science
Data Mining
CharacteristicsOf Knowledge
Objects Of Science
Knowledge In Mind
Non-Observable
Descriptions Of Knowledge
Observable
Natural Language
Incomplete System Diversity Ambiguity
Objects Of Science
Knowledge In Mind
Non-Observable
Descriptions Of Knowledge
Observable
Natural Language
Incomplete System Diversity Ambiguity
My Talk
1. Background : Why NLP in Biomedicines
2. Examples of NLP in Biomedicines
3. Text Mining and NLP
4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition
5. Concluding Remarks
Terms are the basic units of knowledgeClassification, Features
NE recognitionEvent Recognition
Semantic Disambiguation
•Inconsistent naming conventions
e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2
NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, …
•Wide-spread synonymy
Many synonyms in wide usage, e.g. PKB and Akt
cycline-dependent kinase inhibitor p27, p27kip1
<cdc25, cdc25a>, <p52shc, p52(Shc)>
•Open, growing vocabulary for many classes
•Cross-over of names between classes depending on context
•Protein vs DNA
•Frequent uses of coordination inside term formations
Task difficulties in molecular-biology
Linking ProblemDiversityLexicon
Static Processing
Term RecognitionAmbiguity
Context DependentDynamic Processing
Ambiguity
• Abbreviation Extraction ( Schwartz 2003)– Extracts short and long form pairs
Short form Long form
AA Alcoholic Anonymous
American
Americans
Arachidonic acid
arachidonic acid
amino acid
amino acids
anaemia
anemia
:
Experiment[Tsuruoka, et.al. 03 SIGIR]
• Corpus– MEDLINE: the largest collection of abstracts in the
biomedical domain
• Rule learning– 83,142 abstracts
– Obtained rules: 14,158
• Evaluation– 18,930 abstracts
– Count the occurrences of each generated variant.
Results: “NF-kappa B”
Generation Probability
Generated Variants Frequency
1.0 (Input) NF-kappa B 857
0.417 NF-kappaB 692
0.417 nF-kappa B 0
0.337 Nf-kappa B 0
0.275 NF kappa B 25
0.226 NF-kappa b 0
: : :
Results: “antiinflammatory effect”
Generation Probability
Generated Variants Frequency
1.0 (input) antiinflammatory effect 7
0.462 anti-inflammatory effect 33
0.393 antiinflammatory effects 6
0.356 Antiinflammatory effect 0
0.286 antiinflammatory-effect 0
0.181 anti-inflammatory effects 23
: : :
Results: “tumour necrosis factor alpha”
Generation Probability
Generated Variants Frequency
1.0 (Input) tumour necrosis factor alpha 15
0.492 tumor necrosis factor alpha 126
0.356 tumour necrosis factor-alpha 30
0.235 Tumour necrosis factor alpha 2
0.175 tumor necrosis factor alpha 182
0.115 Tumor necrosis factor alpha 8
: : :
•Inconsistent naming conventions
e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2
NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, …
•Wide-spread synonymy
Many synonyms in wide usage, e.g. PKB and Akt
cycline-dependent kinase inhibitor p27, p27kip1
<cdc25, cdc25a>, <p52shc, p52(Shc)>
•Open, growing vocabulary for many classes
•Cross-over of names between classes depending on context
•Protein vs DNA
•Frequent uses of coordination inside term formations
Task difficulties in molecular-biology
Linking ProblemDiversityLexicon
Static Ptocessing
Term RecognitionAmbiguity
Context DependentDynamic Processing
Genia OntologySubstance
+substance-+-compound-+-organic-+-nucleic_acid-+-poly_nucleotides
| | | | +-nucleotide
| | | | +-DNA
| | | | +-RNA
| | | +-amino_acid-+-peptide
| | | | +-amino_acid_monomer
| | | | +-protein
| | | +-lipid
| | | +-carbohydrate
| | | +-other_organic_compounds
| | +-inorganic
| +-atom
Genia Ontology :Source
+-source-+-natural-+-organism-+-multi_cell
| | | +-mono_cell
| | | +-virus
| | +-body_part
| | +-tissue
| | +-cell_type
| +-artificial-+-cell_line
| +-other_artificial_sources
Number of Tagged Objects
• Texts: 2,500 MEDLINE Abstracts – Papers on Transcription Factors in Human blood cells
– 550,000 words, 20,000 sentences
• Tagged objects: 147,000– Protein: ~ 77,000– DNA: ~ 24,000– RNA: ~ 2,400– Source: ~ 27,000– Other: ~ 37,000
Distributions of Semantic Classes
cell line
artificial source
protein
peptide
amino acid monomerDNA
RNA
polynucleotidesnucleotides
lipidcarbohydrate
other organiccompound
atom
inorganic compound
cell component
cell typetissueorganism
others
Extension of GENIA Ontology• Small classes (to be embedded in UMLS)
– 5242 terms labelled with ‘other_names’ class
• Events, Biological reactions 3800 • Disease 636
– Names of Diseases 501– Treatments 61– Diagnoses 52– Pathology 3– Others 39
• Experiments 578– Methods 493– Materials 25– Others 60
• Others 228
Classification of "other_names"
Event or Reaction Disease Experiment Other
Sub-classification of "Disease"
Disease name Treatment methodDiagnosis PathologyOther
Sub-classification of "Experiment"
Method Material Other
DNAPROTEIN
DNA CELLTYPE
and classify
Thus, CIITA not only activates the expression of class II genes
but recruits another B cell-specific coactivator to increase
transcriptional activity of class II promoters in B cells .
• Recognize “names” in the text– Technical terms expressing proteins, genes, cells,
etc.
Biomedical NE Task (Collier Coling00,Kazama ACL02, Kim ISMB02)
Identify
NE Task as Classification• To a class (tag) representing the semantic class and the
position in the term– The task is reduced to a tagging task
• We can use methods developed for tagging
– The structure is encoded in a tag• BIO (Begin, Inside, and Other) tagging
…
Term of class X
B-X I-X I-Xo
Term of class Y
B-Yo o o oWords:
BIO tags:
(OTHER)
NE Tagging Illustrated• Classify a word depending on the context
activity of class II promoters in
B-DNA I-DNA
conversion to features
classifier
N P N Sym Ns P
context
BIO tags:
POS tags:
O O
Words:
Deterministic tagging:
- Only the most probable tag at each word (SVM)
The Viterbi tagging:
- The most probable sequence among all (probabilistic models)
The GENIA Corpus[Tateishi HLT02., Ohta PSB00, ISMB02]
Annotated MEDLINE abstracts
A gold standard for biomedical NLP tasks
# of abstracts:
# of sentences:
# of tokens (words):
# of named entities:
# of semantic classes:
670
5,109
152,216
23,793
24
- 2,000-abstract version soon
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
Big enough to: make SVM usage nontrivial
Small enough to: make sparseness serious
the ME Method• Maximum Entropy model
P(y | h) 1
Z(h) i
Fi (h,y )
i
Feature function
Weight for Fi
Feature function:
F(h, y) 1 if y = T f (h) 1
0 otherwise
Target term Same as the feature in SVMs
The Viterbi algorithm is used for tagging
ContextTag
SOHMM modeling(J.KIM, et.al. ACL03)
• SOHMM modeling
– No assumption is made arbitrarily.– Instead, a context classification function is induced from a corpus.
• SOHMM learning– Inducing the context classification function– Estimating parameters
l
iiiii
ttwPctPW
l 1
||maxarg,1
A set of contextual feature values which are visible at the moment of predicting .
A classification function from sets of contextual feature values to context patterns grouped appropriately.
ic
icit
Experimental Results• Biological source recognition
• Biological substance recognition
Matching method precision recall F-score
hard matching 59.72 68.92 63.99
soft matching left 63.23 72.97 67.75
soft matching right 61.36 70.81 65.75
soft matching either 64.87 74.86 69.51
Matching method precision recall F-score
hard matching 73.76 66.92 70.17
soft matching left 77.64 70.67 73.99
soft matching right 75.19 68.22 71.54
soft matching either 79.07 71.98 75.36
Event Recognition
Identity of events in our mindDisambiguation of different events by context
Problem: Syntactic Variations
RAF6 activates NF-kappaB.
Lck is activated by autophosphorylation at Tyr 394.
Anandamide induces vasodilation by activating vanilloid receptors.
the activation of Rap1 by C3G
the GTPase-activating protein rhoGAP
the stress-activated group of MAP kinases
ACTIVATOR activate ACTIVATEE
Verbs Related to Biological EventsFrequent Verbs in 100 MEDLINE Abstracts
Verb Count Verb Count Verb Count Verb Countbe 255 involve 16 determine 9 explain 6induce 56 identify 16 construct 9 exert 6bind 50 act 15 associate 9 enhance 6show 49 stimulate 14 reduce 8 display 6suggest 42 provide 14 prevent 8 characterize 6activate 42 express 13 locate 8 participate 5factor 36 affect 13 line 8 localize 5demonstrate 35 type 12 differ 8 investigate 5inhibit 26 report 12 trigger 7 imply 5have 25 form 12 synergize 7 establish 5reveal 21 contribute 12 examine 7 conclude 5require 21 study 11 block 7 compare 5regulate 21 observe 11 become 7 use 4indicate 21 lead 11 analyze 7 transform 4find 21 function 11 target 6 transfect 4result 20 assay 11 signal 6 test 4play 19 appear 11 remain 6 suppress 4interact 18 occur 10 produce 6 support 4mediate 17 increase 10 present 6 substitute 4contain 17 phosphorylate 9 possess 6 share 4
Argument Frame Extractor
133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences
Extracted Uniquely
Extracted with ambiguity
ParsingFailures
Extractable from pp’s
31
32
26
Not extractable 27
Memory limitation,etc 17
68%
My Talk
1. Background : Why NLP in Biomedicines
2. Examples of NLP in Biomedicines
3. Text Mining and NLP
4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition
5. Concluding Remarks
Revolution in LT in the last decade
Information
KnowledgeLanguageTexts
GrammarSyntax-Semantic Mapping
Interpretation based on Knowledge
Machine Learning
Knowledge Acquisition
Statistical Biases
Huge Ontology: Next Revolution ?Bio-Medical Application: UMLS, Gene Ontology, etc.
by D. Devos
Genome sequencing.
Actual demands in the real worldwith more homogenous user groups and
more concrete criteria for evaluating results
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
Resources available
Medline Abstracts (4000, about 1 million words) GENIA ontology POS tags Semantic tags Structural tags Co-reference annotations with a Singaporean team
Lexical resources mapped to existing ontology