74
Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand Balagurunathan

Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Embed Size (px)

Citation preview

Page 1: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Processing Complex Sentences for Information Extraction

Deepthi Chidambaram

December 22, 2004BY 510

Committee

Dr. Hasan Davulcu

Dr. Chitta Baral

Dr. Yoganand Balagurunathan

Page 2: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Outline

• Information extraction – motivation.• Current Trends in Information Extraction.• Sentence structures.• Complex Sentence Processing

– A Rule Engineering Approach.– An Automated Approach

• Experimental Results.• Conclusion.

Page 3: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Information Extraction - Motivation

• Enormous amount of knowledge stored in unstructured text.

• Lack of easy access to information for processing.• Biomedical domain - Piping hot results that curated

databases don’t offer.

Page 4: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

What can we get from text

• Literature– Protein functional information.– Interactions between genes, proteins, chemicals and

drugs.– Signal transduction pathways.

• Clinical records– Diagnosis patterns.– Drugs to disease association.

Page 5: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

The information extraction process

Page 6: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Text Retrieval

• Keyword retrieval (PubMed)

• Vector-Space Model

• Probabilistic models

• Latent Semantic Indexing

Page 7: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Text Classification

• Knowledge Engineering– Rules specified by a domain expert.– ‘knowledge acquisition bottleneck’.

• Machine Learning– ‘Hard’ and ‘soft’ classification.– Traditional classifiers with automatic feature selection

algorithms.– Frequent word co-occurrences and/or Natural Language

patterns.– Ontologies to identify word synonyms.

Page 8: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Named Entity Recognition

• Dictionary look up.– LocusLink, Gene Ontology, HUGO, SwissProt and

UMLS provide extensive lists of biomedical terms.– Constrained to identify names present in the

vocabularies alone.

• Regular Expressions.– Patterns defined on words and phrases.– E.g.. Alphanumeric words, hyphenated mixed cases.

Page 9: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Named Entity Recognition

• Boosted Wrapper Induction– Name boundaries to identify entities.

• Supervised Classifiers– Hidden Markov Models.– Support Vector Machines.

• Supervised Rule Learners to tag entity names – the RAPIER system– Rules based on word patterns.

• Probabilistic Rule Learner – ABGene– Rules based on co-occurrence of POS tags and neighboring words.

Page 10: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Named Entity Recognition

Results from Shared task in JNLPBA, 2004

(Trained and tested on GENIA corpus)

 Technique used Precision / Recall / F-measure

Combination of Hidden Markov Models and Support Vector Machines

76.0 / 69.4 / 72.6

Classifier trained on Local and syntactic features of entity names 71.6 / 68.6 / 70.1

Conditional Random Fields (CRF) and Novel Feature Sets 70.3 / 69.3 / 69.8

SVM + CRF 67.8 / 64.8 / 66.3

HMM model 69.1 / 61.0 / 64.8

Training SVM classifiers on linguistic knowledge 67.4 / 61.0 / 64.0

Page 11: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Extracting Relationships

• Word co-occurrences– Presence of a functional word and entity

names in same sentence.

• Regular Expressions – Regular expressions defined on part of speech

tags.– Processing complex sentence structures using

regular expressions.

Page 12: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Extracting Relationships

• Grammar rules based on sentence structures – shallow parses.– Context Free grammars. – Probabilistic Context Free Grammars. – Automatic learning of grammar rules using supervised rule

learners – FOIL, PROGOL.– Combination of Naïve Bayes Classifier and FOIL.

Page 13: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Extracting Relationships

• Hidden Markov Models– Learning templates based on words and sentence structures.

• Preposition based templates – the Arizona Relation parser– Slot filler approach.– Shallow parsing of sentence structure.

• Decision trees - GIS– Wording and term distributions represented as a decision tree to

identify protein interactions.

Page 14: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Extracting relationships

• Manually built templates– SUISEKI - Frame based system.– BioRAT - Templates based on sentence structure and semantics.– GENIES, GeneWays – Extension of the MedLEE frame based

system.

• Full sentence parses– The structure of the entire sentence is considered.– Potentially more accurate.– Rule learners based on Link grammar analysis of sentences.– Dependency sub-graph from the link grammar word-dependency

output.

Page 15: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Evaluation of Extraction Systems

• Precision– Correctness of system

True Positives True Positives + False Positives

• Recall– Sensitivity of system

True PositivesTrue Positive + False Negative

• F-measure– Combination of Precision and Recall (harmonic mean)

2 * P * R(P + R)

Page 16: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Common problems

• Complex sentence structures.c-Abl tyrosine kinase activity is blocked by pRb, which binds to the c-Abl kinase domain.

• Contextual information.c-Abl phosphorylates tyrosines in the C-terminal domain (CTD) of RNA polymerase II (RPase II; Km 5 0.5 mM).

Page 17: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Common problems

• Negation.DNA-PK does not bind detectably to Ku in the absence of DNA.

• Modality and hypothesis.Cdk4 may be inhibited by tyrosine phosphorylation.

Gadd45 binds to Cdk1 and Gadd45 inhibits Cdk1 activity, probably by displacing Cyclin B1.

Page 18: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Complex Sentence Processing - Motivation

• Sentences structures in text are seldom simple.

• Simple, Complex, compound and Complex-compound structures are addressed.

• Dependent and independent clauses.

• Multiple clauses used to specify interactions.

• Multiple interactions specified in clauses.

Page 19: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Sentence Structures - Examples

• SimplePhosphorylation by ATM activates c-Abl.

• Complexc-Abl tyrosine kinase activity is blocked by pRb, which binds to the c-Abl kinase domain.

• Compoundc-Abl phosphorylates tyrosines in the C-terminal domain (CTD) of RNA polymerase II (RPase II; Km 5 0.5 mM) ; the c-Abl SH2 domain is a specificity determinant for this reaction.

• Complex-compoundBecause the same signals that induce somitic expression of Pax-3 also induce myogenesis and because expression of Pax-3 precedes that of MyoD and Myf-5 in vivo and that of MyoD in vitro, we investigated whether Pax-3 might play a role in activating the myogenic program in somitic cell.

Page 20: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Complex Sentence Processing

• Manual Rule Engineering approach– Based on Part of Speech tags.– Domain expert engineers rules for complex sentence

structures.

• Automated approach– Based on syntactic relationships between the words in

a sentence.– Link grammar parser.

Page 21: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Evaluation of Our Systems

• PrecisionNo. of Correct interactions

Total no. of interactions extracted

• RecallNo of correct interactions

Total no of interactions identified in text

• F-measure(2*P*R)/P+R

Page 22: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Rule Engineering Approach

• Patterns learnt by interaction with domain expert.

• Sentences tagged by a natural Language processor.

• Extraction rules written based on Part of speech tags.

• False positives reduced by use of Ontologies.

Page 23: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Rule Engineering Approach

Toufeeq

Page 24: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Rule Engineering Approach

• System developed in XSB prolog.

• A Java GUI integrated to the back-end processing system.

• Dictionaries coded as knowledge bases in prolog.

• Input – Sentences in a text file.

• Ouput – An interaction triplet.– interact (upstream-gene, interaction word,

downstream-gene).

Page 25: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Pos patterns for simple sentences

extract([word([tag = NNP],_h13160),word([tag = VBZ],_h13161), word([tag=JJ],_h13162)], interact(_h13160,_h13161,_h13162),true]).

extract([ng(_h99513),vg(_h99514),ng(_h99515)],

interact(_h99513,_h99514,_h99515),true).

extract([ng(_h108321),vg(_h108322),word([tag=NNP],_h108323)],

interact(_h108321,_h108322,_h108323),true).

Extract an interaction from a (Noun, Verb, Adjective) pattern.

Page 26: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Knowledge Bases Used

• LocusLink (NCBI)– compilation of the genes and related information such as alias names, protein

products, PubMed IDs of articles referring to the gene etc.– Gene, protein names and their aliases.

• UMLS (NLM)– Electronic "Knowledge Sources" and associated lexical programs for biological

domain.– Specialist Lexicon used to extract interaction words.

• WordNet– English nouns, verbs, adjectives and adverbs are organized into synonym sets,

each representing one underlying lexical concept. – Used in conjunction with UMLS to extract interaction words.– Interaction words are stemmed using the Porter Stemming algorithm to match

across numbers and tense.

Page 27: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Representation of Knowledge Bases

%gene and protein namesisa('3''-nucleotidase',gene).isa('3'' repair exonuclease 1',protein).isa('E2F4',gene).isa('5''-nucleotidase, cytosolic IA',protein).isa('M1',gene).isa('PP2A, subunit B''',protein)

% interaction wordsinteraction('accumulat').interaction('activat').interaction('elevat').interaction('hasten').interaction('incite').interaction('increas').interaction('induc').

Page 28: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Complex Sentence Processing

• Extraction Methodology– Identify the Complex Sentence patterns.– Split the complex and compound sentences to simple

sentences based on extraction rules for the pattern.– Process the sub-sentences using pattern matching

against simple sentence patterns.

• An iterative process was used to build the rules and add simple sentence patterns manually.

• Training and Testing – curated text from a 1999 Molecular Biology of the Cell article by Kurtis Kohn.

Page 29: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Rules Written to Address Complex Sentence Structures

Case - List of arguments - a comma separated list or a list separated by conjunctions.

Rule - Sentence is rewritten with each of the arguments. Each rewrite is processed as a single sentence.

Sample Input- Dyhydrofolate reductase (DHFR) is activated via the E2F transactivation domain, whereas B-myb, Cyclin E, E2F-1, E2F-2, and Cdc2 are regulated via the repression domain of pRb family proteins.

Sample output - interact([Dyhydrofolate,reductase],[is,activated],

[the,E2F,transactivation,domain],1.0).interact(B-myb,[are,regulated],[the,repression,domain],0.6667).interact([Cyclin,E],[are,regulated],[the,repression,domain],0.6667).interact(E2F-1,[are,regulated],[the,repression,domain],0.6667).interact(E2F-2,[are,regulated],[the,repression,domain],0.6667).interact(Cdc2,[are,regulated],[the,repression,domain],0.6667).

Page 30: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Rules – Continued

extractListStart([],_).extractListStart([X|Y],Part) :-

conc([X],Part,Whole),extractComplex(Whole),extractListStart(Y,Part).

extractComplex(S):-

contains([W1,word([_],','),_,word([_],','),_],S),splitList([W1],S,S1,S2),S1=[],getList(S2,List),length(List,Len),ith(Len,List,Ele),sublist([Ele],S,Part1,Part2),extractListStart(List,Part2),!.

Page 31: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Snapshot of the system GUI

Page 32: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Sample Run of the system

Page 33: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Results

Training data Testing data Precision (%)

Recall

(%)Parts A, B Parts A, B 98 88

Parts A, B Part C 57 43

Parts A, B, C Part E 81.25 71

Page 34: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Discussion

• Pronoun resolution done by hand.

• Dependent on skills of domain expert.

• Extension to other sentence structures will involve re-engineering rules.

• Recall tends to be higher as rules are manually created.

Page 35: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Automated Extraction System

• Automatic Integrated system for – Pronoun resolution.– Entity Tagging.– Complex sentence processing.– Interaction Extraction.

• Uses dependencies generated by the Link grammar parser

• Scalable• Less error prone

Page 36: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Automated Approach to CSP

Page 37: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Anaphora Resolution

• Anaphora Resolution– Identification of pairs of noun phrases that refer to

each other.

• Need for Anaphora resolution– Entities are often referred to by pronouns and referent

noun phrases.– Identifying the noun phrases referred to improves

recall of the system.

Page 38: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Types of Anaphora in Literature

Page 39: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Pronoun Resolution Module

• Identifies and resolves pronominal anaphora.

• System written in XSB prolog.

• Interprolog Java API is used to integrate the module to the extraction system.

• Addresses third person pronouns such as it, they.

• Handles both singular and plural pronouns as well as reflexives (itself, themselves) and possessives

(its, their).

Page 40: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Pronoun Resolution - Algorithm

1. Process sentences to get POS tags for the words.

2. Perform chunking on the words to form noun phrases.

3. Identify pronouns in the sentence

4. Remove redundant reflexives

(Gene A, by itself remains activated.)

5. Replace identified pronouns with noun phrases matching number and voice.

Page 41: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Pronoun Resolution: Walkthrough

Ku loads onto dsDNA ends and it can diffuse along the DNA in an energy-independent manner.

Ku loads onto dsDNA ends and Ku can diffuse along the DNA in an

energy-independent manner.

When breast cancers were examined for NGAL mRNA and protein levels, they were found to exhibit heterogeneous expression.

When breast cancers were examined for NGAL mRNA and protein levels , breast cancers were found to exhibit heterogeneous expression .

Page 42: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Pronoun Resolution: Example 2

Page 43: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Entity Tagging

• Combination of dictionary approach and heuristic rules.

• Dictionaries built from LocusLink, UMLS, GO and WordNet.

• Entities addressed by dictionaries - genes, proteins, chemicals, location and interaction words.

• Regular expression developed manually using the information from the dictionaries.

• Heuristic rules combine regular expression and POS tags to mark entities.

Page 44: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Gene and Protein Name Dictionary

• Source - NCBI’s LocusLink.• Fields in LocusLink contain information about gene

loci.– Official nomenclature, aliases, sequence accessions, phenotypes, EC numbers,

MIM numbers, UniGene clusters, homology, map locations, related web sites, etc.

• Perl parser written to identify gene and protein names and their aliases.

• Names and aliases maintained in a list loaded by the system during processing.

Page 45: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Gene and Protein Name Dictionary

(a) LocusLink fields considered for extraction (b) Partial LocusLink record (c) Regular expression used for gene name matching

Page 46: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Gene and Protein Name Tagging

• Exact matches with dictionary• Process of relaxation

– Mark the entire noun group as a gene if any of the word has been identified as gene / protein name.

‘The c-abl kinase activity’ is a gene type

• Words that match a manually built regular expression.– POS tags are used to exclude false positives such as adverbs,

modals, prepositions, etc.

• 508,477 entries in dictionary

Page 47: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Chemical and Location Names

• Used to tag location of interaction and agents in interaction for modifier analysis.

• Source - UMLS Metathesaurus and Semantic Network, GO.

• Location names dictionary contains the GO cellular components ontology.

• Semantic types identified from Semantic Network using the ‘isa’ relationship.

Page 48: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Chemical and Location Tagging

• Concepts categorized under the identified semantic types are added to the corresponding dictionaries.

• A perfect match with an entry in the dictionary is tagged as chemical name / cellular location.

• 790,000 chemical names and 118, 018 location names were identified.

Page 49: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Interaction Words

• Source – Verbs from UMLS Specialist Lexicon and WordNet.

• Interaction words are identified as– (Specialist verbs ∩ WordNet verbs)

• Missing interaction words are added manually.

• 1400 verbs in their stemmed form are stored in the list.

• Porter Stemming Algorithm allows matches across tense and number.

Page 50: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Complex Sentence Processor (CSP)

• Tool used – Link Grammar Parser (LGP) from CMU.

• Verb based extraction.– John played the pipes.

• Linkages returned by the link grammar used to identify syntactic constructs.

• Syntactic constructs associated with a verb form simple sentences.

• Internal clause format representation.

• Java system with a GUI.

Page 51: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Link Grammar Parser

• Based on the Link Grammar for English Syntax.

• Dependency based grammar system written in C.

• Every word in a sentence is connected to every other word directly or indirectly through links.

• Dictionary specifies linking requirements for each word.

• Linkages in a sentence should satisfy– Connectivity.– Planarity.

Page 52: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Linking Requirement

• Link Grammar Parser (LGP) trained on sentences from conversational English (70 telephone conversations).

• Linking requirements specified using ‘&’ and ‘or’ operators.

the: ({AL-} & {@L+} & (D+ or DD+)) or DG+ or (TR- & U+);

• Each word should follow its linking requirement.

Page 53: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Linkage

• Words in sentence matched against dictionary– Linkage formed using the satisfied linking requirements.

• Null linkages – ambiguous / unknown words– LG ignores some words if an acceptable guess of the requirement is

not found.

• Linkages returned in increasing order of cost.

Page 54: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Using Link Grammar Parser in CSP

• LGP considers each sentence as a having one verb.

• Provides sub-linkages for each clause in complex sentence.

• Sub linkages are combined to a single linkage using !union option.

• Combined linkage analyzed and split into clauses.

Page 55: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Combined sub-linkages - !union operator

Page 56: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Pre-processing Text for Link Grammar

• Gene names are not in the vocabulary of LG.– Force LGP to identify entities as nouns.– Capitalized and hyphenated phrases with entity tags.– E.g. ‘The C-abl protein kinase activity’ is converted to

‘Gene-The-C-abl-protein-kinase-activity’.

• Phrases / words in parentheses cause incorrect linkages.– Remove words within parentheses.

Page 57: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Preprocessing – removing parentheses

Rapid degradation of Cyclin D1 requires phosphorylation at threonine-286 (kinase unknown, but not Cdk2 or Cdk4); degradation is by way of the ubiquitin-proteasome pathway

Sentence crashes link grammar if processed as is, with parentheses.

Page 58: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Pre-processing Text for Link Grammar

• Windows version of Link Grammar Parser– Does not handle sentences of more than 70 words

• Mark the sentences of more than 70 words and discard them for further processing

– Words of more than 50 characters also crash the LGP• Mark sentences and discard them from further processing

• Sentences without two participants and one interaction word do not contain interactions– Discard sentences to improve processing time– Discarding is done after pronoun resolution so

referrals are not disturbed.

Page 59: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Complex Sentence Processing

• Input– Tagged and pre-processed text.

• Output– Clauses from the sentence representing the simple

sentences.

• Internal clause format representationSubject + Verb + Object (s) + Verb Modifying Phrases (s)

• Simple Sentences processed by Interaction Extractor module to get interactions from text.

Page 60: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Complex Sentence Processor- Goal

Kohn-partC3|2|Another DMP1-regulated gene is CD13/aminopeptidase N, which is activated cooperatively by DMP1 and c-Myb; CD13/aminopeptidase N activation by DMP1 is inhibited by cyclin D, independent of Cdk4/6.

Complex Sentence Processing

Kohn-partC3|2|Gene-Another-DMP1-regulated-gene|is|Gene-CD13/aminopeptidase-N , which#||Kohn-partC3|2|Gene-CD13/aminopeptidase-N |is activated||cooperatively#by Gene-DMP1 and Gene-c-Myb#|Kohn-partC3|2|Gene-N-activation by Gene-DMP1|is inhibited||by Gene-cyclin-D , independent of Gene-Cdk4/6#|

Abstract ID | Sentence ID | Subject | Verb | Objects(s) | Verb Modifier(s)

Page 61: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Our Contribution: CSP Algorithm

1.Processing complex sentences - Set ranges in sentence1. Identify the S links in the sentence. Each S link forms the beginning

of a simple sentence.

2. Identify the Components from each S link and expand the range to include prepositional, adverbial and adjectival phrases.

2.Extract the components of the simple sentence from the original sentence using the ranges as the substring markers.

Page 62: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

CSP - Illustration

Upon growth factor stimulation of quiescent cells, p130 declines late in G1 and p130 is replaced by p107, which is absent in quiescent cells.

Kohn-partE2|8|Upon growth factor stimulation of quiescent cells, Gene-p130 declines late in Gene-G1 and Gene-p130 is replaced by Gene-p107, which is absent in quiescent cells. |

Entity Tagging and Pre-processing

Page 63: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Illustration of the CSP algorithm

Kohn-partE2|8| upon growth factor stimulation of quiescent cells, Gene-p130 |declines || late # in Gene-G1 #|

Kohn-partE2|8| Gene-p130 | is replaced || by Gene-p107 #|

Kohn-partE2|8| Gene-p107 | is absent || in quiescent cells#|

Page 64: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

CSP Example – Pronoun resolution, preprocessing

Page 65: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Experimental results

• System compared with two other systems– BioRAT [Corney et al, 2004]

• Uses manually generated sentence structure templates.

• Partial parses of sentences.

– GeneWays [Rzhetsky et al, 2004]• An improvement on GENIES system.

• Extraction rules are semantic grammars defined on the sentence structures

• Partial parses of the sentences.

Page 66: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Evaluation - BioRAT

• DIP entries with both participants present in SwissProt.

• 229 abstracts, 660 entries in DIP.

• Gene name dictionary restricted to SwissProt names to emulate the BioRAT system.

• 452 interactions obtained.

• Precision of 54.39%

• Recall of 19.98%.

• F-measure of 29.18%

Page 67: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Comparison with BioRAT

Comparison with BioRAT and DIP was undertaken as a combined work with Toufeeq

Page 68: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Comparison with DIP

Difference in Recall - the interaction is not mentioned in the abstract. (Manual and semi-automatic curation of entries in DIP)

Page 69: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Comparison with GeneWays

• 2500 interactions obtained from the authors corresponding to more than 10 citations in PubMed.

• 30 interactions with most citations were selected and PubMed abstracts (495) obtained through NCBI’s eutilities facility corresponding to the entities.

• Extraction system obtained 1485 interactions.• 156 interactions occurred more than twice in the

abstracts with Precision of 76.13 %.• Recall performed on a single full text article quoted in

GENIES. • 42 interactions extracted from the article with a recall of

63.64%.

Page 70: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Comparison with GeneWays

Page 71: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Analysis of errors in comparison with GeneWays

Page 72: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Conclusion

• Extraction system is able to extract multiple interactions mentioned through complex sentence structure.

• Limitations and bottlenecks– Protein name identification.– Link Grammar Parsing problems.– Identification of relationships between clauses.

Page 73: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Thank you

• My advisors – Dr. Davulcu, Dr. Baral for their guidance and encouragement

• Yoga, for the periodic inputs and pointers• Dr. David Corney , for sharing the data from BioRAT• Dr. Andrew Rzhetsky for providing data from the

GeneWays system• Toufeeq, for sharing his code and for his contribution in

evaluation.

Page 74: Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand

Questions