BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility...

Preview:

Citation preview

1

BioNLP for NLPeople

CS5832/HLT-NAACL/RANLP

The weirdest job in the world

2

The weirdest job in the world

The weirdest job in the world

3

The weirdest job in the world

The weirdest job in the world

4

How I got here

How I got here

5

How I got here

• Voice Input Technologies• Linguistix• Nationwide Insurance• MapQuest• Berdy Medical Systems• OneRealm [sic]

How I got here

• Perl hacker, SLM data preprocessing• Linguist, Corpus construction• Senior Programmer/Analyst,

Interactive Voice Response (yuck)• Software test dept. manager; senior

software engineer• Consultant/Perl hacker• Senior software engineer

6

What is BioNLP?

• Natural language processing appliedto biomedical language– Publications– Medical records– Ontologies

Part 0

7

Why a field called BioNLP?

There is little reason for thedata on which a linguist worksto have the right to name thatwork.

Shuy 2002:8

(One lab’s) funding for NLP incomputational biology

• INIA (Neuroinformatics ofAlcoholism) $5M, 5 years

• Wyeth Genomics Institute ($200K, 2years)

• National Library of Medicine ($4.2M,3 years)

• National Library of Medicine ($XM, 3years)

8

Why biologists care

• High-throughput data interpretation• Literature search• Annotation• Database construction

But, I’m a NLPerson(computer scientist, mathematician,

engineer…)

• Hard, but might be possible• Might be harder in biomedical domain

than in newswire text• Might be more possible in biomedical

domain than in newswire text

9

ResourcesThe big drawing point for NLPeople

• Data– Lexical resources– 500 * 16M words of text– Labelled training data

• Tools– NER, POS taggers, parsers, semantic

normalizers....

$$$

10

Job market

• Academia: great– US, Europe

• Industry: not bad, but genomics-specific right now

Surely Shuy jests...

There is littlereason for thedata on which alinguist works tohave the rightto name thatwork.

11

It really is different on every level

•Tokenization•Named entity recognition•Corpus construction•Semantic representation

NLP actually could make theworld a better place....

12

An embarrassing truth aboutBioNLP...

www.chilibot.net

1

13

Part 1:Just enough biology

Cells and proteins

<illustration: cell, structures, proteins>

14

How biologists see the world

Wattarujeekrit et al. (2004)

The Central Dogma: from genes toproteins

http://www.swbic.org/products/clipart/images/dogmag.jpg

15

The Central Dogma:from genes to proteins

http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/images/central_dogma.gif

Higher-level structures

• Genotype, phenotype• Tissue, organ, organism

16

Biological structures are complex

SNAP Receptor

Vesicle SNARE

V-SNARE

N-Ethylmaleimide-Sensitive Fusion Protein

Soluble NSF Attachment Protein

Maleic acid N-ethylimide

Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor

(Alex Morgan, MITRE)

Part 2:Why bioscientists fund and publish

research in BioNLP

17

Two basic markets, multiple usertypes

• Medical– Clinicians– Consumers– “Informationists”– Administrators

(billing, qualityassurance, ...)

• “MolBio” (genomic)– High-throughput

experimentalists– “Bench scientists”– Model organism

database curators

18

19

Structured vocabulary

Free text (phenotypes)

20

122 references...

Medical

21

1997

<scanned picture of business card>

22

<happy-face photo>

One year later…

23

A sad story: physicians don’t buya lot of NLP software

Another sad story: trying to sell“gisting” to physicians

24

Sold for $400K: 14.5 or 2.9¢ on thedollar…

Salesperson’s thought process

25

Physician’s thought process

Genomics

26

Why biologists care

• High-throughput data interpretation• Literature search• Annotation• Database construction

Why biologists care

10 years ago...

27

Why biologists careToday....

Double exponential growthin the literature

New entries in Medline with publication date inJan-Aug 2005: 431,478 (avg. 1775/ day) 1

28

Biological Nomenclature: “V-SNARE”

SNAP Receptor

Vesicle SNARE

V-SNARE

N-Ethylmaleimide-Sensitive Fusion Protein

Soluble NSF Attachment Protein

Maleic acid N-ethylimide

Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor

(Alex Morgan, MITRE)

Part 3

Some things that make BioNLPdifferent

29

Named Entity Recognition

Genes have names??

30

Suzanna Lewis

•Fruitfly geneticist•5 kids•Latte + 3 shots

Suzanna Lewis

It is the middle of the night (2:38to be precise), I am away fromfriends and family, It has beenthis way for over 2 years, I can'tsleep because of all the work thereis yet to do, and there is no endin sight. So when do the magiclittle elves appear out of nowhereand get everything done?

p.s. I am serious.

31

Suzanna Lewis

pray for elves

D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.

(FlyBase report FBal0138651)

32

D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.

(FlyBase report FBal0138651)

Named entity recognition

• Molecular biology entity identificationproblem:– large list of classes– some of them much harder

• Usual case-related cues don't help• More variability of content• Huge lexical ambiguity problem• Common English

– as posed, not useful

33

white

white

"wild-type" (notmutated)

34

white

"mutant"

white

white

35

Case is meaningful

whiteWhite

Case is meaningful

white

Symbol: w

White

Symbol: W

36

Yes, there are genes with thesymbols I, a, R, p....

Case is meaningful

Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock.

37

Case is meaningful

Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock. (Ruanet al. 2002)

…even sentence-initially.

sunday driver (syd) was identified in ascreen for novel axonal transportmutants in Drosophila. Syd is a~137kDa protein that is broadlyconserved in evolution with homogousproteins identified in C. elegans, mouseand human. (Bowman 2000)

38

Case is meaningful

Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.

Surely you could determine on adocument-by-document basis…

Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.

39

Surely you could determine on adocument-by-document basis…

Axonal traffic jams with a sunday driver:Identification of a broadly conservedtransmembrane protein required foraxonal transport in Drosophila.(Bowman 2000)

Evolution

• What it looks like• What it acts like• Metaphor• …

40

Looks like…

• white• swiss cheese• clown• daschund• dreadlocks

Acts like…

• ether a go-go• lush• agnostic• amontillado

41

Metaphor/metonymy

• lot• maggie• scott of the antarctic• always early -> british rail• asp -> cleopatra• tudor -> vasa -> gustavus• nanos -> smaug

whimsy

• chablis, merlot, zinfandel, retsina,moonshine (16 zebrafish genes)

• milkah, murashka, zolotistyuy, zloday(32 Drosophila genes)

42

But, that’s not the only way ofnaming genes....

• Breast cancer 1 (BRCA1)• p53• Ribosomal protein S27• Heat shock protein 110• Mitogen activated protein kinase 15• Mitogen activated protein kinase

kinase kinase 5

• fuculokinase• GABA• Heat shock protein 60• calmodulin• dHAND• suppressor of p53

• cheap date• lush• ken and barbie• ring• to• the• there• a

43

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A

44

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A

• SEMA5A

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) and shortcytoplasmic domain, (semaphorin) 5A

• SEMA5A• Tyrosine kinase with immunoglobulin and

epidermal growth factor homology domains• tie

45

• What doesn’t work• What does (as of 2004)

“Gene mention” (NER)

Yeh et al. (2005)

46

Gene mention (NER)

Yeh et al. (2005)

Good systems?

• Handle multi-word names (heat shockprotein 60) (base NP chunking, abbreviationdefinitions, post-processing)

• Use some form of machine learning(MaxEnt, HMM, CRF, SVM) (or a cleverhack)

• Do some rule-based post-processing• Don’t rely on dictionaries

47

The Jim Martin techniquereally works

Kinoshita et al. (2005)

...which isn’t to say that externalknowledge is bad

• Markert/Nissim’s extensions ofPoesio’s use of Google

48

Most feature sets include...

• Typo/orthographic features– Patterns like \w+-?\d+– Contains Greek letters

• Local/distant context– Next word is “protein”– Followed by “protein” somewhere else in

document

Why not better?

• Length• Case• Tokenization• Annotation issues

– Inconsistency– Multiple correct

answers– Inter-corpus

differences indefinition

Yeh et al. (2005)

49

Length effect(and why the Jim Martin technique

works so well for this)

Kinoshita et al. (2005)

A great research project

• Build an NER system for...– Species– Laboratory techniques– Cell types– Cell lines– Tissues– ....

50

...and, NER isn’t what you needanyways

• GN task and results

Tokenization

• How to build a cheap base nounphrase chunker– Start from right, move left

• If next token is not conjunction, preposition,comma, period, or right parenthesis, add it

• Else start a new chunk

51

Tokenization

• Commas– 2,6-diaminohexanoic acid– tricyclo(3.3.1.13,7)decanone

Four kinds of hyphens

• “Syntactic:”– Calcium-dependent– Hsp-60

• Knocked-out gene: lush-- flies• Negation: -fever• Electric charge: Cl-

52

B-cell-CD4(+)-T-cell interactions

• PMID: 10516078

Special challenges in biomedicalcorpus construction

53

•How do you parse

rat epithelial growthfactor receptor 2

?

• Don’t—pretag allnamed entities

• How do you tokenize

tricyclo(3.3.1.13,7)decanone

• Don’t—pretag allnamed entities

54

• How do you hire alinguistics graduatestudent to tag ratepithelial growthfactor receptor 2?

• You can’t...

• How do you do PAStagging when youdon’t havesyntacticallytagged text?

• Sigh...

55

Some specific cases of wordsense disambiguation

Abbreviation disambiguation

• Incidence of ambiguous abbreviations(Jeff Chang’s paper)

• Statistical approaches– Chang

• Rule-based– Schwartz and Hearst

56

Part 4: getting up to speed

(about) 10 papers and resourcesthat will let you read most other

papers in BioNLP

Named entity recognition 1:rule-based

• Fukuda et al. (1998): first NER paper– Find something that looks like a symbol

for a yeast gene (ABC1)– Extend name to the left (yeast ABC1)– Extend name to the right (ABC1 protein)

• Results in 90s– Never replicated– Yeast is easy

57

Named entity recognition 2:machine learning

• Collier et al. (XXX)

NER 3: state of the art

58

Information extraction 1:rule-based

• Blaschke 1998

Information extraction 2:machine learning

• Craven and Kumlein 199X• Identify entity pairs

– Protein/protein– Protein/disease– Protein/?

• Use naïve Bayes to classify sentencesas +/- positing a relation– Features: bag-of-words

59

Information extraction 3:rules, linguistics, knowledge

• Friedman: MedLEE, BioMedLEE• NER• Syntax

Corpora: 1

• PubMed/MEDLINE– MEDLINE: database of 16M+ abstracts– PubMed: interface for searching

MEDLINE– ASCII and free

NOT a corpus—not really even a “text collection”

60

Corpora: 2

• GENIA– Fully annotated corpus– 2,000 abstracts– X00,000 words– Now: POS, named entities, 25%

treebanked– Coming: anaphora; events?; PAS?;

dependency parses?

Lexical resources: 1

• Gene Ontology– Biological functions– Molecular processes– Cell components

• Building blocks– Terms + definitions– Is-a, part-of

61

Lexical resources: 2

• Entrez Gene (formerly LocusLink)– Names– Symbols– Synonyms– Protein products– “Summary”– Gene References Into Function

Lexical resources: 3

• UMLS (Unified Medical LanguageSystem)– MetaThesaurus– Semantic Network

62

Tools overview

• Probably something available• Might work decently• Definitely improvable for your

specific task

Tools: 1

• POS tagging:– GENIA– MEDPOST– LingPipe?

63

Tools: 2

• Named entity recognition– ABNER (Settles 200x)– KeX– AbGene

• LESSON: distribute a .jar file andthe world will beat a path to yourdoor

Part 6:Current hot topics

64

What’s the right model for semanticrepresentation?

• So far: binary relations• Arguments that that’s not good

enough– Rzhetsky/GeneWays paper– Penn folks/IE paper– Native speaker intuitions (Juliane, etc.)

What’s the right model for semanticrepresentation?

• Two ways forward– Differentiating binary relations

• Marti HLT/EMNLP; Tsujii– PAS

• PASBio/Wattarujeekrit et al.• Kogan et al.

Karin: how do theserepresentationalchoices affect what abiologist would get outof the text?

65

The ontology wars

• Point:– Hunter; PASBio; Barry Smith; L&C....– GOA; MGI; EBI; ...

• Counterpoint:– Tsujii/Ananiadou; Pedersen/Pakhomov;

Markert/Nissim...

True integration of NLP intolaboratory data interpretation

• <Last chapter of Sophia and John’sbook>

66

The embarrassing truth aboutBioNLP (take 2)...

References

• Shuy, Roger (2002) Linguistic battlesin trademark disputes. Palgrave.

• Yeh, Alexander; Alexander Morgan;Marc Colosimo; and LynetteHirschman (2004) BioCreative Task1A: gene mention finding evaluation.BMC Bioinformatics 6(Suppl. 1):S2.

Recommended