176

Applied text mining

Embed Size (px)

Citation preview

Page 1: Applied text mining
Page 2: Applied text mining

>10 km

Page 3: Applied text mining

too much to read

Page 4: Applied text mining

exponential growth

Page 5: Applied text mining

~40 seconds per paper

Page 6: Applied text mining

computer

Page 7: Applied text mining

as smart as a dog

Page 8: Applied text mining

teach it specific tricks

Page 9: Applied text mining
Page 10: Applied text mining
Page 11: Applied text mining

information retrieval

Page 12: Applied text mining

named entity recognition

Page 13: Applied text mining

information extraction

Page 14: Applied text mining

text/data integration

Page 15: Applied text mining

medical text mining

Page 16: Applied text mining

information retrieval

Page 17: Applied text mining

find the relevant papers

Page 18: Applied text mining

ad hoc retrieval

Page 19: Applied text mining

user-specified query

Page 20: Applied text mining

“yeast AND cell cycle”

Page 21: Applied text mining

PubMed

Page 22: Applied text mining
Page 23: Applied text mining

indexing

Page 24: Applied text mining

fast lookup

Page 25: Applied text mining

stemming

Page 26: Applied text mining

word endings

Page 27: Applied text mining

dynamic query expansion

Page 28: Applied text mining

MeSH terms

Page 29: Applied text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 30: Applied text mining

no tool will find that

Page 31: Applied text mining

named entity recognition

Page 32: Applied text mining

identify the concepts

Page 33: Applied text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 34: Applied text mining

comprehensive lexicon

Page 35: Applied text mining

CDC2

Page 36: Applied text mining

cyclin dependent kinase 1

Page 37: Applied text mining

orthographic variation

Page 38: Applied text mining

flexible matching

Page 39: Applied text mining

upper- and lower-case

Page 40: Applied text mining

CDC2

Page 41: Applied text mining

Cdc2

Page 42: Applied text mining

spaces and hyphens

Page 43: Applied text mining

cyclin dependent kinase 1

Page 44: Applied text mining

cyclin-dependent kinase 1

Page 45: Applied text mining

name expansions

Page 46: Applied text mining

prefixes and postfixes

Page 47: Applied text mining

CDC2

Page 48: Applied text mining

hCDC2

Page 49: Applied text mining

“black list”

Page 50: Applied text mining

SDS

Page 51: Applied text mining

efficient tagger

Page 52: Applied text mining

Pafilis et al., PLOS ONE, 2013

Page 53: Applied text mining

benchmarking

Page 54: Applied text mining

the formal way

Page 55: Applied text mining

manually annotated corpus

Page 56: Applied text mining
Page 57: Applied text mining

precision

Page 58: Applied text mining

recall

Page 59: Applied text mining

much work

Page 60: Applied text mining

the pragmatic way

Page 61: Applied text mining

random sampling

Page 62: Applied text mining
Page 63: Applied text mining

precision

Page 64: Applied text mining

no recall

Page 65: Applied text mining

much less work

Page 66: Applied text mining

augmented browsing

Page 67: Applied text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 68: Applied text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 69: Applied text mining

Reflect

Page 70: Applied text mining

reflect.ws

Page 71: Applied text mining

information extraction

Page 72: Applied text mining

formalize the facts

Page 73: Applied text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 74: Applied text mining

two approaches

Page 75: Applied text mining

the formal way

Page 76: Applied text mining

NLPNatural Language Processing

Page 77: Applied text mining

grammatical analysis

Page 78: Applied text mining

part-of-speech tagging

Page 79: Applied text mining

multiword detection

Page 80: Applied text mining

semantic tagging

Page 81: Applied text mining

sentence parsing

Page 82: Applied text mining

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 83: Applied text mining

extract stated facts

Page 84: Applied text mining

high precision

Page 85: Applied text mining

poor recall

Page 86: Applied text mining

the pragmatic way

Page 87: Applied text mining

guilt by association

Page 88: Applied text mining
Page 89: Applied text mining

co-mentioning

Page 90: Applied text mining

counting

Page 91: Applied text mining

within documents

Page 92: Applied text mining

within paragraphs

Page 93: Applied text mining

within sentences

Page 94: Applied text mining

quality score

Page 95: Applied text mining
Page 96: Applied text mining
Page 97: Applied text mining

high recall

Page 98: Applied text mining

high precision

Page 99: Applied text mining

undirected associations

Page 100: Applied text mining

unknown type

Page 101: Applied text mining

text/data integration

Page 102: Applied text mining

STRING

Page 103: Applied text mining

protein associations

Page 104: Applied text mining

string-db.org

Page 105: Applied text mining

STITCH

Page 106: Applied text mining

STRING + 300k chemicals

Page 107: Applied text mining

stitch-db.org

Page 108: Applied text mining

COMPARTMENTS

Page 109: Applied text mining

subcellular localization

Page 110: Applied text mining

compartments.jensenlab.org

Page 111: Applied text mining

TISSUES

Page 112: Applied text mining

tissue expression

Page 113: Applied text mining

tissues.jensenlab.org

Page 114: Applied text mining

DISEASES

Page 115: Applied text mining

disease–gene assocations

Page 116: Applied text mining

diseases.jensenlab.org

Page 117: Applied text mining

curated knowledge

Page 118: Applied text mining

pathways

Page 119: Applied text mining

Letunic & Bork, Trends in Biochemical Sciences, 2008

Page 120: Applied text mining

experimental data

Page 121: Applied text mining

gene expression

Page 122: Applied text mining
Page 123: Applied text mining

computational predictions

Page 124: Applied text mining

gene neighborhood

Page 125: Applied text mining

Korbel et al., Nature Biotechnology, 2004

Page 126: Applied text mining

many databases

Page 127: Applied text mining

different formats

Page 128: Applied text mining

different identifiers

Page 129: Applied text mining

variable quality

Page 130: Applied text mining

not comparable

Page 131: Applied text mining

hard work

Page 132: Applied text mining

common identifiers

Page 133: Applied text mining

quality scores

Page 134: Applied text mining

score calibration

Page 135: Applied text mining

visualization

Page 136: Applied text mining

web interfaces

Page 137: Applied text mining

bulk download

Page 138: Applied text mining

why so many resources?

Page 139: Applied text mining

Swiss army knife syndrome

Page 140: Applied text mining
Page 141: Applied text mining

medical text mining

Page 142: Applied text mining

electronic health records

Page 143: Applied text mining
Page 144: Applied text mining

opt-out

Page 145: Applied text mining

opt-in

Page 146: Applied text mining

structured data

Page 147: Applied text mining

Jensen et al., Nature Reviews Genetics, 2012

Page 148: Applied text mining

unstructured data

Page 149: Applied text mining

clinical narrative

Page 150: Applied text mining
Page 151: Applied text mining

Danish

Page 152: Applied text mining

busy doctors

Page 153: Applied text mining

psychiatric patients

Page 154: Applied text mining

named entity recognition

Page 155: Applied text mining

custom dictionaries

Page 156: Applied text mining

diseases

Page 157: Applied text mining

drugs

Page 158: Applied text mining

adverse events

Page 159: Applied text mining

expansion rules

Page 160: Applied text mining

phonetic spelling

Page 161: Applied text mining

typos

Page 162: Applied text mining

sentence filters

Page 163: Applied text mining

negations

Page 164: Applied text mining

family members

Page 165: Applied text mining

delutions

Page 166: Applied text mining

detailed disease profiles

Page 167: Applied text mining

Roque et al., PLOS Computational Biology, 2011

3262638254947

Assigned codes

Text mined codes

Page 168: Applied text mining

comorbidity

Page 169: Applied text mining

Roque et al., PLOS Computational Biology, 2011

Page 170: Applied text mining

patient stratification

Page 171: Applied text mining

Roque et al., PLOS Computational Biology, 2011

Page 172: Applied text mining

pharmacovigilance

Page 173: Applied text mining

structured medication data

Page 174: Applied text mining

text-mined adverse events

Page 175: Applied text mining

Eriksson et al., submitted, 2013

Page 176: Applied text mining

EMBO Practical Course Computational Biology:Genomes to SystemsPuerto Varas, 3-9 April 2014

Thank you!

Thank you!