Applied text mining

  • View
    158

  • Download
    1

  • Category

    Science

Preview:

Citation preview

>10 km

too much to read

exponential growth

~40 seconds per paper

computer

as smart as a dog

teach it specific tricks

information retrieval

named entity recognition

information extraction

text/data integration

medical text mining

information retrieval

find the relevant papers

ad hoc retrieval

user-specified query

“yeast AND cell cycle”

PubMed

indexing

fast lookup

stemming

word endings

dynamic query expansion

MeSH terms

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

no tool will find that

named entity recognition

identify the concepts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

comprehensive lexicon

CDC2

cyclin dependent kinase 1

orthographic variation

flexible matching

upper- and lower-case

CDC2

Cdc2

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

name expansions

prefixes and postfixes

CDC2

hCDC2

“black list”

SDS

efficient tagger

Pafilis et al., PLOS ONE, 2013

benchmarking

the formal way

manually annotated corpus

precision

recall

much work

the pragmatic way

random sampling

precision

no recall

much less work

augmented browsing

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Reflect

reflect.ws

information extraction

formalize the facts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

two approaches

the formal way

NLPNatural Language Processing

grammatical analysis

part-of-speech tagging

multiword detection

semantic tagging

sentence parsing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

extract stated facts

high precision

poor recall

the pragmatic way

guilt by association

co-mentioning

counting

within documents

within paragraphs

within sentences

quality score

high recall

high precision

undirected associations

unknown type

text/data integration

STRING

protein associations

string-db.org

STITCH

STRING + 300k chemicals

stitch-db.org

COMPARTMENTS

subcellular localization

compartments.jensenlab.org

TISSUES

tissue expression

tissues.jensenlab.org

DISEASES

disease–gene assocations

diseases.jensenlab.org

curated knowledge

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

experimental data

gene expression

computational predictions

gene neighborhood

Korbel et al., Nature Biotechnology, 2004

many databases

different formats

different identifiers

variable quality

not comparable

hard work

common identifiers

quality scores

score calibration

visualization

web interfaces

bulk download

why so many resources?

Swiss army knife syndrome

medical text mining

electronic health records

opt-out

opt-in

structured data

Jensen et al., Nature Reviews Genetics, 2012

unstructured data

clinical narrative

Danish

busy doctors

psychiatric patients

named entity recognition

custom dictionaries

diseases

drugs

adverse events

expansion rules

phonetic spelling

typos

sentence filters

negations

family members

delutions

detailed disease profiles

Roque et al., PLOS Computational Biology, 2011

3262638254947

Assigned codes

Text mined codes

comorbidity

Roque et al., PLOS Computational Biology, 2011

patient stratification

Roque et al., PLOS Computational Biology, 2011

pharmacovigilance

structured medication data

text-mined adverse events

Eriksson et al., submitted, 2013

EMBO Practical Course Computational Biology:Genomes to SystemsPuerto Varas, 3-9 April 2014

Thank you!

Thank you!

Recommended