25
Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library of Medicine (NLM) National Institutes of Health (NIH) Sep. 4, 2014

Strategies towards improving the utility of scientific big data Evan Bolton, PhD

  • Upload
    cira

  • View
    31

  • Download
    1

Embed Size (px)

DESCRIPTION

Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library of Medicine (NLM) National Institutes of Health (NIH ) Sep. 4, 2014. http://www.nlm.nih.gov/. - PowerPoint PPT Presentation

Citation preview

Page 1: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Strategies towards improving the utility of scientific big data

Evan Bolton, PhDNational Center for Biotechnology Information (NCBI)National Library of Medicine (NLM)National Institutes of Health (NIH)

Sep. 4, 2014

Page 2: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

http://www.nlm.nih.gov/

Page 3: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

U.S. National Center for Biotechnology Information

https://www.ncbi.nlm.nih.gov/

Page 4: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

https://pubchem.ncbi.nlm.nih.gov/

PubChem website

Page 5: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

PubChem primary goal

… to be an on-line resource providing

comprehensive information on the

biological activities of substanceswhere “substance” means any biologically testable entity

Small molecules, RNAs, carbohydrates, peptides, plant extracts, etc.

Page 6: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

PubChem data growth over ten years

Contributors Chemicals Biological Assays

Bioactivity ResultsTested ChemicalsProtein Targets

+280 substance contributors, +60 assay contributors, +150M substances, +50M compounds, +1.0M bioassays, +6.1T protein targets, +2.9M tested substances, +2.0M tested compounds, +225M bioactivity result sets

[M=millions, T=thousands, MLP = Molecular Libraries Program]

Page 7: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

CAVEAT! All data has “errors”

Page 8: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Big data has “big errors”

Hypothetical

If your average data error rate is 1 in 1,000,000, you have 99.999% data accuracy

If you have one trillion facts (10^12), can you accept one million errors (10^9)?

Strategies to mitigate errors?

Manual curation has its limits (accuracy, cost, time)

So .. what do you do?

Page 9: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Error suppression strategies for scientific big data

1. Identify quality {un}known known/unknownsuse to formulate an error suppression

strategy

2. Perform data normalizationimproves utility by helping to refine

identification

3. “Trust but verify”cross compare authoritative and curated

data

4. Consistency filteringimproves precision by removal of outliers

5. Address error feedback loopsuse “is”, “can be”, and, if all else fails, “is

not” lists

Page 10: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Error suppression strategies for scientific big data

1. Identify quality {un}known known/unknownsuse to formulate an error suppression

strategythere are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know

Feb. 2002 news briefing

Image credit: http://en.wikipedia.org/wiki/Donald_Rumsfeld

Tautomers and resonance forms of same chemical structure are prolific

(+)-IridodialDefense chemicals from abdominal glands of 13

rove beetle species of subtribe Staphylinina

Ring ClosedRing Open

Salt-form drawing variations are commonChemical meaning of a substance may change upon context

Page 11: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Error suppression strategies for scientific big data

2. Perform data normalizationimproves utility by helping to refine

identification• Verify chemical content– Atoms defined/real– Implicit hydrogen– Functional group– Atom valence sanity

• Normalize representation– Tautomer invariance– Aromaticity detection– Stereochemistry– Explicit hydrogen

• Calculate– Coordinates– Properties– Descriptors

• Detect components– Isolate covalent units– Neutralize (+/- proton)– Reprocess– Detect unique

Page 12: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Error suppression strategies for scientific big data

3. “Trust but verify”cross compare authoritative and curated

data

orJohn Kerry’s more recent adaption of the phrase

when discussing Syria’s chemical weapons disposal:

“Verify and verify”

Image credit: http://en.wikipedia.org/wiki/John_Kerry

Доверяй, но проверяй (doveryai, no proveryai)Russian proverb used extensively by Ronald Regan

when discussing relations with the Soviet Union

Image credit: http://en.wikipedia.org/wiki/Ronald_Reagan

   Cross concept count %           CTD     HDO     KEG     MED     NDF     ORD  CTD     100.0    14.3   79.1    40.7    49.7    35.8  HDO      26.0   100.0   38.7    52.4    48.3    26.2  KEG      24.8     6.7  100.0    10.7     6.4    25.2  MED      97.2    68.9   81.6   100.0    93.8    79.6  NDF      30.4    16.3   12.5    24.0   100.0    10.8  ORD      31.9    12.8   71.6    29.7    15.7   100.0

Cross-reference overlaps between various disease resources: Human Disease Ontology (HDO), NCBI MedGen (MED), CTD MEDIC (CTD), KEGG Disease (KEG), NDF-RT (NDF), and OrphaNet (ORD) using NLM Medical Subject Headings (MeSH) as the basis of comparison.

Page 13: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Error suppression strategies for scientific big data

4. Consistency filteringimproves precision by removal of outliers

Keep consensus, remove the restImage credit: http://withfriendship.com/images/c/11229/Accuracy-and-precision-picture.png

Original Total Added Removed Same -

20,000

40,000

60,000

80,000

100,000

120,000

Histogram of Fate of CID-MNID Pairs

Many votes, 70%Many votes, 60%One Vote, 70%One Vote, 60%1 2 3 4 5 6

1

10

100

1,000

10,000

100,000

1,000,000

Histogram of MNIDs per CID

OriginalMany votes, 70%Many votes, 60%One Vote, 70%One Vote, 60%

Page 14: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Error suppression strategies for scientific big data

5. Address error feedback loopsuse “is”, “can be”, and, if all else fails, “is

not” lists

Prevent error proliferation at the data source, when possible

Page 15: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Error suppression strategies for scientific big data

1. Identify quality {un}known known/unknownsuse to formulate an error suppression

strategy

2. Perform data normalizationimproves utility by helping to refine

identification

3. “Trust but verify”cross compare authoritative and curated

data

4. Consistency filteringimproves precision by removal of outliers

5. Address error feedback loopsuse “is”, “can be”, and, if all else fails, “is

not” lists

Page 16: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Okay … now what?

… you have cleaned up your data … but it is huge, unwieldy, unstructured

How can it be made more useful?

Page 17: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Data organization strategies for scientific big data

1. Crosslink and annotate dataprovides context and identifies associated

concepts

2. Establish similarity schemesenables identification of related records

3. Associate to concept hierarchiesimproves navigation between related

records

4. Perform data reductionsuppresses “redundant” information

5. Be succinctsimplifies presentation by hiding details

Page 18: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Data organization strategies for scientific big data

1. Crosslink and annotate dataprovides context and identifies associated

concepts

Compound

SubstanceProtein

Gene

DrugPublication

Patent

Disease

Pathway

citesinhibit

encode

ingredienttreat

cites

associates

parti

cipa

tes

cites

Page 20: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Data organization strategies for scientific big data

3. Associate to concept hierarchiesimproves navigation between related

records

Match toconcept

Independenthierarchy

= chemicalprotein

genepatent

publicationpathway

… …

Organized records

Page 21: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Data organization strategies for scientific big data

4. Perform data reductionsuppresses “redundant” information

5. Be succinctsimplifies presentation by hiding details

“subject-predicate-object” “atorvastatin may treat hypercholesterolemia”

subject objectpredicate

Evidence citation (PMID)

From whom?(Data Source)

Provenance information

Page 22: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Data organization strategies for scientific big data

1. Crosslink and annotate dataprovides context and identifies associated

concepts

2. Establish similarity schemesenables identification of related records

3. Associate to concept hierarchiesimproves navigation between related

records

4. Perform data reductionsuppresses “redundant” information

5. Be succinctsimplifies presentation by hiding details

Page 23: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Concluding remarks

Scientific “big data” …… contains an amazing amount of information

… provides opportunities to make discoveries

… benefits from strategies to massage it

PubChem is doing its part …… making chemical substance data broadly accessible

… cross-integrating it to key scientific resources

… suppressing errors and their propagation

… organizing the data and making it available

https://pubchem.ncbi.nlm.nih.gov

Page 24: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

PubChem Crew …

Steve Bryant

Tiejun Chen

Gang Fu

Lewis Geer

Renata Geer

Asta Gindulyte

Volker Hahnke

Lianyi Han

Jane He

Siqian He

Sunghwan Kim

Ben Shoemaker

Paul Thiessen

Jiyao Wang

Yanli Wang

Bo Yu

Jian Zhang

Special thanks to the NCBI Help Desk, especially Rana Morris

Page 25: Strategies towards improving the utility of scientific big  data Evan Bolton, PhD

Any questions?

If you think of one later, email me:

[email protected]