2015 6 bd2k_biobranch_knowbio

  • View
    99

  • Download
    3

  • Category

    Science

Preview:

Citation preview

Three TSRI Tools for capturing, sharing, and applying community knowledge

Benjamin GoodThe Scripps Research Institute

@bgood

Outline

• Gene wiki, quick recap, update• Introducing:– http://knowledge.bio– http://biobranch.org

Gene Wiki (on Wikipedia)

3

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Wikidata

4

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

A computable Gene (& Disease & Drug) Wiki

5

Structured data

Here nowSoon

Downstream(but exciting potential..)

?? ?

Wikipedia(s)

Status Update

• Genes, diseases (and any minute.. Drugs) are in wikidata

• Demonstrations of incorporating this content in Wikipedia are functional

• We’ve been slowed a little bit by wikidata governance policies.. (they blocked our bot temporarily)

Wikidata activities

• YOU can help!• https://www.wikidata.org/wiki/User:ProteinBoxBot

Join in one of these discussions and voice your support

Outline

• Gene wiki• knowledge.bio• biobranch.org

Knowledge.bio

• Provides a concept-centric view of the scientific literature. – You search and interact with concepts rather than

documents.• Main purpose is hypothesis generation• 2 data sources mined from PubMed– 70 million Explicit semantic relations (‘triples’)– 200 million Implicit gene-disease associations

http://knowledge.bio

Explicit relations view

Search for concept

View related concepts

(67 results)

Filter results

View text where triple was extracted

Diseases implicitly related to queried concept: CYP2R1

Concepts linking CYP2R1 to Smith-Lemli Opitz Syndrome

Table views complemented by a Network view for taking notes..

Network (“Map”) view

Cytoscape.js canvasAuto and manual layout

Save Map as local text file

Load saved map

Step 1: find candidate relationWhat new diseases might be related to CYP2R1?

Implicit prediction

Step 2: find linking conceptsHow is CYP2R1 related to SLO syndrome?

Step 3: Start building a hypothesis to explain the predicted relation

Do CYP2R1 and DHCR7 participate in a process related to SLO syndrome?

Explicit relations view

Warning, may prove addictive..

Next steps for knowledge.bio

• Enhanced community sharing• Integration with http://ndexbio.org from the

Cytoscape consortium• Allow user actions to feedback into underlying

NLP systems• Include access to other structured knowledge

sources e.g. Gene Ontology

Outline

• Gene wiki• knowledge.bio• biobranch.org

Breast cancer prognosis:10 year survival?

find patterns

Inferring class predictors

No

van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.

Yes make predictions on new samples

No

Yes

10 year survival?

find patterns make predictions

inferring survival predictors

1) select genes

2) infer predictor from data (e.g. decision tree, SVM, etc.)

Out of the 25,000+ genes, which small set works together the best?

No

Yes

10 year survival?

Problem: gene selection instability

instability: different methods, different datasets produce different gene sets for the same phenotype [1]

[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer.” Genome Medicine 5.10 (2013).

Problem: the validation gap

training data, test data

validation

validation: predictive signatures often perform worse on independent data created for validation.

Photograph by Richard Hallman, National Geographic Adventure Blog

find patternsmake predictions

Adding prior knowledge to the discovery algorithm

<10 yr survival

>10 yr survival

Ex.) Network guided forests

Use protein interaction network to find good gene combinations

Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology

But most knowledge is not structured

2000200120022003200420052006200720082009201020112012

500000

550000

600000

650000

700000

750000

800000

850000

900000

950000

1000000

Number ar-ticles added to PubMed

>100 publications/hour

>194715 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000

How can we use unstructured knowledge to improve predictors?

Need a distributed network of intelligent systems that are good at reading and hypothesizing

Like you and your friends

A game with a purpose: The Cure

• http://genegames.org/cure• http://games.jmir.org/2014/2/e

7/• The Cure: Design and Evaluation of a

Crowdsourcing Game for Gene Selection for Breast Cancer Survival Prediction JMIR Serious Games PMID: 25654473

People wanted to control the trees

http://biobranch.org

Branch Goals

• Provide easy, visual way for non-programmers to use large datasets to answer questions

• Construct libraries of manually crafted predictive models

• Use the collected models to generate ensemble predictors that incorporate the knowledge of the users

Branch walkthrough: Choose a dataset

Select evaluation option

Tree Builder

Split node builder

Each button is a different way to compose a split node in your decision tree

Split node

Predictions at leaf nodes

100% correct

56% accurate

View data, adjust split point

If age less than 34.5Predict relapse

If greater, Predict no relapse

Single feature splits

Pick from genes or clinical features

Type-ahead search

Statistical ranker

Custom feature combination

BRCA2TOP2B

BRCA2 + TOP2B

Allows user to use a manually composed linear combination of other features

Eg: 21 Gene Signature from OncoType Dx

ProliferationKi67STK15SurvivinCCNB1 (cyclin B1)MYBL2

InvasionMMP11CTSL2

HER2GRB7HER2

EstrogenERPGRBCL2SCUBE2

GSTM1

ReferenceACTB(b-actin)GAPDHRPLPOGUSTFRC

Recurrence Score Algorithm1. HER2 group score = 0.9 x GRB7+ 0.1 x HER2 (if the result is less than 8, then the GRB7

group score is considered 8);2. ER group score = (0.8x ER +1.2 x PGR + BCL2+ SCUBE2)÷43. Proliferation group score = ( Survivin + KI67 + MYBL2 + CCNB1 [the gene encoding

cyclin B1]+ STK15 )÷5 (if the result is less than 6.5, then the proliferation group score is considered 6.5)

4. Invasion group score=( CTSL2 [the gene encoding cathepsin L2] + MMP11 [the gene encoding stromolysin 3])÷2.

RSU=0.47* HER2- 0.34* ER +1.04* PROLIFERATION + 0.10* INVASION +0.05* CD68 -0.08* GSTM1 -0.07* BAG1

*A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer

CD68

BAG1

Classifier nodes

Classifier Node

Class B

Class A

…...…...…...…...

…...…...…...…...

…...…...…...…...

Use a trained predictive model such as A Support Vector Machine as a node in your tree

Use

Build

biobranch tree nodes

Branch decision tree

Class B

Class A

…...…...…...…...

…...…...…...…...

…...…...…...…...

Use a previously constructed tree as node

Visually set decision boundary nodes

Visual split

Class B

Class A

…...…...…...…...

…...…...…...…...

…...…...…...…...

Creating a visual split

Draw polygon

Add to treeSelect feature

Select feature

Teach students about overfitting..

Tree Builder

Evaluation panel

View training and testing sets

Performance metrics

Confusion matrix

ROC curve

Navigation

Save your treeNew tree

Tree Collection

Open and edit shared tree

Search trees you create and trees shared with the community

Editing shared tree

Tracks which user created each node

Next steps

• More user testing• More datasets• Lots of users?• Better models?

training data, test data

validation

Thanks

Funding and Support

BioGPS: GM83924Gene Wiki: GM089820BD2K COE: GM114833

Andra WaagmeesterSebastian BurgstallerElvira Mitraka

Lynn SchrimlGang FuEvan BoltonPaul PavlidisPeter RobinsonMany WikiDatans

Richard Bruskiewichhttp://starinformatics.com

Karthik GangavarapuVyshakh Babji

Andrew Su

The Prince of Crowdsourcing

ImplicitomeKristina Hettne, Leiden University

Contact: bgood@scripps.edu@bgood

Recommended