2017-07-19 john ismb2017 poster · ZEB1 SAA EGF VPS37A PI3K HGF CDH1 MMP9 MET EGFR invasion TRIM11 ERRFI1 aldosterone IL6ST KLF5 DPP4 CYP11B2 artemisinin Integrins IFNL1 CXCL8 HDAC6

The capacity of modern experimental methods to generate data about biological processes has surpassed the ability of existing informatics approaches to generate meaningful mechanistic explanations. Mechanistic systems biology models could potentially address this gap, but model construction remains a labor-‐intensive process requiring both biological knowledge and modeling expertise. As a result, modeling studies remain fairly small in scope and are disconnected from genome-‐scale research. For mechanistic models to attain the necessary scope, methods for the automated assembly and analysis of large models from available knowledge sources will be required. Here we describe the use of the Integrated Network and Dynamical Reasoning Assembler (INDRA)1 to assemble mechanistic facts from databases and literature into a rule-‐based Kappa2 model in order to explain observations in a previously published phosphoproteomic dataset.3 Explanations were generated by identifying paths through the rule influence map between drug targets and measured protein nodes. The model yielded detailed, biochemically plausible explanations for 20 of 22 of the largest effects (91%), and 95/135 (70%) of smaller effects. Additional improvements in performance could also be made by supplying manually curated mechanistic information in the form of natural language.

Explanation of drug effects using a mechanistic model automatically assembled from natural language, databases, and literatureJohn A. Bachman1*, Benjamin M. Gyori1*, and Peter K. Sorger1*These authors contributed equally to this work 1Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA

INTRODUCTION

RESULTS

Availability: https://github.com/sorgerlab/indra Funding: DARPA Big Mechanism program, ARO contract W911NF-‐14-‐1-‐0397

Phosphorylation(RAF, MEK)

Phosphorylation(BRAF, MAP2K1, 218)

Phosphorylation(BRAF, MAP2K1)

Phosphorylation(BRAF, MAP2K1, S, 218) Phosphorylation(BRAF, MAP2K1, S, 222)

Phosphorylation(BRAF, MAP2K1, S)

Phosphorylation(BRAF, MAP2K1, 222)

Mechanisms are normalized into Statements

Correcting systematic errors in entity groundingApplying mechanistic models to interpretation of data requires that the named entities extracted from text (genes, proteins, small molecules, biological processes, etc.) be appropriately “grounded” to identifiers in the relevant databases. This is often challenging due to overlapping synonyms among gene names and ambiguous acronyms. A key problem is that mechanisms described in literature often refer to protein families and named complexes which cannot be directly related to genes and proteins measured in experimental data. To solve this problem, we created Bioentities(http://github.com/sorgerlab/bioentities), a resource accounting for the hierarchical relationships between genes/proteins, protein families, and named complexes.

In addition, INDRA includes a curated “grounding map” that maps commonly-‐encountered entities in text to the relevant database identifiers.

Representation of “AMPK”in the Bioentities hierarchy

Synonyms for JNK protein familyin the INDRA grounding map

Identifying relationships between mechanisms

Relations can be organized into a hierarchybased on their specificity

A key challenge in assembling detailed mechanistic networks is that a single mechanism may be described at different levels of specificity among the literature and various databases. Reconciling these overlapping mechanisms is essential to eliminate spuriously distinct edges in the assembled model. Using hierarchical ontologies of protein modification types, activity types, and the protein family information provided in Bioentities, INDRA implements duplicate removal, hierarchy-‐based redundancy resolution, and other forms of error correction and mechanism linking.

The Integrated Network and Dynamical Reasoning Assembler (INDRA)1 automatically assembles mechanistic models from pathway databases, literature, and expert knowledge expressed in natural language. INDRA draws on three existing natural language processing systems4,5,6 and uses a modular architecture to build different types of models from a variety of sources.

Mechanisms extracted from each source format are normalized into Statements, an SBO-‐compatible internal representation, where they are processed to remove errors, identify overlaps, and estimate reliability. Statements are designed to correspond in both specificity and ambiguity to descriptions of biochemistry as found in text (e.g., “MEK1 phosphorylates ERK2”, rather than a detailed reaction mechanism). The representation currently encompasses post-‐translational modifications, chemical conversions, protein expression and degradation, and generic activation/inhibition relationships.

Statement

evidence : Evidence

Phosphorylation

Modification

enzyme : Agentsubstrate : Agentresidue : stringposition : string

"is a" (inheritance)composition (has one or more, life-cycle dependence)

StatementsAgent and components

Agent

name : stringmods : list[ModCondition]mutations : list[MutCondition]bound_conditions : list [BoundCondition]location : stringactivity : ActivityConditiondb_refs : dict

Hydroxylation Dehydroxylation

Ubiquitination Deubiquitination

Dephosphorylation

Acetylation Deacetylation

Glycosylation Deglycosylation

Sumoylation Desumoylation

SelfModification

enzyme : Agentresidue : stringposition : string Autophosphorylation

ActiveForm

agent : Agentactivity_type : stringis_active : boolean

Conversionsubj : Agentobj_from : list[Agent]obj_to : list[Agent]

Activation

Transphosphorylation

Gef

gef : Agentgtpase : Agentgef_activity : string

Gap

gap : Agentgtpase : Agentgap_activity : string

ModCondition

mod_type : stringresidue : stringposition : stringis_modified : boolean

MutCondition

from_residue : stringto_residue : stringposition : string

BoundCondition

agent : Agentis_bound : string

Farnesylation

ActivityCondition

activity_type : stringis_active : boolean

Inhibition

RegulateActivity

subject : Agentobject : Agentobj_activity : string

RegulateAmount

subject : Agentobject : Agent

Evidence

text : stringsource_api : stringsource_id : stringpmid : stringannotations : dictepistemics : dict

IncreaseAmount

DecreaseAmount

Ribosylation Deribosylation

Defarnesylation

Geranylgeranylation Degeranylgeranylation

Palmitoylation Depalmitoylation

Myristoylation Demyristoylation

Other

AddModification

RemoveModification

Methylation Demethylation

Complex

members : list[Agent]

Conceptual overview of automated assembly

System architecture and approach

INDRA software architecture

Estimating the reliability of extracted mechanismsEven state-‐of-‐the-‐art NLP and text mining algorithms have limited accuracy, with roughly 20-‐30% of extracted relations representing a misinterpretation of the corresponding sentence (“reader error”). Given empirical estimates of the per-‐sentence error rate for different readers, INDRA’s BeliefEngine component aggregates results to estimate the overall probability that a relation is the result of reader error. It accomplishes this by:

1) aggregating evidence from multiple sentences read by the same reader

2) aggregating results from different reading algorithms on the same sentence

3) propagating error estimates through the network of related statements

Mechanisms can then be filtered with a precision threshold (e.g., 95% confidence).

Reading systems produce partiallyoverlapping extractions

Reliability estimates are propagated through the specificity hierarchy

Use case for explanation: interpreting phosphoproteomic data

REFERENCES1. B. M. Gyori*, J. A. Bachman*, K. Subramanian, J. L. Muhlich, L. Galescu, and P. K. Sorger. “From word models to executable models of signaling networks using automated assembly.” bioRxiv, 2017.2. V. Danos, J. Feret, W. Fontana, R. Harmer, and J. Krivine. “Rule-‐Based Modeling of Cellular Signaling.” Concurrency Theory (CONCUR) 2007, Lecture Notes in Computer Science, 4703:17–41, 2007.3. E. J. Molinelli, A. Korkut, et al., “Perturbation biology: Inferring signaling networks in cellular systems.” PLoS Computational Biology, 9(12):e1003290, Dec 2013. 4. J. Allen, W. de Beaumont, L. Galescu, and C. M. Teng. “Complex event extraction using DRUM.” 2015. 5. M. A. Valenzuela-‐Escarcega, G. Hahn-‐Powell, T. Hicks, and M. Surdeanu. “A domain-‐independent rule-‐based framework for event extraction.” In Proc. 53rd Annual Meeting of the ACL-‐IJCNLP, 2015.6. D. McDonald et al., “Extending Biology Models with Deep NLP over Scientific Articles.” Workshops at the 30th AAAI Conference on Artificial Intelligence, 2016.7. C. F. Lopez*, J. L. Muhlich*, J. A. Bachman*, and P. K. Sorger. “Programming biological models in python using PySB.” Molecular Systems Biology, 9(1):646–646, Apr 2014.

Curated mechanisms for MTOR feedback inhibition on AKT

TIAM1SOS1 ICAM1

PIK3CA

FGFR3

Ca PRKAA1PAK1

NRAS

NANOGP8

INSRNOX1

sorafenib

autophagy

PAK2RASGRF1

RAC1senescenceproliferationcell_proliferation

SIVA1 IRS1DUSP1

UTS2caffeine

ERBB2AGT

cell_survival

BRAFNF1

rapamycin

MTOR GRN

TP53

RPTOR

GRB10EIF4EBP1

ARAF

ZEB1SAA

EGF VPS37API3K

HGF

CDH1MMP9

METEGFR

TRIM11invasionERRFI1

aldosterone

IL6ST

KLF5DPP4

CYP11B2

artemisinin

Integrins

IFNL1CXCL8

HDAC6

apocynin

melatoninNADPH

HTN3cell_migration

SMC2

diosmetin

STAT3

CS

HMGB1

PLXNB1

IGF1R

NRG1

AR

CD274

PTX3dapagliflozin

SREBF1

GH1

UL138

RHBDD1

PLD1

SOX10

AKTIP

ANXA2

VLDL

SETD2

AMIGO2

CBL

PRH2

afatinib

cell_growth

SHC3

metastasis

RASA3

ELAVL1

SFTPC

SNAI2

cell_viability

GTP

angiogenesis

VAV1

THBS1TCN1

CXCL16

ALK

PIK3R3

CXCL12

PTENPIK3R1

SHC1AKT

RASA1

MAZSTK11

HRAS KRASRHOA

PTPN9

cetuximab

NEU1PDCD6IP

PTPN11metabolism

TP53BP2

GRB2

RPL17

RET

TNFRSF12A

tst

ABCB1

erlotinib

PGRMC1CXCL2

RASAL1 BEZ235ROCK1

RASA2

localizationLPA

CXCR4

UL135

PDCD6alcama

IGF2BP3

ADAM17

EPS8 WntCDC25A

CRP

ARHGAP35arsenite

CAV1

STUB1 FGF2

TNFSF12

GSK3B

PDGFRB

MB21D1

IRS2

PDGFD

endocytosis

NotchGPRC5ASOD2PHB

VEGFA DCA DA

CTSS

ABI1

DAB2IPRASSF2

CTNNB1

SNAI1GDPAGO2

SRC

KDR

SPRR2A

FOXP1

DLG1

RAF1

MAP2K2 MAP2K1

RPS6KB1

ADM

CDKN1B

PDPK1

RPS6KB2

DUSP3

CTGF

JUN

RPS6KA3

CDKN1A

Rapamycin

FOS

RPS6KA5

transcription

IRF1

MEK

DIRAS3

glucosePEBP1

AMPK

DYRK1B

TAS_116

MYC

VRK3

RAS

dabrafenib

PKA

KSR1

cell_cycle

CASP8

adhesionTRPM2CCL2

KIAA0101

S100A9

VCAM1

WISP1

HSMCR30

CXCL10

TNF

OLR1MKNK1

TLR4

NFKBIA

MITF

LPS

BCL2

NLRP1

MEK_inhibitorsp38

PLX4032

VEGFB

PKC

STAB2

apoptosis

cisplatinTGFB1

cypermethrin

CAMP

TNFRSF10BSLC22A3

differentiation

ERCC8

IFNG

BAX

HMOX1

IL6

ERK

CD36PTGS2

AREG

NFkappaB gefitinib

JNK

collagen

INS

ROSCP

GHRLEPHB2

PLAT

GCG

signal_transductioncell_death

RNF26SQSTM1NTRK1TubulinRUSC2PROCR APC NR4A2

GSK3dhaA

SIRT1

XBP1

SLC12A3CLEC4DMAP3K7 APEX1 KITLG

SLC6A2

FCGR3B

MAP3K3 CCL20

CXCR3

GRM2GLP1RERMAP

KIT

LRP1

APP REG1A POMC

CSF1R GIT2

CSF1CCR4

SCRIB

PKMHIF1A

Actin

PDGFRAFLT3

SMAD3FGF23 MMP2

FAT4

TEC

LPXN

ACOD1

RALAFABP4

TAZ

ARF6PPARG RBBP5 TLR5

MUC16

WNK3

MSLN

SMARCE1

dmpBMME

VIP

KIF13B

RA

melanin

CBLB

PTPRJ

SMPD1

translation

PTK2

IKBMMP13ERN1MM

rutin

CCR3

SMAD4

CCL28

GLUL

CCND2CCND1

CDK4

TFDP1

E2F1E2F2

TFDP2

E2F3

RASSF1

CDK6

IL12

DTLHOTAIR

RGS19

MARK2

RB1

cocaine

TET1

PJA2

MARK3

TNFSF11

AICARCAMKK2

PGD

NFATC1TNFRSF11A

SNCG

BDNF

FASLG

GLI1

IPO7

EGCG

progesterone

RASD1

MAS1paeoniflorin

UGCGDSPP

TNFRSF11B

NTRK2

NOS2

nitric_oxide

IL10

cytokine_production

NORELA

inflammatory_responseOXT

IGF1

SP600125

AQP7LRIG1

curcumin

TNFAIP8L2FAS

MAGEE1

Sorafenib

CASP3

aspirin

Cdetoposide

IL1B

CX3CL1

S1PR2

SMCP

WDR20ATP

HSPD1vorinostat

quercetinWFDC2

inflammation

PDK3PDK1 SYK

OSCAR

NLRP3

RNF126

cholesterolCTSK

BMP2 TAK165ATRA

hCGIGFBP7CHSY1 XYLT1CHST11

metformin

SPP1PPP1R3Acellular_senescenceBMP7

TAX1BP1

ACE2

PSMD4

SMAD

MTDH

SMAD7

FNDC5

fs_1_h

SPRY2

oxygen

NGFIQGAP1

SB203580

ABA

PITRM1MEK_inhibitorCXCL13

TNFSF10

STAR

CIRBP

TBCATLRNFE2L2WNT3A

KLF4

SAV1

SDC2

AMP

CASR

GDF15

PAK4

MST1

CDKN2ACASP7

VemurafenibvemurafenibRAF_inhibitors

AHR

PREX2CDC42

CYTH2PEA15

PRKAB2PRKAG1 PRKAA2

STK4

ARHGEF2

MDM2

FANCA

PRKAB1

ROCK2

FGFR2

LGALS1

JQ1

TERT

RAC2

GRM5

MAPKAPK2

MMP3

STK3YAP1

ICAM2

TP63

APAF1mTORC1 TSC2

MDM4 AKT1

RHEB

PAK3

TP73

RPS6KA1AKT3

MAP2K3

AKT2

DUSP10

DUSP8DUSP4

ETS1

DUSP7

ETS2DUSP16

DUSP2

DUSP6

FLI1ELF1FEVMYCBPSPDEF

ELK3

EHFELK4

CDK2

COPS5

TBK1

BRCA1

cyclin_E

GABPA

SKI

RPS6KC1RPS6KA4

RPS6KA6RPS6KA2

MAPK15MAPK6

ERGELF3ERF

ETV3MMP1

ELF4DDC

UNGCCNA2

MCM7TK1

CDC6

BARD1

CCNA1

MCM4

MCM3

FGFR1

ELK1ELF2ELF5

MAPK3DUSP9EXOC7DUSP5

MYB PPP1CA MAPK1

HSP90

Cyclin

MAPK8MAPK7

B

BV

BM

BGR

BGRV

BMV

BGMR

BR

BBGGRR

BBGGRRV

BBGGMRR

BBGGMRRV

BBGRR

BBGRRV

BBGMRR

BBGMRRV

BRV

BBGGRRVVBBGGMRRVV

BBGRRVV BBGMRRVV

BGMRV

BMR

BBGGMMRR

BBGGMMRRV

BBGMMRR

BBGMMRRV

BMRV

BBGGMMRRVV BBGMMRRVV

BBRR

BBRRV

BBMRR

BBMRRV

BBGR

BBGRV

BBGMR

BBGMRV

BBR

BBRV

BBMR

BBMRV

BBRRVV

BBMRRVV

BBGRVV

BBGMRVV

BBRVV

BBMRVV

BBMMRR

BBMMRRV

BBGMMR

BBGMMRV

BBMMR

BBMMRV

BBMMRRVVBBGMMRVV

BBMMRVV

BB

BBV

BBM

BBMV

BBVV

BBMVV

BBMM

BBMMV

BBMMVV

Model representations for statically identifying causal paths

Drug com

binatio

ns

RPPA measurements

How did this happen?

http://www.sanderlab.org/pertbio/

Directed proteininteraction graph

Kappa ruleinfluence map2

Chemical reactionnetwork

Mechanistic detail/causal contextMore false positive paths(less stringent context)

More false negative paths(more stringent context)

Boolean network

The assembly challenge

MEK phosphorylates ERK

ERK phosphorylates MEK

MEK1 phosphorylates ERK2 at T185

MEK1p218p222 phosphorylates ERK2 at T184

MEK1p218p222 phosphorylates ERK2 at T185.

Methyl Ethyl Ketone phosphorylates ERK

“Raw” mechanismsMEK phosphorylates ERK

MEK phosphorylates ERK

Assembled mechanisms

Generating mechanistic models from assembled Statements

In directed interaction graphs, the relatively limited causal context leads to an explosion of paths between any two proteins. This leads to many false positive paths and makes identification of long causal chains difficult (or even intractable) in large networks.

Generating explanations from the Kappa2 rule influence map

-‐ identifying rules whose activity is increased by the abundance of the subject (e.g., drug)

-‐ searching for a path to an observable representing the object (e.g., a measured protein) with the appropriate overall polarity

-‐ scoring paths by whether the signs of measured intermediate nodes are correctly predicted

Causal path for “Pervanadateincreases MAPK1 phosphorylation”

Pvd_binds_DUSP

Pvd_binds_DUSP_rev

[0->0];[1->1]

DUSP_binds_MAPK1_phosT185

[1->0]

[0->0];[1->1]

[1->0]

[0->1]

DUSP_binds_MAPK1_phosT185_rev

[0->0];[1->1]DUSP_dephos_MAPK1_at_T185

[0->0];[1->1]

[0->1]

[0->0];[1->1]

[0->0];[1->1]

[0->1]

[0->0]

[0->0];[1->1]

MAPK1_pT185

[1->0]

Extending the model by describing mechanisms in English

“IGF1R phosphorylates IRS1 at tyrosine.Tyrosine-‐phosphorylated IRS1 binds PI3K.Serine phosphorylated IRS1 is degraded.Active PPP2CA dephosphorylates IRS1 at serine.Active MTOR inhibits PPP2CA.

To build a mechanistic model, high-‐level assertions such as “MEK1 phosphorylates ERK1” must be converted into specific reaction mechanisms. INDRA uses user-‐specified policies that determine how the different Statement types are implemented, as PySB7 rules and corresponding reactions.

Phosphorylation(MEK1, ERK1)

one-‐step (pseudo-‐first-‐order)one-‐step (Michaelis-‐Menten)two-‐step (enzyme-‐substrate complex formation)ATP-‐dependent (unordered bi-‐bi reaction)

Genome assemblySequence reads

Assembled sequence

Knowledge assembly

Assembly of a large number of mechanistic facts is analogous to genome assembly: databases and literature yield a large number of redundant, partially overlapping facts that may contain errors. Mechanisms must be corrected and “aligned” in order to produce a set of facts suitable for generating a non-‐redundant, non-‐degenerate model.

To evaluate the ability of INDRA to systematically generate explanations of high-‐throughput data, we assembled a rule-‐based executable model to explain a previously published dataset of the phospho-‐proteomic response of a melanoma cell line to 12 different drugs.3 A rule-‐based model containing 221 proteins and 1451 rules was assembled from mechanisms extracted from databases and ~95,000 publications (abstracts and full texts). Static analysis of the rule influence map provided by Kappa identified possible mechanistic paths linking drug targets to experimentally observed effects on phosphoprotein abundances.

Drug Target

AntibodyFold-

changePath

?

MEK MAPK pT202 0.47

SRC CHK2 pT68 1.75

SRC 4EBP1 pT37 0.44

AKT AKT pT308 0.25

AKT GSK3A/B pS21 0.44

AKT AKT pS473 0.17

AKT S6 pS235 0.36

CDK4 4EBP1 pS65 0.44

CDK4 YBI pS102 2.13

MTOR AKT pT308 2.19

MTOR S6 pS240 0.05

MTOR AKT pS473 3.19

MTOR p70S6K pT389 0.33

MTOR S6 pS235 0.06

PKC GSK3A/B pS21 1.59

PKC S6 pS240 0.47

PKC S6 pS235 0.3

PI3K p70S6K pT389 0.5

PI3K S6 pS240 0.44

PI3K AKT pS473 0.2

PI3K S6 pS235 0.27

SRC phosphorylated on Y418 phosphorylates PAK2 on S20. PAK2 phosphorylated on S20 phosphorylates RAF1 on S338. RAF1 phosphorylated on S338, T269 and S471 phosphorylates MAPK1 on T185. MAPK1 phosphorylated on T185 and Y187 phosphorylates TP53 on S15. TP53 phosphorylated on S20 and S15 decreases the amount of PLK1. PLK1 phosphorylates CHEK2 on T68, which is measured by CHK2_pT68.

Example explanation: How does Src inhibition increase CHK2 pT68?

Performance: For the largest effects in the data (>50% fold-‐change) the model generated biochemically plausible explanations for 20 of the 22 effects (91%). For effects at the 20% fold-‐change level, the model

Where the model was unable to identify a causal path between a drug perturbation and an observed effect, we were able to use NLP to manually curate a causal path in simplified English and co-‐assemble it with the automated model.

Overall, this study shows the potential of automatically assembled models to systematically explain high-‐throughput data, generating mechanistic hypotheses and identifying genuinely novel phenomena.

explained 95/135 (70%) of effects. Notably, performance was biased toward drug targets well-‐represented in the literature corpus: the model explained 94/106 (89%) of effects due to PI3K, PKC, SRC, MTOR, MEK, AKT, RAF, and JAK inhibition, but only 1/29 (3%) of effects due to CDK, STAT or MDM2 inhibition.

The Kappa influence map captures detailed context while avoiding the combinatorial explosion of chemical species. Paths are obtained by:

Documents

2017-07-19 john ismb2017 poster · ZEB1 SAA EGF VPS37A PI3K HGF CDH1 MMP9 MET EGFR invasion TRIM11 ERRFI1 aldosterone IL6ST KLF5 DPP4 CYP11B2 artemisinin Integrins IFNL1 CXCL8 HDAC6