30
Novartis Institutes for BioMedical Research (NIBR) Connecting the dots in early drug discovery Stephan Reiling Senior Scientist, Novartis Institutes for BioMedical Research

Connecting the Dots in Early Drug Discovery

Embed Size (px)

Citation preview

Page 1: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

Connecting the dots in early drug discoveryStephan ReilingSenior Scientist, Novartis Institutes for BioMedical Research

Page 2: Connecting the Dots in Early Drug Discovery

Connecting the dots in early drug discoveryStephan ReilingIn-Silico Lead Discovery GroupNovartis Institutes for BioMedical Research (NIBR) Cambridge

GraphConnect 2016, San Francisco

Novartis Institutes for BioMedical Research (NIBR)

Page 3: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

Why (might you be interested in this talk)

• The talk shows how a lot of heterogeneous data can be integrated into one big graph– Greater than the sum of its parts

• Text mining and pattern detection can lead to valuable insights– Nobody can read 25 million scientific papers

• Data mining this graph can give novel biological insights– Connecting the dots

Public3

Page 4: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

Why (did we build the graph)

Public4

Treatment effects in cellular phenotypic assays

Compound treatment

Page 5: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

• What we have (the dots)– almost 1 Billion data points of

compound activity data on protein targets (~99% of which can be summarized as “not active”)

– More and more results of phenotypic assays

• What we lack (the connections)– A good way to use biological

knowledge or background information to make a connection

– A storage for “biological knowledge” that can be “queried”

Public5

Why

Compound

GeneDisease(Phenotype)

Page 6: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

How (did we build the graph)

Public6

Text mining for chemicals, diseases, proteins

In continuation of our investigation on novel stearoyl-CoA desaturase (SCD) 1 inhibitors, we have already reported on the structural modification of the benzoylpiperidines that led to a series of novel and highly potent spiropiperidine-based SCD1 inhibitors. In this report, we would like to extend the scope of our previous investigation and disclose details of the synthesis, SAR, ADME, PK, and pharmacological evaluation of the spiropiperidines with high potency for SCD1 inhibition. Our current efforts have culminated in the identification of 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'-piperidine] (10e), which demonstrated a very strong potency for liver SCD1inhibition (ID(50)=0.6 mg/kg). This highly efficacious inhibition is presumed to be the result of a combination of strong enzymatic inhibitory activity (IC(50) (mouse)=2 nM) and good oral bioavailability (F >95%). Pharmacological evaluation of 10e has demonstrated potent, dose-dependent reduction of the plasma desaturation index in C57BL/6J mice on a high carbohydrate diet after a 7-day oral administration (q.d.). In addition, it did not cause any noticeable skin abnormalities up to the highest dose (10 mg/kg).

Page 7: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

How (did we build the graph)

Public7

Text mining for chemicals, diseases, proteins

In continuation of our investigation on novel stearoyl-CoA desaturase (SCD) 1 inhibitors, we have already reported on the structural modification of the benzoylpiperidines that led to a series of novel and highly potent spiropiperidine-based SCD1 inhibitors. In this report, we would like to extend the scope of our previous investigation and disclose details of the synthesis, SAR, ADME, PK, and pharmacological evaluation of the spiropiperidines with high potency for SCD1 inhibition. Our current efforts have culminated in the identification of 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'-piperidine] (10e), which demonstrated a very strong potency for liver SCD1inhibition (ID(50)=0.6 mg/kg). This highly efficacious inhibition is presumed to be the result of a combination of strong enzymatic inhibitory activity (IC(50) (mouse)=2 nM) and good oral bioavailability (F >95%). Pharmacological evaluation of 10e has demonstrated potent, dose-dependent reduction of the plasma desaturation index in C57BL/6J mice on a high carbohydrate diet after a 7-day oral administration (q.d.). In addition, it did not cause any noticeable skin abnormalities up to the highest dose (10 mg/kg).

Hit Type Recognized text SmilesT1 GeneOrProtein stearoyl-CoA desaturaseT2 Mechanism inhibitorsT3 G benzoylpiperidines

T4 D spiropiperidine O=C(NC(Cc1c[nH]c2ccccc12)C(=O)N3CCC4(CC3)CCc5ccccc45)NC6CN7CCC6CC7

T5 GeneOrProtein SCD1T6 Mechanism inhibitorsT7 GeneOrProtein SCD1

T8 M 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-

yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'-piperidine]

FC1=C2CCC3(OC2=CC=C1)CCN(CC3)C=3N=NC(=CC3)C=3OC(=NN3)CC=3C=NC=CC3

T9 GeneOrProtein SCD1T10 G carbohydrateT11 Disease skin abnormalities

Page 8: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

How (did we build the graph)

• ~25,000,000 article abstracts

• 5,600 journals

• 1946 – current

Public8

National Institutes of Health (NIH) PubMed http://www.ncbi.nlm.nih.gov/pubmed

http://www.ncbi.nlm.nih.gov/pubmed/?term=20801551

• Tagged with “MeSH terms”(MeSH: Medical Subject Heading)

Page 9: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

How

Public9

Structure of the MeSH term hierarchy (partial)

Yellow: DiseasesBlue: Processes and MechanismsGreen: AnatomyRed: Chemicals and DrugsGrey: Organisms

Page 10: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)Public10

Page 11: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)Public11

Page 12: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

How

Public12

Association rule mining of co-occurrences

Article 1• Compound A• Gene 1• Gene 2

Article 2• Compound A• Compound B• Gene 1

Article 3• Compound A• Mesh term X• Gene 1

Article 4• Compound C• Gene 1

• Identification of entities (compounds, mesh terms, genes, diseases,…) from pubmed annotations or textmining

• The a-priori algorithm from association rule mining is used to identify frequently co-mentioned entities (aka market basket analysis)

• Associations above a certain association strength (lift) and number of articles in which they are co-mentioned (support) are stored

• The association strength is scaled to 0-1 and stored as the uncertainty of the association (high lift = low uncertainty)

• Articles are stored as well, including the entities that are mentioned in it

• This only captures the fact that something is frequently co-mentioned with something else, not any causality (similar to correlation)

Page 13: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What (can you do with this)

Public13

Example: disease – compound – target from text miningEvery relationship in the graph has a property “uncertainty” in the range of 0-1This allows to query for connections with the highest confidence

Tafamidis (INN, or Fx-1006A, trade name Vyndaqel) is a drug for the amelioration of transthyretin-relatedhereditary amyloidosis (also familial amyloid polyneuropathy, or FAP), a rare but deadly neurodegenerative disease.

Canavan disease is caused by a defective ASPA gene which is responsible for the production of the enzyme aspartoacylase. Decreased aspartoacylase activity prevents the normal breakdown of N-acetyl aspartate, wherein the accumulation of N-acetylaspartate, or lack of its further metabolism interferes with growth of the myelin sheath of the nerve fibers of the brain.

From Wikipedia: From Wikipedia:

Color code: Disease, Gene, Compound

MATCH p = (cpd:Compound) -[:is_associated]-> (g:Gene) -[:is_associated]-> (d:Disease) <-[:is_associated]- (cpd)

RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as uncORDER BY unc

Page 14: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What (can you do with this)

Public14

So why not just load Wikipedia?

Disease Uncertainty

Canavan Disease 0.1

Pelizaeus-Merzbacher Disease 0.364

Alexander Disease 0.432

Diffuse Axonal Injury 0.432

Brain Diseases, Metabolic 0.451

MATCH p = (cpd:Compound {name: 'N-acetylaspartate'}) -[r:is_associated]-> (m:Disease) RETURN m.name as Disease, r.uncertainty as Uncertainty ORDER BY r.uncertainty LIMIT 5

Page 15: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What (can you do with this)

Public15

Now this is getting more interesting (for us)

MATCH p = (cpd:Compound {name: 'N-acetylaspartate'}) -[r:is_associated]-> (m:CellularComponent)

return m.name as CellularComponent, r.uncertainty as Uncertainty ORDER BY r.uncertainty LIMIT 5

CellularComponent Uncertainty

Axons 0.582

Myelin Sheath 0.611

Extracellular Fluid 0.772

MATCH p = (cpd:Compound {name: 'N-acetylaspartate'}) -[r:is_associated]-> (m:BiologicalProcess)

RETURN m.name as BiologicalProcess, r.uncertainty as Uncertainty ORDER BY r.uncertainty LIMIT 5

BiologicalProcess Uncertainty

Energy Metabolism 0.476

Dominance, Cerebral 0.532

Functional Laterality 0.586

Cerebrovascular Circulation 0.653

Lipid Metabolism 0.72

N-acetylaspartate association with cellular components

N-acetylaspartate association with biological processes

Page 16: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

Data sources:1. MeSH Hierarchy2. Pubmed articles, (pubmed_id, title,

abstract, Lucene full text searches enabled)

3. Pubmed Associations4. Comparative Toxicogenomics Database

(CTD)5. Compound Target Scores*6. Public compound annotations7. Entity relations from sentences8. Protein-protein interactions data set from

CCSB9. MetaCore gene - gene interactions

(binds, activates, regulates expression, …)10. Similarity relations for all the compounds in

the graph*(~2M compounds)

11. Gene ontology12. Protein annotations13. Pathways / gene sets

Objects:• 25,430,635 articles

• 1,951,819 compounds

• 257,000 Mesh and SCR terms

• 59,859 Genes

• 24,769 GO terms

• 10,570 Diseases

Public16

How (did we build the graph)

Relationships:91 different relationships Compound - is_active – Gene

• X – is_associated – X

• Gene – binding – Gene

• Gene – ubiquitinates – Gene

• Compound – affects_ubiquitination – Gene

• Article – mentions – (compound, gene, mesh)

209,031,615 mentions

50,334,440 is_similar

6,951,257 literature_association

762,002 is_active

Other data sources integrated

(*: NIBR internal data)See Acknowledgments / References slide

30 Million nodes 480 Million relationships

Page 17: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

How

Public17

The different relationships and nodes in the graph

15 NodesArticle

BiologicalProcessCellType

CellularComponentCompound

DiseaseGene

GeneSetGo

MeshPathway

PfamPhenotypeSimilar2D

Tissue

91 Relationshipsacetylation affects_geranoylation affects_stability is_active

adp_ribosylation affects_glucuronidation affects_sulfation is_associatedaffects_ADP_ribosylation affects_glutathionylation affects_sumoylation is_child_of

affects_N_linked_glycosylation affects_glycation affects_transport is_part_ofaffects_O_linked_glycosylation affects_glycosylation affects_ubiquitination is_query

affects_abundance affects_hydrolysis affects_uptake is_similaraffects_acetylation affects_hydroxylation binding member_of

affects_activity affects_import cleavage mentionsaffects_acylation affects_lipidation co_regulation_of_transcription methylationaffects_alkylation affects_localization complex_formation mirna_bindingaffects_amination affects_metabolic_processing covalent_modification neddylationaffects_binding affects_methylation deacetylation oxidation

affects_carbamoylation affects_mutagenesis demethylation phosphorylationaffects_carboxylation affects_nitrosation deneddylation ppi

affects_chemical_synthesis affects_oxidation dephosphorylation receptor_bindingaffects_cleavage affects_phosphorylation desumoylation s_nitrosylation

affects_cotreatment affects_prenylation deubiquitination sulfationaffects_degradation affects_reaction glycosylation sumoylationaffects_ethylation affects_reduction go_component transcription_regulation

affects_export affects_response_to_substance go_function transformationaffects_expression affects_ribosylation go_process transport

affects_farnesylation affects_secretion gpi_anchor ubiquitinationaffects_folding affects_splicing hydroxylation

Page 18: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

How (did we build the graph)

Public18

Overall build process

MongoDB PostgreSQL

Pubmedxml files

Internal data sourcesMeSH hierarchies ctdbase PubchemChEMBL ChEBICCSB MetaStore

Information extraction

Compound similaritiesGene sets

Protein annotationsGene ontologies

CSV file staging

TitlesAbstracts

• Information extraction (entity recognition, relationship detection, association rule mining is done on linux cluster)

• Neo4J “endpoint” focused on graph mining

• MongoDB and PostgreSQL are also used for datamining purposes

Neo4J

Page 19: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What (can you do with this)

Public19

Example: Analysis of compound activities

A

B

C

D

E

F

G

H

Active compounds Inactive compounds

Page 20: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What

Public20

Example: Analysis of compound activities

A

B

C

D

E

F

G

H

2

5

1

43

6

Active compounds Inactive compounds

1. Find genes directly affected by the compounds

Page 21: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What

Public21

Example: Analysis of compound activities

A

B

C

D

E

F

G

H

2

8

5

1

4

9

3

6

7

10

Active compounds Inactive compounds

1. Find genes directly affected by the compounds

2. Find all genes that are indirectly affected with some confidence (below a given uncertainyt)

Page 22: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What

Public22

Example: Analysis of compound activities

A

B

C

D

E

F

G

H

2

8

5

1

4

9

3

6

7

10

Active compounds Inactive compounds

1. Find genes directly affected by the compounds

2. Find all genes that are indirectly affected with some confidence (below a given uncertainty)

3. Assign nodes that can not be reached a large distance

4. Identify nodes that • can not be reached by

most of the inactive compound

• or are “closer” to the actives than the inactives

Page 23: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What

Public23

Example: Analysis of compound activities

MATCH (cpd:Compound)where any( nvs in cpd.cpd_id

where nvs in [‘cpd1’,’cpd2’,…])WITH cpdMATCH p = (cpd) -[r*1..2]-> (m)WITH cpd, p, m, reduce(u=0.0,

r in relationships(p) | u+r.uncertainty) as uncertainty

WHERE uncertainty < 0.9RETURN

cpd.cpd_id as Compound_ID,m.id as ID,uncertainty as Distance

ORDER BY uncertainty

Query reachable nodesCompound_ID Active C582554 C495901 C495900

1 0 1.00 1.00 1.002 1 0.78 0.89 0.883 1 1.00 1.00 1.004 0 1.00 1.00 1.005 0 1.00 0.78 0.676 0 1.00 1.00 1.007 0 1.00 1.00 1.008 0 0.88 0.88 0.909 0 1.00 0.88 0.8210 1 1.00 1.00 1.0011 0 1.00 1.00 1.0012 0 1.00 0.80 0.8313 0 1.00 1.00 1.0014 1 1.00 1.00 1.0015 1 0.82 1.00 1.0016 1 0.78 0.89 0.8817 1 0.80 1.00 1.0018 1 0.80 1.00 1.0019 1 0.78 0.89 0.8820 1 0.80 1.00 1.00

Matrix of compound – node “distances” Result of recursive partitioning(decision tree)

Sum of relationship uncertainty is used as distance from compound to nodeDistance to unreachable node is set to 1.0

( and one surrogate split with equivalent performance: 2 nodes of interest )

Page 24: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

What

Public24

Example: Analysis of compound activities

Green: relationships derived from in-house data

Grey: relationships found from textmining

Compound1

Compound2

Compound3

Compound4

Compound5

Compound6

Compound7

Compound8

Compound9

Compound10

Compound11

Compound12

Compound13

Only showing the active compounds and their connections to the identified nodes.

Page 25: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)Public25

Compound1

Compound2

Compound3

Compound4

Compound5

Compound6

Compound7

Compound8

Compound9

Compound10

Compound11

Compound12

Compound13

MATCH p = (g1:Gene) -[r*1..2 {datasource: 'metacore'}]-> (g2:Gene)WHERE g2.gene_symbol in ['FOXO','MTOR']

and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2']RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as uncORDER BY unc LIMIT 20

Page 26: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)Public26

MATCH p = (g1:Gene) <-[:mentions]- (a:Article) -[:mentions]-> (g2:Gene)WHERE g2.gene_symbol in ['FOXO','MTOR']

and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2']RETURN p

MATCH p = (g1:Gene) -[r*1..2 {datasource: 'metacore'}]-> (g2:Gene)WHERE g2.gene_symbol in ['FOXO','MTOR']

and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2']RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as uncORDER BY unc LIMIT 20

Page 27: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)Public27

Compound1

Compound2

Compound3

Compound4

Compound5

Compound6

Compound7

Compound8

Compound9

Compound10

Compound11

Compound12

Compound13

Page 28: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

Where (is this going)

• More tweaks to what we have– Improvements to text mining– Analysis of verbs (actions) / information extraction– Monitor change over time (what is new “emerging knowledge”)

• Full text analysis– Enable analysis and inclusion of internal documents

• Incorporate additional data sources– Gene Expression data (tissue expression and perturbations)– Mutations– Proteomics

• Refining the “uncertainty” measure– How best to compare uncertainties from different data sources

• Expand user base• Automated updates

Public28

Page 29: Connecting the Dots in Early Drug Discovery

Novartis Institutes for BioMedical Research (NIBR)

• ISLD group– John Davies– Miguel Camargo– Eugen Lounkine– Elisabet Gregori-Puigjane– Mark Bray– Pierre Farmer– Ansgar Schuffenhauer

• Text mining group– Therese Vachon– Pierre Parrisot– Andrea Splendiani– Fatima Oezdemir-Zaech– Frederic Sutter

• Protein information:– Pfam: R.D. Finn, et. al. The Pfam protein families database: towards a more sustainable future, Nucleic Acids

Research (2016) Database Issue 44:D279-D285http://pfam.xfam.org/

– Uniprot: The UniProt Consortium, UniProt: a hub for protein information, Nucleic Acids Res. 43: D204-D212 (2015)http://www.uniprot.org/

• Comparative Toxicogenomics database:– Davis AP et. al. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic Acids Res.

2015 Jan;43 (Database issue): D914-20.Curated chemical–gene data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. World Wide Web (URL: http://ctdbase.org/). [May 2016].

• MetaCore

– Thomson Reuters LifeScienceshttp://thomsonreuters.com/en/products-services/pharma-life-sciences/pharmaceutical-research/metacore.html

• Protein-Protein interaction data set:– Center for Cancer Systems Biology (CCSB) at the Dana Farber Cancer Institute

http://ccsb.dfci.harvard.edu/

• Gene Ontology– The Gene Ontology Consortium. Gene Ontology Consortium: going forward. (2015) Nucl Acids Res 43 Database issue

D1049–D1056.http://geneontology.org/

• Pathways

– Reactome pathway database: A. Fabregat et. al., The Reactome pathway Knowledgebase, Nucl. Acids Res. (04 January 2016) 44 (D1): D481-D487D. Croft et. al., The Reactome pathway knowledgebase, Nucl. Acids Res. (1 January 2014) 42 (D1): D472-D477http://reactome.org/

Public29

Acknowledgments / ReferencesSource References

• CPC– Sylvain Cottens– Doug Auld

• DMP– Jeremy Jenkins– Ben Cornett– Florian Nigsch

• NX– Stephen Litster

Page 30: Connecting the Dots in Early Drug Discovery

Thank you