46
Exploring Chemical Space with Computers— Challenges and Opportunities Pierre Baldi UCI

Exploring Chemical Space with Computers—Challenges and Opportunities

Embed Size (px)

DESCRIPTION

Exploring Chemical Space with Computers—Challenges and Opportunities. Pierre Baldi UCI. Chemical Informatics. Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology). - PowerPoint PPT Presentation

Citation preview

Page 1: Exploring Chemical Space with Computers—Challenges and Opportunities

Exploring Chemical Space with Computers—Challenges and Opportunities

Pierre BaldiUCI

Page 2: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Informatics

Historical perspective: physics, chemistry and biology

Understanding chemical space Small molecules (systems biology,

chemical synthesis, drug design, nanotechnology)

Page 3: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Space

Stars Small Mol.

Existing

1022 107

Virtual 0 1060 (?)

Access Difficult “Easy”

Mode Individual Combinatorial

Page 4: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Space

Page 5: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Informatics

Historical perspective: physics, chemistry and biology

Understanding chemical space Small molecules (systems biology, chemical

synthesis, drug design, nanotechnology) Predict physical, chemical, biological

properties (classification/regression) Build filters/tools to efficiently navigate

chemical space to discover new drugs, new galaxies, etc.

Page 6: Exploring Chemical Space with Computers—Challenges and Opportunities

Methods

Spetrum: Schrodinger Equation Molecular Dynamics Machine Learning (e.g. SS prediction)

Page 7: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Informatics

Informatics must be able to deal with variable-size structured data Graphical Models (Recursive) Neural Networks ILP GA SGs Kernels

Page 8: Exploring Chemical Space with Computers—Challenges and Opportunities

Two Essential Ingredients

1. Data2. Similarity Measures

Bioinformatics analogy and differences:

Data (GenBank, Swissprot, PDB) Similarity (BLAST)

Page 9: Exploring Chemical Space with Computers—Challenges and Opportunities

Data

Mutag (Mutagenicity) 200 compounds (125/63), mutagenicity in Salmonella

PTC (Predictive Toxicity Challenge) A few hundred compounds, carcinogenicity (FM,MM,FR,MR)

NCI (Anti-cancer activity) 70,000 compounds screened for ability to inhibit growth in 60

human tumor cell lines Alkanes (Boiling points)

All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([-164,174])

Benzodiazepines (QSAR) 79 1,4-benzodiazepines-2-one, affinity towards GABAA

ChemDB 7M compounds

Page 10: Exploring Chemical Space with Computers—Challenges and Opportunities

Similarity

Rapid Searches of Large Databases

Predictive Methods (Kernel Methods)

Why it is not hopeless?

Page 11: Exploring Chemical Space with Computers—Challenges and Opportunities

Similarity

Rapid Search of Large Databases Protein Receptor (Docking) Small Molecule/Ligand Small Molecule/Ligand (Similarity)(Similarity)

Predictive Methods (Kernel Methods) Why it is not hopeless

OrganicOrganicChemicalsChemicals

Page 12: Exploring Chemical Space with Computers—Challenges and Opportunities

Linear Classifiers

Page 13: Exploring Chemical Space with Computers—Challenges and Opportunities

Classification

Learning to Classify Limited number of training

examples (molecules, patients, sequences, etc.)

Learning algorithm (how to build the classifier?)

Generalization: should correctly classify test data.

Formalization X is the input space Y (e.g. toxic/non toxic, or

{1,-1}) is the target class f: X→Y is the classifier.

Page 14: Exploring Chemical Space with Computers—Challenges and Opportunities

Classification

Fundamental Point: f is entirely determined by the dot products xi,xjmeasuring the similarity

between pairs of data points

Page 15: Exploring Chemical Space with Computers—Challenges and Opportunities

Non Linear Classification(Kernel Methods)

We can transform a nonlinear problem into a linear one using a kernel.

Page 16: Exploring Chemical Space with Computers—Challenges and Opportunities

Non Linear Classification(Kernel Methods)

We can transform a nonlinear problem into a linear one using a kernel K.

Fundamental property: the linear decision surface depends on

K(xi ,xj)=(xi ) , (xj). All we need is the Gram similarity

matrix K. K defines the local metric of the embedding space.

Page 17: Exploring Chemical Space with Computers—Challenges and Opportunities

Similarity: Data Representations

NC(O)C(=O)O

O

OH

NH2

OH

Page 18: Exploring Chemical Space with Computers—Challenges and Opportunities

Molecular Representations

1D: SMILES strings 2D: Graph of bonds 2.5D: Surfaces 3D: Atomic coordinates 4D: Temporal evolution

Page 19: Exploring Chemical Space with Computers—Challenges and Opportunities

15Total:

1D SMILES Kernel

CCCCCCc1ccc(cc1O)O

CCCCCc1ccc(cc1)CO

C H3

OHCH3

OH O H

Kmer CountCCCC 2CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1) 1c1)C 11)CO 1

Kmer CountCCCC 3CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1O 1c1O) 11O)O 1

Kmer Count1 Count2 Product(cc1 1 1 11)CO 0 1 01O)O 1 0 01ccc 1 1 1CCCC 3 2 6CCCc 1 1 1CCc1 1 1 1Cc1c 1 1 1c(cc 1 1 1c1)C 0 1 0c1O) 1 0 0c1cc 1 1 1cc(c 1 1 1cc1) 0 1 0cc1O 1 0 0ccc( 1 1 1

Page 20: Exploring Chemical Space with Computers—Challenges and Opportunities

2D Molecule Graph Kernel

For chemical compounds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }

Count labeled paths Fingerprints

(CsNsCdO)

Page 21: Exploring Chemical Space with Computers—Challenges and Opportunities

Similarity Measures

Page 22: Exploring Chemical Space with Computers—Challenges and Opportunities

3D Coordinate Kernel

1.4 A

2.0 A

2.8 A

3.4 A

4.2 A

Atom Distance Histogram

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5Distance (Angstroms)

Co

un

t

Distance Count0 01 52 73 34 15 0

Page 23: Exploring Chemical Space with Computers—Challenges and Opportunities

Example of Results

Page 24: Exploring Chemical Space with Computers—Challenges and Opportunities

Results

Page 25: Exploring Chemical Space with Computers—Challenges and Opportunities

Results

Page 26: Exploring Chemical Space with Computers—Challenges and Opportunities

Results

0.6500

0.6600

0.6700

0.6800

0.6900

0.7000

0.7100

0.7200

0.7300

0.7400

0.7500

Cell Line

Pre

dic

tio

n A

ccu

racy

1D SMILES(71.7% avg, 1.17% stdev)

2D Molecule Graph(72.3% avg, 0.99% stdev)

3D Coordinates(69.8% avg, 1.27% stdev)

Page 27: Exploring Chemical Space with Computers—Challenges and Opportunities

Example of Results

Page 28: Exploring Chemical Space with Computers—Challenges and Opportunities

Summary

Derived a variety of kernels for small molecules State-of-the-art performance on several benchmark

datasets 2D kernels slightly better than 1D and 3D kernels Many possible extensions: 2.5D kernels, isomers, etc… Need for larger data sets and new models of

cooperation in the chemistry community Many open (ML) questions (e.g. clustering and

visualizing 107 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life)

Chemistry version of the Turing test

Page 29: Exploring Chemical Space with Computers—Challenges and Opportunities

ChemDB

7M compounds (3.5M unique) Commercially available PostgreSQL/Oracle Annotation (Experimental,

Computational) Searchable Web interface Similarity, in silico reactions

Page 30: Exploring Chemical Space with Computers—Challenges and Opportunities

Acknowledgements Informatics

Liva Ralaivola J. Chen S. J. Swamidass Yimeng Dou Peter Phung Jocelyne Bruand

Funding NIH NSF IGB

Pharmacology Daniele Piomelli

Chemistry G. Weiss J. S. Nowick R. Chamberlin

Page 31: Exploring Chemical Space with Computers—Challenges and Opportunities

New Questions

Predict drug-like molecules? toxicity? New Strategies

How can we search efficiently? Intelligently? New data structures and algorithms Optimizing old structures

How can we understand this much data? Cluster and visualize millions of data points Define commercially accessible space.

Are there other useful things we can do with this?

Discover new polymers, etc. Wonder about the origin of life. Combinatorially combine all known chemicals.

Page 32: Exploring Chemical Space with Computers—Challenges and Opportunities

Acknowledgements

Jocelyne Bruand Peter Phung Liva Ralaivola S. Joshua Swamidass Yimeng Dou NIH/NSF/IGB

Questions

Page 33: Exploring Chemical Space with Computers—Challenges and Opportunities

DockingD

ata b

ase

of p

o ten

tial

dru

gs

6 m

illi

on s

mal

l mol

e cul

e s

Query:Binding Site of Protein

Scoring Function

& Efficient Minimizer

Page 34: Exploring Chemical Space with Computers—Challenges and Opportunities

Some Targets

P53 (Luecke) ACCD5 (Tsai) IMPDH, PPAR, etc.

(Luecke) HIV Integrase

(Robinson)

Page 35: Exploring Chemical Space with Computers—Challenges and Opportunities

P53

Page 36: Exploring Chemical Space with Computers—Challenges and Opportunities

Drug Rescue of P53 Mutants

Page 37: Exploring Chemical Space with Computers—Challenges and Opportunities

Docking → ChemDB

~6 million commercially available compounds

Searchable, annotated, downloadable.

Other Databases: Cambridge Structural Database ChemBank PubChem

Page 38: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Toxicity Prediction

By Kernel Methods

Jonathan ChenS Joshua Swamidass

The Baldi Lab

Page 39: Exploring Chemical Space with Computers—Challenges and Opportunities

Data Flow

Toxicity State List

Predictions

Gram MatrixID 1 2 3 4 …1 21 4 5 10 …2 4 14 5 3 …3 5 5 15 6 …4 10 3 6 23 …… … … … … …

4 Yes

O

S

P

S

O

C H3

O

C H3

NH

C H3

2 No

Cl

Cl

Cl

3 Yes

O O

1 No

NH

N

CH 3CH3

O

O

OH

ID Toxic?

Kernel

Linear Classifier

Page 40: Exploring Chemical Space with Computers—Challenges and Opportunities

Results

0.5000

0.5500

0.6000

0.6500

0.7000

0.7500

0.8000

0.8500

0.9000

0.9500

1.0000

Cell Line

Pre

dic

tio

n A

ccu

racy

1D SMILES(71.7% avg, 1.17% stdev)

2D Molecule Graph(72.3% avg, 0.99% stdev)

3D Coordinates(69.8% avg, 1.27% stdev)

Default(54.2% avg, 3.49% stdev)

Page 41: Exploring Chemical Space with Computers—Challenges and Opportunities

Example of Results

Kernel/Method Mutag MM FM MR FRKashima (2003) 89.1 61.0 61.0 62.8 66.7 Kashima (2003) 85.1 64.3 63.4 58.4 66.11D SMILES spec. 84.0 66.1 61.3 57.3 66.11D SMILES spec+ 85.6 66.4 63.0 57.6 67.02D Tanimoto 87.8 66.4 64.2 63.7 66.72D MinMax 86.2 64.0 64.5 64.5 66.42D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.92D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.82D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.12D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.72D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.72D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.52D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.42D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.43D Histogram 81.9 59.8 61.0 60.8 64.4

Page 42: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Informatics

Historical perspective: physics, chemistry and biology

Understanding chemical space Small molecules (systems biology, chemical

synthesis, drug design, nanotechnology) Catalog Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical

space to discover new drugs, new galaxies, etc.

Page 43: Exploring Chemical Space with Computers—Challenges and Opportunities

Datasets

Page 44: Exploring Chemical Space with Computers—Challenges and Opportunities

Small Molecules as Undirected Labeled Graphs of Bonds

atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }

Page 45: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Informatics

Historical perspective: physics, chemistry and biology

Understanding chemical space Small molecules (systems biology, chemical

synthesis, drug design, nanotechnology) Bioinformatics analogy:

Catalog (GenBank) Search (BLAST)

Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical

space to discover new drugs, new galaxies, etc.

Page 46: Exploring Chemical Space with Computers—Challenges and Opportunities

Chemical Informatics

Historical perspective: physics, chemistry and biology

Understanding chemical space Small molecules (systems biology, chemical

synthesis, drug design, nanotechnology) Bioinformatics analogy:

Catalog (GenBank) Search (BLAST)

Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical

space to discover new drugs, new galaxies, etc.