40
Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection Rajarshi Guha School of Informatics Indiana University and Stephan Schürer Scripps, FL 23rd August, 2007

Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

  • Upload
    rguha

  • View
    1.029

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Random Forest Ensembles Applied toMLSCN Screening Data for Prediction

and Feature Selection

Rajarshi GuhaSchool of Informatics

Indiana Universityand

Stephan SchürerScripps, FL

23rd August, 2007

Page 2: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Broad Goals Understand and possibly predict cytotoxicity

− Utilizing MLSCN screening data and external data

− Characterize and visualize various screening results

− Relate screening data to known information

Model and predict acute toxicity in animals− Relate large cytotoxicty data sets to animal toxicity(?)

Modelling protocols to handle the characteristics of HTS data− Large datasets, imbalanced classes, applicability

Make models publicly available− For use in multiple scenarios and accessible by a variety of methods

Page 3: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Datasets Animal Acute Toxicity Data was extracted from the

ToxNet database (available from MDL)− Selected only LD50 data for mouse and rat and three routes of

administration

− Summarized LD50 data by structure, species and route(140,808 LD50 data points, 103,040 structures)

− Classified into Toxic/Nontoxic using a cutoff

Cytotoxicity Data was taken as published in PubChemfrom Scripps and NCGC− Scripps Jurkat cytotoxicity assay

(59,805 structures with %Inhib, 801 IC50 values)

− NCGC data from PubChem for 13 cell lines (non-MLSMR structures):summarized multiple sample data by unique structures and extracted IC50data: 1,334 structure, 13 x 1,334 IC50 values for different cell lines

Page 4: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Scripps Cytotoxicity Models

57,469 valid structures

775 structures with measured IC50− Skipped 26 structures that BCI could not parse

How do we model this dataset?− Use all data. Very poor results− Use a sampling procedure to get an ensemble of

models− Consider just the 775 structures

Page 5: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Scripps Cytotoxicity Models

First considered the 775 structures

Evaluated 1052 bit BCI fingerprints

Selected a cutoff pIC50− >= 5.5 - toxic− < 5.5 – nontoxic

Used sampling tocreate 10-memberensemble

Page 6: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Handling Imbalanced Classes

Toxic Nontoxic

TSETPSET

PSET TSET-1 TSET-2 TSET-N

RF 1 RF 2 RF N

Preds 1 Preds 2 Preds N

Dataset

Consensus Predictions

Page 7: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Scripps Cytotoxicity Models % correct

(ensemble average) = 69%

% correct(consensus) = 71%

Not very goodperformance

Nontoxic Toxic

Nontoxic 39 17

Toxic 15 41

Page 8: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Do More Negatives Help?

Include 10,000 structures, randomly selected− Primary data, assumed to be nontoxic

Selected a cutoff pIC50− >= 5.0 - toxic− < 5.0 – nontoxic

Used sampling tocreate 10-memberensemble

Page 9: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Expanded Cytotoxicity Dataset % correct (averaged over

the ensemble) = 69%

% correct (consensusprediction) = 71%

Not muchimprovement

Insufficient samplingof the nontoxics

Nontoxic Toxic

Nontoxic 79 24

Toxic 35 68

Page 10: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

We Need More Positives

The two datasets (775 vs 10,775 compounds)are quite similar in terms of bit spectrum

Normalized Manhattan distance = 0.016

Page 11: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Selecting Important Features Identifying the important features in an ensemble

− Each RF model can rank the input features

− A consistent ensemble should have similar, but notnecessarily identical, features highly ranked

− Consider the unique set of top N

− The size of the unique set of top N features providesinformation about the robustness of the ensemble

4

3

15

23

4

8

16

15

3

23

15

8

4

5

28

23

3 4 5 8 15 16 23 28

Ranked features from each member of the ensemble

Unique set oftop 4 features

Page 12: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Important Structural Features

The 10 most important features for predictiveability across the ensemble leads to 43 uniqueimportant bits

This is a total of 66 structural features− The toxic compounds are characterized by having a

slightly larger number of these features, on average

Page 13: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Predicting Animal Toxicity

Should we use cytotoxicity model to predictanimal toxicity?

Normalized Manhattan distance = 0.037

Page 14: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Predicting Animal Toxicity

Performance really depends on the modelcutoff and our goals

Cutoff = 0.6, 93% correct

Cutoff = 0.5, 75% correct

Cutoff = 0.4, 46% correct

Nontoxic Toxic

Nontoxic 43072 1683

Toxic 1748 140

Nontoxic Toxic

Nontoxic 34674 1158

Toxic 10146 665

Nontoxic Toxic

Nontoxic 20369 587

Toxic 24451 1236

Page 15: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

NCGC Toxicity Dataset

Considered 13 cell lines, pIC50's

1334 compounds, including− metals− inorganics

Classified into toxic / nontoxic using a cutoff− mean + 2 * SD

Built models for each cell line

Page 16: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

NCGC – Class Distributions

Cutoff values ranged from 3.56 to 4.72

Classes are severely imbalanced

Developed ensembles of RF models

Page 17: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

NCGC Model ROC Curves

Page 18: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

NCGC Model ROC Curves

Page 19: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

NCGC – Model Performance(Prediction Set)

Page 20: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

NCGC – Using the Models

Predicted toxicity class for the ScrippsCytotoxicity dataset (775 compounds) usingmodel built for NCGC Jurkat cell line

Predictions for the Scripps Cytotox dataset, using the original cutoffs (32% correct)

Predictions for the Scripps Cytotox dataset, using the NCGC cutoff (75% correct)

Nontoxic Toxic

Nontoxic 67 49

Toxic 432 227

Nontoxic Toxic

Nontoxic 26 90

Toxic 109 550

Page 21: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Comparing the datasets as a whole

Comparing NCGC & Scripps Datasets

Page 22: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Comparing NCGC & Scripps Datasets

Comparing datasets class-wise

Page 23: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Important Features

We consider the NCGC Jurkat cell line

The 10 most important features for predictiveability across the ensemble leads to 53 uniqueimportant bits

This is a total of 72 structural features− The toxic compounds are characterized by having a

larger number of these features, on average

Page 24: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Feature matches for example structure

[C;D2](-;@[C,c])-;@[C,c] [O,o;D2](~[C,c])~[C,c] [O,o]!@[C,c]@[C,c]@[C,c]!@[C,c]

[O,o]!@[C,c]@[C,c]@[C,c]@[C,c]!@[O,o]

[O,o]~[C,c]~[O,o]~[C,c]~[C,c]~[O,o]

[O,o,S,s,Se,Te,Po]!@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]

[*]-;@[A]-;@[A]=;@[A]-;!@[*]

[*]!@[*]@[*]@[*]@[*]!@[*]

[*]1@[*]@[*]@[*]@[*]@[*]@1

DigoxinH3C

O

O H

O H

O

H O O

O H

O

C H3

O H

C H3

O H

O

O

OH3C

O

C H3

Page 25: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Important FeaturesAnimal Toxicity vs. Cytotoxicity

The ToxNet (Mouse/IP) and NCGC Jurkat modelshave 130 important features in common

These features are more common in the NCGC toxiccompounds than in the NCGC nontoxic compounds

The average number of these features present in theNCGC dataset, overall, is 18.8− Very low, might indicate that the NCGC model is not going to

be applicable to the ToxNet data

Page 26: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Toxicity vs No. of Features - NCGC Dataset

The important featuresfocus on the toxic class

No correlation between number offeatures and pIC50 for nontoxic class

Dactinomycin

Daunorubicin

Leukerin deriv.

N H

2

N

S H

N

H N

N

CH3

CH3

NH

OHN

O

NH2

OO

CH3

ONH

CH3

O

O

H3C CH3

N

H3C

O

NH3C

O

N

O

NHO

CH3

H3C

N

CH3

H3C

OO

H3C

H3C

NH3C

O N

CH3

O

N

O

HC l

CH3

O

OOH

HO

OH2N

HO

CH3

O

OH O

O

HO

C H3

C H3

Page 27: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Predicting Animal Toxicity

Predictions for the ToxNet Mouse/IP dataset. 29% correct overall. 70% correct on the toxic class

Overall predictive performance is poor

Possible causes− Poor sampling of the nontoxics during training− Feature distributions between the two datasets

Nontoxic Toxic

Nontoxic 12182 558

Toxic 32638 1265

Page 28: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Feature Distributions – ToxNet vsNCGC

ToxNet (Mouse/IP)

Page 29: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Whats Next? Relate structural features to mechanisms of toxicity

Incorporate these into models / build class models− Different cell-lines vs. animal toxicity

− Structural features vs. mechanisms?

Based on prediction confidence and modelapplicability, can we suggest alternative assays?

Use the vote fraction & common bit counts to prioritizecompounds, which may be toxic− Improve assessment of model applicability

Page 30: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Summary

Applying models to predict other datasets is atricky affair− Are the features distributed in a similar manner

between training data & the new data?− Do toxic/nontoxic labels transfer between datasets?

More secondary data required− But this is not the final solution since the NCGC

dataset is small but leads to (some) good models

Page 31: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Summary Fingerprints may not be the optimal way to get the best

predictive ability− They do let us look at structural features easily

We have investigated Molconn-Z descriptors− Preliminary results don't indicate significant improvements

We cannot globally model animal toxicity based oncytotoxicity− Animal data sets are biased to toxic compounds

− Different structural classes behave differently (mechanism ofaction, metabolic effects

Page 32: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Acknowledgements MLSCN data sets / PubChem NCGC Scripps

− Screening (Peter Hodder)− Informatics (Nick Tsinoremas, Chris Mader)− Hugh Rosen

Alex Tropsha, UNC Digital Chemistry Tudor Oprea, UNM

NIH

Page 33: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Extras

Page 34: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Structure Sets: Fingerprint Similarity Only a small fraction of MLSMR structures

are similar to ToxNet structures; and viceversa; 4 to 5 % of MLSMR and ToxNet haveat least one >50 % similar structure to eachother

NCGC structures are much more similar toToxNet (86% >50 % max similar) thanMLSMR (9% >50 % max similar)

Page 35: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Important Features - Distributions

[N,n]~[C,c]~[C,c]~[C,c]

[C,c]~[N,n]~[C,c]~[C,c]

[*]1@[*]@[*]@[*]@[*]@[*]@1

[A]=;@[A]-;@[A]-;@[A]=;@[A]

[C,c;D1]-;!@[N,n]

Page 36: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Are The Cytotox Classes Distinct?

Poor predictive ability may be explained by thelack of separation between toxic & nontoxic

Normalized Manhattan distance = 0.017

Page 37: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Are The Cytotox Classes Distinct?

But the situation is a little better if we just lookat the important bits

Normalized Manhattan distance = 0.06

Page 38: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Standardization Issues - Data Extracting data sets out of PubChem requires manual

curation and post-processing and aggregation of data− No standard measures or column definitions− Activity score and outcome only valid within one experiment− Assay results are not globally comparable− No standardization of assay format (e.g. type, readout, etc.)− Limited ability to query PubChem for specific data sets

rpubchem package for R is one option− Need better way to access specific bulk data sets− No aggregation of assay (sample) data by compound

PubChem seems better suited to browse individual datathan access large standardized data sets

Page 39: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Model Deployment

Final models are deployed in our R WSinfrastructure− Currently the Scripps Jurkat model is available

Model file can be downloaded− http://www.chembiogrid.org/cheminfo/rws/mlist

A web page client is available at− http://www.chembiogrid.org/cheminfo/rws/scripps

Incorporated the model into a Pipeline Pilotworkflow

Page 40: Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

Toxicity vs No. of Features - Mouse/IP

No correlation betweenthe number of importantfeatures and the pLD50for the toxic class

Batrachotoxin

Conotoxin GI

Ciguatoxin CTX 3C

O

C H3

H O

O

H O

N

O

H3C

H3C

O

H3C

N H

H3C

H

N

O

HN O

H

N

CH3

O

NH2

O

HN

O

N

H

S

O

S

H

N

O S

NH

S

O

NH

N

H

O

H2N

NH2

O

O

HN

NH

O

HO O

OH

O

HN

HN O

N

NH2

HN

HN

OH

O

O

O

O

O

O

O

O

O

O

O

O

O

OH

OH

H3C

HO

H3C

CH3

CH3

H3C

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

HH

H

H

H