24
Chemical Data Mining of the NCI Human Tumor Cell Line Database H. Wang, J. Klinginsmith, X. Dong, A. C. Lee, R. Guha, Y. Wu, G. M. Crippen, and D. J. Wild Yoon Soo Pyon [email protected] October. 19 th , 2007

13:06, October 19, 2007

  • Upload
    tommy96

  • View
    160

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 13:06, October 19, 2007

Chemical Data Mining of the NCI Human Tumor Cell Line Database

H. Wang, J. Klinginsmith, X. Dong, A. C. Lee, R. Guha, Y. Wu, G. M. Crippen, and D. J. Wild

Yoon Soo Pyon [email protected]

October. 19th, 2007

Page 2: 13:06, October 19, 2007

Outline

1. Introduction

2. Characterization of the chemical compounds

3. Characterization of the cell line screening growth inhibition values

4. Characterization of the gene expression results

5. Relating dictionary-based structural keys to cellular screening activities

6. Predictive models of activity

7. Relating freely generated SMARTS structures to cellular screening activities

Page 3: 13:06, October 19, 2007

Introduction

• NCI Developmental Therapeutics Program (DTP) Human Tumor cell line Dataset is a publicly available database containing cellular assay screening data for over 40,000 compounds tested in sixty human tumor cell lines.

• Also contains microarray assay gene expression data (961 values) for the cell lines, and so it provides an excellent information resource particularly for chemical, biological and genomic information.

• Formal knowledge discovery approach to characterizing and data mining this set to discover relationship between compounds and biological activity values.

Page 4: 13:06, October 19, 2007

NCI 60 Cell Lines data

• 60 cell lines include melanomas, leukemia, and cancers of the breast, prostate, lung, colon, ovary, kidney and central nervous system.

• Screening results includes three parameters.

GI50 – 50% growth inhibitionTGI – total growth inhibitionLC50 – 50% lethal concentration

Usually use -log(GI50)

COMPARE algorithm

Page 5: 13:06, October 19, 2007

What we have examined

Seed compound

Similarity Search

NCI 60 cell line data correlation

search

Target compounds

Page 6: 13:06, October 19, 2007

Characterization of the chemical compounds

• They implemented local version of database containing 44,653 compounds, screening results, and gene expression value using PostgreSQL and gNova CHORD.

• gNova CHORD allows chemical searching and generation of 166 bit structural key fingerprints.

Page 7: 13:06, October 19, 2007

Characterization of the chemical compounds

• Calculation and profiling of predicted property values compared to two other datasets.

• FDA’s Maximum Recommended Therapeutic Dose (MRTD) set : representative of current marketed drugs• Randomly selected 40,000 compound subset of PubChem : representative of a diverse set

• Calculated properties (Molecular weight, xlogP, Polar surface area, # of Hydrogen bond donors and acceptors) using OpenEye FILTER

Page 8: 13:06, October 19, 2007

H-Bond donors

H-Bond acceptors

Molecular Weight

XlogP

Solubility

Polar Surface Area

Page 9: 13:06, October 19, 2007

Characterization of the chemical compounds

• Compared the similarity of the drug compounds in the MRTD with the most similar compounds in the tumor cell line set

Page 10: 13:06, October 19, 2007

Characterization of the cell line screening growth inhibition value

• They examined the distribution of –log(GI50) data points across cell lines and compounds.• Overall 12.1% of the cell line screening data points are missing.• Overall 44.9% of growth inhibition values are equal to 4.0

• inactive : –log(GI50) < 5

• active : –log(GI50) ≥ 5

• Overall 19.6% compounds are considered active.

Page 11: 13:06, October 19, 2007

Characterization of the cell line screening growth inhibition value (Cont’d.)

Page 12: 13:06, October 19, 2007

Characterization of the gene expression results

• Under-expression from the norm : < 0• Over-expression from the norm : ≥ 0

Page 13: 13:06, October 19, 2007

Relating dictionary-based structural keys to cellular screening activities

• The activity classification (active/inactive) and the structural key fingerprint bits were used to determine which structural features were either more prevalent or scarce in active compounds compared with inactive compounds• The active-structural ration

• The overall-structural ration

compoundsactiveofset

jfeaturewithcompoundsactiveoftotal

C

TR

a

jaja

#,,

compoundsofsetcomplete

jfeaturewithcompoundsoftotal

C

TR jj

#

Page 14: 13:06, October 19, 2007

Relating dictionary-based structural keys to cellular screening activities (cont’d.)

• diffj ≥ 0 : The greater percentage of this feature apearing in the active cells

• diffj < 0 : lack of feature in the active compounds compared with all compounds

jjaj RRdiff ,

Since nearly all 60 cell lines follow the same track, average difference of the active ratio and the overall ratio can be to find the most important substructures in determining the “global” activity and inactivity.

Page 15: 13:06, October 19, 2007

Relating dictionary-based structural keys to cellular screening activities (cont’d.)

• Features associated with global activity indicate the tendency to bind anything.• Features associated with global inactivity indicate the tendency to stop binding to tumor growth related properties.•105, 127, 145, 152, 99 are the most important bits for activity.• 117, 110, 92, 77, 95 are the most important bits for inactivity.

Page 16: 13:06, October 19, 2007

Relating dictionary-based structural keys to cellular screening activities (cont’d.)

Page 17: 13:06, October 19, 2007

Relating dictionary-based structural keys to cellular screening activities (cont’d.)

Page 18: 13:06, October 19, 2007

Predictive Models of Activity

• They designed machine learning model using WEKA to predict individual activity in each of the 60 cell lines.• Applied AD-Tree, Ridor methods, which work best, on various feature set using cell line 60 (UO-31)

Page 19: 13:06, October 19, 2007

Predictive Models of Activity

• Not all 166 features are useful in detrming the cell line activity.

• Thus, feature selection helps increase the prediction accuracy.

Page 20: 13:06, October 19, 2007

Predictive Models of Activity (cont’d.)

• http://www.chembiogrid.org/cheminfo/ncidtp/dtp

Page 21: 13:06, October 19, 2007

Relating freely generated SMARTS structure to cellular screening activities

• Previously, experiments used a constrained dictionary of 166 SMARTS fragments.

• Modified to generate a larger number of SMARTS-based keys. (U of Michgan’s method)

• Lengthening and scoring SMARTS string is applied in order to established SMARTS strings up to seven atom long that have strong tendency to identify active and inactive compounds.

Page 22: 13:06, October 19, 2007

Relating freely generated SMARTS structure to cellular screening activities

• Scoring:

• The ratio of active to inactive compounds in NCI/DTP dataset is 7274 to 35664 Find one active compound to every five inactive compound.• Thus, ratio of significance is 1:5 or 0.2

(consider tenfold improvement)• active compounds : ratio > 2.0• inactive compounds: ratio < 0.02

stringSMARTSsamebyidentifiedcompoundsinactiveofnumber

stringSMARTSabyidentifiedcompoundsactiveofratio

Page 23: 13:06, October 19, 2007

Relating freely generated SMARTS structure to cellular screening activities

Page 24: 13:06, October 19, 2007

Thank You