13:06, October 19, 2007

Chemical Data Mining of the NCI Human Tumor Cell Line Database

H. Wang, J. Klinginsmith, X. Dong, A. C. Lee, R. Guha, Y. Wu, G. M. Crippen, and D. J. Wild

Yoon Soo Pyon [email protected]

October. 19th, 2007

Outline

1. Introduction

2. Characterization of the chemical compounds

3. Characterization of the cell line screening growth inhibition values

4. Characterization of the gene expression results

5. Relating dictionary-based structural keys to cellular screening activities

6. Predictive models of activity

7. Relating freely generated SMARTS structures to cellular screening activities

Introduction

• NCI Developmental Therapeutics Program (DTP) Human Tumor cell line Dataset is a publicly available database containing cellular assay screening data for over 40,000 compounds tested in sixty human tumor cell lines.

• Also contains microarray assay gene expression data (961 values) for the cell lines, and so it provides an excellent information resource particularly for chemical, biological and genomic information.

• Formal knowledge discovery approach to characterizing and data mining this set to discover relationship between compounds and biological activity values.

NCI 60 Cell Lines data

• 60 cell lines include melanomas, leukemia, and cancers of the breast, prostate, lung, colon, ovary, kidney and central nervous system.

• Screening results includes three parameters.

GI50 – 50% growth inhibitionTGI – total growth inhibitionLC50 – 50% lethal concentration

Usually use -log(GI50)

COMPARE algorithm

What we have examined

Seed compound

Similarity Search

NCI 60 cell line data correlation

search

Target compounds

Characterization of the chemical compounds

• They implemented local version of database containing 44,653 compounds, screening results, and gene expression value using PostgreSQL and gNova CHORD.

• gNova CHORD allows chemical searching and generation of 166 bit structural key fingerprints.


• Calculation and profiling of predicted property values compared to two other datasets.

• FDA’s Maximum Recommended Therapeutic Dose (MRTD) set : representative of current marketed drugs• Randomly selected 40,000 compound subset of PubChem : representative of a diverse set

• Calculated properties (Molecular weight, xlogP, Polar surface area, # of Hydrogen bond donors and acceptors) using OpenEye FILTER

H-Bond donors

H-Bond acceptors

Molecular Weight

XlogP

Solubility

Polar Surface Area


• Compared the similarity of the drug compounds in the MRTD with the most similar compounds in the tumor cell line set

Characterization of the cell line screening growth inhibition value

• They examined the distribution of –log(GI50) data points across cell lines and compounds.• Overall 12.1% of the cell line screening data points are missing.• Overall 44.9% of growth inhibition values are equal to 4.0

• inactive : –log(GI50) < 5

• active : –log(GI50) ≥ 5

• Overall 19.6% compounds are considered active.

Characterization of the cell line screening growth inhibition value (Cont’d.)

Characterization of the gene expression results

• Under-expression from the norm : < 0• Over-expression from the norm : ≥ 0

Relating dictionary-based structural keys to cellular screening activities

• The activity classification (active/inactive) and the structural key fingerprint bits were used to determine which structural features were either more prevalent or scarce in active compounds compared with inactive compounds• The active-structural ration

• The overall-structural ration

compoundsactiveofset

jfeaturewithcompoundsactiveoftotal

C

TR

a

jaja

#,,

compoundsofsetcomplete

jfeaturewithcompoundsoftotal

C

TR jj

#

Relating dictionary-based structural keys to cellular screening activities (cont’d.)

• diffj ≥ 0 : The greater percentage of this feature apearing in the active cells

• diffj < 0 : lack of feature in the active compounds compared with all compounds

jjaj RRdiff ,

Since nearly all 60 cell lines follow the same track, average difference of the active ratio and the overall ratio can be to find the most important substructures in determining the “global” activity and inactivity.


• Features associated with global activity indicate the tendency to bind anything.• Features associated with global inactivity indicate the tendency to stop binding to tumor growth related properties.•105, 127, 145, 152, 99 are the most important bits for activity.• 117, 110, 92, 77, 95 are the most important bits for inactivity.



Predictive Models of Activity

• They designed machine learning model using WEKA to predict individual activity in each of the 60 cell lines.• Applied AD-Tree, Ridor methods, which work best, on various feature set using cell line 60 (UO-31)

Predictive Models of Activity

• Not all 166 features are useful in detrming the cell line activity.

• Thus, feature selection helps increase the prediction accuracy.

Predictive Models of Activity (cont’d.)

• http://www.chembiogrid.org/cheminfo/ncidtp/dtp

http://www.chembiogrid.org/cheminfo/ncidtp/dtp

Relating freely generated SMARTS structure to cellular screening activities

• Previously, experiments used a constrained dictionary of 166 SMARTS fragments.

• Modified to generate a larger number of SMARTS-based keys. (U of Michgan’s method)

• Lengthening and scoring SMARTS string is applied in order to established SMARTS strings up to seven atom long that have strong tendency to identify active and inactive compounds.


• Scoring:

• The ratio of active to inactive compounds in NCI/DTP dataset is 7274 to 35664 Find one active compound to every five inactive compound.• Thus, ratio of significance is 1:5 or 0.2

(consider tenfold improvement)• active compounds : ratio > 2.0• inactive compounds: ratio < 0.02

stringSMARTSsamebyidentifiedcompoundsinactiveofnumber

stringSMARTSabyidentifiedcompoundsactiveofratio


Thank You

Documents

13:06, October 19, 2007