56
Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Embed Size (px)

Citation preview

Page 1: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Induction of Decision Trees Using

Genetic Programming for the

Development of SAR Toxicity Models

Induction of Decision Trees Using

Genetic Programming for the

Development of SAR Toxicity Models

Xue Z Wang

Page 2: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

The Background

TOXICITY values are not known !!!TOXICITY values are not known !!!

26 million distinct organic, inorganic chemicals known> 80, 000 in commercial production

Combinatorial chemistry adds more than 1 million new compounds to the library every year

In UK, > 10,000 are evaluated for possible productionevery year Biggest cost factor

Page 3: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

What is toxicity?• "The dose makes the poison” - Paracelsus (1493-1541)• Toxicity Endpoints: EC50, LC50,… …

Page 4: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Toxicity tests are

expensive, • time consuming and• disliked by many people

In Silico Toxicity Prediction:

SAR & QSAR - (Quantitative) Structure

Activity Relationships

TOPKAT, DERECK, MultiCase

Page 5: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

SAR & QSARsSAR & QSARs

e.g. Neural Networkse.g. Neural Networks

PLS, Expert SystemsPLS, Expert Systems

Toxicity Toxicity EndpointsEndpoints

Daphnia magna EC50Daphnia magna EC50

CancinogenicityCancinogenicity

MutagenicityMutagenicity

Rat oral LD50Rat oral LD50

Mouse inhalation LC50Mouse inhalation LC50

Skin sensitisationSkin sensitisation

Eye irritancyEye irritancy

Molecular weightMolecular weight

HOMOHOMO

LUMOLUMO

Heat of formationHeat of formation

Log D at pH 2, 7.4, 10Log D at pH 2, 7.4, 10

Dipole momentDipole moment

PolarisabilityPolarisability

Total energyTotal energy

Molecular volumeMolecular volume

......

HOMO - highest occupied molecular orbitalLUMO - Lowest unoccupied molecular orbital

No of descriptorscost time

Molecular Modelling

DESCRIPTORSDESCRIPTORS

Physcochemical, Physcochemical,

biological, structuralbiological, structural

Page 6: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Aims of Research

integrated data mining environment (IDME) for

in silico toxicity prediction

decision tree induction technique for eco-

toxicity modelling

in silico techniques for mixture toxicity

prediction

Page 7: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Why Data Mining System for In Silico Toxicity Prediction

Existing systems:

• Unknown confidence level of prediction

• Extrapolation

• Models built from small datasets

• Fixed descriptors

• May not cover the endpoint required

Users own data resources, often commercially Users own data resources, often commercially

sensitive, not fully exploitedsensitive, not fully exploited

Page 8: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Data Mining: Discover UsefulInformation and Knowledge from Data

Data: records of numerical data, symbols, images, documents

Data Data Data Data

Information

Knowledge

Decision

Volume

Value

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data More importantlyMore importantly

Better understandingBetter understanding

Knowledge: Rules: IF .. THEN ..Cause-effect relationshipsDecision treesPatterns: abnormal, normal operationPredictive equations……

Page 9: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

ClusteringClustering

ClassificationClassification

Conceptual ClusteringConceptual Clustering

Inductive learningInductive learning

Dependency modelling Dependency modelling

SummarisationSummarisation

RegressionRegression

Case-based LearningCase-based Learning

Page 10: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

x1 x2 x31 0 01 1 10 0 11 1 10 0 00 1 11 1 10 0 01 1 10 0 0

eg. Dependency Modelling or Link Analysiseg. Dependency Modelling or Link Analysis

x1x1 x2x2 x3x3

x1x1

x2x2

x3x3

Page 11: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Data pre-processing Data pre-processing - Wavelet for on-line signal feature extraction and dimension - Wavelet for on-line signal feature extraction and dimension reductionreduction - Fuzzy approach for dynamic trend interpretation- Fuzzy approach for dynamic trend interpretation

Clustering - Supervised classificationClustering - Supervised classification - BPNN- BPNN - Fuzzy set covering approach- Fuzzy set covering approachUnsupervised classificationUnsupervised classification - ART2 (Adaptive resonance theory)- ART2 (Adaptive resonance theory) - AutoClass- AutoClass - PCA- PCADependency modellingDependency modelling - Bayesian networks- Bayesian networks - Fuzzy - SDG (signed directed graph)- Fuzzy - SDG (signed directed graph) - Decision trees- Decision treesOthers - Automatic rules extraction from data using Fuzzy-NN and Fuzzy SDG - Visualisation

Page 12: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Modern control systems

Cost due to PREVENTABLE abnormal operations: e.g. $20 billion per year in pretrochemical ind.

Fault detection & diagnosis: very complexsensor faults, equipment faults, control-loop,interaction of variables …

Page 13: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang
Page 14: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Yussel’s work

Startpoint

End point

Process Operational Safety EnvelopesProcess Operational Safety Envelopes

Loss Prevention in Process Ind. 2002

Page 15: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Integrated Data Mining Environment

Results PresentationGraphsTablesASCII files

Discovery ValidationStatistical significanceResults for training and test sets

Discovery ValidationStatistical significanceResults for training and test sets

Data Mining ToolboxRegressionPCA & ICAART2 networksKohonen networksK-nearest neighbourFuzzy c-meansDecision trees and rulesFeedforward neural networks (FFNN)Summary statisticsVisualisation

Data importExcelASCII FilesDatabaseXML

Data Pre-processingScalingMissing valuesOutlier identificationFeature extraction

Descriptor calculation

- Toxicity- Toxicity

Page 16: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

User Interface

Page 17: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Quantitative Structure Activity Relationship

75 organic compounds with 1094 descriptors and endpoint Log(1/EC50) to Vibrio fischeri Zhao et al QSAR 17(2) 1998 pages 131-138

Log(1/EC50) = -0.3766 + 0.0444 Vx (r2 0.7078, MSE 0.2548)

Vx – McGowan’s characteristic volumer2 – Pearson’s correlation coefficientq2 – leave-one out cross validated correlation coefficient

Page 18: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Principal Component Analysis

Page 19: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Clustering in IDME

Page 20: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Multidimensional Visualisation

Page 21: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Feedforward neural networksInput layer Hidden layer Output layer

Log(1/EC50)

PC1PC2

PC3

PCm

Page 22: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

FFNN Results graph

Page 23: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

QSAR Mode for Mixture Toxicity PredictionQSAR Mode for Mixture Toxicity PredictionT

RA

ININ

GSimilar Constituents

Dissimilar Constituents T

ES

TIN

G

Similar Constituents

Dissimilar Constituents

Page 24: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Why Inductive Data Mining for In Silico Toxicity Prediction ?oxicity Prediction ?

• Lack of knowledge on what descriptors are

important to toxicity endpoints (feature selection)

• Expert systems: subjective knowledge obtained

from human experts

• Linear vs nonlinear

• Black box models

Page 25: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

What is inductive learning?

Aims at Developing a Qualitative Causal Language for Grouping Data Patterns into Clusters

Decision trees Decision trees or production rulesor production rules

Explicit and transparentExplicit and transparent

Page 26: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Human expert knowl. Knowl. transparent, causal Data driven Quantitative

Data driven Quantitative Nonlinear Easy setup

Statistical MethodsStatistical Methods

Neural NetworksNeural Networks

Knowl. Subjective. Data not used Often qualitative

Black-box Human knowl. not used

Black-box Human knowl. not used

Expert SystemsExpert Systems

Combines adv. of ESs, and SMs & NNs Qualitative & quantitative, nonlinear Data & human knowledge used Knowl. transparent and causal

Inductive DMInductive DM

More research Continuous valued output Dynamics / interactions

Page 27: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

C5.0C5.0Binary discretization InformationBinary discretization Information

entropy (Quinlan 1986 & 1993)entropy (Quinlan 1986 & 1993)

LERS LERS (Learning from Examples using (Learning from Examples using

Rough Sets, Grzymala-Busse 1997)Rough Sets, Grzymala-Busse 1997)

Probability distribution histogramProbability distribution histogram

Equal width intervalEqual width interval

KEX (Knowledge EXplorer, KEX (Knowledge EXplorer, Berka & Bruha 1998)Berka & Bruha 1998)

CN4 (Berka & Bruha 1998)CN4 (Berka & Bruha 1998)

Chi2 (Liu & Setiono 1995, Chi2 (Liu & Setiono 1995, Kerber 1992)Kerber 1992)

C5.0C5.0

LERS_C5.0LERS_C5.0

Histogram_C5.0Histogram_C5.0

EQI_C5.0EQI_C5.0

KEX_chi_C5.0KEX_chi_C5.0KEX_fre_C5.0KEX_fre_C5.0KEX_fuzzy_C5.0KEX_fuzzy_C5.0

CN4_C5.0CN4_C5.0

Chi2_C5.0Chi2_C5.0

Discretization techniquesDiscretization techniques Methods TestedMethods Tested

Page 28: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Genetic AlgorithmGenetic Algorithm – optimisation approach can effectively – optimisation approach can effectively avoid local minima and simultaneously evaluate many solutionsavoid local minima and simultaneously evaluate many solutions

GA has been used in decision tree generation to decide the GA has been used in decision tree generation to decide the splitting points and attributes to be used whilst growing a treesplitting points and attributes to be used whilst growing a tree

Traditional Tree Generation methodsTraditional Tree Generation methods – Greedy search, can miss potential models – Greedy search, can miss potential models

Decision Tree Generation Based on Genetic ProgrammingDecision Tree Generation Based on Genetic Programming

Genetic (evolutionary) ProgrammingGenetic (evolutionary) Programming : : Not only simultaneously evaluate many solutionsNot only simultaneously evaluate many solutions and avoid local minimaand avoid local minima

But does not require parameter encoding into fixed But does not require parameter encoding into fixed length vectors called chromosomeslength vectors called chromosomes

Based on direct application of the GA to tree structuresBased on direct application of the GA to tree structures

Page 29: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Genetic ComputationGenetic Computation

(1)Generation of a population of solutions

(2) Repeat steps (i) and (ii) until the stop criteria are

satisfied

(i) calculate the fitness function values for each

solution candidate

(ii) perform crossover and mutation to generate

the next generation

(3) the best solution in all generations is regarded as

the solution

Page 30: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Crossover

Genetic algorithms

Genetic (Evolutionary) Programming / EPTree

+

=

+

=

Page 31: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

1. Divide data into training and test sets1. Divide data into training and test sets

2. Generate the 12. Generate the 1stst population of trees population of trees

- randomly choosing a row (i.e. a compound), and column (i.e.

descriptor)

- Using the value of the slot, s, to split, left child takes those data

points with selected attribute values <= s, whilst the right child

takes those > s.

Descriptors

Mo

lecu

les

<s >s

DeLisle & Dixon J Chem Inf Comput Sci 44, 862-870 (2004)Buontempo & Wang et al, J Chem Inf Comput Sci 45, 904-912 (2005)

Page 32: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

- If a child will not cover enough rows (e.g. 10% of If a child will not cover enough rows (e.g. 10% of

the training rows), another combination is tried.the training rows), another combination is tried.

- A child node becomes a leaf node if pure i.e. all the - A child node becomes a leaf node if pure i.e. all the rows covered are in the same class, or near pure, rows covered are in the same class, or near pure, whilst the other nodes grow childrenwhilst the other nodes grow children

-When all nodes either have two children or are leaf When all nodes either have two children or are leaf nodes, the tree is fully grown and added to the first nodes, the tree is fully grown and added to the first generation. generation.

-A leaf node is assigned to a class label A leaf node is assigned to a class label corresponding to the majority class of points corresponding to the majority class of points partitioned there. partitioned there.

Page 33: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

3. Crossover, Mutation3. Crossover, Mutation

- Tournament: randomly select a groups of trees e.g. 16- Tournament: randomly select a groups of trees e.g. 16

- Calculate fitness values- Calculate fitness values

- Generate the first parent- Generate the first parent

- Similarly generate the second parent- Similarly generate the second parent

- Crossover to generate a child- Crossover to generate a child

- Generate other children- Generate other children

- Select a percentage for mutation- Select a percentage for mutation

+

=

Page 34: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Mutation MethodsMutation Methods

- Random choice of change of split point (i.e. choosing

a different row’s value for the current attribute)

- Choosing a new attribute whilst keeping the same row

- choosing a new attribute and a new row

- re-growing part of the tree

- If no improvement in accuracy for k generations, trees

generated were mutated

- … …

Page 35: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Data Set 1:Data Set 1:

Concentration lethal to 50% of the population, LC50, Concentration lethal to 50% of the population, LC50,

1/Log(LC50), of 1/Log(LC50), of vibrio fischeri, a biolumininescent bactoriumvibrio fischeri, a biolumininescent bactorium

75 compounds 1069 molecular descriptors 75 compounds 1069 molecular descriptors

Data Set 2:Data Set 2:

Concentration effecting 50% of the population, Concentration effecting 50% of the population,

EC50 of algae EC50 of algae chlorella vulgarischlorella vulgaris, by causing fluorescein , by causing fluorescein

diacetate to disappeardiacetate to disappear

80 compounds 1150 descriptors80 compounds 1150 descriptors

Two Data SetsTwo Data Sets

Page 36: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

600 trees were grown in each egneration600 trees were grown in each egneration

16 trees competing in each tournament to 16 trees competing in each tournament to select trees for crossover, select trees for crossover,

66.7% were mutated for the bacterial 66.7% were mutated for the bacterial dataset, and 50% mutated for the algae dataset, and 50% mutated for the algae dataset.dataset.

Data set

Minimum

Class 1 range

Class 2 range

Class 3 range

Class 4 range

Maximum

Bacteria

0.90 ≤3.68 ≤4.05 ≤4.50 >4.50 6.32

Algae -4.06 ≤-1.05 ≤-0.31 ≤0.81 >0.81 3.10

Page 37: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Evolutionary Programming Results: Dataset 1Evolutionary Programming Results: Dataset 1

Class 4(7/7)

Class 1(12/12)

NoYes

No

Class 3(8/8)

NoYes

Yes No

Highest eigenvalue of Burden matrix weighted by atomic mass ≤ 2.15

Lowest eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.304

Yes

Self-returning walk count of order 8 ≤ 4.048

Cl attached to C2 (sp3) ≤ 1

Class 4(5/6)

NoYes

Distance Degree Index ≤ 15.124

Class 4(5/6)

Summed atomic weights of angular scattering function ≤ ‑1.164

NoYes

Class 2(5/6)

R autocorrelation of lag 7 weighted by atomic mass ≤ 3.713

Class 2(7/8)

Yes No

Class 3(6/7)

For data set 1, bacteria data

in generation 37

91.7% for training (60 cases)

73.3% for the test set (15 cases)

Page 38: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Decision Tree Using C5.0 for the Same Data

Class 2(11/12)

Class 4(3/6)

Class 3(5/6)

Class 1(13/14)

Class 4(14/15)

Class 3(7/7)

NoYes

Yes No

Valence connectivity index ≤ 3.346

Cl attached to C1 (sp2) ≤ 1

NoYes

H Autocorrelation lag 5 weighted by atomic mass ≤ 0.007

Yes No

Summed atomic weights of angular scattering function ≤‑0.082

Gravitational index ≤ 7.776

NoYes

For data set 1, bacteria data

88.3% for training (60 cases)

60.0 % for test set (15 cases)

Page 39: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

NoYes

Class 2(14/15)

Class 1(16/16)

No Yes

Class 3(6/8)

NoYes

YesNo

Self-returning walk count order 8 ≤ 3.798

H autocorrelation of lag 2 weighted by Sanderson electro-

negativities ≤ 0.401

Molecular multiple path count order 3 ≤ 92.813

Solvation connectivity index ≤ 2.949

Class 4(6/7)

NoYes

2nd component symmetry directional WHIM index

weighted by van der Waals volume ≤ 0.367

Class 3(9/10)

Class 4(8/8)

2nd dataset - algae data

GP Tree, generation 9

Training: 92.2%

Test: 81.3%

Evolutionary Programming Results: Dataset 2Evolutionary Programming Results: Dataset 2

Page 40: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Class 3(15/20)

Class 1(16/16)

Class 2(15/16)

Class 4(12/12)

NoYes

Yes No

Broto-Moreau autocorrelation of topological structure lag 4 weighted by atomic mass ≤ 9.861

Total accessibility index weighted by van der Waals vol ≤ 0.281

NoYes

Max eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.769

2nd dataset, algae data

See 5,

Training: 90.6%

Test: 75.0%

Decision Tree Using See5.0 for the Same Data

Page 41: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Data set 1 – Bacteria dataData set 1 – Bacteria data GP methodGP methodC5.0C5.0

Tree sizeTree size

Training AccuracyTraining Accuracy

Test AccuracyTest Accuracy

66

88.3%88.3%

60.0%60.0%

88

91.7%91.7%

73.3%73.3%

Data set 2 – Algae dataData set 2 – Algae data GP methodGP methodC5.0C5.0

44

90.6%90.6%

75.0%75.0%

66

92.2%92.2%

81.3%81.3%

Tree sizeTree size

Training AccuracyTraining Accuracy

Test AccuracyTest Accuracy

Summary of Results

Page 42: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Data Set 1 – Bacteria dataData Set 1 – Bacteria data GP (Generation 31)GP (Generation 31)C5.0C5.0

Tree sizeTree size

Training AccuracyTraining Accuracy

Test AccuracyTest Accuracy

66

88.3%88.3%

60.0%60.0%

88

88.3%88.3%

73.3%73.3%

Data Set 2 – Algae dataData Set 2 – Algae data GP (Generation 9)GP (Generation 9)C5.0C5.0

44

90.6%90.6%

75.0%75.0%

66

90.6%90.6%

87.5%87.5%

Tree sizeTree size

Training AccuracyTraining Accuracy

Test AccuracyTest Accuracy

Comparison of Test Accuracy for See5.0 and GP Trees Having the Same Training Accuracy

Page 43: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Primary Treatment Secondary Treatment

Secondary Settler

Aeration TankOutflow

Inflow Screening Grit Removal Primary Settler

Application to Wastewater Treatment Plant Data

Page 44: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

InputInput

Pre-TreatmentPre-TreatmentPrimary Primary

TreatmentTreatment

Sludge LineSludge LineSecondary Secondary TreatmentTreatment

Secondary Secondary SettlerSettler

ScrewsScrews

Aeration Aeration TanksTanks

OutputOutput

Primary Primary SettlerSettler

Data Corresponding to 527 Days’ Operation Data Corresponding to 527 Days’ Operation

38 Variables38 Variables

Page 45: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Decision tree for prediction of suspended solids in effluents– training data

SS-P ≤ -2.9572

DQO-D ≤ 1.80444SS-P ≤ -1.8445

DBO-D ≤ 0.47006 RD-DBO-G ≤ 0.8097

SS-P ≤ -3.167930

ZN-E≤ 2.2447

SS-P ≤ -3.6479

PH-D ≤ 0.8699

DQO-D ≤ 2.53335

PH-D ≤ 0.65534

RD-DQO-S ≤ 0.31152

SS-P ≤ -1.20793

SS-P ≤ -1.68597

PH-D ≤ 0.59323

SS-P ≤ -1.58468 SSV-P ≤ 0.17786

DBO-SS ≤ 0.81806

PH-D ≤ 0.68569

L2

N7

L3

N3

N3

N4

N2

N3

H3

N27

H4

H2

N20

N320/1

N16

N2

L3

N30

N11

N5

Total No of Obs. =470Training Accuracy: 99.8%Training Accuracy: 99.8%Test Accuracy: 93.0%Leaf Nodes = 20L = LowN = NormalH = High

SS-P : input SS to primary settlerSS-P : input SS to primary settlerDQO-D : input COD to secondary settlerDQO-D : input COD to secondary settlerDBO-D : input COD to secondary settlerDBO-D : input COD to secondary settlerPH-D : input pH to secondary settlerPH-D : input pH to secondary settlerSSV-P : input volatile SS to primary settlerSSV-P : input volatile SS to primary settler

Page 46: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

DBO-E ≤ 0.49701

SS-P ≤ -3.08361

SS-P ≤ -1.86019

SS-P ≤ -1.86017

RD-DQO-S ≤ 0.35794RD-SS-G≤0.50018

DBO-D ≤ 0.408557SED-P ≤-2.81193

DBO-E ≤0.71809

RD-SS-P ≤ 0.491144

SS-P ≤ -1.20793

PH-D ≤ 0.65537

RD-DQO-S ≤ 0.357935

PH-P ≤ 0.17333

PH-P ≤ 0.41833

COND-S ≤ 0.49438SS-P ≤ -3.39768

N76/3

L3

N25

N2

H9

N11

N8

H3

N4

N13

N234/1

N11

N69

L3

L3

N31

L2

N20

No of Obs. = 527Accuracy = 99.25%Accuracy = 99.25%Leaf Nodes = 18L = LowN = NormalH = High

Using all the data of 527 days

Page 47: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Final Remarks

• An Integrated Data Mining Prototype System for Toxicity Prediction of Chemicals and Mixtures Developed

• An Evaluation of Current Inductive Data Mining Approaches to Toxicity Prediction Has Been Conducted

• A New Methodology for the Inductive Data Mining Based Novel Use of Genetic Programming is Proposed, Giving Promising Results in Three Case Studies

Page 48: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

On-going Work1)1) Adaptive Discretization of End-point Values through Adaptive Discretization of End-point Values through Simultaneous Mutation of the OutputSimultaneous Mutation of the Output

0

10

2030

40

50

60

7080

90

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Generation

Accu

racy

2-class

3-class

4-class

The best training accuracy in each generation for the trees grown for the algae data using the SSRD. The 2 class trees no longer dominate and very accurate 3 class trees have been found.

SSRD - sum of squared differences in rank

Page 49: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

2) Extend the Method to Model Trees & Fuzzy Model Trees Generation2) Extend the Method to Model Trees & Fuzzy Model Trees Generation

Future WorkRule 1: If antecedent one applies, with degree μ1=μ1,1×μ1,2×…

×μ1,9

then y1= 0.1910 PC1 + 0.6271 PC2 + 0.2839 PC3

+ 1.2102 PC4 + 0.2594 PC5 + 0.3810 PC6

- 0.3695 PC7 + 0.8396 PC8 + 1.0986 PC9 - 0.5162

Rule 2: If antecedent two applies, with degree μ2=μ2,1×μ2,2×…

×μ2,9

then y2 = 0.7403 PC1 + 0.5453 PC2 - 0.0662 PC3

- 0.8266 PC4 + 0.1699 PC5 - 0.0245 PC6

+ 0.9714 PC7 - 0.3646 PC8 - 0.3977 PC9 - 0.0511

Final output: Crisp value (μ1×y1 + μ2×y2) / (μ1 + μ2)

where μi=μi,1×μi,2×……×μi,10

Page 50: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Fuzzy Membership Functions Used in RulesFuzzy Membership Functions Used in Rules

Page 51: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

3) Extend the Method to Mixture Toxicity Prediction3) Extend the Method to Mixture Toxicity PredictionFuture Work

TR

AIN

ING

Similar Constituents

Dissimilar Constituents

TE

ST

ING

Similar Constituents

Dissimilar Constituents

Page 52: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Acknowledgements

Crystal Faraday Partnership Crystal Faraday Partnership on Green Technologyon Green Technology

AstraZenaca Brixham AstraZenaca Brixham Environmental LaboratoryEnvironmental Laboratory

NERC Centre of Ecology NERC Centre of Ecology and Hydrologyand Hydrology

FV BuontempoM MwenseA YoungD Osborn

Page 53: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Type of descriptor Definition Examples

Constitutional Physical description of the compound

Molecular weight, atoms count

Topological 2D descriptors taken from the molecular graph

Wiener index, Balaban index

Walk counts Obtained from molecular graphs

Total walk count

Burden eigenvalues (BCUT)

Eigenvalues of the adjacency matrix, weighting the diagonals by atom weights, reflecting the topology of the whole compound

Weighted by atomic mass, volume, electronegativity or polarizability

Galvez topological charge indices

Describes charge transfer between pairs of atoms calculated from the eigenvalues of the adjacency matrix

Topological and mean charge index of various orders

2D autocorrelation

Sum of the atom weights of the terminal atoms of all the paths of a given length (lag)

Moreau, Moran, and Geary autocorrelations

Page 54: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Charge descriptors

Charges estimated by quantum molecular methods

Total positive charge, dipole index

Aromaticity indices

Estimated from geometrical distance between aromatically bonded atoms

Harmonic oscillator model of aromaticity

Randic molecular profiles

Derived from distance distribution moments of the geometry matrix

Molecular profile, shape profile

Geometrical descriptors

Conformational-dependant, based on molecular geometry

3D Wiener index, gravitational index

Page 55: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Radial distribution function descriptors

Obtained from radial basis functions centred at different distances

Unweighted or weighted by atomic mass, volume, electronegativity or polarizability

3D Molecule Representation of Structure based on Electron diffraction (MoRSE)

Calculated by summing atomic weights viewed by different angular scattering functions

GEometry, Topology, and Atom Weights AssemblY (GETAWAY)

Calculated from the leverage matrix, representing the influence of each atom in determining the shape of the molecule, obtained by centred atomic coordinates

Page 56: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Weighted holistic invariant molecular (WHIM)

Statistical indices calculated from the atoms projected onto 3 principal components from a weighted covariance matrix of atomic coordinates

Unweighted or weighted by atomic mass, volume, electronegativity, polarizability or electrotopological state

Functional groups Counts of various atoms and functional groups

Primary carbonsAliphatic ethers

Atom-centred fragments

From 120 atom centred fragments defined by Ghose-Crippen

Cl-086; Cl attached to C1 (sp3)

Various others Unsaturation index; number of non-single bondsHy; a function of the count of hydrophilic groupsAromaticity ratio; aromatic bonds/ total number of bonds in a H-depleted atomGhose-Crippen molecular refractivityFragment-based polar surface area