Hepatic injury classification

Hepatic Injury Classification

STAT W4240 Section 3 Data Mining Individual Project Two

Michael Jiang zj2160

!

1 Linear Classification Models (Part I)

1.1 Classification Imbalance

The hepatic injury status response variable is a 3-class classifier. It has 3 possible values, which are “None” “Mild Severity” and “Severe”. The distribution of this response variable is shown in Figure 1. As we can see it, the distribution is highly imbalanced, which would impose serious problem in model training. There are several ways to handle this problem, one of the most popular methods is to use sampling techniques to reconstruct a balanced training dataset. Ling and Li (1998)1 provide an approach to up-sampling in which cases from the minority classes are sampled with replacement until each class has approximately the same number. Also, the reason I prefer up-Sampling over down-Sampling in this context is that the number of “Severe” observation is so limited, and the size of training dataset using down-Sampling would be limited so that the model would not be well trained2. Therefore, the way to create the training dataset is as follows:

1) First, set the random seed to be zj2160, and randomly assign every sample in the whole dataset into training dataset and test dataset. The probability to be assigned into training dataset is 80%.

2) Next, use function upSample in library caret to reconstruct the training dataset so that the new training dataset would be a balanced one.

Do I need to use up-sampling method to reconstruct the test dataset? The reason is no since if the training dataset is sampled to be balanced, the test dataset should be sampled to be more consistent with the state of nature and should reflect the imbalance so that honest estimates of future performance could be computed.

1.2 Classification Statistic

There are 3 usual classification statistics, which are AUC(area under curve), Kappa and Accuracy. For 2-class classification problem, we usually use AUC as the classification statistic. However, in this context, the response variable contains 3 classes. Two solutions would be presented as follows:

1) Use Kappa or Accuracy as the classification statistic, since AUC is only appropriate for 2-class classification problem. However, some models are natively not suitable for multi-class classification problem such as logistic regression model (although there is multi-logistic regression model to compensate).

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1! Ling!C,!Li!C!(1998).!“Data!Mining!for!Direct!Marketing:!Problems!and!solutions.”!In!“Proceedings!of!the!Fourth!International!Conference!on!Knowledge!Discovery!and!Data!Mining,”!pp.!73–79.!2! I!have!tested!both!downNSampling!method!and!upNSampling!method,!and!the!comparison!could!be!referred!at!later!part.!In!fact,!the!prediction!performance!of!most!models!would!become!better!by!substituting!downNSampling!with!upNSampling.!The!reason!may!be!that!with!limited!“Severe”!sample,!downNSampling!method!would!produce!a!small!training!dataset!so!that!the!model!would!not!be!well!trained.!

2) Still use Kappa or Accuracy as the final classification statistic, but build k sub-models for all k classes. More specifically, create k binary variables, and the value would be 1 if this sample belongs to the corresponding class, 0 otherwise. In this context, the binary variables would be as follows:

!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"!#!!"#"$%&'!0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$

!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#$!!"#"$%&'!0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!!

!"#"$"! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#"$"!!"#"$%&'!0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!!

Then I would train 3 separate models using these 3 response variables. When selecting tuning parameter, I would use AUC to select the optimal tuning parameter. When predicting, I would combine the 3 predictions of possibility using softmax transformation (Bridle 19903) which is defined as

!!∗ =!!!!!!!

!!!

where !! !is the possibility prediction for the !!! class and !!∗ is the transformed value between 0 and 1. The final prediction would be class with the largest !!∗.

I decided to use the latter one since it can accommodate all the models. The final classification statistic when measuring the prediction performance would be Kappa, and I would also use Accuracy as a reference.

1.3 Comparison between Models Based Separately on Bio and Chem

There are in total 4 linear classification models discussed in Chapter 12, which are Logistic Regression Model, Linear Discriminant Analysis, Partial Least Square Discriminant Analysis and Penalized Models. The result can be referred in table 1 and table 2. As we can see, when we only use biological predictors, Penalized Models yields the best performance with Kappa of 0.13 in up-sampling and 0.193 in down-sampling. When we only use chemical fingerprint predictors, Partial Least Square Discriminant Analysis (PLSDA) yields the best performance with Kappa of 0.277 in up-sampling and 0.246 in down-sampling.

Based on the results, it’s quite obvious that chemical fingerprint predictors contain the most information about hepatic toxicity. This point could be further demonstrated when we consider non-linear models.

1.4 Top Predictors !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!3! Bridle!J!(1990).!“Probabilistic!Interpretation!of!Feedforward!Classification!Network!Outputs,!with!Relationships!to!Statistical!Pattern!Recognition.”!In!“Neurocomputing:!Algorithms,!Architectures!and!Applications,”!pp.!227–236.!Springer–Verlag.!

For the optimal model for biological predictors which is Penalized Model(up-sampling), the top 5 important predictors are as follows:

1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118, Z98, Z48, Z64. See Figure 2 for details.

2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99, Z53, Z79. See Figure 3 for details.

3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83, Z102, Z15, Z59. See Figure 4 for details.

For the optimal model for chemical fingerprint predictors which is PLSDA(up-sampling), the top 5 important predictors are as follows:

1) When predicting whether it’s “None”, the top 5 important variables are X134, X188, X154, X83, X72. See Figure 5 for details.

2) When predicting whether it’s “Mild”, the top 5 important variables are X140, X147, X31, X134, X67. See Figure 6 for details.

3) When predicting whether it’s “Severe”, the top 5 important variables are X72, X113, X44, X136, X81. See Figure 7 for details.

1.5 Comparison between Models Based on Both Bio and Chem

The optimal model for biological and chemical predictors is PLSDA, and it yields a Kappa of 0.372 in up-sampling and a Kappa of 0.186 in down-sampling. With both sets of predictors, the PLSDA model has a significantly better performance than those with only one set of predictors.

The top 5 predictors for PLSDA model(up-sampling) are as follows:

1) When predicting whether it’s “None”, the top 5 important variables are X134, X154, Z116, Z149, Z38. See Figure 8 for details.

2) When predicting whether it’s “Mild”, the top 5 important variables are Z116, Z93, X38, X98, X155. See Figure 9 for details.

3) When predicting whether it’s “Severe”, the top 5 important variables are Z69, Z100, X72, Z102, Z93. See Figure 10 for details.

When comparing those top 5 important variables with previous results, we can see quite easily that for “None” and “Severe”, the top 5 predictors seem to come separately from the top 5 predictors in biological predictors and chemical predictors, for example, in “Severe”, Z100 and Z102 are all among the top 5 predictors in previous result. Another interesting thing is that the percentage of Z-predictors in top 5 lists is higher than that of X-predictors, which again confirms what we got previously that the chemical fingerprints predictors

contains most information about hepatic toxicity.

1.6 Suggestion

I would recommend using both biological and chemical predictors information, and using upsampling to train PLSDA model. This would yield a quite accurate prediction. Since it’s easy to see in the table 3 that almost all the performances of down-sampling method are worse than that of up-sampling, we should use upsampling method to train the model. Also, among all the linear classification models, PLSDA outperformances others with a Kappa of 0.372 which would qualify as a good prediction.

2 Nonlinear Classification Models (Part II)

2.1 Comparison between Models Based Separately on Bio and Chem

There are in total 6 nonlinear classification models discussed in Chapter 13, which are Regulated Discriminant Analysis(I combined Quadratic Discriminant Analysis in it by setting lambda to 1), Neural Network, Average Neural Network, Flexible Discriminant Analysis, Support Vector Machine, K-Nearest Neighbor, Naïve Bayes. The result can be referred in table 4 and table 5. As we can see, when we only use biological predictors, Averaged Neural Network (AvNNet) yields the best performance with Kappa of 0.368 in up-sampling and 0.119 in down-sampling. When we only use chemical fingerprint predictors, Support Vector Machine (SVM) yields the best performance with Kappa of 0.328 in up-sampling and 0.235 in down-sampling.

When compared with linear classification models, when we only use biological predictors, the nonlinear structure of these models greatly help to improve the classification performance, as we can see the best of linear model could only yield a Kappa of 0.13, but almost all the nonlinear models yield a higher Kappa with the highest to be 0.368.

However, when we only use chemical predictors, the nonlinear structure does help but not as much as the case in biological predictors. The highest Kappa with a nonlinear model is 0.372 but the highest Kappa with a linear model is 0.277.

2.2 Top Predictors

For the optimal model for biological predictors which is AvNNet(up-sampling), the top 5 important predictors are as follows:

1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118, Z98, Z48, Z64. See Figure 11 for details.

2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99, Z53, Z79. See Figure 12 for details.

3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83,

Z102, Z15, Z59. See Figure 13 for details.

For the optimal model for chemical fingerprint predictors which is SVM(up-sampling), the top 5 important predictors are as follows:

1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95, X133, X120. See Figure 14 for details.



2.3 Comparison between Models Based on Both Bio and Chem

The optimal model for biological and chemical predictors is Naïve Bayes, and it yields a Kappa of 0.306 in up-sampling and a Kappa of 0.403 in down-sampling. With both sets of predictors, the Naïve Bayes model has a slightly better performance than those with only one set of predictors.

The top 5 predictors for Naïve Bayes model(up-sampling) are as follows:

1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95, X133, Z130. See Figure 17 for details.



When compared with previous results, the top 5 important variables are almost identical to those of using only chemical fingerprint predictors. The only difference is in predicting “None”, the 5th important variable is Z130 rather than X120. Also, this again strongly confirms the previous conclusion that chemical fingerprints predictors contain most of the information about hepatic toxicity, since almost all the important variable are X-predictors(chemical fingerprint predictors).

2.4 Suggestion

I would recommend using both biological and chemical predictors information, and using up-sampling to train Naïve Bayes Model. The nonlinear structure indeed helps to improve performance over linear models. With a Kappa of 0.306 in up-sampling and a Kappa of 0.403 in down-sampling, well-trained Naïve Bayes Model outperforms the optimal linear model. Therefore, I would recommend using Naïve Bayes to predict the hepatic toxicity.

3 Tree-based Classification Models (Part III)

3.1 CART & Conditional Inference Trees

Both CART trees and conditional inference trees models are built using chemistry predictors, and Kappa statistic is used as the metric. When comparing the performance of predicting the whole dataset, CART(tuning parameter mtry is 100) has Accuracy of 0.568 and Kappa of 0.21, while conditional inference tree(tuning parameter mtry is 10) has Accuracy of 0.534 and Kappa of 0.0996. It’s obvious that random forest with CART has better performance.

3.2 Computation Time Comparison

The output of the computation time is as follows:

> ## Obtain the computation time for each model > rfCART$times$everything user system elapsed 492.665 2.582 171.341 > rfcForest$times$everything user system elapsed 581.095 52.354 169.595

As we can see, CART trees not only have a better performance, but also have a less computational time than conditional inference trees. Therefore, I would prefer CART over conditional inference tree.

3.3 Top Predictors

Figure 20 and Figure 21 shows the top 10 important variables for both models.

More specifically, for CART, the top 10 important variables are: X1, X132, X71, X28, X31, X29, X147, X30, X11, X6.

For conditional inference tree, the top 10 important variables are: X132, X134, X1, X71, X35, X95, X139, X38, X98, X160.

The top 10 most important variables are mostly different between CART and Conditional Inference in that in Conditional Inference, statistical hypothesis tests are used to do exhaustive search across predictors and their possible split points, and for every candidate split, a statistical test is used to evaluate the difference between means of two groups created by the split. However, the CART model, when choosing the possible split points, has a different objective function, which is to maximize the reduction of square errors. This difference in objective function may be the reason that two models have noticeable difference.

Table 1

Linear'Models'

Biological'Predictors'Up4Sampling'Method' Down4Sampling'Method'

Training'Dataset'Test'

Dataset'Training'Dataset'

Test'Dataset'

ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa'LRM' 0.619' 0.546' 0.753' 0.0977' 0.578' 0.552' 0.628' 0.0147'LDA' 0.556' 0.555' 0.846' 0.0749' 0.554' 0.567' 0.567' 0.0989'PLSDA' 0.601' 0.569' 0.925' 0.125' 0.642' 0.623' 0.806' 0.102'Penalized'

Models'0.622' 0.577' 0.891' 0.13' 0.612' 0.61' 0.811' 0.193'

Table 2

Linear'Models'

Chemical'Predictors'Up4Sampling'Method' Down4Sampling'Method'



Test'Dataset'

ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa'LRM' 0.645' 0.593' 0.863' 0.167 0.625' 0.613' 0.674' 0.115 LDA' 0.729' 0.633' 0.91' 0.205 0.596' 0.615' 0.706' 0.176 PLSDA' 0.741' 0.659' 0.97' 0.277 0.643' 0.651' 0.76' 0.246 Penalized'

Models'0.704' 0.672' 0.922' 0.166 0.642' 0.621' 0.717' 0.212

Table 3

Linear'Models'

Biological'and'Chemical'Predictors'Up4Sampling'Method' Down4Sampling'Method'



Test'Dataset'

ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa'LRM' 0.624' 0.579' 0.646' 0.159 0.601' 0.55' 0.628' 0.0287 LDA' 0.647' 0.619' 0.776' 0.0739 0.587' 0.584' 0.717' 0.0695 PLSDA' 0.783' 0.648' 0.983' 0.372 0.634' 0.627' 0.785' 0.186 Penalized'

Models'0.698' 0.634' 0.96' 0.353 0.615' 0.621' 0.833' 0.186

Table 4

Non-Linear Models

Biological Predictors Up-Sampling Method Down-Sampling Method

Training'Dataset' Test'Dataset' Training'Dataset' Test'

Dataset'ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa'

RDA 0.81' 0.589' 0.979 0.0811 0.633' 0.645' 0.792 0.0926 NNet 0.75' 0.62' 0.971 0.2 0.666' 0.62' 0.733 -0.0622

AvNNet 0.793' 0.597' 0.987 0.368 0.645' 0.621' 0.825 0.119 FDA 0.593 0.579 0.902 0.284 0.573 0.512 0.642 0.135 SVM 0.688 0.598 0.945 0.253 0.596 0.618 0.9 -0.069 kNN 0.764 0.625 0.958 0.14 0.628 0.643 0.725 0.0353

Naïve Bayes 0.669' 0.575' 0.921 0.0245 0.597' 0.583' 0.711 0.162

Table 5

Non-Linear Models

Chemical Fingerprint Predictors Up-Sampling Method Down-Sampling Method



RDA 0.785' 0.601' 0.974 0.249 0.648' 0.605' 0.7 0.225 NNet 0.854' 0.708' 0.991 0.314 0.665' 0.672' 0.758 0.21

AvNNet 0.877' 0.714' 0.998 0.225 0.68' 0.64' 0.767 0.295 FDA 0.748 0.695 0.89 0.215 0.67 0.666 0.65 0.0911 SVM 0.821 0.659 0.98 0.328 0.648 0.628 0.7 0.235 kNN 0.786 0.629 0.96 0.372 0.655 0.592 0.718 0.0155

Naïve Bayes 0.77' 0.561' 0.829 0.247 0.627' 0.617' 0.625 0.222

Table 6

Non-Linear Models

Biological and Chemical Fingerprint Predictors Up-Sampling Method Down-Sampling Method



RDA 0.801' 0.597' 0.999 0.274 0.63' 0.596' 0.769 0.252 NNet 0.803' 0.612' 0.957 0.176 0.622' 0.617' 0.747 0.242

AvNNet 0.856' 0.679' 0.995 0.248 0.62' 0.625' 0.747 0.208 FDA 0.746 0.637 0.931 0.213 0.62 0.641 0.75 0.0495 SVM 0.836 0.601 0.99 0.137 0.621 0.629 0.775 -0.0088 kNN 0.789 0.628 0.933 0.289 0.644 0.617 0.8 0.0976

Naïve Bayes 0.778' 0.547' 0.896 0.306 0.601' 0.61' 0.667 0.403

Figure 1

!Figure 2

Figure 3

!!!

Figure 4

Figure 5

!!!

Figure 6

Figure 7

!!!

Figure 8

Figure 9

!!!

Figure 10

Figure 11

!!!

Figure 12

Figure 13

!!!

Figure 14

Figure 15

!!!

Figure 16

Figure 17

!!!

Figure 18

Figure 19

!Figure 20

Figure 21

Data & Analytics

Hepatic injury classification